Fruit Classification Based on Improved YOLOv7 Algorithm

With the rapid development of technology and advancements, unmanned vending machines have emerged as the primary contactless retail method. The efficient and accurate implementation of automated identification technology for agricultural products in their distribution and sales has become an urgent problem that needs to be addressed. This article presents an improved YOLOv7 (You Only Look Once) algorithm for fruit detection in complex environments. By replacing the 3×3 convolutions in the backbone of YOLOv7 with Deformable ConvNet v2(DCNv2), the recognition accuracy and efficiency of fruit classification in YOLOv7 are significantly enhanced. The results indicate that the overall recognition accuracy of this system for ten types of fruits is 98.3%, showcasing its high precision and stability. Keywords— fruit classification, YOLOv7, DCNv2


I. INTRODUCTION
At present, the method of manual identification is mostly used in the process of fruit classification, which has problems such as low information collection efficiency, inaccurate classification, and high labor costs.The main occasions for fruit circulation are fruit wholesale markets and various types of supermarkets.Small and medium-sized supermarkets and wholesale markets input the unit price of various fruits into electronic platform scales, electronic scales and other equipment for weighing and billing.In this process, there is a possibility of input errors.When the number of customers increases, the time loss due to input errors greatly reduces the shopping experience of customers.The common defect of traditional sales methods are timeconsuming and error-prone.
In order to achieve automation and improve the accuracy of fruit recognition, many researchers have applied image recognition techniques to the field of fruit classification.Arivazhagan et al. [1] used color and texture to identify fruits and validated their effectiveness among 15 different fruits.Rocha et al. [2] combined different features and classifiers to reduce the classification error of 15 classes of fruits and vegetables to 15% with limited training data.Dubey et al. [3] proposed an improved hue-difference histogram texture feature descriptor and verified it in a fruit and vegetable database.Dubey et al. [4] presented a fruit classification method using K-means for image segmentation and Support Vector Machine (SVM) for classification.The experimental results demonstrated the effectiveness of this method in fruit classification.Sachin C et al. [5] applied the YOLO algorithm to vegetable recognition by inputting multiple images of different vegetables into the network, manually drawing bounding boxes around the vegetables using OpenCV, and preprocessing the images before training.The recognition accuracy of this method reached 61.6%.
As a real-time object detector, YOLO has been widely applied in various fields.Rahul Sharma [6] proposed the application of YOLOv3 for detecting anthracnose lesions on apple surfaces.Yu JM et al. [7] applied YOLOv4 to mask detection, achieving a mask recognition mAP of 98.3% with a frame rate of 54.57FPS, demonstrating good robustness.Hu XL et al. [8] improved YOLOv4 for real-time detection of uneaten feed particles in underwater aquaculture images, achieving an average precision improvement of 27.21% and a reduction of about 30% in computation, obtaining satisfactory recognition results.This article applies the improved YOLOv7 algorithm to fruit recognition, replacing the 3×3 convolutions in the backbone of YOLOv7 with deformable convolutions Shibo Guo et al.
(DCNv2).It achieves high recognition accuracy in testing with various types of fruits.

A DCNv2 Network Architecture
The DCN-v2 network is divided into two types: Stacked structure and Parallel structure.The difference between them lies in their architecture.In the Stacked structure, the input vector passes through multiple layers of Cross networks for feature interaction and then goes through a deep neural network (DNN) layer.In the Parallel structure, similar to the DCN structure, the input vector simultaneously goes through both the Cross layer and the DNN layer, and the outputs are then cascaded together.
The Cross network in DCN-v2 assumes that the output vector of the l-th layer is denoted as , and the output vector of the ( + 1)-th layer +1 ，can be represented as follows: The weights of the Cross layer transform from an ndimensional vector to an × matrix.By analyzing the matrices learned by the Cross network in the DCN-v2 model, it is observed that computational costs can be reduced by employing matrix decomposition methods.One approach is to decompose a dense matrix into low-rank matrices.As a result, the computation in the Cross network can be represented by the following formula:   In the feature utilization part, YOLOv7 extracts three feature layers located at different levels of the backbone.These feature layers are in the middle, lower, and bottom layers.By utilizing the FPN feature pyramid, we obtain three enhanced feature layers with shapes of (20,20,512), (40,40,256), and (80,80,128).These feature layers are then passed into the YOLO Head to obtain the prediction results.YOLOv7 utilizes the RepConv structure before the YOLO Head.The basic idea is to introduce a special residual structure for training, which can be equivalently simplified to a regular 3x3 convolution during actual prediction.After the above network structure, we obtained the prediction results of the three feature layers, and obtained the position of the prediction frame on the picture after decoding.

A. Experimental Environment
In the experiment, we conducte it on a computer with Windows 10 operating system, equipped with a GeForce RTX 2060 GPU.The CUDA version used is 11.7, and the

B. Datasets
We use a home-made fruit dataset to evaluate whether model improvements are effective.It includes 2956 pictures of six kinds of fruits including kiwi, rockmelon, strawberry, and mango.A picture may contain a variety of fruits, and the statistical chart of the label is shown in the figure 3.
There is a problem of sample imbalance in the dataset, which can be improved to a certain extent through image weight redistribution and data enhancement such as cropping, rotation, and flipping.Use the ratio of 6:2:2 to divide the data set to training set, verification set and test set.

C. Experimental results and analysis.
In order to test our system, we identified 4 fruits: mango, strawberry, orange and cantaloupe.Figure 4 (a), (c), (e) are the DCNv2-YOLOv7 recognition results of cantaloupe, strawberry, and kiwifruit, respectively, and Figure 3 (b), (d), (f) are the recognition results of For Figure 3(a) and (b), DCNv2-YOLOv7 has a 96% probability of being recognized as a cantaloupe; YOLOv7 has a 90% probability of being recognized as a cantaloupe.For Figure 3(c) and (d), DCNv2-YOLOv7 has 95% probability of being recognized as strawberry, while YOLOv7 has 88% and 93% probability of being recognized as strawberry.For Figure 3(e) and (f), DCNv2-YOLOv7 has a 92% probability of being recognized as kiwi, and YOLOv7 has a 91% probability of being recognized as orange.In addition, the recognition time of the three fruits DCNv2-YOLOv7 is 54ms.The recognition time of these four fruits of YOLOv7 is 58ms.

D. Experimental results.
In practical application scenarios, the brightness of the camera-captured images can be affected by environmental factors such as occlusions.Therefore, we conducted recognition tests on fruits under different brightness conditions.The brightness levels were categorized into dark, normal, and bright, with 20 trials for each group.We used accuracy as the evaluation metric for our model, which can be obtained from the confusion matrix in Table I.The calculation method for accuracy is as follows: Table II shows the recognition of five fruits under different brightness conditions, and the recognition probability of each fruit gradually increases with the increase of brightness.When the brightness is brighter, all fruits can be successfully identified.When the brightness is darker, the probability of correct recognition of kiwi is lower, and the accuracy rate is 90%.In all 240 sets of tests, the overall accuracy rate can reach 98.3%.It can be seen from Table II that DCNv2-YOLOv7 can complete the classification task very well, with an accuracy rate of about 98.3%, which meets the expected improvement requirements.In practical applications, it can guarantee high recognition accuracy and efficiency in most cases.How to improve the low recognition rate caused by the detection environment will be the next research direction.
We set the epoch to 100.YOLOv7 and DCNv2-YOLOv7 were trained respectively.We use the test set independent of the training set and the verification set to evaluate the model performance.The obtained model test results are shown in the Table III.YOLOv7-DCNv2 has improved on all four metrics.This proves that our model improvement is effective.

E. Ablation Study
For the DCNv2-YOLOv7 model shown in Figure 5, two loss curves and two metric curves of the training set are drawn, as shown in the figure.Each loss tends to be stable and eventually converges to a small value, and the metrics curve eventually stabilizes, which proves that the parameters set by the model are reasonable.As shown in Fig. 6, We also choose some pictures of DCNv2-YOLOv7 detected in the test set as the actual results.
) B Establishment of DCNv2-YOLOv7 Network The YOLOv7 object detection algorithm mainly consists of three components: the Backbone feature extraction structure, the FPN feature fusion structure (in this paper, the SPPCSPC structure is included in FPN), and the head for regression and classification.It can detect and classify objects simultaneously.Modify_Multi_Concat_Block is a modified version of the Multi_Concat_Block network structure.Their architecture diagram is shown in Figure 1.
(a) Multi_Concat_Block structure (b)Modifed_Multi_Concat_Block structure diagram Fig.1.Comparison before and after modification DCNv2-YOLOv7 network replaces the last convolution layer with DCNv2 in the Backbone, where the convolution kernel size is 3.The structure of DCNv2-YOLOv7 is shown in Figure 2.

Fig. 6 .
Fig.6.Visualization results from DCNv2-YOLOv7 on test-set.IV.CONCLUSIONS This paper improves YOLOv7 and modifies the backbone part of the backbone feature extraction network.The improved DCNv2-YOLOv7 experiment results show that the recognition accuracy of 4 kinds of fruits can reach 98.3% under three lighting conditions: dark, normal and bright.Through the completion of the above work, it can be found that the algorithm research based on deep learning theory has important practical significance for intelligent identification of fruit species.

TABLE I .
CONFUSION MATRIX

TABLE II .
FRUIT RECOGNITION RESULTS OF DCNV2-YOLOV7

TABLE III .
ABLATION STUDY ON TEST-SET OF FRUIT