Article Request Page ASABE Journal Article AFC-ResNet18: A Novel Real-Time Image Semantic Segmentation Network for Orchard Scene Understanding
Jian Zhang1, Jingwei Yang1, Ting An1, Pengxin Wu1, Chen Ma1, Cong Zhang1, Ying Zhao2, Lihong Wang1,*, Chengsong Li1,**
Published in Journal of the ASABE 67(2): 493-500 (doi: 10.13031/ja.15682). Copyright 2024 American Society of Agricultural and Biological Engineers.
1 College of Engineering and Technology, Southwest University, Chongqing, Sichuan, China.
2 Mechanical and Electrical Engineering College, Hainan University, Haikou, Hainan, China.
Correspondence: *wlh_shz@163.com, **lcs_shz@163.com
Submitted for review on 23 May 2023 as manuscript number MS 15682; approved for publication as a Research Article and as part of the Artificial Intelligence Applied to Agricultural and Food Systems Collection by Community Editor Dr. Yiannis Ampatzidis of the Machinery Systems Community of ASABE on 14 November 2023.
Highlights
- A novel real-time image semantic segmentation network for orchards, termed AFC-ResNet18, was designed and tested.
- The AFC-ResNet18 model outperformed the SwiftNet network in terms of segmentation depth.
- The AFC-ResNet18 model achieved the highest accuracy in the architecture performance testing.
- The AFC-ResNet18 model won first place in the orchard scene test with 72.5% accuracy.
Abstract. Semantic segmentation is a fundamental prerequisite for the real-time understanding of scenes. This understanding is essential for developing automated devices that can enhance productivity. Orchards, being labor-intensive and time-consuming workplaces, urgently require automated equipment to boost efficiency. Therefore, the objective of this article is to develop a real-time image semantic segmentation network tailored for orchard environments. This development aims to offer significant new insights into the design of automated maintenance and harvesting equipment. Based on ResNet, the 2015 classification champion network, a novel real-time image semantic segmentation network termed AFC-ResNet18, which used an attentional feature complementary module (AFC) to fuse RGB and depth image information, was designed and systematically tested. Interestingly, in the segmentation ability tests, the AFC-ResNet18 model outperformed the SwiftNet network in terms of segmentation depth. Surprisingly, in the architecture performance testing, the AFC-ResNet18 model achieved the highest accuracy. Noteworthily, in the orchard scene test, the AFC-ResNet18 model won first place with 72.5% accuracy. Predictably, these findings may accelerate the development of novel automated equipment to maintain the orchard worldwide, especially AFC-ResNet18 based robots.
Keywords. Attentional feature complementary module, Orchard, Real-time, Robots, Semantic segmentation.The orchard industry provides an environment in which a large number of human laborers are currently employed for maintenance and fruit harvesting. Remarkably, these operations are labor- and time-intensive (Li et al., 2021a; Xiong et al., 2018b). With the gradual reduction of the labor force, these factors have created unprecedented challenges to orchard production. As expected, there is an urgent need for the study of the orchard automated equipment, which may contribute to improving productivity. Notably, using machine vision to develop robot automation technology is noted to have great potential to improve productivity (Li et al., 2021a; Patrício and Rieder, 2018; Tripathi and Maktedar, 2020).
Indeed, the ability to understand scenes in real time is crucial for enabling robots to navigate steadily and perform tasks autonomously in orchard environments. (Li et al., 2021b; Xiong et al., 2018b). As a result, vision systems are extensively used in the applications of image segmentation (Ghielmetti et al., 2022; Lv et al., 2023; Wang et al., 2019; Xiao et al., 2022; Xu et al., 2019; Zou et al., 2023), fruit recognition (Xiong et al., 2018a ; Xiong et al., 2018b), path planning (Davidson et al., 2016; Li et al., 2021a; Silwal et al., 2017), etc. Remarkably, the understanding of scenes, a fundamental aspect of computer vision, has increasingly been demonstrated through applications developed for image reasoning. (Cordts et al., 2016; Li et al., 2021a; Oberweger et al., 2016; Yoon et al., 2015). Importantly, semantic segmentation is the basic premise of target recognition and a necessary way to achieve complete scene understanding (Majeed et al., 2020; Sui et al., 2017).
Recently, semantic segmentation based on deep Convolutional Neural Network (CNN) has been commonly used in scene understanding and has achieved remarkable results (Garcia-Garcia et al., 2018; Gupta et al., 2014; Wang et al., 2019). Therefore, many computer vision problems are semantically segmented via the depth architecture (Borji and Dundar, 2017; Garcia-Garcia et al., 2018; Gupta et al., 2014). Depth models respond strongly and vary greatly with the gradient of obstacles (Enjarini and Graser, 2012). Depth images contain more location and contour information, which can be used as key indicators of objects in real orchard scenes. However, compared with other computer vision and machine learning methods, depth learning (DL) is far from mature (Garcia-Garcia et al., 2018). For one thing, several attempts by means of geometric calculation and CNN showed that small-scale obstacles cannot be detected by relying solely on the obvious information in RGB images. For another thing, small and medium-scale obstacles in reality were commonly ignored by the current mainstream obstacle data, which mainly assumes fixed categories of objects in the scene (Hani et al., 2020). Indeed, there are obstacles not easily detected in the orchard domain, such as debris, bricks, and stones, which are usually small in size and of various shapes and types. Overall, the combined use of RGB and depth image information to detect obstacles in the orchard domain may contribute to improving the performance of the image segmentation models (Bajcsy et al., 2020).
Orchards, being labor- and time-intensive domains, urgently need automated equipment to improving productivity. However, there are many difficult-to-detect obstacles in orchards, which pose great challenges to the development of automatic equipment. To address the above problem, this study created a new real-time image semantic segmentation network (AFC-ResNet18) based on ResNet18. Specifically, the AFC-ResNet18 adopted an attention feature complementary module (AFC) to fuse the features of RGB and depth images. After being trained with a multi-data set training strategy, the AFC-ResNet18 model could classify various obstacle categories in an orchard scene, including pixel-level, small-scale obstacles. Findings of this work may provide basic insights for the future design of automatic equipment to use in orchards, which may contribute to improving global orchard productivity.
Materials and Methods
Experimental Device
A computer with a NIVIDA 1660Ti GPU and an Intel i59400F CPU was used to run the AFC-ResNet18 model. Its software support environment was configured with CUDA 10.0, CUDN 7.6.0, and PyTorch 1.1. With a learning rate of 0.0004 for the Adam optimizer, each model underwent 200 training iterations with a batch size of 128 (Kingma and Ba, 2014).
Materials
Datasets
A composite dataset composed of a large-scale open-source image dataset (https://research.libraries.wsu.edu/?xmlui/handle/2376/17721) and a group of 800 supplementary images dominated by small-scale obstacles were used to develop the AFC-ResNet18. The large-scale open-source image dataset contained 2975/500/1525 images in the training/validation/test subset, with several categories of fine tags. These images covered different orchard scenes in different seasons. The supplementary images were captured in June 2021 at an outdoor orchard in Beibei, Chongqing, China. Subsequently, 476, 80, and 244 captured images were randomly assigned to the training, validation, and test subsets, respectively, of the open-source image dataset. As a result, a composite dataset comprising 3,451 training, 580 validation, and 1,769 test images, each with a resolution of 2048×1024, was created for the respective subsets.
Methods
Data Processing and Enhancement
A huge number of data samples with different sizes, perspectives, and lighting conditions are the basic premise for CNN to obtain robustness (Almomani and Ormeci, 2016; González-Camejo et al., 2021; Rossi et al., 2020). In fact, only limited scene data can be collected, resulting in a lack of data, which poses a great challenge to the training of CNN. Therefore, enlarging the limited dataset is required. Geometric and optical transformations, such as motion, rotation, flip, zoom, distortion, brightness, and color changes, are effective data extension methods suggested by numerous studies (Shen et al., 2019; Taylor and Nitschke, 2017). Following previous studies, the Python image processing program was used to process the images (fig. 1a) in the composite datasets with operations like random clipping (fig. 1b), random flipping (fig. 1c), color change (fig. 1d), etc. After these operations, the data from the composite dataset had been enhanced four times.
Design of the AFC-ResNet18
Knowledge of the CCN has demonstrated that proposing a completely new system architecture is a huge challenge; common networks and methods are extensively used to construct the deep semantic segmentation systems (Garcia-Garcia et al., 2018). Among them, ResNet won the ILSVRC-2015 classification task with 96.43% accuracy (He et al., 2016). Notably, ResNet18 has a medium depth and residual structure and a small operation occupancy, which may be compatible with real-time operation. Consequently, this case study, based on ResNet18, proposed a neural network structure (AFC-ResNet18), using an attention complementary features module (AFC) of RGB and depth images, to achieve fast and accurate pixel-level semantic reasoning segmentation of orchard scenes.
Semantic Segmentation Mechanism
Semantic segmentation is a critical step toward achieving fine-grained reasoning goals. The quality of segmentation directly affects the accuracy of target recognition (Li et al., 2021a). The essence of semantic segmentation is to encode and decode the high-level semantic information of images. Encoding is to record the high-level semantic information of images using an encoder composed of convolution, pooling, and other operations in DL technology. Decoding is to parse the coding results and achieve pixel classification results of the same size as the input image using a decoder that consists of linear interpolation, de-pooling, or trans-convolution operations. Remarkably, in the coding process, the extracted features can be fused to make full use of the spatial position information of low-level semantic features to obtain more accurate image semantic segmentation results. Figure 2a shows the basic symmetric image semantic segmentation network coding-decoding model, which integrates the coding characteristics of the corresponding coding layer with the decoding structure.
(a) (b) (c) (d) Figure 1. Representative images of (a) orchards with their (b) random clipping, (c) flipping, and (d) color change images outputted by the Python image processing program. The AFC-ResNet18 takes RGB input as the main branch and depth input as the auxiliary branch (details of its architecture are shown in fig. S). The coding process (b1) was implemented mainly by operations like fusion and pooling. The fusion operation was conducted by fusing the output features of the depth branch to the RGB branch via the attention feature complementary module (AFC). The decoding operations (b2) were performed by a space pool module (SPP) and there up-sampling modules (b2). A space pool module (SPP) is used to average the features on the aligned grids with different granularity, and to generate feature maps with multi-scale information based on the fused image features. Furthermore, after a linear interpolation analysis performed to up-sample, the low-resolution feature maps via the first two up-sampling blocks (UP 1, UP 2) the feature mappings were averaged and then mixed by a 3 x 3 convolution. Finally, the last up-sampling block (UP 3) was used to restore the resolution to the same level as the input.
The action of the attention feature complementary module was to generate fused images (c5) by fusing the output features of the depth branch (c4) to the RGB branch (c2). In the fusing process, two extrusion and excitation (EE) modules were used to utilize global information to emphasize useful channel information and suppress useless channel information. Moreover, the sigmoid function was used to activate the convolution results and restrict the weight vector to values between 0 and 1. Furthermore, the weight vector was cross-multiplied and the feature mapping were inputted in the two branches via the element superposition module. Finally, the results of RGB branch and depth branch were added together to obtain the feature mapping (eq. 3).
Architecture and Mechanism of the AFC-ResNet18
The entire network architecture of the AFC-ResNet18 is given in figure 2b, and its details are displayed in figure S. The AFC-ResNet18 uses RGB as the main branch and depth as the auxiliary branch to extract features from RGB and depth images, respectively.
For coding, an attention feature complementary model (AFC) was designed after each layer of ResNet18 to fuse RGB and depth information to obtain more complementary features effectively. The feature map, which was fused by four ResNet18 blocks and AFC modules, had rich, high-level semantic information.
For decoding, before up-sampling, a space pool module (SPP) used to average the features on the aligned grids with different granularities, was designed to improve the ability of model input image pixels while maintaining real-time speed. Furthermore, feature maps with multi-scale information were generated based on the fused image features from the two branches collected by SPP. Moreover, an efficient up-sampling module with three blocks was designed with reference to SwiftNet (Orsic et al., 2019), which restores the resolution of these feature mappings through the jump connection. When the up-sampling module operates, the decoder samples the compressed image semantics upwards to obtain the input resolution.
AFC
To make combination use of the RGB and depth image information, the AFC (fig. 2c) was designed to fuse the output features of the depth branch with the RGB branch, and details of the architecture of the AFC are shown in figure S. The RGB input feature mapping and the depth input feature mapping can be expressed as equation 1 and equation 2, respectively.
(1)
(2)
Figure 2. Architecture of the AFC-ResNet18. (a) A representative basic symmetric image semantic segmentation network coding-decoding model, used to explain the semantic segmentation mechanism, (b) architecture and mechanism of the AFC-ResNet18, and (c) the attention feature complementary module (AFC). In AFC, a SE block was used as the channel attention method to utilize global information, emphasize useful information, and suppress useless channel information (Jie et al., 2019). Then, a global average pool was used to describe the channel in the channel attention mechanism, and a 1×1 convolution layer with the same channel was added as the input to excavate the correlation between channels. Moreover, the sigmoid function was used to activate the convolution results and to restrict the weight vector values between 0 and 1. Furthermore, the weight vector was cross-multiplied, and the feature mapping was inputted in the two branches. Finally, the results of the RGB branch and depth branch were added together to obtain the feature mapping (Z ? RC×H×W). Additionally, since the attention mechanism was employed in image fusion, higher weights can be obtained for features with a large amount of information; then, the complementary information in depth could be utilized more effectively.
(3)
where
f = pooling operation
? = convolution operation
s = sigmoid function.
Up-Sampling Module
The up-sampling module consists of three blocks connected to the coding section by a jump connection. The first two blocks of the module performed a linear interpolation analysis to up-sample the low-resolution feature maps to make their feature mapping of the jump connection have the same resolution. Subsequently, the feature mappings were averaged and mixed by a 3×3 convolution. Then, before the activation function (ReLU) of the residual block, the last up-sampling block was accessed into the connection, which was used to restore the resolution to the same level as the input.
Multi-Source Data Learning
Although annotated tags are necessary for semantic segmentation, annotating all classes in the real world is impossible. Fortunately, multi-source data learning (Liu et al., 2022), an effective method, can utilize as much diverse data as possible to increase the number of recognizable classes from tens to almost any scenario. Therefore, the multi-source data learning method was employed to accomplish the learning of the created heterogeneous datasets. To facilitate learning, the dataset was denoted as D1-Dn, and the class-sets as A and B. Two non-conflicting classes, the small obstacle class and road classes, were set as Class-set A, and the remaining were set as class-set B. Apart from small-scale obstacles, the orchard composite datasets were annotated in different classes and set as Dc, of which C represents the total number of classes. Then, the loss function for multiple dataset training can be represented as equation 4, where ? was set as equation 5.
(4)
(5)
where
x = image
y = corresponding label of x
LA(.,.) = cross-entropy loss function of Set A
LB(.,.) = cross-entropy loss function of Set B
f = segmentation model
? = hyperparameter to balance the weight of different classes.
Evaluation Metrics
Among all the proposed metrics, the MIoU (eq. 6) stands out for its simplicity and representativeness, making it the most commonly used (Garcia-Garcia et al., 2017). Thus, the MIoU was used to estimate the accuracy of the AFC-ResNet18 for semantic segmentation.
(6)
where
K = total number of pixel classes
Pji = total number of pixels in class i
Xji = pixels point in class j and forecast class i
Xii = pixels point i and forecast class i.
Results and Discussion
Dataset Training
To evaluate the impact of the proposed multi-data set training strategy on the model accuracy, the AFC-ResNet18 model was trained without considering tag heterogeneity (fig. 3). Surprisingly, a result with a 20% reduction in accuracy was obtained. Obviously, the segmentation results for similar object models were inconsistent because classes in datasets conflicted with each other. These conflicts could arise from a variety of sources, such as different annotation types or one class being a subset of another. Importantly, the proposed multi-data set training strategy significantly improved the accuracy of the model, which was consistent with previous reports.
Fused Image
Figure 4 reports the feature mapping images processed by the first two layers of the ResNet18 and the fusion feature mapping image prepared by the AFC (fig. 2c). Obviously, small-scale obstacles (trunks, fruits) are clearer in the depth feature image (fig. 4b) than in the RGB feature image (fig. 4a). Importantly, their features were integrated using the AFC and presented in the fused feature image (fig. 4c). Reasonably, the AFC-ResNet18 model (fig. 2b) performed better than the single RGB model. Additionally, the AFC enabled the presented model to utilize depth features in a complementary way, which would appear to improve the accuracy of obstacle detection.
Segmentation Depth
To verify the segmentation ability of the model, the segmentation details of AFC-ResNet18 and SwiftNet (Orsic et al., 2019) in different depth ranges were compared and analyzed, as presented in figure 5. Observably, the AFC-ResNet18 model (fig. 5d) outperformed the SwiftNet model (fig. 5c) in terms of segmentation depth and significantly improved the recognition accuracy of small-scale roadblocks, such as debris, bricks, and stones. In fact, the appearance and surface texture of small-scale obstacles are irregular and easy to mix with the orchard background, which makes it difficult to distinguish them by purely RGB-based methods. Conversely, in-depth images where texture is ignored, the
(a) (b) (c) Figure 3. Representatives of (a) the original scene image and (b) its reference segmentation image, as well as (c) the segmentation images prepared by AFC-ResNet18. (a) (b) (c) Figure 4. Representative mapped images of (a) RGB and (b) depth features prepared by the first two layers, and (c) their fused features reported by the AFC (fig. 2c); the architecture details of AFC are given in figure S, accordingly. (a) (b) (c) (d) Figure 5. Representative semantic segmentation image of (a) original scene (c) generated by the SwiftNet model and (d) AFC-ResNet18model, respectively. (b) Parallax images corresponding to (a) the original scene. contours of small-scale obstacles are clear. Therefore, the AFC-ResNet18, with an AFC to fuse RGB and depth image information, could detect different types of obstacles, especially small-scale ones. Although small-scale obstacles are not limited to the types of obstacles in this dataset, the definition of this specific category allows this article to use the method of deep learning to meet the requirements. When confronted with innumerable possible detection situations, the network can use multi-source learning strategies to generalize far beyond its training data.
Architecture Performance
Table 1 displays the performance of models with different architectures and fusion schemes. Particularly, only the RGB branch of the AFC-ResNet18 model was used as the RGB approach presented in the table 1. It differs from SwiftNet (Orsic et al., 2019) in that the EE block after each layer of the ResNet18 block, which has the ability to control image features to determine whether the depth information can improve accuracy. Obviously, the accuracy of the image-stack method is the lowest, indicating that this method cannot use the depth information effectively. In action, this method superimposes the depth image with the respective RGB images to form a 4-channel input to a branch of the model. Notably, the image fusion method performs better than that of the image-stack approach. The reason is that image fusion splices RGB feature mapping and depth feature mapping into a high-dimensional feature mapping and restores to the original dimension after 1×1 convolution, which can effectively utilize depth information. Importantly, the AFC-ResNet18 achieves the highest accuracy, for its ability to integrate the features extracted from the RGB and depth maps effectively.
Segmentation Accuracy
Table 2 shows the semantic segmentation results of orchard scenes using a set of different models consisting of ERF-PSPNet (Yang et al., 2018), SwiftNet (Orsic et al., 2019), and AFC-ResNet18. As a comparison model, ERF-PSPNet and SwiftNet only accept RGB transmission. After testing, the AFC-ResNet18 mode, which benefited from deep complementary information, won first place with 72.5% accuracy.
Table 1. Performance of models with different architectures and fusion schemes. Fusion schemes Architectures Image
FusionDouble
BranchElement
SuperpositionAccuracy Parameter RGB 69.20% 12.17 M Image stack v v 65.20% 12.17 M RGB and depth
images fusionv v v 68.67% 25.08 M AFC-ResNeT18 v v v 72.22% 23.69 M
Table 2. Semantic segmentation performance of orchard scenes with different models.Models Fusion
ModelAccuracy
(%)Velocity
(FPS)ERF-PSPNet × 64.1% 20.4 SwiftNet v 72.0% 41.0 AFC-ResNeT18 v 72.5% 22.2 Conclusions
In the present study, AFC-ResNet18, a novel real-time image semantic segmentation fusion network for orchard scenes, was constructed and tested. Fortunately, benefited from the multi-source data training strategy and the superior performance of the network architecture, the AFC-ResNet18 was able to predict small-scale obstacles with 72.5% accuracy and 22.2 FPS reasoning speed. Notably, a comparative analysis verified that the AFC-ResNet18 model outperforms the SwiftNet model in segmentation depth. Predictably, in near future, the AFC-ResNet18 model will be compressed and deployed on portable devices to understand the robot’s working environment, especially the orchard scene.
Data Availability
The data sets of this study may be made available from the corresponding author on reasonable request.
Supplemental Material
The supplemental materials mentioned in this article are available for download from the ASABE Figshare repository at: https://doi.org/10.13031/25134386
Acknowledgments
This study was supported by the research and development of key technologies and equipment for walnut mechanized harvesting (2022B02028-3) and the development of a precise variable fertilizer application robot for mountain citrus (2022-158-13).
References
AlMomani, F. A., & Örmeci, B. (2016). Performance of Chlorella vulgaris, Neochloris oleoabundans, and mixed indigenous microalgae for treatment of primary effluent, secondary effluent and centrate. Ecol. Eng., 95, 280-289. https://doi.org/10.1016/j.ecoleng.2016.06.038
Bajcsy, P., Feldman, S., Majurski, M., Snyder, K., & Brady, M. (2020). Approaches to training multiclass semantic image segmentation of damage in concrete. J. Microsc., 279(2), 98-113. https://doi.org/10.1111/jmi.12906
Borji, A., & Dundar, A. (2017). A new look at clustering through the lens of deep convolutional neural networks. arXiv preprint arXiv: 1706.05048. https://doi.org/10.48550/arXiv.1706.05048
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,... Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition (CVPR) (pp. 3213-3223). IEEE. https://doi.org/10.1109/CVPR.2016.350
Davidson, J., Silwal, A., Karkee, M., Mo, C., & Zhang, Q. (2016). Hand-picking dynamic analysis for undersensed robotic apple harvesting. Trans. ASABE, 59(4), 745-758. https://doi.org/10.13031/trans.59.11669
Enjarini, B., & Gräser, A. (2012). Planar segmentation from depth images using gradient of depth feature. Proc. 2012 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (pp. 4668-4674). IEEE. https://doi.org/10.1109/IROS.2012.6385521
Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., & Garcia-Rodriguez, J. (2017). A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv: 1704.06857. https://doi.org/10.48550/arXiv.1704.06857
Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., Martinez-Gonzalez, P., & Garcia-Rodriguez, J. (2018). A survey on deep learning techniques for image and video semantic segmentation. Appl. Soft Comput., 70, 41-65. https://doi.org/10.1016/j.asoc.2018.05.018
Ghielmetti, N., Loncar, V., Pierini, M., Roed, M., Summers, S., Aarrestad, T.,... Harris, P. (2022). Real-time semantic segmentation on FPGAs for autonomous vehicles with hls4ml. Mach. Learn. Sci. Technol., 3(4), 045011. https://doi.org/10.1088/2632-2153/ac9cb5
González-Camejo, J., Ferrer, J., Seco, A., & Barat, R. (2021). Outdoor microalgae-based urban wastewater treatment: Recent advances, applications, and future perspectives. WIREs Water, 8(3), e1518. https://doi.org/10.1002/wat2.1518
Gupta, S., Girshick, R., Arbeláez, P., & Malik, J. (2014). Learning rich features from RGB-D images for object detection and segmentation. In D. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, (Vol. 8695, pp. 345-360). Cham: Springer. https://doi.org/10.1007/978-3-319-10584-0_23
Häni, N., Roy, P., & Isler, V. (2020). MinneApple: A benchmark dataset for apple detection and segmentation. IEEE Robot. Autom. Lett., 5(2), 852-858. https://doi.org/10.1109/LRA.2020.2965061
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition (CVPR) (pp. 770-778). IEEE. https://doi.org/10.1109/CVPR.2016.90
Hu, J., Shen, L., Albanie, S., Sun, G., & Wu, E. (2020). Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell., 42(8), 2011-2023. https://doi.org/10.1109/TPAMI.2019.2913372
Kingma, D., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv: 1412.6980. https://doi.org/10.48550/arXiv.1412.6980
Li, Q., Jia, W., Sun, M., Hou, S., & Zheng, Y. (2021a). A novel green apple segmentation algorithm based on ensemble U-Net under complex orchard environment. Comput. Electron. Agric., 180, 105900. https://doi.org/10.1016/j.compag.2020.105900
Li, Y., Li, M., Qi, J., Zhou, D., Zou, Z., & Liu, K. (2021b). Detection of typical obstacles in orchards based on deep convolutional neural network. Comput. Electron. Agric., 181(8), 105932. https://doi.org/10.1016/j.compag.2020.105932
Liu, H., Li, L., Jiang, H., Yang, Y., & Liu, Y. (2022). Small object detection based on multi-source data learning fusion network. In S. C. Chu, S. H. Chen, Z. Meng, K. H. Ryu, & G. A. Tsihrintzis (Eds.), Advances in Intelligent Information Hiding and Multimedia Signal Processing. Smart Innovation, Systems and Technologies (Vol. 277, pp. 59-67). Singapore: Springer. https://doi.org/10.1007/978-981-19-1057-9_7
Lv, N., Zhang, Z., Li, C., Deng, J., Su, T., Chen, C., & Zhou, Y. (2023). A hybrid-attention semantic segmentation network for remote sensing interpretation in land-use surveillance. Int. J. Mach. Learn. Cybern., 14(2), 395-406. https://doi.org/10.1007/s13042-022-01517-7
Majeed, Y., Zhang, J., Zhang, X., Fu, L., Karkee, M., Zhang, Q., & Whiting, M. D. (2020). Deep learning based segmentation for automated training of apple trees on trellis wires. Comput. Electron. Agric., 170, 105277. https://doi.org/10.1016/j.compag.2020.105277
Oberweger, M., Wohlhart, P., & Lepetit, V. (2016). Hands deep in deep learning for hand pose estimation. arXiv preprint arXiv: 1502.06807. http://arxiv.org/abs/1502.06807
Oršic, M., Krešo, I., Bevandic, P., & Šegvic, S. (2019). In defense of pre-trained ImageNet architectures for real-time semantic segmentation of road-driving images. Proc. 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR) (pp. 12599-12608). IEEE. https://doi.org/10.1109/CVPR.2019.01289
Patrício, D. I., & Rieder, R. (2018). Computer vision and artificial intelligence in precision agriculture for grain crops: A systematic review. Comput. Electron. Agric., 153, 69-81. https://doi.org/10.1016/j.compag.2018.08.001
Rossi, S., Díez-Montero, R., Rueda, E., Castillo Cascino, F., Parati, K., García, J., & Ficara, E. (2020). Free ammonia inhibition in microalgae and cyanobacteria grown in wastewaters: Photo-respirometric evaluation and modelling. Bioresour. Technol., 305, 123046. https://doi.org/10.1016/j.biortech.2020.123046
Shen, Y., Yin, Y., Zhao, C., Li, B., Wang, J., Li, G., & Zhang, Z. (2019). Image recognition method based on an improved convolutional neural network to detect impurities in wheat. IEEE Access, 7, 162206-162218. https://doi.org/10.1109/ACCESS.2019.2946589
Silwal, A., Davidson, J. R., Karkee, M., Mo, C., Zhang, Q., & Lewis, K. (2017). Design, integration, and field evaluation of a robotic apple harvester. J. Field Rob., 34(6), 1140-1159. https://doi.org/10.1002/rob.21715
Sui, X., Zheng, Y., Wei, B., Bi, H., Wu, J., Pan, X.,... Zhang, S. (2017). Choroid segmentation from Optical Coherence Tomography with graph-edge weights learned from deep convolutional neural networks. Neurocomputing, 237, 332-341. https://doi.org/10.1016/j.neucom.2017.01.023
Taylor, L., & Nitschke, G. (2017). Improving deep learning with generic data augmentation. arXiv preprint arXiv: arXiv:1708.06020, 1542-1547. https://doi.org/10.48550/arXiv.1708.06020
Tripathi, M. K., & Maktedar, D. D. (2020). A role of computer vision in fruits and vegetables among various horticulture products of agriculture fields: A survey. Inf. Process. Agric., 7(2), 183-203. https://doi.org/10.1016/j.inpa.2019.07.003
Wang, W., Fu, Y., Dong, F., & Li, F. (2019). Semantic segmentation of remote sensing ship image via a convolutional neural networks model. IET Image Proc., 13(6), 1016-1022. https://doi.org/10.1049/iet-ipr.2018.5914
Xiao, C., Hao, X., Li, H., Li, Y., & Zhang, W. (2022). Real-time semantic segmentation with local spatial pixel adjustment. Image Vision Comput., 123, 104470. https://doi.org/10.1016/j.imavis.2022.104470
Xiong, J., Liu, Z., Lin, R., Chen, S., Chen, W., & Yang, Z. (2018a). Unmanned aerial vehicle vision detection technology of green mango on tree in natural environment. Trans. CSAM, 49(11), 23-29. https://doi.org/10.6041/j.issn.1000-1298.2018.11.003
Xiong, J., Liu, Z., Tang, L., Lin, R., Bu, R., & Peng, H. (2018b). Visual detection technology of green citrus under natural environment. Trans. CSAM, 49(4), 45-52. https://doi.org/10.6041/j.issn.1000-1298.2018.04.005
Xu, W., Chen, H., Su, Q., Ji, C., Xu, W., Memon, M.-S., & Zhou, J. (2019). Shadow detection and removal in apple image segmentation under natural light conditions using an ultrametric contour map. Biosyst. Eng., 184, 142-154. https://doi.org/10.1016/j.biosystemseng.2019.06.016
Yang, K., Wang, K., Bergasa, L. M., Romera, E., Hu, W., Sun, D.,... López, E. (2018). Unifying terrain awareness for the visually impaired through real-time semantic segmentation. Sensors, 18(5), 1506. https://doi.org/10.3390/s18051506
Yoon, Y., Jeon, H. G., Yoo, D., Lee, J. Y., & Kweon, I. S. (2015). Learning a deep convolutional network for light-field image super-resolution. Proc. 2015 IEEE Int. Conf. on Computer Vision Workshop (ICCVW) (pp. 57-65). IEEE. https://doi.org/10.1109/ICCVW.2015.17
Zou, K., Wang, H., Yuan, T., & Zhang, C. (2023). Multi-species weed density assessment based on semantic segmentation neural network. Precis. Agric., 24(2), 458-481. https://doi.org/10.1007/s11119-022-09953-9