Click on “Download PDF” for the PDF version or on the title for the HTML version.


If you are not an ASABE member or if your employer has not arranged for access to the full-text, Click here for options.

Vision Transformer with Masked Autoencoder Pretraining for Quantification of Grape Downy Mildew 

Published by the American Society of Agricultural and Biological Engineers, St. Joseph, Michigan www.asabe.org

Citation:  2022 ASABE Annual International Meeting  2200508.(doi:10.13031/aim.202200508)
Authors:   Ertai Liu, Kaitlin M Gold, David Combs, Lance Cadle-Davidson, Yu Jiang
Keywords:   few shots learning, computer vision, plant disease, downy mildew, machine learning, proximal sensing, vineyard management. 

Abstract. The high value grape industry globally has suffered from Downy Mildew (DM) for centuries. Assessment of DM damage is critical for growers to take appropriate actions to minimize the loss and for researchers to study the disease and develop more advanced treatments. Traditionally large-scale DM assessments require laborious and costly in-field human observations. Nowadays computer vision and machine learning based automated methods have been proposed to reduce the cost and increase the efficiency of DM assessments. However, most of the automated methods rely on supervised learning algorithms and require training sample sets containing abundant high quality manual annotations. Although thousands of images can be collected in the vineyard with automated data collection system, annotating reliable training labels for the images are challenging and time consuming. This study exploits the method reducing the size of training dataset by using the state of art Mask Auto Encoder (MAE) pretraining method and quantitively evaluates the performance of DM detection and severity estimation pipeline with limited amount of annotated training samples. The unlabeled images were first intentionally occluded by randomly generated masks and used for training the MAE network to complete the task of recovering the original image from its masked pair. The encoder of MAE network was then used as the backbone of the classification network and finetuned using significantly smaller labeled training dataset. The training results suggest that the MAE pretraining method may effectively reduce the number of carefully labeled images required to train the network for DM detection.

(Download PDF)    (Export to EndNotes)