Article Request Page ASABE Journal Article
An Introduction to Automated Visual Sensemaking for Animal Production Systems
Published in Case Studies and Modules for Data Science Instruction (): 1-7 (doi: ). Copyright American Society of Agricultural and Biological Engineers.
- Students learn basic concepts for a machine learning algorithm used in the context of computer vision
- Opportunities for student reflection and improvement of a data pipeline and workflow are presented
- This learning exercise utilizes a primary video data source that could also be collected by students
Abstract. This paper presents a learning exercise for automatically identifying and tracking individual livestock animals within a video data set. Consistent, reliable, and accurate tracking and monitoring of individual animals is one of the biggest challenges faced by the livestock industry today; furthermore, access to these data for analysis and modeling purposes may be limited by security protocols and/or the format in which the data might be available. The approach of this work uses a computer vision approach with a machine learning algorithm (Mask R-CNN) to automatically identify and visually track individual animals in a pen. Results are converted into a tabular format that can be translated into geometric input files for modeling purposes. This fast, approximation approach to assembling individual animal data could be of interest to animal science-related researchers and practitioners, to improve understanding and operation in animal production systems. The exercise is part of a special topics graduate class taught to students in science and engineering, but who have a limited background in programming; the exercise occurs approximately half-way into the semester and basic familiarity with image and video data, along with some classical computer vision knowledge, is required.
Course materials are in a zip file for download.
Keywords.Computer vision, Machine learning, Livestock, Video, R-CNN, Education.
1 Iowa State University, Ames, Iowa, USA.
2 San Jose State University, San Jose, California, USA.
* Correspondence: firstname.lastname@example.org.
Submitted for review on 1 December 2020 as manuscript number EOPD 14424; approved for publication as a Teaching Module by the Education, Outreach, & Professional Development Community of ASABE on 14 April 2021.
Continuous monitoring of livestock can inform the measurement of the decision-making variables to enable the detection of health status changes in animal production systems. New methods that improve continuous monitoring are therefore useful for animal health and welfare management. Cameras and video technologies, combined with improved cloud computation capabilities, are a non-invasive means of providing data to develop decision-making models for individual animals (Figure 1).
The motivation for this type of activity is the critical need for new and innovative tools that expand precision livestock farming for production-scale operations. As one basis of comparison, row crop producers have aggressively implemented assistive tools, such as robotics and computer vision techniques (Young et al., 2019), to improve yields and reduce costs, but the livestock industry has traditionally lagged behind in this part of the production agriculture community. Consistent, reliable, and accurate tracking and monitoring of individual animals is one of the biggest challenges faced by the livestock industry today. The overarching goal of continuous monitoring is to move from human observation, and even beyond hardware-based solutions such as radio frequency tags, to visual-only tracking through continuous recognition of individual animal identity, nutritional intake (feed and water), and behavior activity patterns (social, and aggression), from arrival to departure at a production facility.
Visual sensemaking is a non-invasive method that relies on a camera node (or array of cameras) to automatically detect, track and measure objects of interest, often at high frequency using machine learning techniques. Visual sensemaking tools have been developed for a variety of domains such as infrastructure, transportation, and even human tracking. Upscaling a visual-only tracking infrastructure to the pen scale and beyond is challenging due to pigs being similar in appearance while in a confined space with uncertain environmental conditions. Producers currently rely on human caretakers to observe animals daily to detect feed, water, illness and air issues. At present, the industry has few objective tools to identify and/or track individual animals throughout their growth phases and these are often invasive, costly, and susceptible to failure. Individual animal identification and tracking are central to optimal production goals and this important step is the focus of this learning exercise.
Figure 1. Illustration of a practical application of automated visual sensemaking to observe, track, and estimate individual pig health and welfare. On left, behavior time-series of visually identified pigs could be recorded (e.g., example, green might be normal; yellow is problematic; red is urgent); on the right, illness probabilities are assigned to individual pigs as a function of the state time histories. When a pig exceeds a probability threshold for a particular illness being exhibited through behavior patterns, it could be isolated to prevent infection of other animals, reducing the amount of needed antibiotics, and preventing disease from proceeding, shortening the illness time. All of this process would first begin with the automated visual sensemaking of each animal in the video.
Materials and Methods
Most machine learning approaches involve statistical methods. The algorithms can be implemented in different programming languages, such as R, Java, C++, MATLAB®, and Python. Among these, Python is the most common programming language for machine learning due to its usability and a wide variety of libraries. Some useful machine learning Python libraries available include: Scikit-learn (scikit-learn.org), Pytorch (pytorch.org), and Tensorflow (tensorflow.org). Object detection models based on machine learning approaches can perform with very high accuracy. Among these are Fast Region-Based Convolution Neural Network (RCNN) (REFERNCE), Faster-RCNN (REFERNCE), You Only Look Once (YOLO) (REFERNCE), Single Shot Detector (SSD) (REFERNCE), and Mask R-CNN (REFERNCE) are the most common (Table 1). In this learning exercise, students will use Python Version 3.5 (python.org) and Mask R-CNN (discussed in the following sections), though any model can be used. A note to the reader: this learning material was developed with Python Version 3.5 and although newer Python versions will come out and will likely not create and programming problems through deprecations or library compatibility issues, the authors can only guarantee that Version 3.5 is stable for these activities.
Table 1. Comparison of four contemporary object detection models the can be used for video data sets.
Model Name Speed Accuracy Output Training Data Processing Time Faster R-CNN Very Slow High Box Low Mask R-CNN Slow Medium Mask / Object Boundary Very High SSD Fast Relatively Low Box Low YOLO Fast Relatively Low Box Low
In this exercise, the instructor will guide students through six learning steps required to build a custom pig object detector. This same object detection framework can then be additional used to train other custom object detectors. There are four learning objectives for this exercise:
Learn how to extract individual images from video. Pre-processing of visual data is a normal part of engaging with visual sensemaking workflows. The ability to gather the necessary data sets for a model is a critical first step.
Annotate image data sets for model development. Once the data set has been established, it is important to create different training and testing data sets for the successful development of a machine learning model.
Train a machine learning model to detect a specific object. There are several machine learning approaches available as libraries; being able to set up and execute a specific model (e.g., Mask R-CNN as used in this exercise) enables use of the data.
Evaluate the machine learning model. After the model produces a result, it is important to be able to assess the accuracy of the predictions. This is done so that the model can be applied to subsequent and/or similar data sets.
This following learning materials are provided or available for use by the students who participate in this exercise; materials that are specific to the instructor are designated as well. These files are provided through a link in the ASABE Technical Library due to the size of the files involved.
ExtractImages (.py): Python script to read the video and extract images.
Pig(.py): Python script to train on the toy pig data set and implement color splash effect.
Inspect_Pig_Images (.ipynb): Code worksheet to analyze individual pig images.
Sample_Annotation_Example (.ipynb): Code and visualizations to test, debug, and evaluate the Mask R-CNN model.
maskrcnn (folder): All supporting data, weight, etc. files for the exercise.
Summary Document/Recorded Lesson: An outline and video of the lesson being used in a classroom environment is provided; the target audience of this resource is the instructor.
Lecture Outline and Learning Steps
The presentation of this exercise includes six specific learning steps. First, machine learning, its definition, and use is discussed through a collaborative group process with the students. Following a basic understanding of machine learning, the specific algorithm used in this exercise called Mask R-CNN is presented along with its general operational parts. A detailed description of the data is then given. Next, an overview of the transfer learning technique is given to aid in the model preparation process. An exercise of annotating the visual information for model inputs to be used as part of the transfer learning process is described. Finally, the students train the machine learning model.
Learning Step 1: Introduction to the Learning Exercise
The lecture for this exercise begins with the opening question: “What is machine learning?” and student responses are captured by a designated recorder. Typically, the answers will converge towards a description that machine learning is a computational approach that provides computers with the ability to learn. Algorithms that utilize machine learning construct mathematical models that are based on a sample (or series of samples over time) of various different types of information, which is usually referred to as “training data”; this is done to enable prediction (or decision-making) without the algorithm being explicitly implemented to perform a specific task. One example of prediction with machine learning in visual sensemaking might include developing an algorithm to differentiate between a bicycle and a dog in a given image; in this exercise, students will use an existing algorithm (called Mask R-CNN and discussed in Learning Step 2) that uses machine learning to identify and track individual pigs in continuous video from a pen at an animal production facility. Further background information on machine learning, with additional examples if desired, can be provided by a short video shown to the students in either a synchronous or asynchronous manner (e.g., The Royal Society, 2020).
Learning Step 2: The Mask R-CNN Machine Learning Algorithm
Mask R-CNN is a machine learning algorithm that predicts the presence of an object in an image by generating bounding boxes and masks. There are two primary stages of Mask R-CNN: i) generation of proposals (or guesses) about the regions in an image under evaluation where there might be an object based on the input image data, and ii) prediction of the classification of an object, refinement of the bounding box, and generation of a mask at the pixel level of the object based on the first stage proposal. To provide a more general context and usage of Mask R-CNN, refer to Figure 1.
Learning Step 3: Description of the Visual Data Inputs
The raw data set provided with this learning exercise consists of a continuous video of 20 pigs in one standard production-sized pen at a commercial pig farm in Iowa. The videos were recorded in 4K color night vision using a custom LaView camera system (laviewsecurity.com) acting as the visual sensing node at each pen; this permitted consistent recording regardless of the environmental conditions (e.g., lights on or off; sun-up or down); this is necessary for continuous and precise measurements of each animal, and animals within groups, especially during dark or nighttime conditions. A note to the reader: any camera system that is high-definition or better could, in theory, work for this type of data acquisition as long as the visibility allowed for the pigs to be adequately viewed by a human. To process the data for the Mask R-CNN algorithm, it is first necessary to extract images from the videos. The included Python script file titled ExtractImages.py will complete this task. If another data set were desired (e.g., dairy cattle), a new and different video could be downloaded from an online image repository source or collected as a primary data source from a specific domain, and a corpus of images extracted, depending on the exercise goals.
Normally, a significant number of images, perhaps hundreds if not thousands, are required to train a machine learning model for visual sensemaking. In this exercise, students will take advantage of a technique called Transfer Learning that is explained in more detail in Learning Step 4. For this exercise, it might be initially suggested that the students use, e.g., 70 images of pigs; however, depending on the hardware being used if this is too computationally intensive and thus time consuming, the number of images can be reduced to 48 (40 training images and 8 validation images) likely without error; these assumptions might not be valid under more rigorous research conditions and requirements. One final consideration in the use of any image or video data is that the reduction of the spatial dimensions (i.e., resolution) of the images will typically improve computational times, assuming the objects of interest remain readily discernable in the images.
Learning Step 4: The Transfer Learning Technique
A Transfer Learning approach means that a model is not required to be trained from scratch. In this exercise, students will rely on an included weights file that has already been trained to identify a variety of other classes, specifically, the Common Objects in Context (COCO) data set which is a standardized data set for transfer learning (Microsoft, 2014). The COCO data set contains roughly 120,000 different images, so these weights are based on prior efforts that have already been utilized to identify 91 different classes of objects such as horses, forks, and stop signs. A weights file is a numerical set of data produced through an iterative computational process that represent the strength of connections and biases between units in a neural network. It should be noted that the COCO data set does not contain a class for pig identification; therefore, the next step in the learning exercise will be to identify the pigs in the training images so that the algorithm will know to find pigs.
Learning Step 5: Annotation of the Visual Data
The Mask R-CNN algorithm requires that the input images be annotated (i.e., having each object of interest in the image labeled) to specify the area(s) of interest in an image. This step can often be time consuming and depends in part on the complexity of the scene and number of objects that need to be identified; therefore, a pre-annotated image data set of pigs is provided with the learning materials. Figure 3 illustrates an example of how an annotated image of pigs from this data set should appear once the annotation process is completed. If there is additional time allocated for image annotations as part of this learning exercise, students are recommended to use the Visual Geometry Group (VGG) Image Annotator software available as an online resource with detailed instructions (Dutta et al., 2016; Dutta and Zisserman, 2019). Once an annotation is completed, students will use (or download, if making a new data set) the resulting annotation file (in JSON format) and split the annotated images into training and validation sets. The training set is used to build a model for this data set, while the validation set is used to validate the model; the data points from the training set are not included in the validation set.
Figure 3. Annotated pigs in an image sample used in this exercise. Care should be taken to outline each of the pigs (or general objects of interest) as to not include features that might cause errors.
Learning Step 6: Training the Machine Learning Model
The machine learning model to located pigs in an image is now ready to be trained. To do so, students can use the included Python script titled Balloon.py to train the model (Abdulla, 2017) or they can use the provided Jupyter notebook either locally or via Google Colab (colab.research.google.com). Students should set 30 iterations of 100 epochs each for the model training parameters. For the model to be trained, the included folder maskrcnn must be located in the same directory as the included Python script Pig.py. This folder will provide other code dependencies needed during the training process by the software. Students should follow the instructions provided in the Python script files to begin the training process.
Results and Discussion
Training a model through transfer learning can be computationally intensive. It is recommended that a graphical processing unit (GPU) and not a central processing unit (CPU) be used for this purpose, which can be leveraged by using cloud computing services (e.g., Google Colab); otherwise, it could take several hours to train this model locally and could vary depending on the local hardware. The distinction between CPU and GPU architecture is that a CPU is designed to handle a wide variety of tasks quickly but is constrained in the number of tasks that can be run concurrently, while a GPU is designed to process high-resolution images and video simultaneously; the GPU does not affect CPU processes in this learning example.
As mentioned before, the resolution of images also plays a role in computational time involved; one good activity during this learning exercise is to have different students resize their images so that each student is training with a different image resolution (but consistent to each student) set of image data and this can be used to compare computation times among students. After the training is complete, the students will each now have a trained weights file that has been produced; this information is included in the data files. A result similar to that shown in Figure 4 indicating a loss (indicating how poorly the model performs) of 0.2082 (or similar) should be the outcome; the loss function assesses the difference between the predicted and actual model values.
Figure 4. Screenshot of the output results of the machine learning model training step.
Figure 5. Final evaluation of the visual data by the model. As the video stream, the algorithm will track masks and bounding boxes over each identified pig. The x-y coordinates of these objects are recorded to provide a tracking time series.
The bounding boxes shown in Figure 5 contain labels indicating the accuracy of the detection; students should make note of the accuracies and why they might be happening. For example, individual animals that might become occluded by another animal might receive a poor accuracy score or be inadvertently merged with another animal; this will vary with different video data that might be used in the exercise.
There is an assessment asked as part of this learning exercise. Students are instructed to obtain (or capture video of their own primary data set) for which to apply this code for object recognition. Students prepare a one-page summary describing why they chose the data set they did and cite from where it was obtained. Students should include at least one screenshot as a figure of the code working as a code demonstration. Annotated figures should be included as needed. In the reports, students should both verbally and numerically describe how they had to (if necessary) change any of the parameters in the code, and how effective this code was at recognizing the objects they chose to find. Finally, students are asked to describe ways in which the code might need to be improved (i.e., where, if at all, did the code fail to perform well). For the submission, students create and upload one zip file that contains the one-page report, the source code used, and the image(s) on which they applied they code.
This learning exercise utilized the Mask R-CNN algorithm to identify and track individual pigs in a video. Transfer learning utilized the COCO data set. The model was trained for 30 iterations each containing 100 epochs. A best-case loss function of 0.1298 was obtained at the end of training. The trained weights of this project can be used by students having similar data sets where there is a need identify and track pigs (or this could be extended through a separate exercise to another video of different animals, plants, or even people). The desired output product of the exercise is to obtain an output video containing bounding boxes, accuracy labels, and masks, and to be able to explain the general steps involved and the consequences of changes at each step.
The data used in this exercise were made possible through two different projects funded by the Iowa Pork Producers Association and the National Pork Board.
All supporting files for this exercise are downloadable from a link at the ASABE technical library.
Abdulla, W. (2017). Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow. https://github.com/matterport/Mask_RCNN
Dutta, A., Gupta, A., & Zissermann, A. (2016). VGG Image Annotator (VIA). http://www.robots.ox.ac.uk/~vgg/software/via/
Dutta, A., & Zissermann, A. (2019). The VIA Annotation Software for Images, Audio and Video. In Proceedings of the 27th ACM International Conference on Multimedia, doi: 10.1145/3343031.3350535.
Microsoft. (2014). Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, doi:10.1007/978-3-319-10602-1_48.
The Royal Society. (2020). What is Machine Learning? Machine Learning. https://royalsociety.org/topics-policy/projects/machine-learning/videos-and-background-information/
Young, S.N., Peschel, J.M., & Kayacan, E. (2019). Design and Field Evaluation of a Ground Robot for High-Throughput Phenotyping of Energy Sorghum. Precision Agriculture, 20(4): 697–772. doi:10.1007/s11119-018-9601-6.