Top Navigation Bar

Article Request Page ASABE Journal Article

Case Studies and Modules for Data Science Instruction
Problem-Centered Data Science Education in the Agricultural and Biological Engineering Classroom: Analyzing Air Quality Index Data in R

Natalie Nelson

Published in Case Studies and Modules for Data Science Instruction (): 1-6 (doi: ). Copyright American Society of Agricultural and Biological Engineers.


Abstract. In the presented lesson, students will wrangle and visualize U.S. Environmental Protection Agency Air Quality Index (AQI) data from North Carolina during a time period when a peat bog fire produced an expansive smoke plume that impacted large swaths of the state. The lesson includes a lecture and exercise, with the exercise requiring students to write code in R, an open-source statistical software environment. Lesson learning objectives include (1) Describe the goals of an exploratory data analysis, (2) Apply an exploratory data analysis in R, (3) Explain what the AQI is and how it is calculated, and (4) Assess how the spatial resolution of data influences conclusions. Materials provided with the lesson include lecture slides, data, an R script, and a recorded lesson synopsis.

See this video: Course materials are in a zip file for download.

Keywords.Active learning, problem-centered education, data acumen, agricultural and biological engineering, wildfires.


In 2018, the National Academies of Sciences, Engineering, and Medicine (NASEM) published a report on Data Science for Undergraduates: Opportunities and Options (NASEM, 2018) in which the authors called on academic institutions to incorporate data science education across all undergraduate curricula, specifically by providing training in skills that underly data acumen, such as:

The lesson presented here aims to provide Agricultural and Biological Engineering students with the opportunity to develop data acumen and the skills outlined above by analyzing U.S. Environmental Protection Agency (EPA) Air Quality Index (AQI) observations (USEPA, 2014) using R, an open-source statistical software environment (R Core Team, 2020). The lesson implements problem-centered and active learning approaches by motivating the analysis with a real-world case study, involving students in a critical assessment of the ways in which data are used in practice, and engaging students in the creation of code to analyze real observations collected with environmental monitors (Chi, 2009; Chin and Chia, 2004; Woods, 2014). Active and problem-centered instructional strategies have proven effective at facilitating student learning and fostering enthusiasm for the subject matter, as well as helping to close achievement gaps (Freeman et al., 2014; Haak et al., 2017).

This article provides an overview of the lesson, including the learning objectives, case study, results of the analysis, and ideas as to how this lesson could be further refined and enhanced by other instructors. Additionally, lecture slides, sample data, R code, an instructor key, and discussion questions are provided.

Materials and Methods

Lesson Learning Objectives and Course Context

The learning objectives for the lesson are to: (1) Describe the goals of an exploratory data analysis (EDA), (2) Apply the EDA checklist in R, (3) Explain what the AQI is and how it is calculated, and (4) Assess how the spatial resolution of data influences conclusions. This lesson is currently taught as part of an asynchronous online three-credit hour course on “R Coding for Data Management and Analysis” that includes graduate and undergraduate sections. The course aims to provide students with foundational coding skills in R. No prior coding experience is required or expected. The course primarily uses functions in the Tidyverse packages (Wickham et al., 2019). Prior to the lesson presented here, the students have learned how to read, summarize, and visualize data in RStudio using functions in the readr, dplyr, and ggplot2 packages (all included in the Tidyverse). R for Data Science is the primary reference text for instruction on Tidyverse packages and functions (Wickham and Grolemund, 2017) and available as a free e-book ( This lesson is delivered in the fourth week of a 16-week semester.

Provided Materials


The lesson begins with a lecture in which the instructor introduces the concepts of EDA and the AQI. The Art of Data Science (Peng and Matsui, 2015) is used as the primary reference to support instruction of EDA; the text is available in print, but also online as a free e-book. In The Art of Data Science, Peng and Matsui offer an “EDA Checklist”, which includes the following steps: (1) Formulate your question, (2) Read in your data, (3) Check the packaging, (4) Look at the top and bottom of your data, (5) Check your “n”s, (6) Validate with at least one external data source, (7) Make a plot, (8) Try the easy solution first, and (9) Follow up. To illustrate the use of the EDA Checklist in practice, the text works through an example in which the authors evaluate ozone levels across the United States. This example served as inspiration for the lesson described here, which also focuses on air pollution, but across a smaller geographic area and in the context of a specific case study.

When introducing the AQI, the instructor describes how the AQI varies from 0 to 500 for regulated pollutants, with values of 51-100 corresponding to moderate levels of health concern for the public, 101-150 to unhealthy levels for sensitive groups (e.g. those with asthma), 151-200 to unhealthy levels, 201-300 to very unhealthy levels, and 301-500 to hazardous conditions (USEPA, 2014). The AQI is calculated in relation to the national air quality standard for a given pollutant. When AQI values are greater than 100, conditions exceed air quality standards (USEPA, 2014). The AQI is specifically calculated as follows:

PM2.5Obsis the observed concentration of PM2.5 (µg m-3). Concentrations fall in pre-defined ranges specified by the U.S. EPA, which are bracketed by values of PM2.5Hi and PM2.5Lo. AQIHiand AQILo are the AQI values corresponding to PM2.5Hi and PM2.5Lo, respectively. PM2.5Hi, PM2.5Lo, AQIHi, and AQILo are tabulated in Technical Assistance Document for the Reporting of Daily Air Quality – the Air Quality Index (AQI) (USEPA, 2018).

Additionally, the instructor should emphasize that the AQI is calculated from measurements collected with ground instruments at specific locations, referred to as “air quality monitors” by the U.S. EPA. For more information on the sensing approaches used to monitor PM2.5, refer to Guidance for Using Continuous Monitors in PM2.5 Monitoring Networks (USEPA, 1998). The lecture currently does not provide an overview of the instruments used to monitor air pollutants given that the lesson is taught in a coding course, so the emphasis is on programming, not instrumentation.

Lastly, the lecture introduces the case study that serves as the backdrop for the coding exercise in which students will run an EDA on an AQI dataset.

Case Study: Peat Bog Fire Smoke Plume in North Carolina, 2008

The exercise accompanying this lesson involves analysis of AQI data produced from observations collected in North Carolina (NC) in June 2008 when a peat bog fire in the Pocosin Lakes National Wildlife Refuge produced a large smoke plume over much of the eastern and central regions of the state. A multidisciplinary team of researchers investigated relationships between the smoke plume and cardiopulmonary emergency department visits in areas affected by the smoke, and found that there were significant increases in emergency department visits in counties exposed to the smoke plume (Rappold et al., 2011). To determine which counties were exposed to the smoke plume, satellite measurements of aerosol optical depth (AOD) were evaluated in relation to county boundaries. The AOD product is gridded with 16 km2 pixels. AOD values are unitless and range from 0-2, with larger values corresponding to high concentrations of atmospheric particles and reduced visibility. In eastern NC, background AOD levels are generally less than 0.5. In the Rappold et al. (2011) study, the smoke plume was assumed to correspond to areas where AOD values were greater than or equal to 1.25. The article includes a figure presenting AOD measurements across eastern NC for three dates: June 10, 11, and 12, 2008. The authors argue that using AOD to identify smoke-exposed counties is effective since AOD is known to correlate with the AQI.

For the exercise, the students are asked: Do AQI values, calculated from on-the-ground measurements, match with the AOD measurements captured by satellite during the peat bog fires of 2008? More specifically, among the counties where AQI data are available, what were the maximum AQI values observed during the peat fires of 2008 in counties identified as “exposed” in the Rappold et al. (2011) study? What were the maximum AQI values during this time period in counties that were not labelled as exposed in the Rappold et al. study?

The students are provided with a comma separated value file including daily AQI values for PM2.5 across all U.S. EPA air pollution monitoring sites in the country in 2008. The students are also provided with an R script template, which includes subheaders for different steps to be completed in the analysis. The subheaders serve as analysis guideposts and correspond to the steps outlined in the EDA Checklist (Peng and Matsui, 2015). In R, the students produce visualizations and summary tables to address the questions outlined above.

Lesson Delivery

The instructor delivers the lecture in which EDA and AQI are reviewed, and then the instructor works through the exercise in R. The instructor narrates each step of the exercise and demonstrates how code is written in R to implement the EDA Checklist. Because this lesson is currently delivered through an asynchronous online course, the coding demonstration occurs through a video recording. However, the exercise could easily be delivered in-person instead of through a video recording, and could also be converted into a homework assignment as long as the students had sufficient prior coding experience such that they could work independently.

Results and Discussion

Results from Exercise

From the exercise, students generate multiple plots and one summary table. The main findings include:

Figure 2. AQI values calculated from PM2.5 observations collected in June 2008 from monitoring sites located in Duplin, Lenoir, Pitt, and Wayne Counties, NC. These counties were all labelled as “exposed” to the smoke plume in the Rappold et al. (2011) study. The blue vertical lines mark June 10 and 12, 2008, as the Rappold et al. (2011) study analyzed satellite images from June 10, 11, and 12, 2008, so these lines mark the study period. The horizontal red lines are shown at 100 and 150 as these are AQI thresholds corresponding to unhealthy air quality conditions for sensitive groups and unhealthy conditions for all groups, respectively.


After going through the analysis, the students are asked to reflect on the results. The instructor could consider posing questions to encourage reflection on broader repercussions of the analysis, including:


The lesson presented here exemplifies how students can be engaged in real-world problem-solving and data analysis with fairly basic coding skills. Instructors interested in implementing this lesson may consider the following ideas as to how the lesson could be improved and adapted:


This work is supported by the USDA National Institute of Food and Agriculture, Hatch project 1016068. The author thanks Sierra Young and Dennis Buckmaster for the invitation to participate in a special collection of data science lessons for agricultural and biological engineering students.


Chi, M.T.H., 2009. Active-Constructive-Interactive: A Conceptual Framework for Differentiating Learning Activities. Top. Cogn. Sci. 1, 73–105.

Chin, C., Chia, L.G., 2004. Problem-based learning: Using students’ questions to drive knowledge construction. Sci. Educ. 88, 707–727.

Freeman, S., Eddy, S.L., McDonough, M., Smith, M.K., Okoroafor, N., Jordt, H., Wenderoth, M.P., 2014. Active learning increases student performance in science, engineering, and mathematics. Proc. Natl. Acad. Sci. 111, 8410–8415.

Haak, D.C., Hillerislambers, J., Pitre, E., Freeman, S., 2017. Increased Structure and Active Learning Reduce the Achievement Gap in Introductory Biology. Science (80-. ). 332, 1213–1216.

NASEM, 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC.

Peng, R., Matsui, E., 2015. The Art of Data Science. Leanpub and Skybrude Consulting, LLC,

R Core Team, 2020. R: A language and environment for statistical computing.

Rappold, A.G., Stone, S.L., Cascio, W.E., Neas, L.M., Kilaru, V.J., Carraway, M.S., Szykman, J.J., Ising, A., Cleve, W.E., Meredith, J.T., Vaughan-Batten, H., Deyneka, L., Devlin, R.B., 2011. Peat bog wildfire smoke exposure in rural North Carolina is associated with cardiopulmonary emergency department visits assessed through syndromic surveillance. Environ. Health Perspect. 119, 1415–1420.

USEPA, 2018. Technical Assistance Document for the Reporting of Daily Air Quality – the Air Quality Index (AQI), EPA 454/B-18-007. Research Triangle Park, NC.

USEPA, 2014. Air Quality Index (AQI): A Guide to Air Quality and Your Health, EPA-456/F-14-002. Research Triangle Park, NC.

USEPA, 1998. Guidance for Using Continuous Monitors in PM2.5 Monitoring Networks, EPA-454/R-98-012. Research Triangle Park, NC.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T., Miller, E., Bache, S., Müller, K., Ooms, J., Robinson, D., Seidel, D., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., Yutani, H., 2019. Welcome to the Tidyverse. J. Open Source Softw. 4, 1686.

Wickham, H., Grolemund, G., 2017. R for Data Science, 1st ed. O’Reilly Media.

Woods, D.R., 2014. Problem-oriented learning, problem-based learning, problem-based synthesis, process oriented guided inquiry learning, Peer-Led team learning, model-eliciting activities, and project-based learning: What is best for you? Ind. Eng. Chem. Res. 53, 5337–5354.