Article Request Page ASABE Journal Article
Case Studies and Modules for Data Science Instruction
Problem-Centered Data Science Education in the Agricultural and Biological Engineering Classroom: Analyzing Air Quality Index Data in R
Published in Case Studies and Modules for Data Science Instruction (): 1-6 (doi: ). Copyright American Society of Agricultural and Biological Engineers.
- Students analyze real-world observations in R by writing code for data wrangling and visualization
- Students are provided with opportunities to think critically about the spatial resolution of data
- The exercise is presented in the context of a real-world case study involving wildfires
Abstract. In the presented lesson, students will wrangle and visualize U.S. Environmental Protection Agency Air Quality Index (AQI) data from North Carolina during a time period when a peat bog fire produced an expansive smoke plume that impacted large swaths of the state. The lesson includes a lecture and exercise, with the exercise requiring students to write code in R, an open-source statistical software environment. Lesson learning objectives include (1) Describe the goals of an exploratory data analysis, (2) Apply an exploratory data analysis in R, (3) Explain what the AQI is and how it is calculated, and (4) Assess how the spatial resolution of data influences conclusions. Materials provided with the lesson include lecture slides, data, an R script, and a recorded lesson synopsis.
See this video:https://vimeo.com/434334322. Course materials are in a zip file for download.
Keywords.Active learning, problem-centered education, data acumen, agricultural and biological engineering, wildfires.
In 2018, the National Academies of Sciences, Engineering, and Medicine (NASEM) published a report on Data Science for Undergraduates: Opportunities and Options (NASEM, 2018) in which the authors called on academic institutions to incorporate data science education across all undergraduate curricula, specifically by providing training in skills that underly data acumen, such as:
- The ability to couple programs or codes into computational workflows;
- Ingesting, cleaning, and wrangling data;
- Evaluating how data quality may impact data processing and analysis workflows;
- Questioning the use of analytical methods and thinking critically about their applications;
- Communicating about the use of computational workflows and challenges associated with data and analyses.
The lesson presented here aims to provide Agricultural and Biological Engineering students with the opportunity to develop data acumen and the skills outlined above by analyzing U.S. Environmental Protection Agency (EPA) Air Quality Index (AQI) observations (USEPA, 2014) using R, an open-source statistical software environment (R Core Team, 2020). The lesson implements problem-centered and active learning approaches by motivating the analysis with a real-world case study, involving students in a critical assessment of the ways in which data are used in practice, and engaging students in the creation of code to analyze real observations collected with environmental monitors (Chi, 2009; Chin and Chia, 2004; Woods, 2014). Active and problem-centered instructional strategies have proven effective at facilitating student learning and fostering enthusiasm for the subject matter, as well as helping to close achievement gaps (Freeman et al., 2014; Haak et al., 2017).
This article provides an overview of the lesson, including the learning objectives, case study, results of the analysis, and ideas as to how this lesson could be further refined and enhanced by other instructors. Additionally, lecture slides, sample data, R code, an instructor key, and discussion questions are provided.
Materials and Methods
Lesson Learning Objectives and Course Context
The learning objectives for the lesson are to: (1) Describe the goals of an exploratory data analysis (EDA), (2) Apply the EDA checklist in R, (3) Explain what the AQI is and how it is calculated, and (4) Assess how the spatial resolution of data influences conclusions. This lesson is currently taught as part of an asynchronous online three-credit hour course on “R Coding for Data Management and Analysis” that includes graduate and undergraduate sections. The course aims to provide students with foundational coding skills in R. No prior coding experience is required or expected. The course primarily uses functions in the Tidyverse packages (Wickham et al., 2019). Prior to the lesson presented here, the students have learned how to read, summarize, and visualize data in RStudio using functions in the readr, dplyr, and ggplot2 packages (all included in the Tidyverse). R for Data Science is the primary reference text for instruction on Tidyverse packages and functions (Wickham and Grolemund, 2017) and available as a free e-book (https://r4ds.had.co.nz/). This lesson is delivered in the fourth week of a 16-week semester.
- Lecture slides (.ppt): The lecture slides provide a brief overview of the goals of an EDA, the EDA Checklist, and the AQI.
- Data (.csv): The data file was downloaded from the U.S. EPA for the year 2008 and includes AQI observations for particulate matter of diameter 2.5 micrometers or less (PM2.5) for all monitoring stations across the U.S. Although the provided data file spans many states and includes measurements from January through December, only the measurements collected in North Carolina during the month of June are analyzed.
- Metadata (.csv): The metadata file describes the information presented in each column of the data file. The metadata are sourced directly from the U.S. Environmental Protection Agency AirData Download Files Documentation, Version 3.0.0 (December 1, 2015).
- R script template (.R): The script template includes comments and a few lines of code. The students are provided the template and write the code in the template directly.
- R script key (.R): The script key includes the exercise with all code completed. The key is intended to be used by the instructor.
- Recorded lesson overview for instructors (.mp4): A video recording providing a lesson synopsis. Instructors are the target audience. A recording of the coding exercise is also available per request.
The lesson begins with a lecture in which the instructor introduces the concepts of EDA and the AQI. The Art of Data Science (Peng and Matsui, 2015) is used as the primary reference to support instruction of EDA; the text is available in print, but also online as a free e-book. In The Art of Data Science, Peng and Matsui offer an “EDA Checklist”, which includes the following steps: (1) Formulate your question, (2) Read in your data, (3) Check the packaging, (4) Look at the top and bottom of your data, (5) Check your “n”s, (6) Validate with at least one external data source, (7) Make a plot, (8) Try the easy solution first, and (9) Follow up. To illustrate the use of the EDA Checklist in practice, the text works through an example in which the authors evaluate ozone levels across the United States. This example served as inspiration for the lesson described here, which also focuses on air pollution, but across a smaller geographic area and in the context of a specific case study.
When introducing the AQI, the instructor describes how the AQI varies from 0 to 500 for regulated pollutants, with values of 51-100 corresponding to moderate levels of health concern for the public, 101-150 to unhealthy levels for sensitive groups (e.g. those with asthma), 151-200 to unhealthy levels, 201-300 to very unhealthy levels, and 301-500 to hazardous conditions (USEPA, 2014). The AQI is calculated in relation to the national air quality standard for a given pollutant. When AQI values are greater than 100, conditions exceed air quality standards (USEPA, 2014). The AQI is specifically calculated as follows:
PM2.5Obsis the observed concentration of PM2.5 (µg m-3). Concentrations fall in pre-defined ranges specified by the U.S. EPA, which are bracketed by values of PM2.5Hi and PM2.5Lo. AQIHiand AQILo are the AQI values corresponding to PM2.5Hi and PM2.5Lo, respectively. PM2.5Hi, PM2.5Lo, AQIHi, and AQILo are tabulated in Technical Assistance Document for the Reporting of Daily Air Quality – the Air Quality Index (AQI) (USEPA, 2018).
Additionally, the instructor should emphasize that the AQI is calculated from measurements collected with ground instruments at specific locations, referred to as “air quality monitors” by the U.S. EPA. For more information on the sensing approaches used to monitor PM2.5, refer to Guidance for Using Continuous Monitors in PM2.5 Monitoring Networks (USEPA, 1998). The lecture currently does not provide an overview of the instruments used to monitor air pollutants given that the lesson is taught in a coding course, so the emphasis is on programming, not instrumentation.
Lastly, the lecture introduces the case study that serves as the backdrop for the coding exercise in which students will run an EDA on an AQI dataset.
Case Study: Peat Bog Fire Smoke Plume in North Carolina, 2008
The exercise accompanying this lesson involves analysis of AQI data produced from observations collected in North Carolina (NC) in June 2008 when a peat bog fire in the Pocosin Lakes National Wildlife Refuge produced a large smoke plume over much of the eastern and central regions of the state. A multidisciplinary team of researchers investigated relationships between the smoke plume and cardiopulmonary emergency department visits in areas affected by the smoke, and found that there were significant increases in emergency department visits in counties exposed to the smoke plume (Rappold et al., 2011). To determine which counties were exposed to the smoke plume, satellite measurements of aerosol optical depth (AOD) were evaluated in relation to county boundaries. The AOD product is gridded with 16 km2 pixels. AOD values are unitless and range from 0-2, with larger values corresponding to high concentrations of atmospheric particles and reduced visibility. In eastern NC, background AOD levels are generally less than 0.5. In the Rappold et al. (2011) study, the smoke plume was assumed to correspond to areas where AOD values were greater than or equal to 1.25. The article includes a figure presenting AOD measurements across eastern NC for three dates: June 10, 11, and 12, 2008. The authors argue that using AOD to identify smoke-exposed counties is effective since AOD is known to correlate with the AQI.
For the exercise, the students are asked: Do AQI values, calculated from on-the-ground measurements, match with the AOD measurements captured by satellite during the peat bog fires of 2008? More specifically, among the counties where AQI data are available, what were the maximum AQI values observed during the peat fires of 2008 in counties identified as “exposed” in the Rappold et al. (2011) study? What were the maximum AQI values during this time period in counties that were not labelled as exposed in the Rappold et al. study?
The students are provided with a comma separated value file including daily AQI values for PM2.5 across all U.S. EPA air pollution monitoring sites in the country in 2008. The students are also provided with an R script template, which includes subheaders for different steps to be completed in the analysis. The subheaders serve as analysis guideposts and correspond to the steps outlined in the EDA Checklist (Peng and Matsui, 2015). In R, the students produce visualizations and summary tables to address the questions outlined above.
The instructor delivers the lecture in which EDA and AQI are reviewed, and then the instructor works through the exercise in R. The instructor narrates each step of the exercise and demonstrates how code is written in R to implement the EDA Checklist. Because this lesson is currently delivered through an asynchronous online course, the coding demonstration occurs through a video recording. However, the exercise could easily be delivered in-person instead of through a video recording, and could also be converted into a homework assignment as long as the students had sufficient prior coding experience such that they could work independently.
Results and Discussion
Results from Exercise
From the exercise, students generate multiple plots and one summary table. The main findings include:
- AQI data are only available for a subset of the 100 counties in NC. Because AQI values are calculated from ground measurements at discrete monitoring locations, the available data only represent a subset of the counties in the state (Figure 1). Moreover, of the counties analyzed in the Rappold et al. (2011) study, only eight are represented in the AQI dataset.
- Figure 1. AQI values calculated from PM2.5 observations collected in June 2008 from all monitoring sites (n = 29) located in NC. The blue vertical lines mark June 10 and 12, 2008, as the Rappold et al. (2011) study analyzed satellite images from June 10, 11, and 12, 2008, so these lines mark the study period. The horizontal red lines are shown at 100 and 150 as these are AQI thresholds corresponding to unhealthy air quality conditions for sensitive groups and unhealthy conditions for all groups, respectively.
- AQI values remained below 100 during the study period for one of the counties labeled as “exposed” to the smoke plume in the Rappold et al. (2011) study. The AQI values measured from data collected at a station in Duplin County, one of the counties identified as exposed to the smoke plume in the Rappold et al. (2011) study, did not exceed 80 (Figure 2). AQI values that fall between 50 and 100 correspond to “moderate” air quality conditions. The discrepancy between observed PM2.5 and satellite-derived AOD creates an opportunity for students to think critically about why this disparity may have occurred due to differences in spatial resolution and measurement type. Whereas the AQI data were collected at discrete locations in space, the AOD data were gridded and had constant values for pixels with area of 16 km2. Additionally, the AQI data were derived from direct PM2.5 measurements, whereas AOD values served as proxy estimates of PM2.5.
Figure 2. AQI values calculated from PM2.5 observations collected in June 2008 from monitoring sites located in Duplin, Lenoir, Pitt, and Wayne Counties, NC. These counties were all labelled as “exposed” to the smoke plume in the Rappold et al. (2011) study. The blue vertical lines mark June 10 and 12, 2008, as the Rappold et al. (2011) study analyzed satellite images from June 10, 11, and 12, 2008, so these lines mark the study period. The horizontal red lines are shown at 100 and 150 as these are AQI thresholds corresponding to unhealthy air quality conditions for sensitive groups and unhealthy conditions for all groups, respectively.
- AQI values for several counties not considered in the Rappold et al. (2011) study were elevated during the time following the peat bog fires. For several counties, including Wake, Forsyth, Chatham, and Guilford, AQI values exceeded 150 during the study period, meaning air quality conditions were considered unhealthy (Figure 2). These counties had not been considered in the Rappold et al. (2011) study, so the occurrence of elevated AQI values in these counties provides an opportunity for the students to consider how the spatial extent of a study area influences conclusions.
After going through the analysis, the students are asked to reflect on the results. The instructor could consider posing questions to encourage reflection on broader repercussions of the analysis, including:
- Could the results of the Rappold et al. (2011) study have differed if the researchers used AQI values based on measurements collected via field sensing instead of AOD measurements captured via satellite sensors?
- In the Rappold et al. (2011) study, what were the advantages and disadvantages of using AOD measurements captured via satellite instead of AQI values?
- How could AQI data be used to complement AOD measurements collected via satellite when performing analyses like those in the Rappold et al. (2011) study?
The lesson presented here exemplifies how students can be engaged in real-world problem-solving and data analysis with fairly basic coding skills. Instructors interested in implementing this lesson may consider the following ideas as to how the lesson could be improved and adapted:
- Analyze AQI data corresponding to a local wildfire or air pollution event. To maximize student engagement in the exercise, consider identifying a case study local to your college or university. By connecting the exercise to an air pollution event that occurred locally, there will be greater opportunity for the students to personally connect with the content.
- Require the students to download the data on their own. The lesson as described here does not require students to download AQI data from the U.S. EPA. Instead, the data were provided to the students directly in an effort to reduce the total time required for the exercise, and also to avoid issues related to data cleaning and formatting. However, engaging students in the process of accessing, downloading, and preparing data for analysis could prove valuable in further developing students’ data acumen. Therefore, time permitting, including data download as a first step in the exercise could be beneficial to supporting agricultural and biological engineering students’ data science education.
- Couple this lesson with lessons related to environmental sensing. This lesson could be particularly impactful in a course focused on sensing and field instrumentation. By including this lesson in an instrumentation course, the instructor could link sensing with sense-making and teach students to think more critically about the use and value of data.
This work is supported by the USDA National Institute of Food and Agriculture, Hatch project 1016068. The author thanks Sierra Young and Dennis Buckmaster for the invitation to participate in a special collection of data science lessons for agricultural and biological engineering students.
Chi, M.T.H., 2009. Active-Constructive-Interactive: A Conceptual Framework for Differentiating Learning Activities. Top. Cogn. Sci. 1, 73–105. https://doi.org/10.1111/j.1756-8765.2008.01005.x
Chin, C., Chia, L.G., 2004. Problem-based learning: Using students’ questions to drive knowledge construction. Sci. Educ. 88, 707–727. https://doi.org/10.1002/sce.10144
Freeman, S., Eddy, S.L., McDonough, M., Smith, M.K., Okoroafor, N., Jordt, H., Wenderoth, M.P., 2014. Active learning increases student performance in science, engineering, and mathematics. Proc. Natl. Acad. Sci. 111, 8410–8415. https://doi.org/10.1073/pnas.1319030111
Haak, D.C., Hillerislambers, J., Pitre, E., Freeman, S., 2017. Increased Structure and Active Learning Reduce the Achievement Gap in Introductory Biology. Science (80-. ). 332, 1213–1216.
NASEM, 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC. https://doi.org/10.17226/25104
Peng, R., Matsui, E., 2015. The Art of Data Science. Leanpub and Skybrude Consulting, LLC, https://bookdown.org/rdpeng/artofdatascience/.
R Core Team, 2020. R: A language and environment for statistical computing.
Rappold, A.G., Stone, S.L., Cascio, W.E., Neas, L.M., Kilaru, V.J., Carraway, M.S., Szykman, J.J., Ising, A., Cleve, W.E., Meredith, J.T., Vaughan-Batten, H., Deyneka, L., Devlin, R.B., 2011. Peat bog wildfire smoke exposure in rural North Carolina is associated with cardiopulmonary emergency department visits assessed through syndromic surveillance. Environ. Health Perspect. 119, 1415–1420. https://doi.org/10.1289/ehp.1003206
USEPA, 2018. Technical Assistance Document for the Reporting of Daily Air Quality – the Air Quality Index (AQI), EPA 454/B-18-007. Research Triangle Park, NC.
USEPA, 2014. Air Quality Index (AQI): A Guide to Air Quality and Your Health, EPA-456/F-14-002. Research Triangle Park, NC. https://doi.org/10.1007/978-94-007-0753-5_100115
USEPA, 1998. Guidance for Using Continuous Monitors in PM2.5 Monitoring Networks, EPA-454/R-98-012. Research Triangle Park, NC.
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T., Miller, E., Bache, S., Müller, K., Ooms, J., Robinson, D., Seidel, D., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., Yutani, H., 2019. Welcome to the Tidyverse. J. Open Source Softw. 4, 1686. https://doi.org/10.21105/joss.01686
Wickham, H., Grolemund, G., 2017. R for Data Science, 1st ed. O’Reilly Media.
Woods, D.R., 2014. Problem-oriented learning, problem-based learning, problem-based synthesis, process oriented guided inquiry learning, Peer-Led team learning, model-eliciting activities, and project-based learning: What is best for you? Ind. Eng. Chem. Res. 53, 5337–5354. https://doi.org/10.1021/ie401202k