Article Request Page ASABE Journal Article
Analyzing Sensor Data at the Source
Published in Case Studies and Modules for Data Science Instruction (): 1-14 (doi: ). Copyright American Society of Agricultural and Biological Engineers.
Abstract. Data science spans a broad array of activities, including data collection, storage, integration, analysis, inference, communication, and ethics. Exposure to these concepts in the context of real-world applications is critical so students can understand the limitations and other considerations that may arise when applying data science concepts. Of particular importance for biological and agricultural engineering data science applications is the acquisition and availability of sensor data. Sensor technologies are continuously being incorporated into modern agricultural and environmental practices and measure physical quantities such as temperature, light, pressure, sound, and humidity. When using sensors, physical phenomena are converted into detectable electrical signals, which must be ultimately processed into meaningful measurements. It is important that students have some understanding of how most sensor responses are first represented by a change in electrical property, and how that electrical property may be converted into the physical, measured phenomenon. Therefore, the purpose of this two-part instructional material is to familiarize students with the basic concepts of raw sensor output data by providing activities focused on linear sensor calibration and characterizing nonlinear temperature responses.
See this video:https://vimeo.com/434519511
Course materials are in a zip file for download, listed alongside this abstract at https://elibrary.asabe.org/textbook.asp?confid=sci2021
Keywords.data science, instruction, sensors, calibration, measurements.
Background and Context
Data science spans a broad array of activities, including data collection, storage, integration, analysis, inference, communication, and ethics (1). The development of data science curriculum may include creating new courses or academic programs that focus intensively on teaching data science; however, there are also many opportunities to integrate activities that teach a variety of data science skills in existing courses. Such skills that may be incorporated into biological and agricultural engineering curriculum include applying mathematical, statistical, and computational foundations, data management and curation, data visualization, modeling and assessment, workflow development, domain-specific considerations, and ethical problem solving (1). Exposure to these concepts in the context of real-world applications is critical so students can understand the limitations and other considerations that may arise when applying data science concepts.
Of particular importance for biological and agricultural engineering data science applications is the acquisition and availability of sensor data. Sensor technologies are continuously being incorporated into modern agricultural and environmental practices and measure physical quantities such as temperature, light, pressure, sound, and humidity. These physical phenomena are converted into detectable electrical signals, which must be ultimately processed into meaningful measurements. Sensors themselves range from relatively simple thermocouple structures to the detection of specific compounds using advanced biochemical principles. Regardless of the sensor type, however, it is important that students have some understanding of how most sensor responses are first represented by a change in electrical property, and how we must convert that electrical property to the physical, measured phenomenon. Therefore, the purpose of this two-part instructional module is to familiarize students with the basic concepts of raw sensor output data and highlight what often happens “behind the scenes” when relying on sensor outputs to measure physical phenomena of interest.
These materials are an introductory application of statistical concepts in the domain-specific application of sensor calibration and step response analysis, with a focus on pressure and temperature sensor data. Note that there is an opportunity to pair these materials with a simple lab activity such that the analyses are performed on data collected by the students themselves, although the laboratory exercise is not included in this material. Prerequisites: It is assumed that students have background knowledge on the following concepts: descriptive statistics, tests of significance, outliers, linear regression, and introductory circuit analysis (e.g., Ohm’s Law). Additionally, basic programming skills are required. Previous exposure to MATLAB is desirable.
Overview of the Materials
These materials include two activities that deal with analyzing raw sensor data. The first activity includes a set of calibration data points as voltage measurements for a linear pressure sensor, with questions focusing on the detection of “false” sensor measurements and creating and applying a linear calibration curve for the sensor. The second activity includes step response measurements (voltages) for a nonlinear temperature sensor, and the module includes data pre-processing and a comparison of two different methods for calculating a sensor parameter that describes sensor responsiveness. Complete code for both activities are provided as MATLAB scripts, and the data are included as ‘.mat’ files. A student worksheet and supplementary slides for teaching each activity are also included as part of these materials. The following sections describe the data and activities in greater detail.
Activity I: Outlier Detection and Calibration
Background on the Data:
It is common for sensors to produce a change in electrical property, normally measured as a voltage, in response to a change in some physical property. This voltage can be measured by a data logger or other data acquisition system. The voltage is then converted back to a measured physical value by applying a relationship between the sensor change and physical parameter change (Figure 1). This general relationship is often in the form of a characteristic curve, which defines the sensor response to an input. While the availability of standards and characteristic curves for a sensor type are informative, calibration should ideally be performed for each individual sensor. The calibration process involves comparison a sensor reading to a known, standard physical reference, and aims to account for sources of error that may be due to a non-zero offset, variation in sensor physical properties, and possible changes in sensor behavior over time. By limiting errors in the voltage measurements and representing the relationship between voltage and physical parameter more accurately through a calibration process, sensor readings can generally become more accurate.
Figure 1: General calibration curve with the input value, normally measured as a voltage signal, on the x axis, and the output value, or inferred sensor response value, on the y axis. (Source: Theory and Design for Mechanical Measurements, 6th Ed., Figliola and Beasley).
An established method for calibration includes establishing a set of calibration points over the required operating range, measuring those points with a standard or laboratory-grade reference, and observing the sensor response at each calibration point. Given that sensors (especially low-cost sensors) may be noisy, multiple sensor output readings may be taken at each individual calibration point.
For calibration purposes, standards fall under two main categories: standards used to produce accurate physical quantities, and standards used to accurately measure physical quantities. In the case of pressure sensor calibration, a standard may produce a known pressure or measure pressures created by another source. For example, the National Institute of Standards and Technology (NIST) has piston gauge pressure standards for calibrating pressure sensing devices (2). It is important to note that any calibration is only as good as the standard used. The accuracies of calibration standards vary, and the required accuracy for a given application will depend on what is being calibrated; however, a good rule of thumb is to select a standard that is approximately 3-4 times more accurate than the sensor or device being calibrated.
The data included in this activity represent data points used for generating a calibration curve for a pressure sensor and includeoutput voltage readings at a set of eight known pressure values for a sensor with linear behavior. Multiple readings were taken at each calibration point. Also, normally distributed noise has been manually added to each calibration point for the purpose of illustrating concepts related to outlier detection and removal. The data are stored in a .mat file named ‘calibration_data_raw.mat’. The data are in a 1x9 cell array, where each cell index corresponds to a known pressure (0-8 atm), and the data in each cell are the voltage sensor output values associated with that known pressure. (Note: the data are also provided as a ‘.csv’ file with pressure values in the header row).
Objectives of this Activity:
By the end of this activity, students will be able to
- Visually explore a data set for potential outliers;
- Apply several outlier detection techniques to a data set;
- Generate and evaluate linear sensor calibration curves.
Description of Activity:
This activity can be divided into four components: 1) data visualization and exploration, 2) outlier detection and method comparison, and 3) linear regression (calibration curve generation), and 4) calibration evaluation.
1. Data visualization and exploration: A common first step in an exploratory analysis is visualizing the data. The first visualization step is to plot the raw sensor data as a scatterplot (see Figure 2a). After creating this plot, it should be visually evident that the calibration data follow a linear relationship; however, there is some variation in sensor output at each pressure value. To further explore the data points at each calibration point, two statistical plots should be generated: boxplots and histograms. These are two commonly used visual approaches to detect outliers and understanding the underlying distributions of the data. Sample plots generated in MATLAB are shown in Figures 2b and 2c below.
Figure 2: (a) Scatterplot of the sensor output data vs. known calibration pressure, (b) box plots for each calibration point, and (c) histograms for each calibration point.
Within the supplied worksheet, students will have the opportunity to comment on any observations they may have after generating each of the above plots. Some observations that can be made from looking at the figures above may be the presence of a clear outlier in the sensor output data at a calibration pressure of 3 atm that is perhaps due to measurement error; additionally, there appears to be more variance in sensor output at higher pressure readings.
2. Outlier detection: The section of the activity requires students to apply, evaluate, and compare several different methods for univariate outlier detection. The MATLAB function isoutlier (3) has several options to invoke statistical outlier detection methods, including using the Median Absolute Deviation (MAD), interquartile ranges, standard deviations, Grubbs’s test, and the generalized extreme Studentized deviate test. The table below contains a summary of these methods available for use in MATLAB.
Table 1: Outlier detection methods available for use in MATLAB (3).
Method Description ‘median’ Returns true for elements more than three scaled MAD from the median. The scaled MAD is defined as c*median(abs(A-median(A))), where c=-1/(sqrt(2)*erfcinv(3/2)) ‘mean’ Returns true for elements more than three standard deviations from the mean. ‘quartiles’ Returns true for elements more than 1.5 interquartile ranges above the upper quartile or below the lower quartile. This method does not assume the data are normally distributed. ‘grubbs’ Applies Grubbs’s test for outliers, which removes one outlier per iteration based on hypothesis testing. This method assumes that the data are normally distributed. ‘gesd’ Applies the generalized extreme Studentized deviate test for outliers. This iterative method is similar to 'grubbs', but can perform better when there are multiple outliers masking each other
In this portion of the activity, students have the freedom to choose a subset of the above methods for implementation and explain their choice, drawing on previously learned concepts from statistics, as well as considering the application-specific reasons for wanting to detect outliers in this data set (e.g., eliminating measurements that are due to instrument or other error that may introduce unnecessary error during calibration). Some considerations for method choice include the mean being highly affected by outliers and if the underlying data can be assumed to be normally distributed (and possibly applying a test for normality before selecting a particular method). Note that the outlier detection and removal also encompass ethical decision-making. Rejecting data values without attempting to understand why they may exist is not considered an ethical practice. Further, any data reduction should always be reported and explained.
After selecting and applying several outlier detection methods, students will compare the results of these three methods. Some metrics for comparison may include the total number of outliers removed at each calibration point, the change mean/median value at each calibration point, or the difference in outlier removal between methods. Students will select one set of calibration data to use before moving on to the next section and generating a calibration curve.
3. Creating a calibration curve: For this exercise, students will generate a calibration curve to relate sensor output voltage to pressure using linear regression (4) with sensor output voltage as the independent variable, and pressure as the dependent variable. Two regressions will be performed; one assuming a zero offset (ideal characteristic curve) of the form Click here to enter text., and the other that includes a non-zero offset of the form Click here to enter text. (see Figure 3 below). This exercise will highlight the importance of accounting for offset error that may arise.
Figure 3: Illustration of an ideal characteristic curve compared to the characteristic curve after accounting for two sources of error: offset error (non-zero offset) and span error.
Additionally, students will perform a third linear regression using sensor output data that has been averaged at each of the calibration points. Comparing results between this curve and the calibration curve generated from using all sensor output values will illustrate that the regression coefficients and correlation coefficients (R squared values) are not much different, but averaging the sensor data results in losing variance that is important in understanding sensor performance and provides somewhat of a false representation of precision for the sensor.
Figure 4: Example linear regression curves, including an ideal curve and a non-zero offset curve, for all of the calibration data points (a) and averaged calibration data points (b) using results after applying the median outlier removal method.
4. Evaluating the calibration curves. To evaluate the linear regression results, students are first asked to calculate the correlation coefficients. These values for both curves should be very close to 1, indicating a strong linear relationship. Next, because the linear regression model coefficients are calculated from data and subject to error, students are asked to evaluate the coefficients by performing a hypothesis test on both the slope and intercept. To conduct this test, a two-tailed t test is used to determine if the coefficients differ significantly from zero (i.e., there is a significant relationship between the independent and dependent variables). Finally, confidence intervals on the predicted values from the calibration equations are calculated using the following: Click here to enter text.. A brief example of expected results is included below in Table 2 comparing the coefficients and R-squared values for each curve (with and without an intercept). Table 3 shows example P values for the calibration coefficients for the model with the intercept included (note that additional parameters and confidence intervals are calculated in the MATLAB script).
Table 2: Example results from linear regression.
Ideal (zero offset) Line Non-Zero Offset Line ß0 (atm) ß1 (atm/V) R2 ß0
R2 Complete Dataset - 1.68 0.992 -0.203 1.73 0.993 Averaged Dataset - 1.67 0.994 -0.198 1.73 0.996
Table 3: Example t test results for calibration coefficients.
Estimate Standard Error t Statistic p value ß0 (intercept) -0.2023 atm 1.14 x 10-2 -17.154 6.382 x 10-64 ß1 1.7336 atm/V 3.832 x 10-3 452.35 0 R-squared: 0.993
Materials Included and Implementation:
A student worksheet is provided for this activity, and all code for this activity is found in the MATLAB script file, ‘ActivityOneScript.m’. Instructors are free to, depending on the level of students’ comfort with MATLAB and programming experience, either provide a script “shell” by removing specific portions of the provided code, or provide only the data and student worksheet for use. Additionally, the analyses could be adapted to R or other statistical software if desired using the data provided in the ‘.csv’ format.
The student worksheet is intended to be a self-explanatory set of instructions and is designed to be a hands-on assignment. The student worksheet contains a set of steps and questions that they can answer in a separate document, along with including plots and other descriptors of their data (the exact format for student responses is up to the discretion of each individual instructor).
Activity II: Determining Time Constants for Nonlinear Sensors
Background on the Data:
When selecting a sensor, it is important to understand how fast or how slow it responds to a change in environment. A common way to characterize the responsiveness of a sensor is its time constant, or tau (Click here to enter text.) (5). It is normally defined as the time required for the sensor output to reach 63.2% of the difference between the final value and initial value after being exposed to a step input (or a sudden, immediate change). Without this information, it is possible to select a sensor that responds too slow (or too fast, although that is less common) for a given application.
This activity will focus on using provided experimental data to calculate the time constant for a negative temperature coefficient (NTC) thermistor (6). Thermistors are semiconductor temperature sensors made from mixtures of metal oxides, and their resistance decreases in a non-linear manner with an increase in temperature. There are various formulas used for estimating the resistance of a thermistor at a particular temperature (or, equivalently, the temperature for a particular resistance). The simplest such formula, used on many specification sheets for thermistors, is the “ß” equation:
Where Click here to enter text. is the sensor temperature in Kelvin, Click here to enter text. is the resistance, Click here to enter text. is a parameter that depends on the material used in the thermistor (usually provided by the manufacturer), and Click here to enter text. is the resistance at some calibration temperature Click here to enter text. (typically 25° Celsius).
The data provided represent the thermistor response (change in voltage drop across the sensor) after exposure to a sudden increase in temperature, starting from room temperature. These data were collected using an RTD in a voltage divider circuit, given a supply voltage of 12 V, ß parameter of 3972, Click here to enter text. of 5 kO, and resistor value of 5k O. (Note that a detailed discussion of operation and design of voltage divider circuits is outside the scope of this activity; however, it is worth mentioning that in practice, resistor values have tolerances and are subject to potential drift, and the values of resistor components used in voltage divider circuits should always be measured.) Using these data, students will estimate the time constant for this sensor using two methods: 1) graphical visual method, and 2) the error fraction method.
Objectives of this Activity:
By the end of completing this activity, students will be able to
- Analyze temperature sensor step response data;
- Calculate the time constant parameter using two different methods.
Description of this Activity:
This activity can be divided into two components: 1) data visualization and preparation, and 2) time constant determination using two methods.
1. Data visualization and preparation: The first step in this analysis is visualizing the data by plotting the sensor output voltage vs. time (Figure 5a). After creating this plot, it should be visually evident that the data generally follow a nonlinear step response curve. While the voltage data are useful, with an RTD sensor the voltage measurements need to be converted to resistance measurements for use in the ß equation. For this step, resistance can be calculated given the known parameters of the voltage divider. After this conversion the ß equation can be used to calculate RTD temperature as a function of time (Figure 5b).
Figure 5: Plot of (a) voltage output across the RTD, and (b) RTD temperature, vs. time.
2. Time constant determination: In this section, students will calculate the time constant of the sensor using two different methods, and compare and discuss the results. The first method is a visual approximation, while the second method uses the concept of error fraction and linear regression to directly compute the time constant.
(a) Visual approximation of time constant, Click here to enter text.: The time constant is defined as the time required for the sensor output to reach 63.2% of the difference between the final value and initial value after being exposed to a step input and can be directly estimated from the step response plot. Students are asked to use their step response plot to estimate tau and visually illustrate their estimation methods. An example of this is shown in Figure 6 below.
Figure 6: Illustration of using the step response curve to graphically estimate Click here to enter text..
(b) Error fraction method to estimate time constant, Click here to enter text.: The error fraction represents the amount of error in the measurement system after a step input change relative to the final value as a function of time. The error fraction can be calculated as:
where Click here to enter text. is the response value at a given time Click here to enter text., Click here to enter text. is the initial value, and Click here to enter text.is the steady state value. An example plot of the error fraction vs. time is shown in Figure 7 below. Note that, depending on the exact final value selected, there may be oscillations around the final value.
Figure 7: Error fraction vs. time (a) including oscillations about the final value, which leads to negative error fraction values, and (b) with these oscillations and negative error fraction values removed.
The error fraction equation can be simplified to Click here to enter text., which can be rearranged to the linear form Click here to enter text.. Students will use this relationship to perform a linear regression and relate the slope to the time constant Click here to enter text.Note that when taking the natural log of the error fraction as it approaches zero, the values will create nonlinear behavior near the end of the step response; therefore, students will have to select the linear portion of the curve before performing linear regression. An example of this is shown in Figures 8a and 8b below. Additionally, two linear regression curves should be generated – one that is an “ideal” line with a zero y-intercept, as well as a best linear fit including a non-zero y-intercept.
A comparison between these methods should be completed after estimating Click here to enter text.. The time constant estimates will likely vary between each of the three methods but should be comparable. A summary of example time constant values from this activity are shown in Table 4 below. Finally, students should compare their values for the time constant with the manufacturer specifications of 1.25s. Potential sources of error include the experimental setup for initiating a step response, the selection of the time for the step response final value, human error when reading time values from the plot, among others. Also, instructors may ask domain-specific questions regarding the RTD specification, and whether or not it is adequate for a particular application.
Figure 8: (a) The natural log of the error fraction vs. time (note the natural log of small values near zero cause nonlinear behavior in the curve near the end of the step response); and (b) the same plot including only the approximately linear portion of the curve, in addition to two linear regression curves (an ideal and best linear fit).
Table 4: Example results from estimating the time constant Click here to enter text..
method Click here to enter text. (s) Graphical estimation 1.34 ‘Ideal’ linear regression 1.394 Best fit linear regression 1.348
Materials Included and Implementation:
A student worksheet is included for this activity, and analysis code is included in the ‘ActivityTwo Script.m’ file. The voltage output data for this activity are saved in the “Vout.mat” file, with the corresponding time vector saved in the “t.mat” file (data are also provided in a ‘.csv’ file). Instructors are free to, depending on the level of students’ comfort with MATLAB and programming experience, either provide a script “shell” by removing specific portions of the provided code, or provide only the data and student worksheet for use.
The student worksheet is intended to be a self-explanatory set of instructions and is designed to be a hands-on assignment. The student worksheet contains a set of steps and questions that they can answer in a separate document, along with plots and other descriptors of their data (the exact format for student responses is up to the discretion of each individual instructor).
(1) National Academies of Sciences, Engineering, and Medicine. (2018). "Data Science for Undergraduates: Opportunities and Options." The National Academies Press, Washington, DC.
(2) National Institute of Standards and Technology. "Special Publication 250-39: NIST Calibration Services for Pressure Using Piston Gauge Standards." 2009
(3) MATLAB Help Center. "Find outliers in data". MathWorks. Available: https://www.mathworks.com/help/matlab/ref/isoutlier.html
(4) MATLAB Help Center. "Linear Regression". MathWorks. Available: https://www.mathworks.com/help/stats/linear-regression-model-workflow.html
(5) Chapter 3, Measurement System Behavior. In: Theory and Design for Mechanical Measurements, 6th Edition. John Wiley & Sons, Inc. Hoboken, New Jersey, USA.
(6) Texas Instruments. "The Engineer’s Guide to Temperature Sensing." Temperature Sensors Support & Training Web Page, Available: https://www.ti.com/sensors/temperature-sensors/support-training.html