ASABE Logo

Article Request Page ASABE Journal Article

Relative and Unified Skill of Environmental, Edaphic, and Management Factors to Explain Crop Yield Variance Using Machine Learning

Meetpal Singh Kukal1,2,*


Published in Journal of the ASABE 67(4): 995-1011 (doi: 10.13031/ja.15755). Copyright 2024 American Society of Agricultural and Biological Engineers.


1 Department of Soil and Water Systems, University of Idaho-Boise, Boise, Idaho, USA.

2 Department of Agricultural and Biological Engineering, Pennsylvania State University, University Park, Pennsylvania, USA.

* Correspondence: mkukal@uidaho.edu

The authors have paid for open access for this article. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License https://creative commons.org/licenses/by-nc-nd/4.0/

Submitted for review on 28 July 2023 as manuscript number NRES 15755; approved for publication as a Research Article by Associate Editor Dr. Kendall DeJonge and Community Editor Dr. Kati Migliaccio of the Natural Resources & Environmental Systems Community of ASABE on 25 September 2023.

Citation: Kukal, M. S. (2024). Relative and unified skill of environmental, edaphic, and management factors to explain crop yield variance using machine learning. J. ASABE, 67 (4), 995-1011. https://doi.org/10.13031/ja.15755

Highlights

Abstract. Crop yields are dictated by a complex interplay between environment, edaphic (soils), and management that are subject to change across space and time. However, to what extent each of these influences and their interactions have been important in explaining yield variance is limitedly understood. The convoluted nature of this question motivates the application of modern machine learning approaches to decipher these influences and elucidate crop yield relations with their critical drivers. Here, we used random forest modeling with recursive feature elimination (RFE) to discern the diverse drivers of historical (1981-2019) county-level maize and soybean yields in the U.S. within the realms of environment (growing and extreme degree days, precipitation, vapor pressure deficit, evaporative demand, crop water use, soil moisture), soils (sand, silt, and clay contents, bulk density, soil organic carbon, available water capacity), and management (irrigation intensity, tile drainage, nitrogen input, depth to groundwater). We found that the most effective models selected predictors from all three realms and achieved remarkable explanatory capability, accounting for 75%-80% of the spatial and temporal variance combined. Environmental predictors exhibited a non-negotiable role in determining model performance, while their combination with either soil or management predictors approached the efficacy of the best-performing model. Specific variables within each individual predictor set and their combinations were analyzed for their relative importance to the model skill. The best performing model that used soils, management, and seasonal environmental predictors evaluated extreme heat as the most influential predictor for both crops. On further inclusion of sub-seasonal environmental variables with soil and management predictors, the relative importance shifted, with irrigation intensity assuming prominence as the most influential predictor, accompanied by tile drainage, silt content, and July precipitation for both crops. RFE revealed that only eight of the most relevant predictors were sufficient to explain >70% of the yield variance, even when the total predictors included were as high as 52. that Visual representations were developed to offer insights into the functional response of crop yield to changes in critical predictors, thereby facilitating predictions across diverse agricultural production systems and over time. This research underscores the value of including soil and management indicators alongside environmental predictors to improve understanding and predictability of the intricate dynamics governing crop yield variability.

Keywords.Climate variability, Drainage, Irrigation, Maize, Nitrogen, Random forest, Soils, Soybean.

Biomass and economic yield attained from croplands dictate a wide range of ecosystem processes and services, including carbon sequestration, weather modulation, water, energy, and nutrient cycling, as well as providing for food security and rural livelihoods. Crop yields are a derivative of a complex interplay among the uptake of resources such as water, radiative energy, and nutrients. These complex interactions are influenced by both the availability of each resource and the ability of plants to use it. Weather conditions dictate availability of resources such as precipitation, radiative energy for crop growth, the atmosphere’s demand for water, etc. Different regions, crops, and growth stages show non-linear and varying sensitivity to changes in weather. Thus, similar weather patterns can be detrimental or beneficial to crops by different degrees in different environments. Many of these crop processes and weather impacts are mediated or worsened by certain critical edaphic (soil) factors, as soils dictate nutrient and water retention. Furthermore, crop production in inconducive climates and soils often see large adaptive investments by producers to alleviate weather impacts, especially in on-farm water management practices such as irrigation and drainage (Frisvold and Bai, 2016; Lambert et al., 2021; Ma et al., 2023). Thus, the regional performance of agroecosystems in the long run is a compounded function of environmental, edaphic, and management factors.

Numerous environmental, edaphic, and management attributes are explicitly accounted for in process-based models to predict crop biomass, yield, and resource use. However, much of the learning from crop model simulations stems from predefined interrelationships between biophysical indicators. Long-term observational data collected from producer fields that are distributed across vast swaths provides an unprecedented opportunity to detect new signals in crop productivity vs. soil and ambient environments and adaptation space. However, unlike physically-based models, statistical modeling of crop yield rarely considers all determinants of crop productivity, with weather/climate receiving the most emphasis as drivers of crop yields and their variability. A vast body of work has enhanced our knowledge of the historical behavior of crop yields against weather variability, both nationally and internationally. These include impacts of change in heat accumulation for crop growth (Lin et al., 2020; Kukal and Irmak, 2018a; Zhu et al., 2018; Miller et al., 2021), extreme heat (Lobell et al., 2013; Zaveri and Lobell, 2019; Butler and Huybers, 2015), precipitation amount (Kukal and Irmak, 2018b; Fishman, 2016; Li et al., 2019; Eck et al., 2020) and precipitation intensity (Troy et al., 2015; Proctor, 2023), solar radiation (Tollenaar et al., 2017), soil water availability (Rigden et al., 2020; Urban et al., 2015; Peichl et al., 2018; Ortiz-Bobea et al., 2019), vapor pressure deficit (Rigden et al., 2020; Kukal et al., 2023) and other environmental indicators.

In addition to environmental variables, certain soil characteristics and adaptation/management strategies have also been studied to some extent for their ability to mitigate impacts from adverse weather attributes. Huang et al. (2021) found that yields of corn, winter wheat, soybean, cotton, barley, oats, and rice in the U.S. between 1958 and 2019 show greater sensitivity to temperature and precipitation under coarse-textured soils than medium and fine-textured soils. Additionally, they found that soils with higher soil organic carbon show reduced sensitivity to climate variability. Recently, a mediating effect of soils’ available water capacity was also revealed by Kukal et al. (2023) for U.S. maize and soybean, establishing that soils with higher available water capacity buffered negative response of yields to atmospheric dryness (high vapor pressure deficit). Qiao et al. (2022) used >12,000 site -year observations from on-farm trials across China to establish that higher quality soils, as attributed to soil organic matter, available phosphorous, and soil texture, resulted in higher and more stable crop yields, and that soil improvement can offset yield reductions under the most drastic concentration pathway. On a global scale, Oldfield et al. (2019) conducted a meta-analysis of published research concluding that maize and wheat yields increase with soil organic carbon up to 2%, which represents two-thirds of global maize and wheat acreage. Within adaptation strategies, irrigation has been discussed widely for its impacts on regional and global crop yields. Irrigated yields have been consistently shown to be greater than rainfed yields by varying magnitudes (Kukal and Irmak, 2019; Kucharik et al., 2020; Kucharik and Ramankutty, 2005), as well as have greater stability to climate variability by alleviating crop water stress (Troy et al., 2015; Li and Troy, 2018; Zhu et al., 2019; Kukal and Irmak, 2018b; Kukal and Irmak, 2020a; Zhang et al., 2015). While yield impacts of other important management aspects such as drainage and fertilization have been a subject of extensive investigation at the field scale (Drury et al., 2009; Jeong and Bhattarai, 2018; Pittelkow et al., 2017; Nasielski et al., 2020; Tenorio et al., 2020), these have not been addressed at regional to national scales amidst environmental and soil impacts.

Within the existing body of research, there are three major knowledge gaps that hinder comprehensive understanding of crop yield behavior in response to environments and regional adaptation: (a) What is the combined skill of environment, soil, and management to explain variance in crop yield and what additional value it lends when compared to the skill of each individual domain? (b) What is the relative importance of critical environmental, soil, and management attributes in affecting crop yields, i.e., is any of these factors relatively more or less important in explaining variance in crop yields? (c) Finally, what is the nature of the functional relationship between crop yield and the abovementioned drivers within their ranges observed in the long-term across regions? In this research, we address these knowledge gaps by leveraging a widely used machine learning approach to build robust models of county-level crop yields using a wide suite of environmental, soils, and management factors as predictor variables. The specific objectives of this research are to (1) build random forest models to predict county-year maize and soybean yields across conterminous United States (CONUS) during 1981-2019 that sufficiently capture yield variance using indicators that represent broader domains of environment, soils, and management influences; (2) understand the relative importance of these indicators to predict maize and soybean yields and achieve model parsimony; (3) understand controls of seasonal versus sub-seasonal environmental conditions on crop yields and its implications for modeling; and (4) visualize functional relationships between yield and its critical drivers to understand optimal and suboptimal performance environments and adaptations.

Materials and Methods

Datasets and Preprocessing

Crop Yield

Corn and soybean yield data during 1981-2019 were obtained from the National Agricultural Statistics Service (NASS)-USDA for all available counties in CONUS. A total of 73,774 and 48,608 county-year data records were obtained for corn and soybean, respectively. Yields were originally reported in bushels ac-1 and were converted to kg ac-1 for consistency.

Environmental Data

In this research, we used a total of seven variables in the environmental domain, which were either obtained directly from existing public repositories or were derived from certain fundamental variables. All variables were obtained for the growing season months (April until September) during the 1981-2019 period. Daily precipitation (Prcp), vapor pressure deficit (VPD), and ASCE grass-reference evapotranspiration (ETo, representing evaporative demand) were obtained for the entire geographical extent of CONUS from the GRIDMET dataset (Abatzoglou, 2013). The GRIDMET provides daily surface meteorological data at a spatial resolution of 1/24th degree (~4 km) for the United States. In addition, daily maximum and minimum air temperatures at 2 m were also obtained from GRIDMET, which were not employed directly in modeling but were tailored to report heat accumulation necessary and detrimental to crop growth and development, i.e., growing degree days (GDD) and extreme degree days (EDD), respectively. Monthly total actual evapotranspiration (ETa) and mean soil moisture (?) data were obtained from the TerraClimate datasets (Abatzoglou et al., 2018). These estimates are based on climatic water balance model outputs and are available globally at ~4-km spatial resolution.

Soils Data

A total of six variables were employed in the soils domain and were obtained from global OpenLandMap datasets hosted on Google Earth Engine. The soil properties obtained included sand, silt, and clay fractions (Hengl, 2018), soil organic carbon (SOC) (Hengl and Wheeler, 2018), field capacity and permanent wilting point (Hengl and Gupta, 2019), and soil bulk density (BD) (Hengl, 2018). Although field capacity and permanent wilting point were not employed directly in the models, the available water capacity, AWC (field capacity minus permanent wilting point), was calculated and used. Each soil property has been mapped globally at 250 meters spatial resolution and is available for soil surface (0 cm) and depths of 10, 30, 60, and 100 cm in the soil profile. None of the soil properties data had any temporal dimension and only varied spatially.

Management Data

Four major management regimes were represented in this research, namely intensity of irrigation, tile drainage, nitrogen (N) input, and depth to groundwater. Irrigation intensity refers to the percent of land within a land parcel that is irrigated and is reported by the Moderate Resolution Imaging Spectroradiometer (MODIS) Irrigated Agriculture Datasets for the Conterminous United States (MIrAD-US) at 250 meter spatial resolution (Brown et al., 2019). The dataset reports irrigation intensity for years 2002, 2007, 2012, and 2017, and thus, temporal changes in irrigation intensity were incorporated. We used the 2002 product to represent years 1981-2006; the 2007 product for years 2007-2011; the 2012 product for years 2012-2016, and the 2017 product for years 2017-2019. Tile drainage was represented using the AgTile-US tile drainage map presented by Valayamkunnath et al. (2020), which maps US croplands for the presence of tile drained cropland at 20 meters spatial resolution based on census, soil survey, land cover, and topographical information. There is no temporal dimension to this dataset, and the map was used to represent conditions during the entire study period. Nitrogen input to corn crop was obtained from estimates given by Zhang et al. (2021), who quantified multiple constituents of N input to corn: chemical N fertilizer use, manure N application, atmospheric N deposition, and N residues from previous-year soybean fixation. These data were available during 1970-2019 at the county level. Finally, depth to groundwater was obtained from simulations at 250 meters spatial resolution presented by Zell and Sanford (2020) and was used to represent the entire study period. Although depth to groundwater is not traditionally thought of as a management factor, it represents access to irrigation water or shallow water tables, which might in turn affect irrigation or drainage management, and thus, it was grouped within the management domain.

Data Preprocessing and Rescaling

Yield Data Normalization

Both corn and soybean yields in the U.S. have been subject to drastic improvement owing to technological, management, and breeding interventions (Lobell and Asner, 2003; Kucharik et al., 2020; Rizzo et al., 2022; Kukal and Irmak, 2019). To build robust models that are transferrable in space and time, it is critical that this time-dependency of crop yields due to factors that are challenging to quantify (technological and breeding factors) is minimized. To effectively achieve this, I normalized all county-year yields by recorded maxima in the corresponding year. By doing so, we assume that at any given time, maximal crop yields will be recorded under the most optimal settings of soils and weather. The year-specific yield maxima were calculated as the 95th percentile of crop yields recorded in all counties for each individual year. A linear function was fit to these year-specific maxima, and the predicted values of this function were used as the quantity by which yield data was normalized.

Spatiotemporal Consistency of Yield and Predictor Variables

All environmental variables were reported as either totals (Prcp, ETo, GDD, EDD, and ETa) or mean values (VPD and ?) for individual months during the growing season window (1 April until 30 September) for all years. CONUS-wide rasterized surfaces of all environmental variables were spatially aggregated to county-means using spatial reducer tools in Google Earth Engine. GDD was calculated using the approach proposed by McMaster and Wilhelm (1997), with a base temperature and upper cutoff imposed at 10°C and 30°C for both crops. Extreme-degree days were calculated as the daily mean temperature minus 30°C (Lobell et al., 2013).

Global soil data rasters were masked for the geographical extent of CONUS using Google Earth Engine and downloaded. Depth-specific soil property rasters at 0, 10, 30, 60, and 100 cm were weighted using equation 1 to quantify root-zone (0-100 cm) aggregated rasters. This aggregation was performed using maptools (Bivand et al., 2022) and raster packages (Hijmans et al., 2015) in R (R Core Team, 2022). Finally, county-specific mean values for all root zone aggregated soil properties were extracted from corresponding rasters using the exactextractr package (Baston et al., 2022) in R.

(1)

where S0, S10, S30, S60, and S100 are any given soil property reported at the surface, and depths of 10 cm, 30 cm, 60 cm, and 100 cm, respectively. A soil profile of 0-100 cm was selected to weigh soil data because it represents the effective root zone for maize and soybean. It has been experimentally shown that a 0-100 cm soil layer is responsible for dominant soil water extraction by maize and soybean crops (Hao et al., 2015; Irmak et al., 2014; Kukal and Irmak, 2020b).

Irrigation intensity and depth to groundwater rasters were sampled to extract county-level mean data using the exactextractr package (Baston et al., 2022) in RStudio. On the other hand, a tile drainage raster was sampled to calculate the sum of cropland parcels (30 m × 30 m) in each county that were equipped with tile drainage.

Random Forest Modeling

A random forest analysis was used to identify how and to what degree environmental, soils, and management variables contribute to crop yield and its variation across space and time. A primary reason why random forest technique was selected for these analyses is that, in addition to being recognized for their predictive accuracy, they are well suited for applications where predictor collinearity is expected. As observed in figure 1, many predictors among the environmental, soils, and management domains show a significant moderate-to-high correlation due to their dependence on interconnected fundamental physical processes. Due to these reasons, random forests are one of the most used data-driven approaches in the vegetation productivity modeling area (Aghighi et al., 2018; Vergopolan et al., 2021; Khanal et al., 2018; Lischeid et al., 2022; Jeong et al., 2016; Kang et al., 2020; Folberth et al., 2019).

Experiments

The following experiments were conducted: (1) management only; (2) soils only; (3) growing season-level environment only; (4) no soils, i.e., environment and management; (5) no environment, i.e., management and soils; (6) no management, i.e., environment and soils; (7) soils, growing season-level environment, and management; and (8) soils, sub-seasonal-level environment, and management. Each experiment was conducted separately, with normalized corn and soybean yields as the target (predicted) variable.

Evaluation of Predictor Importance and Feature Selection

We used the RandomForestRegressor package of scikit-learn in Python with a default parameter setting and 500 decision trees trained. For each experiment, 80% of the data records were used for training, and model performance was assessed using the remaining 20% of the data. A total of 73,697 and 46,409 yield observations were available for corn and soybean, respectively, of which 66,327 and 41,768 observations were used for training and 7,370 and 4,641 for testing. Overall model performance was assessed and reported using the squared Pearson correlation coefficient between the observations in the testing pool and the corresponding simulations, averaged across the 10 cross-validation models. The R2 represents the proportion of variance in crop yields that is captured by the predictor variables in a given model. To achieve greater rigorousness, a ten-fold cross-validation was followed, where the training and testing samples were randomly and independently selected ten times and the model was validated on ten independent samples. In addition to the feature selection that is conducted inherently within the random forest algorithm at the individual node and decision tree level, features were evaluated to be either retained or eliminated from the models based on their relative contribution and importance to model performance. This evaluation was performed using an iterative recursive feature elimination (RFE) approach outlined by the following steps:

  1. Train the model using a 10-fold cross-validation.
  2. Measure the cross-validation mean performance (R2) and set as baseline R2.
  3. Evaluate the performance of each predictor in the following sequence:
  4. Re-train the model without the given predictor.
  5. Measure the new cross-validation mean performance (R2).
  6. Calculate delta R2=baseline R2-new R2. A greater positive change in R2 implies greater importance of the predictor.
    1. Rank predictors from highest delta R2 to lowest delta R2.
  7. Remove the predictor with the lowest delta R2, updating the predictor set.
  8. Repeat steps 1-5 until the lowest delta R2 is <0.001.
  9. For this final predictor set, repeat steps 3 and 4 to rank the predictors by their importance.

Eliminating predictors that are non-predictive (show delta R2<0.001) can improve model transferability by reducing the model’s risk for overfitting as well as reducing computational cost. The RFE approach was applied to each of the experiments conducted. Lastly, the functional relationships between crop yields and all predictors in the final best-performing predictor set were visualized using partial dependence plots (Friedman, 2001).

Results and Discussion

Homogenization of Crop Yield Observations

The 95th percentile yield for each year showed a strong linear fit (p<0.01) for maize (fig. 2a) and soybean (fig. 2c), implicitly capturing the long-term yield improvement due to technological, management, and breeding interventions. The rate of 95th percentile yield improvement was 150 kg ha-1 yr-1 and 46 kg ha-1 yr-1 for maize and soybean, respectively. Over the period of record (39 years), a total improvement of 67% and 64% was recorded for maize and soybean, respectively, relative to baseline yields (1981-1990). Predicted values of year-specific 95th percentiles were used to normalize county-level yield data to remove these long-term effects. The resulting time series was free of long-term trends and presented yields as normalized values ranging from 0 to 1.6 for maize (fig. 2b) and 0 to 1.3 for soybean (fig. 2d).

Figure 1. A correlation matrix between all predictors and predicted variables. The number shown in the box represents the correlation coefficient (r) and is highlighted with a colored circle if there is a significant correlation. Blank cells imply an insignificant correlation. The size of the circle is scaled based on r, and the color of the circle shows the nature of the correlation (blue if positive and red if negative).

The normalized yield datasets eliminate the need to include time as an additional variable in the models. The annual maxima identified by the 95th percentile of crop yields represent long-term progress in crop breeding, agricultural management, and adaptation. These effects, although important, were factored out prior to RF modeling and feature elimination so that spatial and temporal variance can be considered equivalent for modeling. Moreover, normalization by annual maxima ensures that crop yields (predicted variable) are represented in the model in a way that is relative to an empirical productivity ceiling (potential) that is optimized by a combination of environment, soils, and management. This approach is different from the detrending techniques used in studies that only consider temporal variance in individual spatial units by calculating residuals from a linear fit (Lu et al., 2017; Eck et al., 2020; Kukal and Irmak, 2018b).

Individual Versus Unified Skill of Hydroclimate, Soil, and Management Factors

A comparison of experiments that included only soils, only hydroclimate, or only management factors as predictor sets showed that model performance was roughly similar across all models and both crops (fig. 3), as measured by model R2. Each individual variable class was able to explain 54%-55% of variance in crop yields, with a slight exception of the environment predictor set for soybean (R2 of 0.60). This underscores the importance of each of these predictor sets to explain more than half of the space-time variance in crop yields, which is significant, especially given the relatively coarse-scale resolution of the data (county-level). Soils and management factors were on par with space-time varying environmental parameters in terms of performance, which is promising. It's worth noting that most of these factors do not vary over time; only spatial variation is considered. Environmental predictors being responsible for explaining 60% of yield variance highlights the usefulness of environmental predictors beyond temperature and precipitation (i.e., GDD, EDD, VPD, ETo, ETa, ?), which have received the most focus in related literature. At both the U.S. (Kukal and Irmak, 2018b) and global scales (Ray et al., 2015), it has been shown that temperature and precipitation explain only between 25% and 33% of yield variance.

Figure 2. Time series of (a) observed maize yields and associated 95th percentiles for each year during 1981-2019; (b) normalized maize yields; (c) observed soybean yields and associated 95th percentiles for each year during 1981-2019; and (d) normalized soybean yields. The 95th percentiles in (a) and (c) are fit with a linear function that was used to calculate normalized yields in (b) and (d).

    (a) Maize

(b) Soybean
Figure 3. Performance of random forest models with carefully selected predictors based on recursive feature elimination applied to different experiments. The performance is measured by coefficient of determination (R2) between observed and simulated yields for the testing datasets.

Next, not accounting for each predictor set one-by-one, i.e., including two predictor sets at a time in the model development, impacted performance by different degrees. Combinations of predictor sets resulted in substantial improvement in performance (by up to 0.20) for both crops, except for the experiment where environment was not included as a predictor set (fig. 3). This implies that combining soils and management provided no additional skill as compared to what was observed when each predictor set was individually considered. Such behavior is expected since soils and some management predictors are entirely temporally static and cannot account for interannual variation, where environmental predictors are most instrumental. Combining environment predictor sets with either soils or management improved performance, even though each of these predictor sets only explained 54%-60% variance when considered individually. This observation demonstrates the synergistic effects of environment and soils, as well as environment and management, that further improve the predictive skill of crop yield. This synergy emerges when these predictor sets are used in the same model and represent biophysical interactions that influence plant functions and translate into crop yields. These synergies went unaccounted for by any of the predictor sets when used individually.

Figure 4. Observed versus predicted maize and soybean yields derived from the best-performing model that uses seasonal-level environmental, soils, and management predictor sets. Only those predictors have been included that were retained after recursive feature elimination. The color of the datapoints represent their density. Both the best fit linear function and the 1:1 line are shown. The linear function, mean squared error (MSE), and R2 are shown.

Lastly, we investigate how predictive performance increases when all three predictor sets are considered in model development. This experiment resulted in a 76% and 74% correlation between model predictions and observations of maize and soybean yields, respectively (fig. 4). Overall, the MSE was less than 0.01 units, and a 26%-28% underestimation was recorded (slope of best fit line in fig. 4). The RMSE was 0.101 and 0.093 units for maize and soybean yields, respectively. We found that model performance increased slightly (1-4 percentage points) relative to no soils and no management experiments (fig. 3). Even though a slight improvement was found, the added complexity may mean that it may be sufficient to account for soils and climate (no management) to explain crop yield variance. Nevertheless, in scenarios where data on each predictor set is available, this experiment appears to be most accurate, especially when only seasonal-level environmental information is available.

When the seasonal environment was decomposed to sub-seasonal (monthly) predictors and combined with soils and management, performance increased reasonably for both crops, where 80% of variance was explained (fig. 3). This demonstrates that a given environmental predictor can be relatively more important during a certain crop growth stage than others, and growing season mean environmental predictors can overlook these growth stage-specific dynamics (Singh et al., 2021; Wang et al., 2016; Kang et al., 2020). Decomposing the mean environment into month-specific predictors not only allows for addressing growth stage-specific crop impacts but also accounting for synergies that may have emerged with other predictor sets. The observation that 80% of crop yield variance is explainable by a combination of relatively simple, publicly available predictors (fig. 3) shows significant promise for machine learning models, especially acknowledging the spatial and temporal coarseness of datasets as well as the non-dynamic nature of soils data that was used over a long period of time. Even in carefully designed single-site, long-term experiments (no spatial component) with no soils or management differences, where interannual yield variation is solely attributable to weather, it is challenging to obtain such levels of model performance. In a recent study, statistical models developed using a 31-year long experimental dataset in the Western Corn Belt showed that seasonal and sub-seasonal weather predictors were only able to explain 64-66% of rainfed maize yield variance (Singh et al., 2021).

Figure 5. Relative importance of predictors when environmental (seasonal), soils, and management predictor sets are used to predict (a) maize (testing R2 = 0.72) and (b) soybean (testing R2=0.71) yields.

Model Parsimony and Relative Importance of Factors

Environment, soils, and management represent three major domains of variables that can influence crop yield; however, it is possible and likely that not all variables considered within each domain are equally important. Using the RFE procedure, we eliminated the predictors that did not result in significant improvement in model performance when dropped as predictors in a cross-validation approach. Each predictor set as well as all combinations of predictor sets were evaluated using RFE to identify the most important predictors. Within the management only class, tile drainage and depth to groundwater were the only two predictors that were retained following RFE for both crops. Within the soils only class, bulk density was the only retained predictor for soybean, while both bulk density and soil organic carbon were retained for maize. The environment only predictor set was the only one where all seven original predictors were retained after RFE. Within these predictors, GDD and EDD were the two most important predictors for both crops, which resulted in a significant drop of model explanatory power (R2 drop of 0.09 and 0.04, respectively) when ignored. Following GDD and EDD, the most important predictors were ?, ETo, ETa, VPD, and Prcp for maize, and ETa, ?, ETo, Prcp, and VPD for soybean. Thus, the most important predictors of crop yields within the environment domain were heat accumulation, extreme heat, and hydroclimatology.

When environment and management predictor sets were used in the model (fig. 5), tile drainage and irrigation intensity emerged as the most important predictors for both crops. While VPD was dropped for maize, all predictors were retained after RFE for soybean. Tile drainage and irrigation intensity resulted in a roughly similar loss of model explanatory power (0.04) for maize; however, tile drainage was much more important (0.06 vs. 0.03) than irrigation intensity for soybean. For both crops, water management factors were followed in relative importance by GDD, EDD, and depth to groundwater, although in different orders. Hydroclimatic factors such as ETo, ETa, ?, and Prcp were ranked low for both crops. When soils and management predictor sets were evaluated in combination, all soil factors were eliminated upon RFE, and only tile drainage (more important for maize) and depth to groundwater (more important for soybean) were retained. Thus, with only two management factors retained, the performance was the poorest (R2 of 0.54) of the three combinations with two predictor sets. Lastly, a combination of environment and soils predictor sets (fig. 6) showed the best performance of the three (R2: 0.73-0.74). Within these predictor sets, extreme heat (EDD) was the most important predictor across both crops. GDD followed closely, roughly being similarly important as EDD for soybean, but about half as important as EDD for maize. Interestingly, soil structure (BD, SOC) was relatively more important than soil texture (silt, clay) for maize, whereas the opposite was true for soybean. Moreover, evaporative demand (ETo) was more important than soil texture for maize, and water use (ETa) was more important than soil structure for soybean. Beyond this, other hydroclimatology indicators were positioned similarly (low importance) for both crops. Sand content and VPD were ranked the least important for both crops, likely because they did not add much new information to the model beyond what was already accounted for by other predictors. Since the sum of sand, silt, and clay is 100, information on clay and silt should be sufficient in describing the particle size distribution of the soil profile, as supported by high correlation seen among sand vs. clay and sand vs. silt (fig. 1). Similarly, VPD is encapsulated within ETo formulation, and hence was eliminated.

Figure 6. Relative importance of predictors when environmental (seasonal), soils, and management predictor sets are used to predict (a) maize (testing R2 = 0.74) and (b) soybean (testing R2=0.73) yields.

Finally, when all three predictor sets were used as model inputs for RFE, extreme heat still prevailed as the most important predictor for both crops (fig. 7). Beyond EDD, both crops showed less consistency in relative importance among predictors. For maize, the top three predictors represented environment (EDD), management (tile drainage), and soil texture (silt), while for soybean, the top three predictors were all environmental attributes (EDD, ETa, and GDD). In fact, any of the soil factors did not appear as an important predictor for soybean yield until the eighth place. Both tile drainage and irrigation intensity were among the top predictors for crop yields. For both crops, tile drainage was more important than irrigation intensity. Hydroclimate indicators were relatively more important for soybean than maize, with ETa being the second most important predictor. Additionally, N input was moderately important for maize, while it was eliminated for soybean when included as a predictor. For both maize and soybean, the first half (8) predictors that were selected by RFE were sufficient to explain >70% of variance in yields, and further increases in the number of predictors only increased model performance slightly (by 0.03-0.04). Sand was eliminated by RFE for both crops due to sufficient textural information conveyed by silt and clay.

Upon using sub-seasonal environmental, soils, and management predictor classes, the total predictors included in the model increased substantially (up to 52), however, not all of them were useful. A total of 31 and 29 predictors were eliminated by RFE for maize and soybean, respectively. Furthermore, only 8 predictors were required to achieve model performance of >0.70, and 12 predictors were required to achieve model performance of >0.75 for both crops. The relative importance of top predictors was very similar for maize and soybean yields. In contrast to what was observed when seasonal environment was used, irrigation intensity was the most important predictor, followed by tile drainage, silt, and July precipitation for both crops (fig. 8). For soybean, silt and clay were the two soil predictors included in the top 12 predictors, while SOC was also important for maize. Precipitation during June and July for maize and during July, August, and September for soybean were important predictors. In addition, other management factors like N input and depth to groundwater were also included in the top predictors for maize yield. Overall, the relative importance observed here underscores that the top predictors belong to all three predictor sets and signifies the synergy and importance of accounting for them as drivers of crop yield. Figure 8 presents the relative importance of all predictors that were retained by RFE for sub-seasonal environment, soils, and management.

Functional Relationships Between Crop Yield and Predictors

Using the best-performing predictor selection (seasonal environment, soils, and management), we developed partial dependence plots to individually visualize predictors’ relationships with yields as determined by the model. Figures 9 and 10 present these functional relationships between maize and soybean yields, respectively, and their predictors. The plots are arranged in the order of predictor importance as determined by delta R2. The x-axes for each plot represent a range of the 5th to 95th percentile of predictor data to minimize any aberrations originating from outliers. The vertical dashes on the X-axes indicate the deciles of the predictors, which convey the distribution of predictor values, allowing for identifying sparsely populated ranges. Accordingly, the interpretations of the plots should only be limited within the range covered by the deciles.

Figure 7. Relative importance of predictors when environmental (seasonal), soils, and management predictor sets are used to predict (a) maize (testing R2 = 0.76) and (b) soybean (testing R2=0.74) yields.
Figure 8. Relative importance of predictors when environmental (seasonal), soils, and management predictor sets are used to predict (a) maize (testing R2 = 0.80) and (b) soybean (testing R2=0.80) yields.

EDD, which was the most important predictor for both crops, affected yields negatively throughout its observed range. The response was non-linear, with an initial increase in EDD resulting in a greater loss, which consequently changed to a lower rate of loss at higher EDD. Based on these observations, virtually any EDD accumulation has resulted in yield penalties historically. There is a strong non-linear response of yields to GDD, where yields increase until a certain number of seasonal GDD (around 2100 °C), thereby decreasing abruptly. For maize, an aberration in yield response between 1700 and 2000 °C was observed but was minor compared to the behavior across the complete GDD range. This aberration may be a result of a constant growing season window used for GDD aggregation, which may not coincide with the actual growing season in some regions. The negative yield response at higher GDD increases implies that the cutoff temperature used for GDD calculation may be higher than what is ideal and that crop growth and development may be negatively impacted at lower temperatures than those considered here (30°C). Precipitation affected crop yields in a strong non-linear fashion: a sharp increase until about 600 mm and thereby decreasing at a smaller rate. Yields increased linearly with both ETo and ETa, indicating that evaporative demand translated into actual crop water use and that no water limitations were encountered on a seasonal level. This is supported by the crop yield response to ?, which represents no yield loss at a lower range of seasonal-level soil water. The monotonic decrease in crop yields in response to ? found here is atypical to the physical understanding of crop water stress. This is because ETa and ? used in this research are calculated using a simple one-dimensional monthly climatic water balance that does not explicitly account for plant water uptake (Abatzoglou et al., 2018). In this case, soil water is only extracted when ETo exceeds precipitation, ignoring biological factors such as crop growth and water uptake patterns. Given the availability of observed soil moisture to be included as a predictor, it is likely that a non-linear response is derived that is analogous to precipitation. However, measured spatiotemporal soil moisture data acquisition has only recently (since 2015) been initiated with satellite missions such as SMAP (Entekhabi et al., 2010) and was not available for the study duration of this research.

Figure 9. Partial dependence plots for maize yields (normalized) and all selected predictors as identified from the best-performing model that uses seasonal-level environmental, soils, and management predictor sets. Only those predictors have been included that were retained after recursive feature elimination. The slopes of the partial dependence plots indicate the sensitivity of the response variable to the given predictor. Tick marks on the X-axes indicate the deciles calculated from the distribution of each predictor in the training observations, and thus the plots should be interpreted only within the extreme ticks.

Besides environmental predictors, the RF model was also able to capture signals in crop yield vs. soil parameters, despite the temporally stationary nature of these predictors. Among soil textural indicators, silt and clay were retained by RFE for both crops, which responded positively and negatively, respectively, as the percent composition of each increased. Yields were relatively stable until an initial increase in% clay (until 23%-24%), after which they responded negatively throughout the remainder of the range. Among soil structural predictors, yield responded negatively as bulk density increased within the range observed (1.55-1.65 g cm-3), and this response was much clearer in soybean than maize, which showed some aberrations. Yields also increased as SOC increased between 1.75% and 3%, signifying the soil carbon sequestration benefit to plant productivity (Oldfield et al., 2019). Soil texture and structure are conjunctively responsible for soil AWC via micro and microporosity (Libohova et al., 2018). As AWC increased, maize yields increased after a threshold AWC was reached (around 14%), while soybean yield increased even with initial increase in AWC (Minasny and McBratney, 2018). Although soil texture, soil structure, and AWC are moderately correlated (fig. 1), using them concurrently in the RF modeling framework provides for the ability to leverage both their independent effects and interactions to explain crop yield variance. The fact that each of them was retained in the model after RFE speaks to their ability to add significant value to the model.

Figure 10. Partial dependence plots for soybean yields (normalized) and all selected predictors as identified from the best-performing model that uses seasonal-level environmental, soils, and management predictor sets. Only those predictors have been included that were retained after recursive feature elimination. The slopes of the partial dependence plots indicate the sensitivity of the response variable to the given predictor. Tick marks on the X-axes indicate the deciles calculated from the distribution of each predictor in the training observations, and thus the plots should be interpreted only within the extreme ticks.

Regional representation of water management practices, i.e., irrigation intensity and acreage of tile drainage in partial dependence plots, revealed a strong positive response of yields to both. County-aggregated irrigation intensity was concentrated mostly below 5% across the U.S., and both crops showed a sharp increase in yield within this range. Similarly, yields responded positively to an increase in the acreage of tile drained land within a county. This response is anticipated as the presence of irrigation and drainage can mitigate impacts from dry and flooded conditions and improve soil water availability conditions to enhance yield potential. Although irrigation intensity was obtained at several temporal intervals, tile drainage data were temporally static, and the spatial distribution of large-scale investment in tile drainage infrastructure in a region is mostly limited to the Midwest. Regions with higher irrigation intensity, however, are distributed across many regions in the nation. Previous research has shown that irrigation does not only improve yields but can also absorb shocks from extreme weather (Kukal and Irmak, 2019; Kukal and Irmak, 2020; Troy et al., 2015; Wang et al., 2021). This is an ideal example of how management and environmental conditions can interact to affect crop yields, and such interactions are captured by the model’s partial dependences. Additionally, crop yields increased as the depth to groundwater increased, and a depth of a few meters can be detrimental to root functions. This effect was also observed by Huang et al. (2021) for seven major crops in the U.S.; however, they showed that an intermediate depth to groundwater was ideal and deeper depths negatively affected crop yields, although our model did not capture any such negative yield impacts. For maize, higher nitrogen input rates also translated into greater yields until around 200 lb ha-1, after which they plateaued, representing diminishing returns upon excessive N input.

The partial dependence plots deduced from what the model learned from observations (n= 73,774 and 48,608 for maize and soybean, respectively) represent a 16-dimensional (16D) response surface (15 predictors and yield). Since a 16D response surface cannot be easily plotted and interpreted, figures 9 and 10 simplify these individual responses as 2D plots. The relationships deciphered by these plots highlight the limitations of traditionally used linear, polynomial, and logarithmic models, which may oversimplify yield dependencies on their biophysical drivers.

A significant contribution of this research is to document the ability of modern machine learning algorithms to extract patterns of complex interactions in biophysical drivers from rather coarsely collected datasets. This ability is evidenced as different variants in these models explaining 70%-80% of variance in maize and soybean yields, which is on the higher end of performance in the literature. Quite interestingly, this performance is very similar to recent performance (R2=0.78) of a popular process-based model, the Agricultural Production Systems sIMulator (APSIM), to predict NASS county-level maize yields in Illinois, Indiana, and Iowa (Shahhosseini et al., 2021). In another recent study, APSIM was used to estimate NASS county-level maize yields across 12 Corn Belt states, explaining 67% of variance (Sajid et al., 2022). Jiang et al. (2019) used more sophisticated predictors based on remotely sensed crop phenology combined with meteorology in a long short-term memory model to explain 76% of maize yield variation in the US Corn Belt. Jeong et al., (2016) used RF models to predict US maize yields by using weather variables as well as year as a predictor, resulting in 76% of variance explained. A primary takeaway is that regional crop yield variance can only be limitedly explained by either of the environmental, soils, or management predictor sets, and that synergy between at least two of these predictor sets is required to enhance the predictive skill of crop yields to an above-satisfactory level (three-quarters of variance explained). Furthermore, such synergy can be leveraged at a much lower cost of data acquisition since soils and management data do not necessarily have to be collected at high temporal frequencies. As evident from our findings, features such as soil texture, soil structure, and tile drainage represent only a single point in time but emerge as important predictors. Other management features such as irrigation intensity (collected every 5 years) and N inputs (annual) are easily retrievable via remote sensing (Pervez and Brown, 2010) and fertilizer sales data (Zhang et al., 2021; Fixen et al., 2012), respectively. It appears from the literature that studies relying on costlier (time, effort, and computationally) variables such as remotely sensed crop conditions do not substantially improve performance. Thus, it might be advantageous to use simple and cost-effective predictor variables against more involved variables that require significant preprocessing prior to inclusion. In fact, this was explicitly encountered in this research, as the “no management” experiment resulted in only slightly lower performance than the “all predictor sets” experiment. An integral task for successful machine learning models is feature selection, i.e., selecting a small but relevant group of variables among candidates to avoid the risk of overfitting and general robustness to different datasets. Our final models presented here are derived from an exhaustive RFE procedure with cross-validation to ensure the retention of important features across independent subsets of training/testing data. Even prior to RFE, our selection of candidate variables was aware of known mechanisms in the soil-plant-atmosphere continuum that affect crop yields, which is often termed as expert knowledge. Informed selection was confirmed by the modeling as well, since RFE procedures eliminated only 2-3 predictors for seasonal models. Such expert oversight over candidate predictor selection in machine learning has been emphasized in the literature (Lischeid et al., 2022).

Overall, our framework driven by machine learning applied to relating both spatial and temporal variation in crop yields to diverse features can be advantageous to traditional crop yield modeling efforts. Largely, crop yield modeling studies focus on capturing temporal variation at single locations or a few locations because such efforts require extensive site characterization to calibrate and parameterize models (Leng and Hall, 2020; Lobell and Asseng, 2017). Such studies are incapable of equivalently treating spatial and temporal variation in the crop models, especially since they assume a constant hybrid/variety for simulations over a long period of time during which yields have increased substantially (fig. 2). In this scenario, using normalized survey yield records to expand the pool of observations across space and time for the model to train on is a major advantage. The strategy of treating spatial and temporal yield variation equivalently allows for inferring crop response to predictors within a range that captures the historical extent of its variability. These predictor ranges can be well beyond what is encountered or projected for a single site. This approach leverages the diversity in environment, soils, and management as a “natural experiment” to benchmark multidimensional impacts on crop yields, which otherwise is challenging to determine via experimentation or physically-based crop models. For example, monitoring temporal change in soil characteristics is an important need but is also very challenging (Wills et al., 2017), especially improvement in soil structure change upon soil carbon sequestration assisted by climate-smart management. In such scenarios, understanding how crop yields have responded to spatial change in SOC, BD, and AWC can greatly inform our understanding of the projected impacts of temporal soil change.

Interpretation of the findings from this research requires some caution. First, predictors of low importance should not be thought of as irrelevant for influencing crop yields. A low relative importance only suggests that the given predictor did not provide novel information to the model beyond what was already conveyed by other predictors. For example, the low relative importance of VPD for both maize and soybean is because other predictors such as ETo and ETa already encapsulate the effects of VPD in their observational records. Similarly, although AWC has been shown to be a critical determinant of crop functions and hence yield, other soil characteristics such as silt, clay, BD, and SOC capture the spatial representation of AWC in the model. Second, it is acknowledged that seasonally aggregated and monthly aggregated environmental predictors may be too coarse to capture finer-scale impacts on crop yields. In some instances, individual sporadic extreme events such as high-intensity precipitation episodes, hailstorms, frosts, and extremely high heat may significantly affect plant functions. Similarly, our analysis did not distinguish between crop growth stages, which can show varying sensitivity to the same environmental patterns. The primary reasoning behind overlooking these aspects was to maintain the simplicity of predictors and avoid complex preconditioning of predictors. For example, growing season initiation, termination, and duration vary substantially not only across the U.S. but also temporally (both long-term shifts and interannual variability). Such preconditioning of environmental predictors that account for this variation would drastically increase the complexity and processing cost. Previous work has shown that monthly aggregation of predictors was equivalent to 8-16 days aggregation (Kang et al., 2020). Lastly, the findings of this research are based on the variability encountered in crop yields and its drivers aggregated across the entire conterminous U.S. To infer whether certain predictors have stronger controls in specific regions, it may be useful to develop independent regional models as an avenue for future work. One challenge that may be encountered in this direction is lower training datapoints when split by region, in which case model robustness may be compromised. However, novel remotely sensed yield/biomass datasets (Deines et al., 2021; Guan et al., 2017) at much higher resolutions (km scale) seem promising to help increase the amount of training data by several fold and match the much higher spatial resolution of environmental, soil, and management layers.

Conclusions

The random forest modeling approach was combined with robust recursive feature selection procedures to identify diverse drivers of crop yield variability that belong to environmental, soil, and management domains. Based on analyses of selected predictors and their importance, it was concluded that it is indispensable to consider synergies among environmental, soil, and management factors to explain a substantial portion of maize and soybean yield variance in the U.S. counties during 1981-2019. The best-performing models explained 75%-80% of variance across space and time combined after accounting for long-term crop yield improvements and using critically selected predictors from all three domains. Environmental predictors were non-negotiable for model performance, while a combination of environment with either soils or management approached the performance of the best-performing model. At the seasonal level, extreme heat represented by EDD was the most important predictor for both crops, followed by tile drainage and soil texture for maize, and ETa, and GDD for soybean. When a sub-seasonal environment was used with soils and management predictors, the relative importance changed, with irrigation intensity being the most important predictor, followed by tile drainage, silt, and July precipitation for both crops. Although the inclusion of sub-seasonal weather increased the total predictors drastically, only 8 and 12 predictors were sufficient to achieve model performance of >0.70 and 0.75, respectively. A model understanding of multidimensional empirical associations among observations was presented as 2D partial dependence plots to elucidate the nature of relationships between crop yields and critical predictor variables. These visual aids allow for projecting the functional aspects of crop yield variation with any change in its drivers that is expected across different agricultural production systems as well as time. This research documents the value and necessity of including soils and/or management indicators in addition to commonly employed environmental/climate predictors in modeling efforts aimed at explaining agricultural productivity over large regions and long periods. With such synergy appropriately represented in models, machine learning approaches can not only supplement but may also present at par success to that of physically-based crop models, largely increasing the efficiency and reducing the data acquisition and computational costs incurred in these extensive undertakings.

Acknowledgments

This work was supported by USDA-NIFA Grant Number 2023-67019-39837 and USDA-NRCS Grant Number NR233A750023C025.

References

Abatzoglou, J. T. (2013). Development of gridded surface meteorological data for ecological applications and modelling. Int. J. Climatol., 33(1), 121-131. https://doi.org/10.1002/joc.3413

Abatzoglou, J. T., Dobrowski, S. Z., Parks, S. A., & Hegewisch, K. C. (2018). TerraClimate, a high-resolution global dataset of monthly climate and climatic water balance from 1958–2015. Sci. Data, 5(1), 170191. https://doi.org/10.1038/sdata.2017.191

Aghighi, H., Azadbakht, M., Ashourloo, D., Shahrabi, H. S., & Radiom, S. (2018). Machine learning regression techniques for the silage maize yield prediction using time-series images of Landsat 8 OLI. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 11(12), 4563-4577. https://doi.org/10.1109/JSTARS.2018.2823361

Baston, D. & ISciences L.L.C. (2022). Package ‘exactextractr’. terra.

Bivand, R. (2022). R packages for analyzing spatial data: A comparative case study with areal data. Geogr. Anal., 54(3), 488-518. https://doi.org/10.1111/gean.12319

Brown, J. F., Howard, D. M., Shrestha, D., & Benedict, T. D. (2019). Moderate Resolution Imaging Spectroradiometer (MODIS) irrigated agriculture datasets for the conterminous United States (MIrAD-US). US Geological Survey. https://doi.org/10.5066/P9NA3EO8

Butler, E. E., & Huybers, P. (2015). Variations in the sensitivity of US maize yield to extreme temperatures by region and growth phase. Environ. Res. Lett., 10(3), 34009. https://doi.org/10.1088/1748-9326/10/3/034009

Deines, J. M., Patel, R., Liang, S.-Z., Dado, W., & Lobell, D. B. (2021). A million kernels of truth: Insights into scalable satellite maize yield mapping and yield gap analysis from an extensive ground dataset in the US Corn Belt. Remote Sens. Environ., 253, 112174. https://doi.org/10.1016/j.rse.2020.112174

Drury, C. F., Tan, C. S., Reynolds, W. D., Welacky, T. W., Oloya, T. O., & Gaynor, J. D. (2009). Managing tile drainage, subirrigation, and nitrogen fertilization to enhance crop yields and reduce nitrate loss. J. Environ. Qual., 38(3), 1193-1204. https://doi.org/10.2134/jeq2008.0036

Eck, M. A., Murray, A. R., Ward, A. R., & Konrad, C. E. (2020). Influence of growing season temperature and precipitation anomalies on crop yield in the southeastern United States. Agric. For. Meteorol., 291, 108053. https://doi.org/10.1016/j.agrformet.2020.108053

Entekhabi, D., Njoku, E. G., O’ Neill, P. E., Kellogg, K. H., Crow, W. T., Edelstein, W. N.,... Van Zyl, J. (2010). The soil moisture active passive (SMAP) mission. Proc. IEEE, 98(5), 704-716. https://doi.org/10.1109/JPROC.2010.2043918

Fishman, R. (2016). More uneven distributions overturn benefits of higher precipitation for crop yields. Environ. Res. Lett., 11(2), 24004. https://doi.org/10.1088/1748-9326/11/2/024004

Fixen, P. E., Williams, R., & Rund, Q. B. (2012). NUGIS: A nutrient use geographic information system for the US. International Plant Nutrition Institute.

Folberth, C., Baklanov, A., Balkovic, J., Skalský, R., Khabarov, N., & Obersteiner, M. (2019). Spatio-temporal downscaling of gridded crop model yield estimates based on machine learning. Agric. For. Meteorol., 264, 1-15. https://doi.org/10.1016/j.agrformet.2018.09.021

Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Ann. Stat., 29(5), 1189-1232, 44.

Frisvold, G., & Bai, T. (2016). Irrigation technology choice as adaptation to climate change in the western United States. J. Contemp. Water Res. Educ., 158(1), 62-77. https://doi.org/10.1111/j.1936-704X.2016.03219.x

Guan, K., Wu, J., Kimball, J. S., Anderson, M. C., Frolking, S., Li, B.,... Lobell, D. B. (2017). The shared and unique values of optical, fluorescence, thermal and microwave satellite data for estimating large-scale crop yields. Remote Sens. Environ., 199, 333-349. https://doi.org/10.1016/j.rse.2017.06.043

Hao, B., Xue, Q., Marek, T. H., Jessup, K. E., Hou, X., Xu, W.,... Bean, B. W. (2015). Soil water extraction, water use, and grain yield by drought-tolerant maize on the Texas High Plains. Agric. Water Manag., 155, 11-21. https://doi.org/10.1016/j.agwat.2015.03.007

Hengl, T. (2018). Soil texture classes (USDA system) for 6 soil depths (0, 10, 30, 60, 100 and 200 cm) at 250 m (Version v2) - Data set. Zenodo. https://doi.org/10.5281/zenodo.1475451

Hengl, T., & Gupta, S. (2019). Soil water content (volumetric %) for 33kPa and 1500kPa suctions predicted at 6 standard depths (0, 10, 30, 60, 100 and 200 cm) at 250 m resolution (v0.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.2629589

Hengl, T., & Wheeler, I. (2018). Soil organic carbon content in x 5 g / kg at 6 standard depths (0, 10, 30, 60, 100 and 200 cm) at 250 m resolution (v0.2) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.2525553

Hijmans, R. J., Van Etten, J., Cheng, J., Mattiuzzi, M., Sumner, M., Greenberg, J. A.,... Shortridge, A. (2015). Package ‘raster’. 734, 473. R package.

Huang, J., Hartemink, A. E., & Kucharik, C. J. (2021). Soil-dependent responses of US crop yields to climate variability and depth to groundwater. Agric. Syst., 190, 103085. https://doi.org/10.1016/j.agsy.2021.103085

Irmak, S., Specht, J. E., Odhiambo, L. O., Rees, J. M., & Cassman, K. G. (2014). Soybean yield, evapotranspiration, water productivity, and soil water extraction response to subsurface drip irrigation and fertigation. Trans. ASABE, 57(3), 729-748. https://doi.org/10.13031/trans.57.10085

Jeong, H., & Bhattarai, R. (2018). Exploring the effects of nitrogen fertilization management alternatives on nitrate loss and crop yields in tile-drained fields in Illinois. J. Environ. Manag., 213, 341-352. https://doi.org/10.1016/j.jenvman.2018.02.062

Jeong, J. H., Resop, J. P., Mueller, N. D., Fleisher, D. H., Yun, K., Butler, E. E.,... Kim, S.-H. (2016). Random forests for global and regional crop yield predictions. PLoS One, 11(6), e0156571. https://doi.org/10.1371/journal.pone.0156571

Jiang, H., Hu, H., Zhong, R., Xu, J., Xu, J., Huang, J., ... & Lin, T. (2020). A deep learning approach to conflating heterogeneous geospatial data for corn yield estimation: A case study of the US Corn Belt at the county level. Global change biology, 26(3), 1754-1766.

Kang, Y., Ozdogan, M., Zhu, X., Ye, Z., Hain, C., & Anderson, M. (2020). Comparative assessment of environmental variables and machine learning algorithms for maize yield prediction in the US Midwest. Environ. Res. Lett., 15(6), 064005. https://doi.org/10.1088/1748-9326/ab7df9

Khanal, S., Fulton, J., Klopfenstein, A., Douridas, N., & Shearer, S. (2018). Integration of high resolution remotely sensed data and machine learning techniques for spatial prediction of soil properties and corn yield. Comput. Electron. Agric., 153, 213-225. https://doi.org/10.1016/j.compag.2018.07.016

Kucharik, C. J., & Ramankutty, N. (2005). Trends and variability in U.S. corn yields over the twentieth century. Earth Interact, 9(1), 1-29. https://doi.org/10.1175/EI098.1

Kucharik, C. J., Ramiadantsoa, T., Zhang, J., & Ives, A. R. (2020). Spatiotemporal trends in crop yields, yield variability, and yield gaps across the USA. Crop Sci., 60(4), 2085-2101. https://doi.org/10.1002/csc2.20089

Kukal, M. S., & Irmak, S. (2018a). U.S. agro-climate in 20th century: Growing degree days, first and last frost, growing season length, and impacts on crop yields. Sci. Rep., 8(1), 6977. https://doi.org/10.1038/s41598-018-25212-2

Kukal, M. S., & Irmak, S. (2018b). Climate-driven crop yield and yield variability and climate change impacts on the U.S. Great Plains agricultural production. Sci. Rep., 8(1), 3450. https://doi.org/10.1038/s41598-018-21848-2

Kukal, M. S., & Irmak, S. (2019). Irrigation-limited yield gaps: Trends and variability in the United States post-1950. Environ. Res. Commun., 1(6), 061005. https://doi.org/10.1088/2515-7620/ab2aee

Kukal, M. S., & Irmak, S. (2020b). Characterization of water use and productivity dynamics across four C3 and C4 row crops under optimal growth conditions. Agric. Water Manag., 227, 105840. https://doi.org/10.1016/j.agwat.2019.105840

Kukal, M. S., & Irmak, S. (2020a). Impact of irrigation on interannual variability in United States agricultural productivity. Agric. Water Manag., 234, 106141. https://doi.org/10.1016/j.agwat.2020.106141

Kukal, M. S., Irmak, S., Dobos, R., & Gupta, S. (2023). Atmospheric dryness impacts on crop yields are buffered in soils with higher available water capacity. Geoderma, 429, 116270. https://doi.org/10.1016/j.geoderma.2022.116270

Lambert, L. H., English, B. C., Clark, C. C., Lambert, D. M., Menard, R. J., Hellwinckel, C. M.,... Papanicolaou, A. (2021). Local effects of climate change on row crop production and irrigation adoption. Clim. Risk Manag., 32, 100293. https://doi.org/10.1016/j.crm.2021.100293

Leng, G., & Hall, J. W. (2020). Predicting spatial and temporal variability in crop yields: An inter-comparison of machine learning, regression and process-based models. Environ. Res. Lett., 15(4), 044027. https://doi.org/10.1088/1748-9326/ab7b24

Li, X., & Troy, T. J. (2018). Changes in rainfed and irrigated crop yield response to climate in the western US. Environ. Res. Lett., 13(6), 064031. https://doi.org/10.1088/1748-9326/aac4b1

Li, Y., Guan, K., Schnitkey, G. D., DeLucia, E., & Peng, B. (2019). Excessive rainfall leads to maize yield loss of a comparable magnitude to extreme drought in the United States. Global Change Biol., 25(7), 2325-2337. https://doi.org/10.1111/gcb.14628

Libohova, Z., Seybold, C., Wysocki, D., Wills, S., Schoeneberger, P., Williams, C.,... Owens, P. R. (2018). Reevaluating the effects of soil organic matter and other properties on available water-holding capacity using the National Cooperative Soil Survey Characterization Database. J. Soil Water Conserv., 73(4), 411-421. https://doi.org/10.2489/jswc.73.4.411

Lin, T., Zhong, R., Wang, Y., Xu, J., Jiang, H., Xu, J.,... Li, H. (2020). DeepCropNet: A deep spatial-temporal learning framework for county-level corn yield estimation. Environ. Res. Lett., 15(3), 034016. https://doi.org/10.1088/1748-9326/ab66cb

Lischeid, G., Webber, H., Sommer, M., Nendel, C., & Ewert, F. (2022). Machine learning in crop yield modelling: A powerful tool, but no surrogate for science. Agric. For. Meteorol., 312, 108698. https://doi.org/10.1016/j.agrformet.2021.108698

Lobell, D. B., & Asner, G. P. (2003). Climate and management contributions to recent trends in U.S. agricultural yields. Science, 299(5609), 1032-1032. https://doi.org/10.1126/science.1078475

Lobell, D. B., & Asseng, S. (2017). Comparing estimates of climate change impacts from process-based and statistical crop models. Environ. Res. Lett., 12(1), 015001. https://doi.org/10.1088/1748-9326/aa518a

Lobell, D. B., Hammer, G. L., McLean, G., Messina, C., Roberts, M. J., & Schlenker, W. (2013). The critical role of extreme heat for maize production in the United States. Nat. Clim. Change, 3(5), 497-501. https://doi.org/10.1038/nclimate1832

Lu, J., Carbone, G. J., & Gao, P. (2017). Detrending crop yield data for spatial visualization of drought impacts in the United States, 1895–2014. Agric. For. Meteorol., 237-238, 196-208. https://doi.org/10.1016/j.agrformet.2017.02.001

Ma, Z., Guan, K., Peng, B., Sivapalan, M., Li, L., Pan, M.,... Zhang, J. (2023). Agricultural nitrate export patterns shaped by crop rotation and tile drainage. Water Res., 229, 119468. https://doi.org/10.1016/j.watres.2022.119468

McMaster, G. S., & Wilhelm, W. W. (1997). Growing degree-days: One equation, two interpretations. Agric. For. Meteorol., 87(4), 291-300. https://doi.org/10.1016/S0168-1923(97)00027-0

Miller, N., Tack, J., & Bergtold, J. (2021). The impacts of warming temperatures on?US sorghum yields and the potential for adaptation. Am. J. Agric. Econ., 103(5), 1742-1758. https://doi.org/10.1111/ajae.12223

Minasny, B., & McBratney, A. B. (2018). Limited effect of organic matter on soil available water capacity. Eur. J. Soil Sci., 69(1), 39-47. https://doi.org/10.1111/ejss.12475

Nasielski, J., Grant, B., Smith, W., Niemeyer, C., Janovicek, K., & Deen, B. (2020). Effect of nitrogen source, placement and timing on the environmental performance of economically optimum nitrogen rates in maize. Field Crops Res., 246, 107686. https://doi.org/10.1016/j.fcr.2019.107686

Oldfield, E. E., Bradford, M. A., & Wood, S. A. (2019). Global meta-analysis of the relationship between soil organic matter and crop yields. SOIL, 5(1), 15-32. https://doi.org/10.5194/soil-5-15-2019

Ortiz-Bobea, A., Wang, H., Carrillo, C. M., & Ault, T. R. (2019). Unpacking the climatic drivers of US agricultural yields. Environ. Res. Lett., 14(6), 064003. https://doi.org/10.1088/1748-9326/ab1e75

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., & Dubourg, V. (n.d.). sklearn. ensemble. RandomForestRegressor. sklearn. ensemble. RandomForestRegressor. Retrieved from https://scikit-learn.org/0.16/modules/generated/sklearn.ensemble.RandomForestRegressor.html

Peichl, M., Thober, S., Meyer, V., & Samaniego, L. (2018). The effect of soil moisture anomalies on maize yield in Germany. Nat. Hazards Earth Syst. Sci., 18(3), 889-906. https://doi.org/10.5194/nhess-18-889-2018

Pervez, M. S., & Brown, J. F. (2010). Mapping irrigated lands at 250-m scale by Merging MODIS data and national agricultural statistics. Remote Sens., 2(10), 2388-2412.

Pittelkow, C. M., Clover, M. W., Hoeft, R. G., Nafziger, E. D., Warren, J. J., Gonzini, L. C., & Greer, K. D. (2017). Tile drainage nitrate losses and corn yield response to fall and spring nitrogen management. J. Environ. Qual., 46(5), 1057-1064. https://doi.org/10.2134/jeq2017.03.0109

Proctor, J. (2023). Extreme rainfall reduces rice yields in China. Nat. Food, 4(5), 360-361. https://doi.org/10.1038/s43016-023-00757-2

Qiao, L., Wang, X., Smith, P., Fan, J., Lu, Y., Emmett, B.,... Fan, M. (2022). Soil quality both increases crop production and improves resilience to climate change. Nat. Clim. Change, 12(6), 574-580. https://doi.org/10.1038/s41558-022-01376-8

Ray, D. K., Gerber, J. S., MacDonald, G. K., & West, P. C. (2015). Climate variation explains a third of global crop yield variability. Nat. Commun., 6(1), 5989. https://doi.org/10.1038/ncomms6989

Rigden, A. J., Mueller, N. D., Holbrook, N. M., Pillai, N., & Huybers, P. (2020). Combined influence of soil moisture and atmospheric evaporative demand is important for accurately predicting US maize yields. Nat. Food, 1(2), 127-133. https://doi.org/10.1038/s43016-020-0028-7

Rizzo, G., Monzon, J. P., Tenorio, F. A., Howard, R., Cassman, K. G., & Grassini, P. (2022). Climate and agronomy, not genetics, underpin recent maize yield gains in favorable environments. Proc. Natl. Acad. Sci., 119(4), e2113629119. https://doi.org/10.1073/pnas.2113629119

Sajid, S. S., Shahhosseini, M., Huber, I., Hu, G., & Archontoulis, S. V. (2022). County-scale crop yield prediction by integrating crop simulation with machine learning models. Front. Plant Sci., 13. https://doi.org/10.3389/fpls.2022.1000224

Shahhosseini, M., Hu, G., Huber, I., & Archontoulis, S. V. (2021). Coupling machine learning and crop modeling improves crop yield prediction in the US Corn Belt. Scientific reports, 11(1), 1606.

Singh, A., Kukal, M. S., Shapiro, C. A., Snow, D. D., Irmak, S., & Iqbal, J. (2021). Growth phase-specific evaporative demand and nighttime temperatures determine Maize (Zea Mays L.) yield deviations as revealed from a long-term field experiment. Agric. For. Meteorol., 308-309, 108543. https://doi.org/10.1016/j.agrformet.2021.108543

Tenorio, F. A., McLellan, E. L., Eagle, A. J., Cassman, K. G., Andersen, D., Krausnick, M.,... Grassini, P. (2020). Benchmarking impact of nitrogen inputs on grain yield and environmental performance of producer fields in the western US Corn Belt. Agric. Ecosyst. Environ., 294, 106865. https://doi.org/10.1016/j.agee.2020.106865

Tollenaar, M., Fridgen, J., Tyagi, P., Stackhouse Jr, P. W., & Kumudini, S. (2017). The contribution of solar brightening to the US maize yield trend. Nat. Clim. Change, 7(4), 275-278. https://doi.org/10.1038/nclimate3234

Troy, T. J., Kipgen, C., & Pal, I. (2015). The impact of climate extremes and irrigation on US crop yields. Environ. Res. Lett., 10(5), 054013. https://doi.org/10.1088/1748-9326/10/5/054013

Urban, D. W., Roberts, M. J., Schlenker, W., & Lobell, D. B. (2015). The effects of extremely wet planting conditions on maize and soybean yields. Clim. Change, 130(2), 247-260. https://doi.org/10.1007/s10584-015-1362-x

Valayamkunnath, P., Barlage, M., Chen, F., Gochis, D. J., & Franz, K. J. (2020). Mapping of 30-meter resolution tile-drained croplands using a geospatial modeling approach. Sci. Data, 7(1), 257. https://doi.org/10.1038/s41597-020-00596-x

Vergopolan, N., Xiong, S., Estes, L., Wanders, N., Chaney, N. W., Wood, E. F.,... Sheffield, J. (2021). Field-scale soil moisture bridges the spatial-scale gap between drought monitoring and agricultural yields. Hydrol. Earth Syst. Sci., 25(4), 1827-1847. https://doi.org/10.5194/hess-25-1827-2021

Wang, R., Bowling, L. C., & Cherkauer, K. A. (2016). Estimation of the effects of climate variability on crop yield in the Midwest USA. Agric. For. Meteorol., 216, 141-156. https://doi.org/10.1016/j.agrformet.2015.10.001

Wang, X., Müller, C., Elliot, J., Mueller, N. D., Ciais, P., Jägermeyr, J.,... Piao, S. (2021). Global irrigation contribution to wheat and maize yield. Nat. Commun., 12(1), 1235. https://doi.org/10.1038/s41467-021-21498-5

Wills, S., Williams, C., Seybold, C., Scheffe, L., Libohova, Z., Hoover, D.,... Brown, J. (2017). Using soil survey to assess and predict soil condition and change. In D. J. Field, C. L. Morgan, & A. B. McBratney (Eds.), Global Soil Security (pp. 123-135). Cham: Springer. https://doi.org/10.1007/978-3-319-43394-3_11

Zaveri, E., & Lobell, D. B. (2019). The role of irrigation in changing wheat yields and heat sensitivity in India. Nat. Commun., 10(1), 4144. https://doi.org/10.1038/s41467-019-12183-9

Zell, W. O., & Sanford, W. E. (2020). Calibrated simulation of the long-term average surficial groundwater system and derived spatial distributions of its characteristics for the contiguous United States. Water Resour. Res., 56(8), e2019WR026724. https://doi.org/10.1029/2019WR026724

Zhang, J., Cao, P., & Lu, C. (2021). Half-century history of crop nitrogen budget in the conterminous United States: variations over time, space and crop types. Global Biogeochem. Cycles, 35(10), e2020GB006876. https://doi.org/10.1029/2020GB006876

Zhang, T., Lin, X., & Sassenrath, G. F. (2015). Current irrigation practices in the central United States reduce drought and extreme heat impacts for maize and soybean, but not for wheat. Sci. Total Environ., 508, 331-342. https://doi.org/10.1016/j.scitotenv.2014.12.004

Zhu, P., Jin, Z., Zhuang, Q., Ciais, P., Bernacchi, C., Wang, X.,... Lobell, D. (2018). The important but weakening maize yield benefit of grain filling prolongation in the US Midwest. Global Change Biol., 24(10), 4718-4730. https://doi.org/10.1111/gcb.14356

Zhu, X., Troy, T. J., & Devineni, N. (2019). Stochastically modeling the projected impacts of climate change on rainfed and irrigated US crop yields. Environ. Res. Lett., 14(7), 074021. https://doi.org/10.1088/1748-9326/ab25a1