Article Request Page ASABE Journal Article Application of Sediment Fingerprinting to Apportion Sediment Sources: Using Machine Learning Models
Kritika Malhotra1, Jingyi Zheng2, Ash Abebe2, Jasmeet Lamba1,*
Published in Journal of the ASABE 66(5): 1205-1221 (doi: 10.13031/ja.14906). Copyright 2023 American Society of Agricultural and Biological Engineers.
1Biosystems Engineering, Auburn University, Auburn, Alabama, USA.
2Mathematics and Statistics, Auburn University, Auburn, Alabama, USA.
*Correspondence: jsl0005@auburn.edu
Submitted for review on 12 October 2021 as manuscript number NRES 14906; approved for publication as a Research Article and as part of the Streambank Erosion, Sediment Dynamics and Restoration: Monitoring, Modeling and Case Studies Collection by Associate Editor Dr. Fouad Jaber and Community Editor Dr. Kyle Mankin of the Natural Resources & Environmental Systems Community of ASABE on 17 March 2023.
Highlights
- Relative source contributions to stream bed sediment from construction sites and stream banks were quantified.
- Two machine-learning techniques were used to select composite fingerprinting properties.
- The MixSIR Bayesian model was employed for source apportionment.
- Statistical methods employed for fingerprinting properties selection have the potential to impact source apportionments.
- Management strategies to reduce sediment mobilization should be targeted depending on the dominant source of sediment in each sub-watershed.
Abstract. Sediment fingerprinting is an extensively used approach for investigating sediment sources by linking in-stream sediment mixtures with watershed source materials. The overall goal of this research was to estimate the relative source contributions of stream banks and construction sites to the stream bed sediment in an urbanized watershed (Alabama, USA) using a fingerprinting technique established on composite fingerprints selected by two different machine learning techniques at a sub-watershed scale. The two statistical approaches employed to select the subset of fingerprinting properties were: (1) the Random Forest algorithm (RF) with Gini importance ranking of variables; and (2) logistic regression with the least absolute shrinkage and selection operator (LASSO). A Bayesian mixing model was then used to estimate the distribution of mixing proportions along with the associated uncertainty. The models were built based on the composite fingerprints selected using the two machine learning methods. Overall, using the subset of fingerprints selected by RF and LASSO, the relative contribution of stream banks ranged from 14±9% to 97±2% and from 24±18% to 94±5%, respectively, throughout the watershed. The stream bank contributions were compared with a previous study conducted in the watershed that utilized a two-step statistical procedure (which involved a Mann-Whitney U-test as the first step and discriminant function analysis (DFA) as the second step) to select the composite of fingerprinting properties and a frequentist mixing model to calculate the source apportionments. The relative contributions of stream banks to stream bed sediment in the previous study reported ranged from 9±8% to 100±1%. Therefore, the study demonstrated the dependence of source attributions on the statistical procedures used to select the optimum composite fingerprints for sediment fingerprinting applications. Furthermore, the results underscored the importance of using different mixing model structures to obtain reliable estimates of source contributions.
Keywords. Least absolute shrinkage and selection operator (LASSO), MixSIR Bayesian model, Random Forest (RF), Statistical techniques.An increase in urbanization results in the alteration of hydrological and sedimentological processes, posing a risk to aquatic ecosystems such as streams and lakes (Russell et al., 2019; Taylor and Owens, 2009). Impervious land cover combined with effective stormwater drainage systems not only promotes increased and more responsive runoff patterns but also increases the in-stream potential for channel erosion and scour, thus elevating the sediment load contribution from the channel itself (Chin, 2006; Russell et al., 2019). Instead of infiltrating into the soil, the stormwater drainage systems quickly funnel the runoff directly to streams or rivers. The augmented supply of sediment to the channel can lead to the degradation of water quality and reduced biodiversity (Sherriff et al., 2018). Therefore, identifying the dominant source of in-stream sediment in a watershed can help design cost-effective sediment management strategies to achieve meaningful reductions in sediment load and mitigate the negative impacts of sediment and sediment-associated contaminants on water quality (Pulley et al., 2015).
Sediment fingerprinting techniques are frequently used to assess the sediment sources within agricultural and urbanized watersheds (Franz et al., 2014; Malhotra et al., 2018; Rose et al., 2018; Russell et al., 2019; Sherriff et al., 2018). Principally, sediment fingerprinting is based on the characterization and comparison of the physical or geochemical properties of in-stream sediment to their corresponding potential sources across the watershed (Rose et al., 2018). The fingerprinting properties of in-stream sediment after its delivery to a river, lake, or floodplain, will be reflective of their concentration in the sediment sources (Pulley et al., 2015). Therefore, the comparison of the sediment fingerprinting properties of in-stream sediment with those of the soil collected from the erosional source areas can be used to determine the relative contributions from the potential sediment sources to waterbodies (Rose et al., 2018). Individual source type (such as uplands versus stream banks) contributions are fundamental to targeting relevant management strategies aimed at mitigating sediment delivery to streams (Gellis and Noe, 2013; Massoudieh et al., 2013; Raigani et al., 2019).
In traditional sediment fingerprinting studies, source apportionments have typically been obtained by the application of the “frequentist” mixing model approach (Cooper and Krueger, 2017), which involves optimization to minimize the sum of squares of the weighted relative errors (Collins et al., 1997). One of the limitations of this approach was that it did not incorporate a variety of sources of uncertainty, as these models are inadequate in terms of structural flexibility to reasonably convert all sources of error into the model results (Cooper et al., 2014, 2015; Moore and Semmens, 2008). The various sources of uncertainty include errors related to the analytical instrument or sampling methods, spatial and temporal variability of fingerprinting properties in source and sediment, mixing model error, or error associated with non-conservative sediment transport (Barthod et al., 2015; Cooper et al., 2014; Sherriff et al., 2015; Stewart et al., 2015). The traditional mixing model incorporates the use of the means of fingerprinting properties only, and the observed means might fail to capture the true variability of the source groups due to natural variation (D’Haen et al., 2013). However, the mean and variance of the fingerprinting properties of the respective sources provide a good representation of the actual source composition (D’Haen et al., 2013; Small et al., 2002). In furtherance of this, the Bayesian approach allows for the rational translation of all known and residual sources of uncertainty associated with the mixing model and the data into parameter probability distributions within a hierarchical framework (Cooper et al., 2014, 2015; Cooper and Krueger, 2017).
Recent sediment fingerprinting investigations have adopted Bayesian uncertainty techniques, which differ in their interpretation as contributions from different sources to in-stream sediment are presented as probability distributions, and prior knowledge for model parameters (proportions of source contributions) could be incorporated. Therefore, a Bayesian mixing model (MixSIR) is considered a powerful tool for assessing the contributions of different sources to a particular sediment mixture since it is based on Bayesian estimation of the probability distributions of source contributions to the mixture (Moore and Semmens, 2008; Nosrati and Collins, 2019b; Stock et al., 2018). The uncertainties linked to the input data can be included in the model using prior probability distributions, and the spread of posterior probability distributions indicates the uncertainty related to the source contributions (D’Haen et al., 2013). Furthermore, the model has been recommended to be used as a stand-alone probabilistic tool to investigate the catchment erosional processes (Fox and Papanicolaou, 2008; Nosrati et al., 2014).
The Bayesian paradigm employs Markov-Chain Monte-Carlo random walks to estimate the mixing proportions and the most probable source composition (D’Haen et al., 2013). The procedure for quantifying the relative contributions from various sources using MixSIR Bayesian statistical approach involves three steps: (1) incorporating the uncertainty associated with the input data into the model as model parameters of the prior probability distributions, (2) generating a likelihood function for the model, and (3) deriving the posterior probability distributions for the source proportions (by combining the prior knowledge with the likelihood), which provides an estimate of the uncertainty linked to the mixing proportion computation (Davies et al., 2018; D’Haen et al., 2013; Moore and Semmens, 2008; Nosrati et al., 2014). It should be noted that the uncertainty in the mixing proportions is a function of the overlap between the probability distribution of the composite fingerprinting properties.
Although the two-step statistical procedure outlined by Collins et al. (1997) to generate a composite of fingerprinting properties involving Kruskal-Wallis H-test (KW-H) or Mann-Whitney U-test (U-test) as the first step along with discriminant function analysis (DFA) as the second step has been extensively used in the past (Schuller et al., 2013; Yu and Rhoads, 2018), other procedures have also been used for the selection of composite signatures of fingerprinting properties (Palazón and Navas, 2017; Stone et al., 2014; Tiecher et al., 2015). Depending on the statistical method employed to determine relative source contributions, differences/similarities in sediment source apportionment results have been reported. For instance, Nosrati et al. (2018) demonstrated the importance of statistical procedures employed in fingerprinting studies by reporting statistical differences in the predicted source contributions generated using three different combinations of statistical techniques (namely, KW-H, a combination of KW-H and DFA, and a combination of KW-H and principal components and classification analysis). Similarly, Palazón et al. (2017) reported different source contribution estimates using different statistical methods. Different statistical procedures lead to different results, which motivated us to investigate the uncertainties in source allocations predicted by alternative composite fingerprints.
Statistical learning methods search combinations of fingerprinting properties to find the set of properties that minimizes the misclassification error. This makes them optimal for source discrimination and often vastly superior to tests on individual fingerprinting properties (Hastie et al., 2009). Furthermore, to our knowledge, popular machine learning algorithms such as least absolute shrinkage and selection operator (LASSO) and the Random Forest algorithm (RF) with Gini importance ranking of variables have not been previously used in sediment fingerprinting studies to select the composite of fingerprinting properties for source discrimination. Random Forest (RF) is a machine learning method that generally performs well with high-dimensional data and can model nonlinear relationships between features (input variables) (Darst et al., 2018). Using RF to train the models effectively avoids overfitting and restrains the negative impacts of noise. This leads to improved accuracy and stability of model classification and prediction. LASSO is a popular high-dimensional data analysis technique that can simultaneously perform regularization and variable selection and is more accurate and stable than traditional variable selection methods, such as stepwise, forward, or backward selection (Leng et al., 2006; Tibshirani, 1996; Vasquez et al., 2016). LASSO offers several advantages over other penalized regression methods, such as ridge regression. The LASSO model can handle multicollinearity and increase the model interpretability by eliminating unimportant features that are not associated with the response variable and reducing overfitting. It also provides very good prediction accuracy and is especially useful with a small number of observations and a large number of attributes, as shrinking and removing the coefficients can decrease variance without a significant increase in bias (Kumar et al., 2019). By using the techniques mentioned above, we can eliminate the need to use two steps for selecting the composite of fingerprinting properties and use single-step methods based on the prediction accuracy of a combination of properties.
Feature selection (i.e., fingerprinting properties selection) methods, such as the Mann-Whitney U-test, estimate the discriminating ability of each feature separately, without considering the interplay among other features. However, features are indeed correlated with each other, which is handled well by LASSO and RF, as these methods choose features based on their performance in the model while considering the correlation among features. Therefore, U-test may not select the best feature sets and may potentially disregard important features that are less informative on their own but are informative when combined with others. Furthermore, DFA has the assumption of normality for the data and hence cannot be used for small datasets for which the assumption of normality does not hold. Employing DFA for such datasets might impact their accuracy.
Therefore, the overall goal of this study was to understand how the different statistical methods influence the selection of fingerprinting properties to identify the dominant sources of stream bed sediment at a sub-watershed scale. The specific objectives of this study were to: (1) determine the relative contributions to stream bed sediment from construction sites and stream banks using two different machine learning techniques and a Bayesian mixing model; and (2) determine how the source apportionment results for the sediment deposited on the stream bed and associated uncertainty in this study compare with a previous study that employed a two-step statistical procedure (U-test + DFA) and a frequentist mixing model. This study builds on the previous sediment fingerprinting studies (Malhotra et al., 2018, 2020) conducted by the authors in an urbanized watershed in the eastern part of Alabama, USA. The results of this study, in conjunction with previous studies conducted in this watershed, can demonstrate that machine learning methods can be employed in sediment fingerprinting and assess the uncertainty associated with the source apportionments in a better way. It was hypothesized that there would be differences in the relative source contributions to the stream bed sediment estimated in both studies.
Materials and Methods
Study Area
The study site was Moore's Mill Creek Watershed located in Alabama, USA (fig. 1). According to the Cropland Data Layer for 2017 and 2008 developed by USDA-NASS (https://www.nass.usda.gov/Research_and_Science/Cropland/SARS1a.php), urbanization has occurred in this watershed, with the developed area increasing from 44% to 65% between 2008 and 2017. Throughout the study period, Moore’s Mill Creek was listed on the Alabama Department of Environmental Management’s (ADEM) 303(d) list of impaired waterbodies due to excessive siltation caused by land development and urban runoff/storm sewers (http://adem.alabama.gov/programs/water/wquality/2016AL303dList.pdf). Our visual inspections during sampling visits confirmed the issues reported in the ADEM report, including unstable banks, poor sinuosity, lack of riparian buffer protection, and heavy sediment deposition.
The average annual precipitation in this basin is 1430 mm (1997-2017), with average annual high and low temperatures of 25°C and 11°C, respectively (1997-2017) (https://www.ncei.noaa.gov/). The catchment is characterized by bedrock lithology ranging from schist, gneiss, and mylonite to quartzite and sandstone. The soil series details of the catchment are shown in figure 2.
Sampling strategy
The main land uses in the study watershed are developed (65%), forested (23%), pasture (5%), and shrubland (4%) (fig. 1). The potential sediment sources considered in this study (fig. 1) were identified through observations of sediment mobilization and transport processes within the watershed. Source sampling points were determined during the investigation conducted from August to September 2016 (fig. 1). The source and stream bed sediment samples were collected as part of the Malhotra et al. (2020) study. During the study period, significant sediment mobilization from the construction sites that were not managed adequately was observed, which could have resulted in a substantial sediment contribution from these sites toward in-stream sediment (fig. 3). Visual observations also indicated the existence of unstable and severely eroding stream banks marked by bare surfaces with little or no vegetation, exposed soils, and vertical walls that appeared to be susceptible to erosion (Malhotra et al., 2020) (fig. 3).
For forested areas, we observed leaf litter/organic material layer almost 3 cm deep covering the soil at the sites inspected, providing minimal chances for erosion to occur. Additionally, since erosion rates from pastures are typically low as grass cover on soil minimizes soil erosion, no samples were collected from pasture areas. Therefore, we decided to focus the sampling effort on those sources that represented a reasonable source of sediment mobilization, and samples were not collected from forested areas or pastures. Hence, construction sites and stream banks were considered potential sources of stream bed sediment in this study. However, future research could benefit from incorporating multiple sources (e.g., forests, pastures, road dust) to better understand the uncertainty in the fingerprinting model results.
(a) (b) Figure 1. (a) Land use distribution in the Moore’s Mill Creek watershed and (b) location of stream bed sediment, stream banks, and construction sampling sites.
Figure 2. Distribution of soil series in the Moore’s Mill Creek watershed. Individual source sampling locations were selected based on the following criteria: (1) accessibility; (2) potential connectivity or proximity of sampling locations to stormwater drains/streams; and (3) visible evidence of erosion or the presence of exposed actively eroding stream banks.
For details on the methodology used for source and stream bed sediment sample collection, refer to Malhotra et al. (2020). Briefly, thirteen different construction sites were sampled within this watershed, and stream bank sampling involved the collection of soil samples from seventeen different sites. To increase the representativeness of the potential heterogeneity of the samples at each construction site, composite samples made of 10 sub-samples (top 2.5 cm of surface soil) were collected from each site. At each stream bank sediment sampling site, five different vertical profiles of the actively eroding banks at an interval of 2-3 m were sampled. From each full vertical profile of the actively eroding banks, stream bank samples (~ 5 cm deep) were collected from three different points (top to bottom), which were 10-20 cm apart, depending upon the height of the stream bank at each location (fig. 4). Stream bank soil samples at each sampling site were composited for analysis (Malhotra et al., 2018, 2020).
Stream bed target samples were collected at the outlets of different sub-watersheds of distinctive land-use characteristics (fig.1). The collection of stream bed sediment samples was done monthly at three different locations within this watershed to encompass both the spatial and temporal dynamics of sediment recently deposited on the stream bed and capture multiple runoff-generating storm events (fig. 1). The stream bed sediment sample collection was done from December 2016 to September 2017 (table 1). Apart from the first month (December 2016) of the sampling period, the top 10 cm of
Figure 3. Illustrations of source and stream bed sample collection areas. (a, b, c) Construction sites; (d, e, f) stream banks; and (g, h) stream beds. (i) Improper installation of silt fence at one of the construction sites; and (j) sediment tracked onto the road adjoining a construction site through the tires of vehicles leaving the construction site. the sediment deposited on the stream bed was collected. We primarily focused on the top few centimeters, which was likely the fresh sediment that was mobilized from sources and deposited during the study period. Cores from the top 20 cm, top 15 cm, and top 10 cm at sites 1, 2, and 3, respectively, were collected for the first month of the sampling period, depending upon the amount of sediment deposited at each site (table 1). The purpose of analyzing the sectioned profile of the deeper core for the first month of the sampling period was to collect information on whether there is a difference in the dominant source of newly deposited sediment and the sediment that has not been deposited lately. To improve the representativeness of the bed sediment, each sample comprised a composite of 3-5 sub-samples at each site.
Before measuring the concentration of geochemical fingerprinting properties, all the soil and sediment samples were oven-dried at 60° C for 48 hours to a constant weight, and the dried samples were disaggregated using a pestle and mortar and sieved down to 63-212 µm (fine sand) particle size fraction. Given the focus of this study on stream bed sediment, the fine sand fraction was specifically considered due to its coarser nature compared to the material that remains suspended. Elemental analysis was performed on the samples for the following list of elements: S, K, Ca, Li, Be, B, Sc, Ti, Mg, Na, Al, P, V, Cr, Mn, Fe, Co, Ga, Rh, Pd, Ag, As, Sb, Cs, Se, Rb, Sr, Y, Zr, Nb, Nd, Ni, Cu, Zn, Gd, Dy, Ho, Yb, Lu, Hf, Ta, W, Ir, Sn, Ba, La, Ce, Pr, Pt, Hg, Tl, Pb, Bi, Sm, Eu, Mo, Cd, Th, and U. The analysis was performed at the Wisconsin State Laboratory of Hygiene, Wisconsin, USA, using the ICP-MS microwave-aided digestion procedure (using ultrahigh purity nitric, hydrochloric, and hydrofluoric acids) based on the United States Environmental Protection Agency Method 3052 (USEPA, 1996).
Figure 4. Stream bank profile for sample collection at each location. Three cores (C1, C2, and C3; ~5 cm deep) were collected from top to bottom at an interval of 10-20 cm at each location, and five different locations were sampled laterally along the bank at each site. Statistical Discrimination of Sediment Sources
A statistical analysis was conducted to choose the fingerprinting properties that best discriminate between construction sites and stream banks. A range test was conducted to identify fingerprinting properties that behaved non-conservatively during transport from the source to the in-stream sampling point. This test compares the concentration of fingerprinting properties in the stream bed sediment samples with the corresponding ranges of concentration of fingerprinting properties in the source samples (Franz et al., 2014; Gellis and Noe, 2013; Lamba et al., 2015; Smith and Blake, 2014; Wilkinson et al., 2013). All the non-conservative properties that did not meet this criterion (i.e., the concentration of fingerprints for stream bed sediment samples falling outside the corresponding ranges of upstream source sample fingerprinting properties concentration) were removed from further analysis.
Recent studies based on sediment fingerprinting approaches have demonstrated the importance of using more than one composite fingerprint selected by different statistical procedures (Nosrati and Collins, 2019b, a; Palazón and Navas, 2017). Therefore, after checking the conservativeness of the fingerprinting properties, we employed two different machine-learning approaches to select composite fingerprinting properties. In the first approach, the Random Forest algorithm along with the Gini importance ranking of variables was used. The second method we applied was logistic regression with the least absolute shrinkage and selection operator (LASSO). Therefore, two final composite fingerprints were selected based on these machine-learning approaches and were used to determine sediment sources at each site.
Random Forest Algorithm
The analysis for feature selection (i.e., fingerprinting properties selection) was performed using the ranger package of R (Wright et al., 2016). The ranger random forest algorithm provides a fast and efficient calculation using a recursive partitioning framework (Alabdulwahhab et al., 2021). In this study, the model was run 500 times, and the feature importance based on the mean of the Gini coefficient for each independent feature was calculated. The mean decrease in the Gini coefficient is a measure of how each variable contributes to the homogeneity of nodes and the resulting random forest (Menze et al., 2009). A higher mean decrease in Gini indicates higher variable importance, i.e., the variables that result in nodes with higher purity have a higher decrease in Gini coefficient (Balzter et al., 2015).
Random-Forest-Recursive Feature Elimination (RF-RFE) algorithm along with the validation set approach was used to perform the feature selection based on the accuracies obtained (Granitto et al., 2006; Sohil et al., 2022). RF models with a different number of features (selected based on ranking) were trained, and the model with the highest accuracy was selected. The validation set approach was repeated 500 times to obtain the mean accuracy of each model. The validation set approach estimates the error rate of the model by separating a subset of the data from the fitting process (generating a testing dataset), and the model is built using the other set of observations that constitutes the training dataset. The model is then applied to the testing dataset, in which the error is calculated (testing dataset error). Hyperparameter tuning was performed to obtain the combination of hyperparameters that could improve the accuracy of the model. Since the number of decision trees greatly influences the model, the number of decision trees was set to 400 to guarantee the stability of the model.
Least Absolute Shrinkage and Selection Operator (LASSO)
In this study, LASSO was used for the feature selection process, where it penalized the coefficients of the logistic regression variables, shrinking some of them to zero with a tuning parameter ? using the glmnet package in R. LASSO uses penalized logistic regression, which minimizes the penalized negative of a binomial log-likelihood given by (eq. 1):
(1)
where
j varies from 1 to p, which = number of fingerprinting properties
i varies from 1 to n, where n = the number of samples
y = either 0 or 1 indicating the source of the sample
x's = measured values of the fingerprinting features.
As ? increases, the number of estimated logistic regression coefficients that are shrunk to zero also increases.
Cross-validation was performed to obtain the optimal lambda and returned a model with the smallest misclassification error. Following the LASSO estimation process, the features with non-zero coefficients were selected to be part of the Bayesian model.
It should be noted that both tracer selection methods (Random Forest and LASSO) can be used when considering more than two sources (multi-class problems).
Source Apportionment Using the MixSIR Bayesian Mixing Model
The set of variables selected by the two approaches was analyzed using a Bayesian mixture model. The Bayesian analysis using Markov Chain Monte Carlo (MCMC) simulations for estimating source apportionments was programmed in the open-source software Just Another Gibbs Sampler (JAGS) program within the R environment (Plummer, 2003). Information regarding JAGS is provided at http://mcmc-jags.sourceforge.net/. Using a Gibbs sampling Markov Chain Monte Carlo (MCMC) algorithm on prior parameter distributions and a likelihood function, JAGS performed a hierarchical Bayesian interface to estimate posterior distributions and estimates of uncertainty associated with the mixing model inputs. The Bayesian model was run for 50,000 iterations with 10,000 burn-in samples and resulted in posterior probability distributions for the estimated contributions from the sources using the two different composites of fingerprinting properties. We specified the thinning rate as 5, implying that the procedure will retain every fifth parameter value. Trace plots and Gelman-Rubin statistics were used for convergence diagnostics.
The relative contribution of stream banks and construction sites was calculated by approximating the mixture probabilities of a Gaussian mixture distribution using Bayesian Monte Carlo sampling (Moore and Semmens, 2008). This approach quantified the relative contribution of the two sources to the stream bed sediment by calculating the probability distributions for the proportional contribution (f) of source 1to the stream bed sediment mixture in the before-mentioned two stages, with the proportional contribution of source 2 as (1-f). It should be noted that the contributions from stream banks and construction sites are relative to each other and do not constitute all the sediment sources that are contributing to stream bed sediment. However, the Bayesian model can be used to calculate the proportions of source contributions even if more than two sources are considered in a study.
The Bayes rule states that the posterior probability distribution for f is proportional to the likelihood of the data given f based on the estimated mean and standard deviation of the fingerprinting properties in the sources, and respectively, and prior probability distribution of f (Nosrati et al., 2014; Nosrati and Collins, 2019a). The posterior likelihood is given by (eq. 2):
(2)
where is the likelihood of the data and P(f | a,ß) is the prior probability distribution that depends on hyperparameters a and ß.In the formulation (eqs. 3 and 4),
(3)
(4)
where, and are the mean and standard deviation vectors for the J selected fingerprinting properties. These can be calculated as (eqs. 5 and 6):
(5)
(6)
where mj1 and mj2 represent the means of the jth sediment fingerprinting property in source 1 and source 2, respectively; and are the variances of the jth sediment fingerprinting property in source 1 and 2, respectively.
Based on and , the likelihood of the data given the proposed sediment mixture is calculated as (Nosrati et al., 2019) (eq. 7):
(7)
where xkj represents the jth fingerprinting property of the kth sediment sample. Then, the prior information according to the beta distribution f ~ Beta(a,ß) that has the probability density function given by (eq. 8):
(8)
where a and ß are shape parameters taken to be (1,1) to make the prior non-informative (uniform on [0,1]).
It should be noted that some of the recent work in sediment fingerprinting (Liu et al., 2016; Martinez-Carreras et al., 2008; Martínez-Carreras et al., 2010; Palazón and Navas, 2017; Smith and Blake, 2014a) undermined the use of particle size and organic matter correction factors as they may bias the estimated contribution results; therefore, such corrections were not performed on the fingerprinting property datasets in this study. Standard deviations provided by MixSIR were considered uncertainties in the source apportionment results (Gateuille et al., 2019).
Uncertainty and Equivalence Testing
The previous study employed a Monte-Carlo simulation approach to calculate the uncertainty in source apportionment results obtained from the frequentist mixing model. This approach involved generating a thousand random values for each fingerprinting property within the 5th and 95th percentile range for both source groups. The mixing model was then run 1000 times by using one random value at a time for each fingerprinting property. Subsequently, the mean contributions of stream banks were determined for each site (Malhotra et al., 2020).
The predicted source apportionment results based on composite fingerprinting properties selected by all the statistical approaches were compared using the root mean squared differences (RMSD). This analysis computes the magnitude of the difference between the predicted source proportions for the two sources for each of the composite fingerprinting properties selected by the different statistical approaches (Nosrati et al., 2018) (eq. 9):
(9)
Yi1 and Yi2 are the relative contribution of stream banks to a specific stream bed sediment sample (i) based on the composite signatures selected using different statistical approaches. For example, Y11 is the relative contribution of stream banks to stream bed sediment samples based on RF and Gini, and Y12 is the relative contribution of stream banks to the corresponding stream bed sediment sample based on fingerprinting properties selected using LASSO, and n is the number of stream bed sediment samples (i.e., 11, 10, and 9 for sites 1, 2, and 3, respectively).
In addition to RMSD, a two one-sided test (TOST) approach was applied to test whether the results from the three approaches were in agreement. The TOST approach has been used in various scientific disciplines, especially to establish the bioequivalence of drugs (Berger and Hsu, 1996; Lakens, 2017). This test is different from tests such as the Kolmogorov-Smirnov that seek to establish differences among methods. The null hypothesis of the TOST approach is that the mean values of two datasets are not equivalent, and then the test attempts to demonstrate that they are equivalent within a predefined limit (Lakens, 2017). This is conceptually opposite to the two-sample t-test procedure. If the confidence interval for the difference between the mean values is within the predefined limits, the mean values of the two data sets are considered to be equivalent (Mara and Cribbie, 2012). The predefined limits define a range of values for which the efficacies are “close enough” to be considered equivalent. Therefore, to check if the methods show equivalence within 10% of the source apportionment results, the upper and lower equivalence bounds were set to be 10 and -10, respectively. This defines the acceptance criteria, i.e., if the confidence interval for the difference between the two mean values is completely contained within [-10, 10], the mean values of the two data sets are deemed to be equivalent. The list of statistical approaches used in both methods has been illustrated in figure 5.
Results And Discussions
Composite Fingerprinting Properties Selected for Discriminating the Potential Sediment Sources
The results of the range test showed that the majority of the fingerprinting properties were conservative in terms of their concentrations during sediment mobilization and transport within this watershed. The fingerprinting properties that did not pass the range test are listed in table 2. The properties that passed the test were retained and tested using the subsequent statistical analysis. The graphical representation of the variable importance based on the mean decrease in the Gini index obtained using RF at all sites is shown in figure 6. The higher the value of the mean decrease in Gini coefficient, the higher the importance of the variable in the model. For example, at site 1, fingerprinting properties V, Ag, and Nb had a higher mean decrease in Gini coefficient as compared to other properties, indicating their higher importance. The mean accuracy obtained by different models based on the different number of features (selected based on ranking) selected for all sites is shown in figure 7. The permutation of features in each model was based on the largest decrease of the Gini coefficient obtained. For example, for site 2, model 1 with Ag, Be, Zr, and Pb had the highest accuracy. Therefore, the composite of fingerprinting properties that provided the highest accuracy was selected at each site. The hyperparameters obtained from tuning are listed in table 3. Hyperparameter tuning was performed to optimize the performance of the model. The parameter ‘minimum node size’ controls the structure of each individual tree, and the parameter ‘mtry’ controls the level of randomness of the forest. ‘Minimum node size’ implicitly sets the depth of the trees. It specifies the minimum number of observations in a terminal node such that a larger value causes smaller trees to grow (and thus take less time). We obtained the highest accuracy with the value of 1 for minimum node size. The default value is 1 for classification. ‘Mtry’ indicates the number of features considered for each split of the tree. The highest accuracy was obtained using the value of 1 for mtry.
The LASSO model provides coefficients of each feature after the shrinking process. Figure 8 shows the path of LASSO regression coefficients of the features at varying log-transformed lambda values. As lambda grew larger, the coefficient of less important features gradually became zero. Features with non-zero LASSO coefficients are treated as important. In this study, the validation set approach was repeated 500 times, and in each of the 500 iterations, features with non-zero coefficients were recorded (fig. 9). In the end, the features that had non-zero coefficients, which are important features in each iteration, in the majority (40%) of the iterations were considered important, and the non-important ones were eliminated for the feature selection process. The list of the different fingerprinting properties selected as optimum composite fingerprints by the two statistical operations (i.e., RF and LASSO) and the two-step statistical approach used in the previous study for each site are summarized in table 4. The number of fingerprinting properties selected using different statistical operations varied among different sites.
Certain geochemical properties (such as P) are surface-elevated, whereas many are surface-depleted. This could be due to weathering or pedogenetic effects. Stream banks consist of mostly less weathered subsoil material, with the inclusion of some surface soil due to sampling of the entire bank profile (Smith and Blake, 2014b). Therefore, they tend to have higher concentrations of various metals as compared to surface soils (construction sites). This was apparent for Zr, which had a higher concentration in stream bank samples as compared to construction sites. Higher concentrations of lanthanides (such as Sm and Ho) in the stream bank samples than those in construction site samples were observed. The concentrations of lanthanides are greater in sub-soils as compared to surface soils (Tyler, 2004; Zhou et al., 2020). Heavy metals such as V, Ga, Ag, and Fe that were selected as a part of composite fingerprint and have an association with anthropogenic sources had greater concentrations in soils sampled from construction sites in comparison to stream banks soil samples. Similarly, Cr was found in higher concentration in construction sites as compared to stream banks, as it is used in automobile parts and lubricating oils (Mukundan et al., 2010). It should be noted that additional factors, such as the number of source samples collected, parent material, previous land uses (such as mining), geomorphic landform, and hydrology, can also affect the concentrations of elements within different sources (Koiter et al., 2013; Lamba et al., 2015).
Figure 6. Scaled variable importance based on the mean decrease in Gini index at sites (a) 1, (b) 2, and (c) 3. Source Apportionments
The results for the credible intervals and the relative mean values of stream bank contributions to stream bed sediment at all sites obtained by the Bayesian model are listed in tables 5-7. Using RF, the relative source contributions from stream banks to stream bed sediment estimated ranged from 60±14 to 94 ±5%, from 27±16% to 89±4%, and from 14±9% to 97±2%, at sites 1, 2, and 3, respectively. Using the alternative composite signature selected by LASSO, the respective corresponding contributions were estimated as ranging from 60±14 to 94±5%, from 26±5% to 88±5%, and from 24±18% to 79%±18%, at sites 1, 2, and 3, respectively. However, based on the composite signature selected using a combination of U-test + DFA and frequentist mixing model in the previous study (Malhotra et al., 2020), the relative contributions from stream banks to stream bed sediment at sites 1, 2, and 3, were estimated as ranging from 88±1 to 100±1%, 97±3 to 100±1% and 9±8 to 90±11%, respectively. At site 1, stream banks were recognized as the dominant source of stream bed sediment using both machine-learning approaches throughout the sampling period (table 5). Streams of the southern Piedmont region have experienced heavy levels of anthropogenic disturbances (such as land degradation and stream impoundments) in the mid-20th century, which ultimately caused channelization, increased stream power, and caused streams to go through phases of incision and accelerated bank erosion (Haney and Davis, 2015; Mukundan et al., 2010). A combination of historical changes and recent urbanization has resulted in over-widened channels and altered flow regimes, which may be responsible for accelerated stream bank erosion. At sites 2 and 3, both construction sites and stream banks were the dominant sources of stream bed sediment (tables 6 and 7). The temporal variability of the relative contribution
Table 3. Hyperparameters obtained from tuning. Hyperparameter Values mtry 1 Splitting rule Gini Minimum node size 1
Figure 7. Mean accuracy obtained by different models using Random Forest algorithm based on the different number of features at sites (a) 1, (b) 2, and (c) 3. of stream banks and construction sites to stream bed sediment depends on a variety of factors, such as the phase of construction activities, intensity of storm events, stream flow conditions, and the ‘lag time’ between sediment mobilization from construction sites and subsequent delivery to the stream, which depends upon the hydrological connectivity between sediment sources and the stream channel.
It was observed that the greater the contribution of a particular source group to stream bed sediment, the lower the difference between the concentration of fingerprinting properties selected by the machine learning approaches in stream bed sediment samples and the mean value of the corresponding fingerprinting properties concentration in that sediment source group. For example, the mean concentration of V, Nb, and Ag (properties selected by RF and LASSO at site 1) in stream banks was 36.9 µg g-1, 12.3 µg g-1, and 0.59 µg g-1, respectively. The mean concentrations of V, Nb, and Ag in samples collected from construction sites were 74.3 µg g-1, 19.4 µg g-1, and 1.48 µg g-1, respectively. The corresponding mean concentrations of V, Nb, and Ag in the stream bed sediment samples for site 1 were 28.6 µg g-1, 12.9 µg g-1, and 0.88 µg g-1, respectively. Since the stream bank was the dominant source for this site, the concentrations of the optimum set of fingerprinting properties selected by RF and LASSO in the stream bed samples were closer to the mean concentrations of these properties in the stream banks. Similar trends were observed for all the composite fingerprinting properties selected by machine learning approaches at all sites and for all months of the sampling period.
Figure 8. Profiles of LASSO coefficients as a function of tuning parameter lambda at sites (a) 1, (b) 2, and (c) 3. Although stream banks generally made a greater relative contribution to the stream bed sediment compared to construction sites in both studies, differences were observed in the estimated source contributions when comparing the composite signature selected by machine learning methods in this study and the U-test + DFA approach used in the previous study. The relative contributions from both sources varied spatially and temporally within the watershed. For example, at site 1, for the last month of sample collection, the relative source contribution of stream banks was quantified as 100% using the combination of U-test + DFA and the frequentist mixing model (Malhotra et al., 2020). However, the stream bank source contribution computed using RF/LASSO and the Bayesian mixing model was 86%. Likewise, slight source apportionment differences were observed between both studies at other sites throughout the sampling period, with the highest difference (~74%) obtained at site 2 for the stream bed sediment sample collected on December 2, 2016 (core 3). The differences could be attributed either to the selection of different composite of fingerprinting properties using different statistical procedures (RF, LASSO, U-test + DFA) or to the application of different mixing model approaches, i.e., Bayesian versus frequentist. As suggested by Collins et al. (2017) and Cooper et al. (2014), the structure of the mixing model impacts the source apportionment estimates, and a comparison of source apportionment using different models is advisable to ensure robust source discrimination. Since the Bayesian method accounts for uncertainty and we are using the Beta prior distribution, which is continuous, contributions of 100% are not possible. Instead, such cases would correspond to high percentage values due to a ceiling effect. This demonstrates the need for robust uncertainty analysis and quantification in the results.
Figure 9. Ratio of coefficients being non-zero among 500 iterations for the features during the shrinkage process in LASSO regression at sites (a) 1, (b) 2, and (c) 3. The source apportionments generated in this study using different composite signatures underscored the importance of steep or vertical stream banks devoid of vegetation as a significant source of sediment loss. Also, understanding the relative differences in upland (construction sites) erosion versus stream bank erosion is important, as the strategies to reduce the sediment flux may differ considerably based on whether the source is stream banks or uplands.For details regarding watershed characteristics (e.g., riparian area, surface runoff generated, streamflow) that influence the contribution from different sources to stream bed sediment, please refer to Malhotra et al. (2020).
Table 4. Optimum composite fingerprinting properties obtained with the machine learning techniques utilized in this study and the two-step statistical approach used in the previous study (Malhotra et al., 2020). Site Test Fingerprinting Properties 1 Random Forest Nb, V, Ag LASSO Nb, V, Ag U-test + DFA Nb, V 2 Random Forest Ag, Be, Zr, Pb LASSO Ag, Zr U-test + DFA Ag, Be 3 Random Forest V, Ga, Cr, Zr, Sc, Fe, Nb LASSO V, Ag, Zr, Mo U-test + DFA V, Sm, Y, Fe, Zr, Ta, Ho Uncertainty and Equivalence Testing
The uncertainties calculated using Monte Carlo simulation in the previous study ranged from 1% to 14% for all sites throughout the sampling period. However, the uncertainty assessed as the standard deviation from the Bayesian mixing model in this study ranged from 4% to 32% for all sites over the entire sampling period.
Depending on the composite signature used, the root mean squared difference was found to be 16% for site 1 and ranged from 4% to 55% and 30% to 33% for sites 2 and 3, respectively (table 8). Using TOST for site 1, the three methods showed statistical non-equivalence with a 90% confidence interval of (-18.0, -9). At site 1, it is worth noting that the apportionment results from RF and LASSO were identical, and therefore, only the confidence interval derived from U-test + DFA and RF/LASSO is provided. For site 2, the paired TOST showed statistical equivalence between the apportionment results calculated by RF and LASSO but not for LASSO and U-test + DFA and RF and U-test + DFA. For example, the 90% confidence interval obtained by RF and LASSO, LASSO and U-test + DFA, and RF and U-test + DFA were (1.4, 4.3), (-64, -40), and (-61, -37), respectively. For site 3, none of the methods showed statistical equivalence, with the 90% confidence interval varying between (-10, 27) for RF and LASSO, and (-16, 26) and (-4, 31) for LASSO and U-test + DFA, and RF and U-test + DFA, respectively. As suggested by the high values of RMSD and statistical non-equivalence obtained at the majority of the sites, the three methods used to select the composite fingerprints, as well as the Bayesian and frequentist mixing model frameworks for sediment fingerprinting applications, were judged to be different.
Hence, fingerprinting properties selection using various statistical procedures has the potential to impact source apportionments. Accordingly, we believe that it is important to incorporate robust statistical approaches when selecting fingerprinting properties to obtain reliable results.
Table 5. Credible intervals and mean values of stream bank contribution to stream bed sediment obtained from Bayesian mixing model after using RF and LASSO procedures at site 1 for the entire sampling period. Sampling Period
(2016 December – 2017 September)
Credible Interval of Stream Bank Contribution Obtained From Bayesian Model After Using Composite Fingerprint Obtained With RF[a]
Mean Value of Stream Bank Contribution Obtained From Bayesian Model After Using Composite Fingerprint Obtained With RF[a]
Credible Interval of Stream Bank Contribution Obtained From Bayesian Model After Using Composite Fingerprint Obtained With LASSO[a]
Mean Value of Stream Bank Contribution Obtained From Bayesian Model After Using Composite Fingerprint Obtained With LASSO[a]
2 December (core 1) 0.81 – 0.99 0.91 0.81 – 0.99 0.91 2 December (core 2) 0.65 – 0.99 0.80 0.65 – 0.99 0.80 2 December (core 3) 0.73 – 0.99 0.86 0.73 – 0.99 0.86 2 December (core 4) 0.78 – 0.99 0.90 0.78 – 0.99 0.90 2 December – 5 February 0.82 – 0.99 0.91 0.82 – 0.99 0.91 5 February – 10 March 0.83 – 0.99 0.92 0.83 – 0.99 0.92 10 March – 15 April 0.41 – 0.99 0.70 0.41 – 0.99 0.70 14 April – 18 May 0.84 – 0.99 0.93 0.84 – 0.99 0.93 18 May – 28 June 0.36 – 0.84 0.60 0.36 – 0.84 0.60 28 June – 28 July 0.86 – 0.99 0.94 0.86 – 0.99 0.94 28 July – 22 September 0.72 – 0.99 0.86 0.72 – 0.99 0.86
[a] Proportional contribution of construction sites to the stream bed sediment mixture would be calculated as (1-f ), where f is the proportional contribution of stream banks.
Table 6. Credible intervals and mean values of stream bank contribution to stream bed sediment obtained from Bayesian mixing model after using RF and LASSO procedures at site 2 for the entire sampling period.Sampling Period
(2016 December –
2017 September)
Credible Interval of Stream Bank Contribution Obtained From Bayesian Model After Using Composite Fingerprint Obtained With RF[a]
Mean Value of Stream Bank Contribution Obtained From Bayesian Model After Using Composite Fingerprint Obtained With RF[a]
Credible Interval of Stream Bank Contribution Obtained From Bayesian Model After Using Composite Fingerprint Obtained With LASSO[a]
Mean Value of Stream Bank Contribution Obtained From Bayesian Model After Using Composite Fingerprint Obtained With LASSO[a]
2 December (core 1) 0.82 – 0.96 0.89 0.82 - 0.97 0.88 2 December (core 2) 0.28 – 0.47 0.38 0.27 - 0.45 0.36 2 December (core 3) 0.17 – 0.37 0.28 0.16 - 0.35 0.26 2 December – 5 February 0.38 – 0.65 0.51 0.33 - 0.62 0.47 5 February – 10 March 0.23 – 0.60 0.42 0.17 - 0.54 0.35 10 March – 14 April 0.28 – 0.56 0.42 0.26 - 0.53 0.39 14 April – 18 May 0.33 – 0.65 0.49 0.29 - 0.60 0.44 18 May – 28 June 0.79 – 0.92 0.86 0.73 - 0.91 0.81 28 June – 28 July 0.28 – 0.66 0.47 0.29 - 0.64 0.47 28 July – 22 September 0.03 – 0.53 0.27 0.04 - 0.53 0.29
[a] Proportional contribution of construction sites to the stream bed sediment mixture would be calculated as (1-f ), where f is the proportional contribution of stream banks.
Table 7. Credible intervals and mean values of stream bank contribution to stream bed sediment obtained from Bayesian mixing model after using RF and LASSO procedures at site 3 for the entire sampling period.Sampling Period
(2016 December –
2017 September)
Credible Interval of Stream Bank Contribution Obtained From Bayesian Model After Using Composite Fingerprint Obtained With RF[a]
Mean Value of Stream Bank Contribution Obtained From Bayesian Model After Using Composite Fingerprint Obtained With RF[a]
Credible Interval of Stream Bank Contribution Obtained From Bayesian Model After Using Composite Fingerprint Obtained With LASSO[a]
Mean Value of Stream Bank Contribution Obtained From Bayesian Model After Using Composite Fingerprint Obtained With LASSO[a]
2 December (core 1) 0.33 – 0.99 0.64 0.61 - 0.99 0.79 2 December (core 2) 0.15 – 0.58 0.38 0.17 - 0.84 0.51 2 December – 5 February 0.01 – 0.47 0.27 0.02 - 0.79 0.49 5 February – 10 March 0.01 – 0.27 0.14 0.02 - 0.60 0.33 10 March – 14 April 0.16 – 0.70 0.42 0.06 - 0.60 0.33 14 April – 18 May 0.42 – 0.91 0.65 0.06 - 0.89 0.53 18 May – 28 June 0.62 – 0.99 0.79 0.37 - 0.97 0.64 28 June – 28 July 0.88 – 0.99 0.94 0.02 - 0.47 0.24 28 July – 22 September 0.94 – 0.99 0.97 0.22 - 0.99 0.55
[a] Proportional contribution of construction sites to the stream bed sediment mixture would be calculated as (1-f ), where f is the proportional contribution of stream banks.
Conclusions and Recommendations
Different statistical approaches were used to investigate the relative importance of stream banks and construction site sediment sources in the study area. Two different composites of fingerprinting properties were selected using different machine learning approaches, and another composite was selected using the two-step statistical approach. Sediment source contributions were estimated using a Bayesian mixing model framework, and the estimates were compared with those obtained using a frequentist mixing model. Sediment source apportionment results exhibited variability based on the statistical method employed. Therefore, it remains important to be careful when selecting statistical tests to obtain the optimum composite fingerprints, as sediment source apportionment results are sensitive to the optimum fingerprinting properties selected. The present study utilized non-informative priors for the model parameters, however, in the future, prior scientific knowledge can be helpful to get insight into the sediment dynamics of a watershed. However, it should be noted that prior knowledge can lead to potentially biased results and should be justified while specifying the Bayesian model.
Table 8. Root mean squared difference (RMSD) of the source proportions predicted using different composite signatures for all sites. Root Mean Squared Difference (RMSD)
(%)Site 1 Site 2 Site 3 RF vs. LASSO 0 4 30 LASSO vs. U-test + DFA 16 55 33 U-test + DFA vs. RF 16 53 31 Limitations and Future Work
In this study, two sources (stream banks and construction sites) from which active sediment mobilization was observed during our sampling trips were considered. Therefore, for future studies, additional potential sources (e.g., road dust, forested areas) of sediment should be considered in this watershed to better assess the uncertainty in the model. It would be interesting to see how the number of sources considered influences the results of source apportionment. Future research can also incorporate the physical and chemical basis (process-based reasoning) for source discrimination by fingerprinting properties, in addition to statistical basis, which has been common in most of the fingerprinting studies. Sediment source apportionment results exhibited variability based upon the statistical method employed in this study. To our knowledge, since no other sediment fingerprinting study has been conducted that utilized a combination of machine learning techniques such as RF and LASSO, it will not be wise to recommend a particular statistical method just based on the results of one study. Further studies are required to test the utility of these methods for sediment source apportionment in different watershed settings to target management solutions.
Acknowledgments
Funding for this project was provided by the USGS State Water Competitive Grants Program (Project No. 2016AL176B), the USDA-NIFA Hatch project (ALA014-1-19052), the Alabama Agricultural Experiment Station, and the Auburn University Intramural Grant Program. We are also thankful to Stephanie Shepherd, Puneet Srivastava, staff at the City of Auburn’s Watershed Division, and Moore’s Mill Club, Auburn, AL, for their assistance in this study.
References
Alabdulwahhab, K. M., Sami, W., Mehmood, T., Meo, S. A., Alasbali, T. A., & Alwadani, F. A. (2021). Automated detection of diabetic retinopathy using machine learning classifiers. Eur. Rev. Med. Pharmacol. Sci., 25(2), 583-590. https://doi.org/10.26355/eurrev_202101_24615
Balzter, H., Cole, B., Thiel, C., & Schmullius, C. (2015). Mapping CORINE land cover from Sentinel-1A SAR and SRTM digital elevation model data using random forests. Remote Sens., 7(11), 14876-14898. https://doi.org/10.3390/rs71114876
Barthod, L. R., Liu, K., Lobb, D. A., Owens, P. N., Martínez-Carreras, N., Koiter, A. J.,... Gaspar, L. (2015). Selecting color-based tracers and classifying sediment sources in the assessment of sediment dynamics using sediment source fingerprinting. J. Environ. Qual., 44(5), 1605-1616. https://doi.org/10.2134/jeq2015.01.0043
Berger, R. L., & Hsu, J. C. (1996). Bioequivalence trials, intersection-union tests and equivalence confidence sets. Stat. Sci., 11(4), 283-319. https://doi.org/10.1214/ss/1032280304
Chin, A. (2006). Urban transformation of river landscapes in a global context. Geomorphology, 79(3), 460-487. https://doi.org/10.1016/j.geomorph.2006.06.033
Collins, A. L., Walling, D. E., & Leeks, G. J. (1997). Source type ascription for fluvial suspended sediment based on a quantitative composite fingerprinting technique. CATENA, 29(1), 1-27. https://doi.org/10.1016/S0341-8162(96)00064-1
Cooper, R. J., & Krueger, T. (2017). An extended Bayesian sediment fingerprinting mixing model for the full Bayes treatment of geochemical uncertainties. Hydrol. Processes, 31(10), 1900-1912. https://doi.org/10.1002/hyp.11154
Cooper, R. J., Krueger, T., Hiscock, K. M., & Rawlins, B. G. (2014). Sensitivity of fluvial sediment source apportionment to mixing model assumptions: A Bayesian model comparison. Water Resour. Res., 50(11), 9031-9047. https://doi.org/10.1002/2014WR016194
Cooper, R. J., Krueger, T., Hiscock, K. M., & Rawlins, B. G. (2015). High-temporal resolution fluvial sediment source fingerprinting with uncertainty: A Bayesian approach. Earth Surf. Process. Landforms, 40(1), 78-92. https://doi.org/10.1002/esp.3621
Darst, B. F., Malecki, K. C., & Engelman, C. D. (2018). Using recursive feature elimination in random forest to account for correlated variables in high dimensional data. BMC Genet., 19(1), 65. https://doi.org/10.1186/s12863-018-0633-8
Davies, J., Olley, J., Hawker, D., & McBroom, J. (2018). Application of the Bayesian approach to sediment fingerprinting and source attribution. Hydrol. Process., 32(26), 3978-3995. https://doi.org/10.1002/hyp.13306
D’Haen, K., Verstraeten, G., Dusar, B., Degryse, P., Haex, J., & Waelkens, M. (2013). Unravelling changing sediment sources in a Mediterranean mountain catchment: A Bayesian fingerprinting approach. Hydrol. Process., 27(6), 896-910. https://doi.org/10.1002/hyp.9399
Fox, J. F., & Papanicolaou, A. N. (2008). An un-mixing model to study watershed erosion processes. Adv. Water Resour., 31(1), 96-108. https://doi.org/10.1016/j.advwatres.2007.06.008
Franz, C., Makeschin, F., Weiß, H., & Lorz, C. (2014). Sediments in urban river basins: Identification of sediment sources within the Lago Paranoá catchment, Brasilia DF, Brazil – using the fingerprint approach. Sci. Total Environ., 466-467, 513-523. https://doi.org/10.1016/j.scitotenv.2013.07.056
Gateuille, D., Owens, P. N., Petticrew, E. L., Booth, B. P., French, T. D., & Déry, S. J. (2019). Determining contemporary and historical sediment sources in a large drainage basin impacted by cumulative effects: The regulated Nechako River, British Columbia, Canada. J. Soils Sediments, 19(9), 3357-3373. https://doi.org/10.1007/s11368-019-02299-2
Gellis, A. C., & Noe, G. B. (2013). Sediment source analysis in the Linganore Creek watershed, Maryland, USA, using the sediment fingerprinting approach: 2008 to 2010. J. Soils Sediments, 13(10), 1735-1753. https://doi.org/10.1007/s11368-013-0771-6
Granitto, P. M., Furlanello, C., Biasioli, F., & Gasperi, F. (2006). Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemom. Intell. Lab. Syst., 83(2), 83-90. https://doi.org/10.1016/j.chemolab.2006.01.007
Haney, N. R., & Davis, L. (2015). Potential controls of alluvial bench deposition and erosion in southern Piedmont streams, Alabama (USA). Geomorphology, 241, 292-303. https://doi.org/10.1016/j.geomorph.2015.04.005
Hastie, T., Tibshirani, R., & Friedman. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Springer Science & Business Media. https://doi.org/10.1007/978-0-387-21606-5
Koiter, A. J., Owens, P. N., Petticrew, E. L., & Lobb, D. A. (2013). The behavioural characteristics of sediment properties and their implications for sediment fingerprinting as an approach for identifying sediment sources in river basins. Earth Sci. Rev., 125, Suppl. C, 24-42. https://doi.org/10.1016/j.earscirev.2013.05.009
Kumar, S., Attri, S. D., & Singh, K. K. (2019). Comparison of Lasso and stepwise regression technique for wheat yield prediction. J. Agrometeorol., 21(2).
Lakens, D. (2017). Equivalence tests: A practical primer for t tests, correlations, and meta-analyses. Soc. Psychol. Personal. Sci., 8(4), 355-362. https://doi.org/10.1177/1948550617697177
Lamba, J., Karthikeyan, K. G., & Thompson, A. M. (2015). Apportionment of suspended sediment sources in an agricultural watershed using sediment fingerprinting. Geoderma, 239-240, 25-33. https://doi.org/10.1016/j.geoderma.2014.09.024
Leng, C., Lin, Y., & Wahba, G. (2006). A note on the lasso and related procedures in model selection. Statistica Sinica, 16(4), 1273–1284. http://www.jstor.org/stable/24307787
Liu, B., Storm, D. E., Zhang, X. J., Cao, W., & Duan, X. (2016). A new method for fingerprinting sediment source contributions using distances from discriminant function analysis. CATENA, 147, 32-39. https://doi.org/10.1016/j.catena.2016.06.039
Malhotra, K., Lamba, J., & Shepherd, S. (2020). Sources of stream bed sediment in an urbanized watershed. CATENA, 184, 104228. https://doi.org/10.1016/j.catena.2019.104228
Malhotra, K., Lamba, J., Srivastava, P., & Shepherd, S. (2018). Fingerprinting suspended sediment sources in an urbanized watershed. Water, 10(11), 1573. https://doi.org/10.3390/w10111573
Mara, C. A., & Cribbie, R. A. (2012). Paired-samples tests of equivalence. Commun. Stat. - Simul. Comput., 41(10), 1928-1943. https://doi.org/10.1080/03610918.2011.626545
Martinez-Carreras, N., Gallart, F., Iffly, J. F., Pfister, L., Walling, D. E., & Krein, A. (2008). Uncertainty assessment in suspended sediment fingerprinting based on tracer mixing models: A case study from Luxembourg. IAHS Publ., 325, 94.
Martínez-Carreras, N., Udelhoven, T., Krein, A., Gallart, F., Iffly, J. F., Ziebel, J.,... Walling, D. E. (2010). The use of sediment colour measured by diffuse reflectance spectrometry to determine sediment sources: Application to the Attert River catchment (Luxembourg). J. Hydrol., 382(1), 49-63. https://doi.org/10.1016/j.jhydrol.2009.12.017
Massoudieh, A., Gellis, A., Banks, W. S., & Wieczorek, M. E. (2013). Suspended sediment source apportionment in Chesapeake Bay watershed using Bayesian chemical mass balance receptor modeling. Hydrol. Process., 27(24), 3363-3374. https://doi.org/10.1002/hyp.9429
Menze, B. H., Kelm, B. M., Masuch, R., Himmelreich, U., Bachert, P., Petrich, W., & Hamprecht, F. A. (2009). A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinf., 10(1), 213. https://doi.org/10.1186/1471-2105-10-213
Moore, J. W., & Semmens, B. X. (2008). Incorporating uncertainty and prior information into stable isotope mixing models. Ecol. Lett., 11(5), 470-480. https://doi.org/10.1111/j.1461-0248.2008.01163.x
Mukundan, R., Radcliffe, D. E., Ritchie, J. C., Risse, L. M., & McKinley, R. A. (2010). Sediment fingerprinting to determine the source of suspended sediment in a southern Piedmont stream. J. Environ. Qual., 39(4), 1328-1337. https://doi.org/10.2134/jeq2009.0405
Nosrati, K., & Collins, A. L. (2019a). Fingerprinting the contribution of quarrying to fine-grained bed sediment in a mountainous catchment, Iran. River Res. Appl., 35(3), 290-300. https://doi.org/10.1002/rra.3408
Nosrati, K., & Collins, A. L. (2019b). Investigating the importance of recreational roads as a sediment source in a mountainous catchment using a fingerprinting procedure with different multivariate statistical techniques and a Bayesian un-mixing model. J. Hydrol., 569, 506-518. https://doi.org/10.1016/j.jhydrol.2018.12.019
Nosrati, K., Collins, A. L., & Madankan, M. (2018). Fingerprinting sub-basin spatial sediment sources using different multivariate statistical techniques and the Modified MixSIR model. CATENA, 164, 32-43. https://doi.org/10.1016/j.catena.2018.01.003
Nosrati, K., Fathi, Z., & Collins, A. L. (2019). Fingerprinting sub-basin spatial suspended sediment sources by combining geochemical tracers and weathering indices. Environ. Sci. Pollut. Res., 26(27), 28401-28414. https://doi.org/10.1007/s11356-019-06024-x
Nosrati, K., Govers, G., Semmens, B. X., & Ward, E. J. (2014). A mixing model to incorporate uncertainty in sediment fingerprinting. Geoderma, 217-218, 173-180. https://doi.org/10.1016/j.geoderma.2013.12.002
Palazón, L., & Navas, A. (2017). Variability in source sediment contributions by applying different statistic test for a Pyrenean catchment. J. Environ. Manag., 194, 42-53. https://doi.org/10.1016/j.jenvman.2016.07.058
Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. Proc. 3rd Intl. Workshop Distributed Statistical Computing, 124, pp. 1-10.
Pulley, S., Foster, I., & Antunes, P. (2015). The uncertainties associated with sediment fingerprinting suspended and recently deposited fluvial sediment in the Nene river basin. Geomorphology, 228, 303-319. https://doi.org/10.1016/j.geomorph.2014.09.016
Raigani, Z. M., Nosrati, K., & Collins, A. L. (2019). Fingerprinting sub-basin spatial sediment sources in a large Iranian catchment under dry-land cultivation and rangeland farming: Combining geochemical tracers and weathering indices. J. Hydrol.: Reg. Stud., 24, 100613. https://doi.org/10.1016/j.ejrh.2019.100613
Rose, L. A., Karwan, D. L., & Aufdenkampe, A. K. (2018). Sediment fingerprinting suggests differential suspended particulate matter formation and transport processes across hydrologic regimes. J. Geophys. Res.: Biogeosci., 123(4), 1213-1229. https://doi.org/10.1002/2017JG004210
Russell, K. L., Vietz, G. J., & Fletcher, T. D. (2019). Urban sediment supply to streams from hillslope sources. Sci. Total Environ., 653, 684-697. https://doi.org/10.1016/j.scitotenv.2018.10.374
Schuller, P., Walling, D. E., Iroumé, A., Quilodrán, C., Castillo, A., & Navas, A. (2013). Using 137Cs and 210Pbex and other sediment source fingerprints to document suspended sediment sources in small forested catchments in south-central Chile. J. Environ. Radioact., 124, 147-159. https://doi.org/10.1016/j.jenvrad.2013.05.002
Sherriff, S. C., Franks, S. W., Rowan, J. S., Fenton, O., & Ó’hUallacháin, D. (2015). Uncertainty-based assessment of tracer selection, tracer non-conservativeness and multiple solutions in sediment fingerprinting using synthetic and field data. J. Soils Sediments, 15(10), 2101-2116. https://doi.org/10.1007/s11368-015-1123-5
Sherriff, S. C., Rowan, J. S., Fenton, O., Jordan, P., & Ó hUallacháin, D. (2018). Sediment fingerprinting as a tool to identify temporal and spatial variability of sediment sources and transport pathways in agricultural catchments. Agric. Ecosyst. Environ., 267, 188-200. https://doi.org/10.1016/j.agee.2018.08.023
Small, I. F., Rowan, J. S., & Franks, S. W. (2002). Quantitative sediment fingerprinting using a Bayesian uncertainty estimation framework. In The Structure, Function and Management Implications of Fluvial Sedimentary Systems (pp. 443-450). International Association of Hydrological Sciences.
Smith, H. G., & Blake, W. H. (2014). Sediment fingerprinting in agricultural catchments: A critical re-examination of source discrimination and data corrections. Geomorphology, 204 Suppl. C, 177-191. https://doi.org/10.1016/j.geomorph.2013.08.003
Sohil, F., Sohali, M. U., & Shabbir, J. (2022). An introduction to statistical learning with applications in R. Stat. Theory Related Fields, 6(1), 87. https://doi.org/10.1080/24754269.2021.1980261
Stewart, H. A., Massoudieh, A., & Gellis, A. (2015). Sediment source apportionment in Laurel Hill Creek, PA, using Bayesian chemical mass balance and isotope fingerprinting. Hydrol. Process., 29(11), 2545-2560. https://doi.org/10.1002/hyp.10364
Stock, B. C., Jackson, A. L., Ward, E. J., Parnell, A. C., Phillips, D. L., & Semmens, B. X. (2018). Analyzing mixing systems using a new generation of Bayesian tracer mixing models. PeerJ, 6, e5096. https://doi.org/10.7717/peerj.5096
Stone, M., Collins, A. L., Silins, U., Emelko, M. B., & Zhang, Y. S. (2014). The use of composite fingerprints to quantify sediment sources in a wildfire impacted landscape, Alberta, Canada. Sci. Total Environ., 473-474, 642-650. https://doi.org/10.1016/j.scitotenv.2013.12.052
Taylor, K. G., & Owens, P. N. (2009). Sediments in urban river basins: A review of sediment-contaminant dynamics in an environmental system conditioned by human activities. J. Soils Sediments, 9(4), 281-303. https://doi.org/10.1007/s11368-009-0103-z
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc.: Ser. B (Methodol.), 58(1), 267-288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Tiecher, T., Caner, L., Minella, J. P., & Santos, D. R. (2015). Combining visible-based-color parameters and geochemical tracers to improve sediment source discrimination and apportionment. Sci. Total Environ., 527-528, 135-149. https://doi.org/10.1016/j.scitotenv.2015.04.103
Tyler, G. (2004). Rare earth elements in soil and plant systems - A review. Plant Soil, 267(1), 191-206. https://doi.org/10.1007/s11104-005-4888-2
USEPA (1996). USEPA (United States Environmental Protection Agency) Method 3052: Microwave assisted acid digestion of siliceous and organically based matrices SW-846. Washington, DC (1996).
Vasquez, M. M., Hu, C., Roe, D. J., Chen, Z., Halonen, M., & Guerra, S. (2016). Least absolute shrinkage and selection operator type methods for the identification of serum biomarkers of overweight and obesity: Simulation and application. BMC Med. Res. Methodol., 16(1), 154. https://doi.org/10.1186/s12874-016-0254-8
Wilkinson, S. N., Hancock, G. J., Bartley, R., Hawdon, A. A., & Keen, R. J. (2013). Using sediment tracing to assess processes and spatial patterns of erosion in grazed rangelands, Burdekin River basin, Australia. Agric. Ecosyst. Environ., 180, Suppl. C, 90-102. https://doi.org/10.1016/j.agee.2012.02.002
Wright, M. N., Wager, S., & Probst, P. (2016). ranger: A fast implementation of random forests. R package version 0.5.0. Retrieved from https://CRAN.R-project.org/package=ranger
Yu, M., & Rhoads, B. L. (2018). Floodplains as a source of fine sediment in grazed landscapes: Tracing the source of suspended sediment in the headwaters of an intensively managed agricultural landscape. Geomorphology, 308, 278-292. https://doi.org/10.1016/j.geomorph.2018.01.022
Zhou, W., Han, G., Liu, M., Song, C., & Li, X. (2020). Geochemical distribution characteristics of rare earth elements in different soil profiles in Mun River Basin, Northeast Thailand. Sustainability, 12(2), 457. https://doi.org/10.3390/su12020457