ASABE Logo

Article Request Page ASABE Journal Article

Prediction Of Chlorophyll-a As an Index of Harmful Algal Blooms Using Machine Learning Models

Ibrahim Busari1, Debabrata Sahoo1,*, R. Daren Harmel2, Brian E. Haggard3


Published in Journal of Natural Resources and Agricultural Ecosystems 2(2): 53-61 (doi: 10.13031/jnrae.15812). 2024 American Society of Agricultural and Biological Engineers.


1    Department of Agricultural Sciences, Clemson University, Clemson, South Carolina, USA.

2    USDA ARS, Fort Collins, Colorado, USA.

3    Arkansas Water Resources Center, University of Arkansas, Fayetteville, Arkansas, USA.

*    Correspondence: dsahoo@clemson.edu

The authors have paid for open access for this article. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License https://creative?commons.org/licenses/by-nc-nd/4.0/

Submitted for review on 10 September 2023 as manuscript number NRES 15812; approved for publication as a Research Article and as part of the “Digital Water: Computing Tools, Technologies, and Trends” Collection by Associate Editor Dr. Sushant Mehan and Community Editor Dr. Kati Migliaccio of the Natural Resources & Environmental Systems Community of ASABE on 12 January 2024.

Mention of company or trade names is for description only and does not imply endorsement by the USDA. The USDA is an equal opportunity provider and employer.

Highlights

Abstract. The complex dynamics of freshwater harmful algal blooms (HABs) necessitate proactive monitoring approaches to mitigate their impacts. The rapid breakthrough in computing prowess and statistical advances is triggering the development of data-driven techniques such as machine learning (ML) models, which have been shown in different fields to be instrumental in finding patterns for explaining relationships in observed data. This study assesses the ability of ML models for HABs monitoring in a lake using chlorophyll-a concentration as the index. The selected models for this study were regression tree, random forest (RF), multilayer perceptron (MLP), support vector regression (SVR), long short-term memory (LSTM), and gated recurrent unit (GRU) models, with the last two models able to consider the temporal sequence of obtained water quality datasets. The results showed that the RF model with R2, mean absolute error (MAE), and root mean square error (RMSE) of 0.87 µgL-1, 0.97 µgL-1, and 3.53 µgL-1, respectively, outperformed the SVR, MLP, and regression tree models. LSTM model with MAE and RMSE of 2.39 µgL-1 and 3.29 µgL-1, respectively, predicted temporal dynamics of chlorophyll-a better than GRU, although with more runtime, and showed the potential for developing real-time HAB monitoring and early warning systems. The findings reveal the robustness of the chosen ML models, thereby shedding light on crucial factors that necessitate careful deliberation by researchers and policymakers in determining the most suitable approaches for monitoring HABs.

Keywords.Cyanobacteria, Early warning systems, Freshwater, HABs, Machine learning models.

Achieving clean water and sanitation, as stated in the sixth sustainable development goal (SDG) designed by the United Nations (UN), entails effective monitoring of lakes and ponds, especially in the face of threats such as climate change and anthropogenic activities (e.g., urbanization and agriculture) (Janssen et al., 2019). Harmful Algal Blooms (HABs) in lakes and ponds are detrimental to freshwater systems' health and threaten their sustainability. HABs affect aquatic ecosystems through biomass production/decay and toxin release (Grattan et al., 2016). Toxins can contaminate drinking water sources, impacting the health of the ecosystem, pets, livestock, humans, and cause fish kills (Carmichael and Boyer, 2016). These toxins include neurotoxins and hepatotoxins, which affect exposed animals' nervous systems and livers (Christensen and Khan, 2020). The effects of HABs on the economy include increased water treatment costs, reduced property value, and biodiversity loss (Christa et al., 2021). Climate change poses significant threats to the expansion of HABs, including increased runoff from uneven precipitation patterns and high temperatures due to global warming (Gobler, 2020). Excess nutrient influx, such as phosphorus (P) and nitrogen (N), into waterbodies after rainfall events could trigger algal expansion, especially after drought conditions (Reichwaldt and Ghadouani, 2012). The detrimental impacts of HABs on the ecosystem, economy, and public health necessitate their robust monitoring.

Monitoring strategies for HABs often entail laboratory analysis of water samples and quantification of various algal-related parameters such as chlorophyll-a (Katin et al., 2021), phycocyanin (Giere et al., 2020), cyanobacteria or other algal cells (Ponjavic et al., 2019), and various algal toxins (Greer et al., 2016). The analyses include microscopic and analytical approaches using a spectrophotometer, liquid chromatography, whole-organism bioassays, and biochemical and immunological assays (Lombard et al., 2019). These techniques are vital for identifying algae up to the species level and require a high degree of technical expertise. Using remotely sensed images derived from satellites and unmanned air vehicles (UAV) is also useful for spatial monitoring of HAB expansion (Bresciani et al., 2014; Lekki et al., 2019), especially for observing bloom conditions over various water bodies. This method could be combined with other Geographical Information Systems (GIS) to provide real-time monitoring systems.

Biogeochemical processes, such as the cycling of P, N, and carbon (C), which influence HABs in freshwater systems, can also be simulated in mechanistic models using mathematical expressions. These computer models include the Water Quality Analysis Simulation Program (WASP), Soil and Water Assessment Tool (SWAT), and QUAL2K, which help explore the impacts of Best Management Practices (BMPs) and climate changes on water quality dynamics (Costa et al., 2021; Mbuh et al., 2019; Privette and Smink, 2017). However, these models are limited in toxin modeling, requiring multiple datasets and intense calibrations, often leading to uncertain predictions (Rousso et al., 2020; Zhang and Rao, 2012).

Due to the limitations of computer models, strengthening the understanding of complex biogeochemical processes and various HABs drivers relies on water quality monitoring using multiparameter sensors, thereby enhancing the development of machine learning (ML) models. ML models can find patterns in datasets usable for making reasonable inferences (Zhu et al., 2023). These models have been applied in predicting HABs based on the relationship between different algal-related factors and water quality parameters (Busari et al., 2023a; Cho et al., 2018; Shin et al., 2020). Various ML models have been developed and are in practice for determining drivers of algal dynamics and multistep prediction. Despite advances in ML models for predicting HABs, there is still a need to assess the performance of various ML models in the context of early warning systems. Constructing accurate predictive models requires in-depth research of the specific algal-related factors and water quality parameters that significantly influence HABs dynamics. Water quality parameters such as turbidity, temperature, pH, and dissolved oxygen (DO) influence the chlorophyll-a concentration in lakes and ponds. Turbidity levels impact chlorophyll-a levels due to high suspended solids from phytoplankton and other inorganic materials. Chlorophyll-a dynamics also impact DO concentration in waterbodies due to respiration and photosynthesis engaged by algae (Maslukah et al., 2022). Chlorophyll-a level is an essential index of HABs whose distribution and concentration highly relate to algal expansion and their toxicity (Humbert and Fastner, 2016; Marlian et al., 2015). While cyanobacteria growth often corresponds to increased chlorophyll-a concentration, their toxicity may vary with site-specific environmental factors (Hartshorn et al., 2016), suggesting a less absolute relationship between chlorophyll-a and algal toxins. Additionally, higher chlorophyll-a concentrations do not necessarily mean a high toxin concentration but may be predictive of the exceedance probability of a certain threshold (Hollister and Kreakie, 2016).

This study is part II of the two articles exploring and reviewing the application of ML models for HABs monitoring in freshwater ecosystems (see Busari et al., 2023b). Effective monitoring of HABs using these ML models entails accurately understanding their limitations to enable their improvement. Identifying the strengths and weaknesses of these models will aid in knowing the data or scenario that suits each model. Improved decision-making about HABs can be achieved by accurately predicting chlorophyll-a upon identifying a suitable scenario that warrants each model. The current study compares the efficacy of several ML models in predicting HABs based on relationships between chlorophyll-a concentration and various high frequency water quality parameters. The models selected in this study include regression trees (RT), random forest (RF), support vector regression (SVR), multilayer perceptron (MLP), long short-term memory (LSTM), and gated recurrent units (GRU). These models were selected due to their frequent use in HABs prediction and to be consistent with those reviewed in Busari et al. 2023b, the first section of this two-part manuscript. Two different data splitting ratios were selected for model development, and the data sequence was also considered to explore the potential of ML models for the continuous prediction of HABs. By considering these factors, a comprehensive analysis of the strengths and limitations of each ML model can be achieved, providing valuable insights for effective HABs monitoring and prediction.

Methodology

Data Acquisition and Processing

The data used for model comparison were obtained from the Greenville County Department of Land and Stormwater Division. A multiparameter YSI sensor collected near-surface pH, chlorophyll-a (µgL-1), turbidity (NTU), dissolved oxygen (DO) (mgL-1), saturated dissolved oxygen (%), oxidation-reduction potential (mV), specific conductivity (µs/cm), and temperature (°C) data from 2014-2021 every 15 min. In addition to continuous chlorophyll-a data, discrete chlorophyll-a values were measured from grab samples. The dataset was standardized using the min-max method shown in equation 1 to convert the data into a range of 0 and 1, which addresses the variation in the possible bias due to variation in the range of the parameters.

        (1)

Figure 1. Reedy River watershed and its land use characteristics truncated at Boyd Millpond.

The missing values in the datasets were interpolated using the K-Neural Network algorithm, which evaluates missing values based on Euclidean distance to known values.

Model Architectures

The selected models evaluated for HABs prediction efficiency in this study are RT, RF, SVR, MLP, LSTM, and GRU, as shown in table 1. The independent variables were near surface 15 minutes continuous pH, DO, specific conductivity, saturated DO, temperature, turbidity, and oxidation-reduction potential. At the same time, the chlorophyll-a concentration was used as the output and as the index of HABs. The model structures varied based on the features of each model. Descriptive statistics were also performed to obtain each model prediction's mean and standard deviation.

Table 1. Summary of models developed in this study.
ModelsSplitting RatioLearning Rate
RT, RF, SVR, and MLP70:30
80:20
LSTM and GRU0.01
0.001

RT

The RT model was constructed using a maximum depth of 15, obtained after several iterations. The maximum depth is a parameter that determines the number of splits allowed for the tree. Predictions using an RT aim to predict continuous values as the output of unseen sets of new input data using trained patterns observed in the sample data, as shown in equation 2.

        (2)

where

O = output, x is the new set of input observations

= trained regression function

= trained model parameters.

The data were split in the ratio 70:30, where 70% of the data were randomly selected for model training and 30% were used for validation (Nguyen et al., 2021). The study also used a splitting ratio of 80:20 to observe the increased performance when an increased training dataset was used.

RF

The same splitting criteria were used for the RF model. The RF model was developed using several regression trees constructed with random subsets of the whole dataset. A 3-fold cross-validation was adopted for efficient model training and to prevent overfitting. A RF model's performance depends on the individual trees' quality and the correlation among the trees (Breiman, 2001). The correlation between the trees refers to the ordinary correlation of predictions of the sample datasets not included during the development of the current tree. This dataset, known as the Out of Bag sample, is used to estimate the prediction error and evaluate variable importance (Genuer et al., 2010). The number of trees that make up the forest and the maximum depth of each tree are important parameters to estimate when applying the RF model and were optimized using the grid search cross-validation method. The grid search approach enables tuning parameters based on a specified grid to obtain optimum parameter combinations for the model.

SVR

The SVR model was also developed using hyperparameters tuned with the grid search cross-validation method. The optimized parameters are 'epsilon,' which defines the width of the tube around the hyperplane, and 'C,' which determines the tradeoff between the training error and the flatness of the model, and RBF was selected as the kernel function. Kernel functions are essential tools in machine learning for dealing with nonlinear problems while still relying on linear algebra concepts (Johnson et al., 2020). While various kernel functions exist to map the input data into the feature space, the Radial Basis Function (RBF) is the most often used function due to its ease of implementation and capability to map the training datasets into infinite dimensional space nonlinearly.

MLP

A fully connected MLP model was also developed for chlorophyll-a prediction. The MLP model consisted of one input layer, two hidden layers with 64 nodes each, and one output layer. The hidden layers consisted of 64 nodes each, and these values were obtained after several trials with different combinations. The learning rate, batch size, and epoch number (which are important hyperparameters) were tuned using the grid search cross-validation method. The splitting of the datasets into training and testing sets was the same as in the previous models, where 70% of the data were randomly selected for training, and the rest were for testing.

LSTM and GRU

It is important to note that the previous models were structured such that the time dependency of the data was ignored. This plays to the models' strengths and avoids their limitation of being unable to model sequential data. This is important when HABs monitoring is performed using non-sequential grab sampling techniques. In order to account for the temporal dynamics of the data and effectively capture its time series nature, the LSTM and GRU models were developed. The propagation of information across time in individual cell states is enabled through the linear interaction of the components. The information regulation in the LTSM cell is controlled by three gates implemented by a sigmoid and a pointwise multiplication (Bianchi et al., 2017).

The three gates work to keep relevant water quality information for long periods and discard irrelevant information. The LSTM model was developed using four stacked LSTM layers with 64 cells each. The learning rate, dropout value, and optimizer were 0.001, 0.2, and adaptive moment estimations (Adam). The learning rate controls how quickly the model learns from the data when training. The dropout value reduces the overfitting rate, while the Adam optimizer adjusts the learning performance based on the progress made. The splitting of the data into training and testing was performed to consider the data sequence. The dataset from 2014-2018 was used to train the model, and the model was tested using the 2019–2021 dataset. The GRU model was also constructed using the same architecture as the LSTM model. The difference between the two models is in their inherent structure, which has to do with the number of gates that control the flow of information in the model.

Performance Evaluation

Model performance was evaluated using various statistical techniques that compare the model predictions with the actual observations. The techniques used were root mean square error (RMSE), coefficient of determination (R2), and mean absolute error (MAE). The RMSE describes the standard deviation of the unexplained variance and often takes the unit of the response variable (Chai and Draxler, 2014). The lower the RMSE, the more accurately the model predicts the observed data, indicating a better fit. Similarly, the coefficient of determination, denoted as R˛, measures the proportion of variance in the dependent variable that is predictable from the independent variables. Therefore, a higher R˛ value signifies a better fit of the model to the data, as it explains a greater amount of variance. (Chicco et al., 2021). The MAE measures the average variance between the predicted and observed values (Oswalt Manoj et al., 2022). Low MAE and RMSE values indicate that the predictions are close to the true observed values, while high MAE indicates the opposite.

Results and Discussion

For all the models developed, the mean of the observed chlorophyll-a ranged from 11.85 µgL-1 to 11.87 µgL-1, with the standard deviation ranging between 5.01 µgL-1 and 5.03 µgL-1. The mean of predicted chlorophyll-a also ranged from 11.74 µgL-1 to 11.94 µgL-1, and the standard deviation ranged between 4.17 µgL-1 and 4.80 µgL-1. The performance of the models is further shown in table S1. According to the observed R2 in the table, the RT could train the model such that 89% of the variance was explained during the training stage. This performance decreased to about 72% when validated with the testing dataset, as shown in figure 2. The same was observed with the RMSE and MAE between the observed and predicted chlorophyll-a values.

The performance, however, became better when more regression trees were introduced. This is evident in the RF model trained with different trees. The RF model explained 96% of the variation between observed and predicted values. This performance improvement can be related to efficient training using random samples of the training sets by the various regression trees making up the RF. Adding more trees enables the random selection of different ranges of the training set, allowing the model to learn various patterns (Yajima and Derot, 2018). The model's performance decreased when the test dataset was used to validate the models, as shown in figure 3. Increasing the training set to 80% of the dataset allows the model to learn more efficiently. This is evident in the R2, RMSE, and MAE of the RF model, as shown in table S1. The RF model with the 80% training set could explain 99% of the variation in chlorophyll-a prediction during the training stage and about 87% when validated with the testing datasets. Chlorophyll-a predictions with both tree-based algorithms provide good HABs detection and monitoring potential. Both models captured high peaks of chlorophyll-a, which are often indicators of algal biomass. While the RF model showed increased chlorophyll-a prediction performance, the RT model represents simple decision rules that can easily be understood and applicable for finding input parameters' relative importance while requiring less runtime (Zhu et al., 2023).

Figure 2. Prediction of chlorophyll-a concentration using RT model with 70% (left) and 80% (right) data splitting.
Figure 3. Prediction of chlorophyll-a concentration using RF model with 70% (left) and 80% (right) data splitting.

The MLP model's performance changed slightly with increments in training datasets. The MLP model trained with 70% of the whole dataset could explain 74% of the variations between predicted and observed chlorophyll-a concentrations in the training set, compared with the 77% explained by the model trained with 80% of the dataset. In the testing stage, the performance showed a reduction to 73% and 76% in explained variance, respectively, as shown in figure 4. The SVR model performance was slightly lower than the RT, RF, and MLP models. The R2, RMSE, and MAE values were 0.75, 6.64 µgL-1, and 1.49 µgL-1 for the training period. SVR model performance was not different when the training set was increased to 80% of the whole dataset, showing a 72% variance explained in the testing stage, as shown in figure 5. This suggests that increasing model performance using SVR requires many datasets. These datasets are necessary to obtain an efficient hyperplane that defines the boundary line for effective learning. MLP and SVR models are considered black box models with difficulty understanding internal mechanisms (Martínez-Comesańa et al., 2020; Stollfuss and Bacher, 2022), although research into model explainability is gaining attention (Molnar et al., 2020). The reduced performance observed in MLP and SVR models could be related to the complexity of their model parameters. The SVR performance contradicts the findings of Mamun et al. (2020), where the SVR model performance was better when compared with other models. The increased performance was attributed to the ability of SVR to capture nonlinear relationships rather than other models, such as ANN and multilinear regression. The reduced performance of these models could be related to their complex structure, making it complicated to obtain stable parameters using the chosen optimization methods (Tharwat and Gabel, 2020). The improved chlorophyll-a prediction observed in the RF model correlates with the studies of Huang et al. (2022); Kim and Ahn (2022); and Li et al. (2018). These studies showed the RF model's ability to surpass other models such as ANN, SVR, and generalized linear regressions in their chlorophyll-a prediction ability.

The LSTM and GRU models for predicting the sequential chlorophyll-a concentration were compared using the RMSE and MAE, as shown in table 2. The selected evaluation metrics for this section were based on the recommendation of Hewamalage et al. (2023), and R2 was omitted due to its misleading nature when evaluating time dependence models. The data splitting criterion for these models was carried out with significant consideration of the data sequence. The training datasets from 2014-2018 and 2019-2021 validated the model. The models used a prediction horizon of 96, corresponding to one day of previously observed input parameters for multistep prediction of chlorophyll-a concentration.

Figure 4. Prediction of chlorophyll-a concentration using MLP model with 70% (left) and 80% (right) data splitting.
Figure 5. Prediction of chlorophyll-a concentration using SVR model.
Table 2. Prediction performance for models considering data sequence.
Learning
Rate
TrainingTesting
LSTMRMSE (µgL-1)0.014.155.23
MAE (µgL-1)3.303.27
RMSE (µgL-1)0.0011.633.29
MAE (µgL-1)1.242.39
GRURMSE (µgL-1)0.015.644.09
MAE (µgL-1)3.733.36
RMSE (µgL-1)0.0013.453.73
MAE (µgL-1)1.982.63

Two learning rates (LR) were tested to determine the efficient hyperparameter. LR determines how the model changes due to errors estimated as the model weights are learned. For the GRU and LSTM models, the 0.001 LR performed better than the 0.01 models. However, when the GRU and LSTM models were compared using the RMSE as the error metric, the LSTM performed better. This is unsurprising because the three gates in the LSTM model control information flow compared to the two gates in the GRU model, although the former takes more run time, as shown in table 3.

Table 3. Training time for each optimized model in seconds (s).
ModelTraining Runtime
RT0.75 s
RF17 s
SVR18306.0 s
MLP437.0 s
LSTM42900.0 s
GRU32820.0 s

Both LSTM and GRU models are essential for making time series predictions and creating early warning systems for HABs monitoring. The LSTM complex structure captures nonlinear chlorophyll-a dynamics and trends in different algal parameters based on their relationship with other water quality parameters. HABs monitoring with these time-dependent models will assist in the real-time monitoring of HABs and observing lag time effects of water quality parameters on algal growth. LSTM is enabled by using lagged parameters as input variables, which is instrumental when dealing with water quality parameters that exhibit autocorrelation. While both models are suitable for time series prediction, GRU may struggle with long-term dependencies compared to LSTM models. This means GRU is limited with the previous timesteps of water quality observations that could be used for making multistep predictions. However, GRU models are more straightforward and computationally efficient than LSTM, resulting in a swift training process and inference deduction. The simple nature of GRU compared to LSTM enables generalization during training and could perform better on smaller datasets (Yang et al., 2020). Contrary to the result of this study showing LSTM as the better of the time-dependent models, GRU performed better in the study of Xiao et al. (2022). Their study's reduced performance in the LSTM model could be because they are not as optimized as the GRU model, leading to potential confirmation bias.

The early prediction of chlorophyll-a as an index of HABs is vital for protecting freshwater systems, especially in lentic ecosystems. Observing the trend of chlorophyll-a predictions, especially with time-dependent models like LSTM and GRU, favors understanding HABs' diurnal and seasonal variations. This could be linked to other variable selection algorithms for the derivation of drivers of HABs at different periods, which is essential for creating early warning systems and implementing management strategies. The result of this study provides important factors to be considered when selecting ML models for HABs monitoring. The choice of model is highly dependent on the project's objective and the nature of available data. While RF models show good chlorophyll-a prediction prowess, they are unsuitable for making time-dependent chlorophyll-a predictions. Models such as LSTM and GRU can learn long-term dependency in data, making them suitable for chlorophyll-a time series multistep ahead predictions. Predicting chlorophyll-a concentration using water quality datasets with irregular temporal patterns can be efficiently performed using RF models, which are often trained using random datasets. The identified limitations and strengths of the models will assist in decision-making about the kind of ML models to adopt for tasks such as identifying HABs drivers and predicting HABs occurrence at different time scales.

Conclusion

The early detection of HABs is essential for ecosystem protection and the prevention of public health issues. ML models are decision support tools for finding patterns in water quality datasets and making predictions based on these patterns. Models such as regression trees are simple and convenient, although their performance could be limited. The RF model has the lowest R2, RMSE, and MAE of 0.87, 3.53 µgL-1, and 0.97 µgL-1, respectively, indicating good chlorophyll-a prediction ability when the temporal sequence of the data is not considered. SVR and MLP were also shown to have reduced prediction performance compared to the RT and RF models. The reduced performance was associated with complexity in SVR and MLP and difficulty in finding optimum model parameters.

Time-dependent models like LSTM and GRU showed the potential to consider the temporal dynamics of water quality variables and their influence on HABs. Improved performance was observed in LSTM with RMSE and MAE of 3.29 µgL-1 and 2.39 µgL-1, respectively, indicating better performance than the GRU for making time-dependent chlorophyll-a prediction for HABs monitoring. Ultimately, ML models are crucial for observing algal biomass and hold high potential for real-time monitoring. Future studies can focus on incorporating the impacts of input dataset errors on model predictions. Multiparameter water quality sensors used for data acquisition are often affected by sensor drifting, which compromises data quality and influences model predictions. Efficient decision-making using ML models will hugely benefit from acknowledging and incorporating the measurement uncertainties. These ML models can be further expanded to understand the triggers of HABs-related toxins to develop strategies for controlling and mitigating their impacts on the environment, human health, and the ecosystem.

Supplemental Material

The supplemental materials mentioned in this article are available for download from the ASABE Figshare repository at: https://doi.org/10.13031/25114571

Acknowledgments

The authors sincerely thank the Greenville County Department of Land Development, Stormwater Division, for providing the dataset utilized for this study. The authors also thank Clemson Computing and Information Technology (CCIT) for the cyberinfrastructure resources and advanced research computing capabilities provided through its Cyber Infrastructure Technology Integration (CITI) group. The authors acknowledge the Southern Sustainable Agriculture Research and Education (S-SARE) program under subaward number LS21-2595 and the S.C. Sea Grant, NOAA Federal Award No. NA22OAR4170114, for supporting parts of this project. USDA-ARS is an equal-opportunity employer and provider.

References

Bianchi, F. M., Maiorino, E., Kampffmeyer, M. C., Rizzi, A., & Jenssen, R. (2017). Recurrent neural networks for short-term load forecasting - An overview and comparative analysis. Cham: Springer. https://doi.org/10.1007/978-3-319-70338-1

Breiman, L. (2001). Random forests. Mach. Learn., 45(1), 5-32. https://doi.org/10.1023/A:1010933404324

Bresciani, M., Adamo, M., De Carolis, G., Matta, E., Pasquariello, G., Vaiciute, D., & Giardino, C. (2014). Monitoring blooms and surface accumulation of cyanobacteria in the Curonian Lagoon by combining MERIS and ASAR data. Remote Sens. Environ., 146, 124-135. https://doi.org/10.1016/j.rse.2013.07.040

Busari, I., Sahoo, D., Harmel, R. D., & Haggard, B. E. (2023b). A review of machine learning models for harmful algal bloom monitoring in freshwater systems. J. Nat. Resour. Agric. Ecosyst., 1(2), 63-76. https://doi.org/10.13031/jnrae.15647

Busari, I., Sahoo, D., Jana, R., & Privette, C. (2023a). Chlorophyll a predictions in a Piedmont Lake in Upstate South Carolina using machine-learning approaches. J. S. C. Water Resour., 9(1), 1-14.

Carmichael, W. W., & Boyer, G. L. (2016). Health impacts from cyanobacteria harmful algae blooms: Implications for the North American Great Lakes. Harmful Algae, 54, 194-212. https://doi.org/10.1016/j.hal.2016.02.002

Chai, T., & Draxler, R. R. (2014). Root mean square error (RMSE) or mean absolute error (MAE)? – Arguments against avoiding RMSE in the literature. Geosci. Model Dev., 7(3), 1247-1250. https://doi.org/10.5194/gmd-7-1247-2014

Chicco, D., Warrens, M. J., & Jurman, G. (2021). The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput. Sci., 7, e623. https://doi.org/10.7717/peerj-cs.623

Cho, H., Choi, U.-J., & Park, H. (2018). Deep learning application to time-series prediction of daily chlorophyll-a concentration. WIT Trans. Ecol. Environ, 215, 157-163. https://doi.org/10.2495/EID180141

Christa, C., Ferreira, J., Andrew, R., Xiaohui, Q., & Bijeta, S. (2021). Quantifying the socio-economic impacts of harmful algal blooms in Southwest Florida in 2018. SSRN Electron. J., 2.

Christensen, V. G., & Khan, E. (2020). Freshwater neurotoxins and concerns for human, animal, and ecosystem health: A review of anatoxin-a and saxitoxin. Sci. Total Environ., 736, 139515. https://doi.org/10.1016/j.scitotenv.2020.139515

Costa, C. M., Leite, I. R., Almeida, A. K., & de Almeida, I. K. (2021). Choosing an appropriate water quality model — a review. Environ. Monit. Assess., 193(1), 38. https://doi.org/10.1007/s10661-020-08786-1

DHEC. (2008a). Development of a comprehensive watershed water quality model for the reedy river phase I - Existing data review.

DHEC. (2008b). Development of a comprehensive watershed water quality model for the Reedy River Phase III - Model calibration/validation.

Genuer, R., Poggi, J.-M., & Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern Recognit. Lett., 31(14), 2225-2236. https://doi.org/10.1016/j.patrec.2010.03.014

Giere, J., Riley, D., Nowling, R. J., McComack, J., & Sander, H. (2020). An investigation on machine-learning models for the prediction of cyanobacteria growth. Fundam. Appl. Limnol., 192(2), 85-94. https://doi.org/10.1127/fal/2020/1306

Gobler, C. J. (2020). Climate change and harmful algal blooms: Insights and perspective. Harmful Algae, 91, 101731. https://doi.org/10.1016/j.hal.2019.101731

Grattan, L. M., Holobaugh, S., & Morris, J. G. (2016). Harmful algal blooms and public health. Harmful Algae, 57, 2-8. https://doi.org/10.1016/j.hal.2016.05.003

Greer, B., McNamee, S. E., Boots, B., Cimarelli, L., Guillebault, D., Helmi, K.,... Campbell, K. (2016). A validated UPLC-MS/MS method for the surveillance of ten aquatic biotoxins in European brackish and freshwater systems. Harmful Algae, 55, 31-40. https://doi.org/10.1016/j.hal.2016.01.006

Hartshorn, N., Marimon, Z., Xuan, Z., Cormier, J., Chang, N. B., & Wanielista, M. (2016). Complex interactions among nutrients, chlorophyll-a, and microcystins in three stormwater wet detention basins with floating treatment wetlands. Chemosphere, 144, 408-419. https://doi.org/10.1016/j.chemosphere.2015.08.023

Hewamalage, H., Ackermann, K., & Bergmeir, C. (2023). Forecast evaluation for data scientists: Common pitfalls and best practices. Data Min. Knowl. Discov., 37(2), 788-832. https://doi.org/10.1007/s10618-022-00894-5

Hollister, J. W., & Kreakie, B. J. (2016). Associations between chlorophyll a and various microcystin health advisory concentrations. F1000Research, 5(151). https://doi.org/10.12688/f1000research.7955.2

Huang, H., Wang, W., Lv, J., Liu, Q., Liu, X., Xie, S.,... Feng, J. (2022). Relationship between Chlorophyll a and environmental factors in lakes based on the random forest algorithm. Water, 14(19), 3128. https://doi.org/10.3390/w14193128

Humbert, J.-F., & Fastner, J. (2016). Ecology of cyanobacteria. In J. Meriluoto, L. Spoof, & G. A. Codd (Eds.), Handbook of cyanobacterial monitoring and cyanotoxin analysis (pp. 9-18). Wiley. https://doi.org/10.1002/9781119068761.ch2

Janssen, A. B., Janse, J. H., Beusen, A. H., Chang, M., Harrison, J. A., Huttunen, I.,... Mooij, W. M. (2019). How to model algal blooms in any lake on earth. Curr. Opin. Environ. Sustain., 36, 1-10. https://doi.org/10.1016/j.cosust.2018.09.001

Johnson, J. E., Laparra, V., Pérez-Suay, A., Mahecha, M. D., & Camps-Valls, G. (2020). Kernel methods and their derivatives: Concept and perspectives for the earth system sciences. PLoS One, 15(10), e0235885. https://doi.org/10.1371/journal.pone.0235885

Katin, A., Giudice, D. D., Hall, N. S., Paerl, H. W., & Obenour, D. R. (2021). Simulating algal dynamics within a Bayesian framework to evaluate controls on estuary productivity. Ecol. Modell., 447, 109497. https://doi.org/10.1016/j.ecolmodel.2021.109497

Kim, K.-M., & Ahn, J.-H. (2022). Machine learning predictions of chlorophyll-a in the Han river basin, Korea. J. Environ. Manag., 318, 115636. https://doi.org/10.1016/j.jenvman.2022.115636

Lekki, J., Deutsch, E., Sayers, M., Bosse, K., Anderson, R., Tokars, R., & Sawtell, R. (2019). Determining remote sensing spatial resolution requirements for the monitoring of harmful algal blooms in the Great Lakes. J. Great Lakes Res., 45(3), 434-443. https://doi.org/10.1016/j.jglr.2019.03.014

Li, X., Sha, J., & Wang, Z.-L. (2018). Application of feature selection and regression models for chlorophyll-a prediction in a shallow lake. Environ. Sci. Pollut. Res., 25(20), 19488-19498. https://doi.org/10.1007/s11356-018-2147-3

Lombard, F., Boss, E., Waite, A. M., Vogt, M., Uitz, J., Stemmann, L.,... Appeltans, W. (2019). Globally consistent quantitative observations of planktonic ecosystems. Front. Mar. Sci., 6. https://doi.org/10.3389/fmars.2019.00196

Mamun, M., Kim, J.-J., Alam, M. A., & An, K.-G. (2020). Prediction of algal chlorophyll-a and water clarity in monsoon-region reservoir using machine learning approaches. Water, 12(1), 30. https://doi.org/10.3390/w12010030

Marlian, N., Damar, A., & Effendi, H. (2015). The horizontal distribution clorophyll-a fitoplankton as indicator of the tropic state in waters of Meulaboh Bay, West Aceh. Jurnal Ilmu Pertanian Indonesia, 20(3), 272-279. https://doi.org/10.18343/jipi.20.3.272

Martínez-Comesańa, M., Febrero-Garrido, L., Granada-Álvarez, E., Martínez-Torres, J., & Martínez-Marińo, S. (2020). Heat loss coefficient estimation applied to existing buildings through machine learning models. Appl. Sci., 10(24), 8968. https://doi.org/10.3390/app10248968

Maslukah, L., Setiawan, R. Y., Nurdin, N., Helmi, M., & Widiaratih, R. (2022). Phytoplankton chlorophyll-a biomass and the relationship with water quality in Barrang Caddi, Spermonde, Indonesia. Ecol. Eng. Environ. Technol., 23(1), 25-33. https://doi.org/10.12912/27197050/143064

Mbuh, M. J., Mbih, R., & Wendi, C. (2019). Water quality modeling and sensitivity analysis using Water Quality Analysis Simulation Program (WASP) in the Shenandoah River watershed. Phys. Geogr., 40(2), 127-148. https://doi.org/10.1080/02723646.2018.1507339

Molnar, C., Casalicchio, G., & Bischl, B. (2020). Interpretable machine learning – a brief history, state-of-the-art and challenges. In I. Koprinska, M. Kamp, A. Appice, C. Loglisci, L. Antonie, A. Zimmermann,... J. A. Gulla (Eds.), ECML PKDD 2020 Workshops. ECML PKDD 2020. Communications in Computer and Information Science (Vol. 1323, pp. 417-431). Cham: Springer. https://doi.org/10.1007/978-3-030-65965-3_28

Nguyen, Q. H., Ly, H.-B., Ho, L. S., Al-Ansari, N., Le, H. V., Tran, V. Q.,... Pham, B. T. (2021). Influence of data splitting on performance of machine learning models in prediction of shear strength of soil. Math. Probl. Eng., 2021, 4832864. https://doi.org/10.1155/2021/4832864

Oswalt Manoj, S., Ananth, J. P., Rohini, M., Dhanka, B., Pooranam, N., & Ram Arumugam, S. (2022). 17 - FWS-DL: Forecasting wind speed based on deep learning algorithms. In A. K. Dubey, S. K. Narang, A. L. Srivastav, A. Kumar, & V. García-Díaz (Eds.), Artificial intelligence for renewable energy systems (pp. 353-374). Woodhead Publ. https://doi.org/10.1016/B978-0-323-90396-7.00007-9

Ponjavic, A. B., Kostic, D., Marjanovic, P., Trbojevic, I., Popovic, S., Predojevic, D., & Simic, G. S. (2019). Bloom of the potentially toxic cyanobacterium P. rubescens: Seasonal distribution and possible drivers of its proliferation in the Vrutci reservoir (Serbia). Oceanol. Hydrobiol. Stud., 48(4), 316-327. https://doi.org/10.2478/ohs-2019-0029

Privette, C. V., & Smink, J. (2017). Assessing the potential impacts of WWTP effluent reductions within the Reedy River watershed. Ecol. Eng., 98, 11-16. https://doi.org/10.1016/j.ecoleng.2016.10.058

Reichwaldt, E. S., & Ghadouani, A. (2012). Effects of rainfall patterns on toxic cyanobacterial blooms in a changing climate: Between simplistic scenarios and complex dynamics. Water Res., 46(5), 1372-1393. https://doi.org/10.1016/j.watres.2011.11.052

Rousso, B. Z., Bertone, E., Stewart, R., & Hamilton, D. P. (2020). A systematic literature review of forecasting and predictive models for cyanobacteria blooms in freshwater lakes. Water Res., 182, 115959. https://doi.org/10.1016/j.watres.2020.115959

Shin, Y., Kim, T., Hong, S., Lee, S., Lee, E., Hong, S.,... Heo, T.-Y. (2020). Prediction of chlorophyll-a concentrations in the nakdong river using machine learning methods. Water, 12(6), 1822. https://doi.org/10.3390/w12061822

Stollfuss, B., & Bacher, M. (2022). MLP-supported mathematical optimization of simulation models: Investigation into the approximation of black box functions of any simulation model with MLPs with the aim of functional analysis. Proc. 3rd Int. Conf. on Innovative Intelligent Industrial Production and Logistics IN4PL, 1, pp. 107-114.

Tharwat, A., & Gabel, T. (2020). Parameters optimization of support vector machines for imbalanced data using social ski driver algorithm. Neural Comput. Appl., 32(11), 6925-6938. https://doi.org/10.1007/s00521-019-04159-z

Xiao, S., Jian-feng, l., Fang-fang, W. A., Xuan, Y. U., Shi, X., Lu-yao, H. A.,... Idris, I. (2022). Research on red tide short-time prediction using GRU network model based on multi-feature Factors — A case in Xiamen sea area. Mar. Environ. Res., 182, 105727. https://doi.org/10.1016/j.marenvres.2022.105727

Yajima, H., & Derot, J. (2017). Application of the Random Forest model for chlorophyll-a forecasts in fresh and brackish water bodies in Japan, using multivariate long-term databases. J. Hydroinf., 20(1), 206-220. https://doi.org/10.2166/hydro.2017.010

Yang, S., Yu, X., & Zhou, Y. (2020). LSTM and GRU neural network performance comparison study: Taking yelp review dataset as an example. Proc 2020 International Workshop on Electronic Communication and Artificial Intelligence (IWECAI), (pp. 98-101). 10.1109/IWECAI50956.2020.00027

Zhang, W., & Rao, Y. R. (2012). Application of a eutrophication model for assessing water quality in Lake Winnipeg. J. Great Lakes Res., 38, 158-173. https://doi.org/10.1016/j.jglr.2011.01.003

Zhu, J.-J., Yang, M., & Ren, Z. J. (2023). Machine learning in environmental research: Common pitfalls and best practices. Environ. Sci. Technol., 57(46), 17671-17689. https://doi.org/10.1021/acs.est.3c00026