ASABE Logo

Article Request Page ASABE Journal Article

A Review of Machine Learning Models for Harmful Algal Bloom Monitoring in Freshwater Systems

Ibrahim Busari1, Debabrata Sahoo1,*, R. Daren Harmel2, Brian E. Haggard3


Published in Journal of Natural Resources and Agricultural Ecosystems 1(2): 63-76 (doi: 10.13031/jnrae.15647). 2023 American Society of Agricultural and Biological Engineers.


1College of Agricultural, Forest, and Life Sciences, Clemson University, Pendleton, South Carolina, USA.

2USDA ARS, Fort Collins, Colorado, USA.

3Arkansas Water Resources Center, University of Arkansas, Fayetteville, Arkansas, USA.

*Correspondence: dsahoo@clemson.edu

The authors have paid for open access for this article. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License https://creative?commons.org/licenses/by-nc-nd/4.0/

Submitted for review on 24 April 2023 as manuscript number NRES 15647; approved for publication as a Review Article and as part of the “Digital Water: Computing Tools, Technologies, and Trends” Collection by Associate Editor Dr. Sushant Mehan and Community Editor Dr. Kati Migliaccio of the Natural Resources & Environmental Systems Community of ASABE on 9 August 2023.

Mention of company or trade names is for description only and does not imply endorsement by the USDA. The USDA-ARS is an equal opportunity provider and employer.

Highlights

Abstract. Harmful algal blooms (HABs) are detrimental to livestock, humans, pets, the environment, and the global economy, which calls for a robust approach to their management. While process-based models can inform practitioners about HAB enabling conditions, they have inherent limitations in accurately predicting harmful algal blooms. To address these limitations, Machine Learning (ML) models can potentially leverage large volumes of IoT data to aid in near real-time predictions. ML models have evolved as efficient tools for understanding patterns and relationships between water quality parameters and HAB expansion. This review describes ML models currently used for predicting and forecasting HABs in freshwater ecosystems and presents model structures and their application for predicting algal parameters and related toxins. The review revealed that regression trees, random forest, Artificial Neural Network (ANN), Support Vector Regression (SVR), Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU) are the most frequently used models for HABs monitoring. This review shows ML models' prowess in identifying significant variables influencing algal growth, HAB drivers, and multistep HAB prediction. Hybrid models also improve the prediction of algal-related parameters through improved optimization techniques and variable selection algorithms. While ML models often focus on algal biomass prediction, few studies apply ML models for toxin monitoring and prediction. This limitation can be associated with a lack of high-frequency toxin datasets for model development, and exploring this domain is encouraged. This review serves as a guide for policymakers and researchers to implement ML models for HAB prediction and reveals the potential of ML models for decision support and early prediction for HAB management.

Keywords. Cyanobacteria, Freshwater, Harmful algal blooms, Machine learning, Water quality.

Freshwater bodies such as lakes and ponds (e.g., irrigation ponds, aquaculture ponds, livestock ponds, agriculture ponds, and stormwater ponds) are essential resources worldwide due to the variety of ecosystem services they deliver (Brönmark and Hansson, 2002). However, freshwater quality is threatened by climate change and anthropogenic activities, such as changes in land use and land cover, often associated with the global increase in population (McGrane, 2016). Increased nutrient loadings from agricultural and industrial activities are major threats to freshwater quality (Haggard et al., 2023a; Pastor and Hernández, 2012) and can lead to detrimental effects on human health, livestock, ecosystems, and the economy and threaten their sustainability (Tirgar et al., 2020). Harmful algal blooms (HABs) are significant issues affecting lakes and ponds across the globe owing to their large biomass and tendency to produce toxins such as Microcystins and Cylindrospermopsin (Carmichael and Boyer, 2016; Grattan et al., 2016). HABs impacts are detrimental to the ecology of lakes and ponds worldwide, with direct effects on the economy through increased cost of drinking water treatment, loss of lakefront property, loss of recreational value, and biodiversity loss (Christa et al., 2021; Dodds et al., 2009)and significant impacts on human health (Kouakou and Poder, 2019), which necessitates their effective monitoring and prediction. The exponential increase in algal biomass causes HABs due to enabling conditions such as warm temperatures, light availability, and nutrient influx, especially nitrogen and phosphorus, which are essential for algal growth (Anderson et al., 2002). HABs are characterized by foul odor and taste and poor aesthetics, but HAB-related toxins can also cause skin, respiratory, and gastrointestinal issues in exposed animals and humans, which could lead to mortality during high exposure rates (Osswald et al., 2007; van der Merwe, 2015).

Monitoring of HABs entails water quality sampling and laboratory analysis to quantify water quality parameters and algal characteristics, including associated toxins, and continuous sampling is often needed to identify rapid changes in algal-related parameters and to capture complex variabilities in HAB dynamics. The laboratory analysis requires financial support, time, and analytical expertise, and few taxonomists are available for accurate algal and cyanobacterial species identification. Remote sensing methods can also monitor HAB expansion by interpreting and analyzing satellite images using various algorithms (Copado-Rivera et al., 2020; Oyama et al., 2015; Van der Wal et al., 2014). Satellite images can, however, be subject to cloud interference, affecting the accuracy of HAB interpretation. Process-based models are also essential for understanding the underlying physical and biogeochemical conditions affecting algae and their expansion (Scavia et al., 2021). These models use mathematical expression to quantify ecosystem processes and understand the relationship between algal biomass and environmental variables such as pH, temperature, turbidity, conductivity, dissolved oxygen, chlorophyll-a, and so on (Kim et al., 2019). Different scenarios relating to HABs can be predicted and simulated using process-based models. The growth and persistence of toxic algae can be studied by changing the model inputs, such as nutrient inputs or environmental variables. However, process-based models require intense calibration to obtain accurate results, could lead to high uncertainty in model outputs due to rigorous parameter estimation, and are often unable to make real-time forecasts of algal dynamics (Lee et al., 2003; Zhang and Rao, 2012).

The imminent threats of climate change through irregular precipitation patterns and increasing temperatures could increase the occurrence of HABs (Gobler, 2020). While runoff from precipitation events carries nutrients from inland anthropogenic processes, increased temperature triggers biogeochemical reactions in waterbodies and could favor the rapid proliferation of algae. Effectively managing these waterbodies entails understanding how changing climatic patterns will impact water quality and HAB expansion.

Technological breakthroughs in sensors and the adoption of the Internet of Things (IoT) favor the swift acquisition of high-frequency data that may prove helpful in understanding complex interactions between water quality parameters and HAB expansion (Kwon et al., 2022). Intense data from IoT-enabled sensors favors the development of Artificial Intelligence (A.I.) models to explore complex relationships between various water quality parameters. Machine learning (ML) is an A.I. field that has rapidly gained popularity due to improved computing and statistical prowess. ML models involve learning patterns in data that can inform decision-making based on predictions when new data are introduced (Wang et al., 2009). ML models have proved useful in various fields, such as medicine (Peng et al., 2021; Richens et al., 2020), engineering (Bevilacqua et al., 2010; Curiel-Ramirez et al., 2019; Elelu et al., 2023), finance (Emmanoulopoulos and Dimoska, 2022; Kumar et al., 2021), hydrology (Kratzert et al., 2018), and environment (Hubert and Padovese, 2019; Radford et al., 2016). ML models are vital for learning patterns in ecosystem and the climatic data and the utilization of these patterns for the early detection of HABs. Applying ML models to HAB monitoring could trigger the swift implementation of management decisions. A range of ML models have been applied for the understanding of HABs dynamics through the prediction of algal-related parameters such as chlorophyll-a, phycocyanin, algal cells, dissolved oxygen, nutrient concentrations, and toxin levels (Cao et al., 2022; Giere et al., 2020; Haggard et al., 2023b; Millie et al., 2014). These models, upon validation, could be incorporated into a real-time HAB observation and early warning systems for the early dissemination of regulatory advisories for the protection of freshwater bodies and the prevention of public health issues.

However, limited information exists on the criteria for selecting ML models for HAB prediction. There is insufficient information on data processing strategies for different sampling approaches when developing ML models. The study was conducted in two parts. In Part I, the current study extensively reviews the literature that has adopted ML models for HAB predictions and provides a perception of the status of ML in HAB prediction and the prospects of the techniques in managing HABs. It provides insight into limitations in the field that future applications can focus on to enable the protection of freshwater systems from all HAB-related impacts. The review is structured such that section I focuses on the introduction of the concept, section II focuses on Article Keyword analysis, section III focuses on ML models and their features, and section IV focuses on some applications of ML for HABs monitoring, such as finding drivers and environmental stresses of HABs, model optimization and time series prediction. Section V discusses some measures to consider when selecting ML models for HAB prediction, Section VI reveals research needs and future prospects of HAB detection with ML models, and Section VII summarizes the findings from this study. In part II, an example dataset was used to assess and quantify the efficacy of various ML models in predicting HABs based on relationships between chlorophyll-a concentration and various water quality parameters using high frequency data.

Article Keyword Analysis

The articles included in this review were selected from three databases (Web of Science, Google Scholar, and Scopus) using the following keywords: harmful algal blooms, cyanotoxins, machine learning models, cyanobacteria, deep learning models, algal blooms, and HABs. The Boolean operators "OR" and "AND" were used to form various combinations in advanced searches. Articles applying ML models for algal bloom prediction that were included tended to focus on the last five years; select earlier publications were also included to describe inherent features of ML models.

Machine Learning Models and Their Features

Machine learning (ML) models have found practical applications in detecting and predicting HABs. These models include simple and ensemble regression trees, neural networks, clustering techniques, and hybrid models. Typical ML model structures are shown in table 1. Simple regression trees such as Classification and Regression Trees (CARTs) involve splitting large water quality datasets (whole training set) into subgroups by applying binary division defined by one of the independent variables, which are often ecosystem and environmental variables influencing HABs in various aquatic systems. The variables represent the tree's root node and recursively split the data based on different values of the variables. The trees are constructed until a particular stopping criterion is reached, such as a maximum tree depth or when a certain level of the algal parameter used as an index is attained homogenously within the subsets. The maximum tree depth refers to the maximum number of splits between the root and leaf nodes in the constructed tree, often determined by estimating the node impurity measure (Jena and Dehuri, 2020).

The main advantage of decision trees is the clarity in their structure and ease of understanding, which can be beneficial for observing management thresholds (Haggard et al., 2023b). These models are easy to use and do not require initialization, such as predefining the water quality data distribution(s). Regression trees can handle large, high-frequency datasets and are often scalable (Talekar and Agrawal, 2020). However, their major disadvantage is the reduction in performance when complex interactions occur between the features. Regression tree performance is also affected by data fragmentation during the training process (Talekar and Agrawal, 2020) and is often prone to overfitting, although effective pruning can minimize overfitting. Irrespective, the random forest model is an ensemble tree-based algorithm that Breiman (2001) developed. The model combines several regression trees, each developed by randomly selecting a sample of the entire water quality datasets. The final output of the random forest model is obtained by finding the average of the predictions of the individual trees that make up the model (Biau and Scornet, 2016). The number of trees that comprise the forest and the maximum depth of each tree are essential parameters to estimate when applying the random forest model. The main advantage of this model is the reduced risk of overfitting due to the presence of robust regression trees. The random forest technique makes it easy to determine the importance of individual environmental variables to the model output. This algorithm has a feature selection module that could enable the selection of ecosystems and environmental variables that highly influence algal growth and the increase in their biomass.

Another approach, Support Vector Regression (SVR), which is an extension of Support Vector Machine (SVM), creates a hyperplane with a decision surface that maximizes the separation between positive and negative regions, offering the potential for accurate prediction of HABs (Marcello et al., 2022). SVR focuses on determining the hyperplane with optimal separation that minimizes the distance between training samples and the decision surface. SVR uses the optimal hyperplane to accurately classify and predict HAB occurrences based on the input environmental and water quality variables.

Consider a training dataset {xi,yi}Ni = 1 in which xi? R? is a ?-dimensional input vector, which could be environmental and water quality parameters, and yi? R is a scaler representing the output variable, which could be chlorophyll-a, phycocyanin, algal cells, or any algal-related parameter. The objective of SVR is to find a function that maps the input vector to its corresponding output. While both random forest and SVR models can be effectively used for HABs predictions, the SVR model often performs efficiently when a clear pattern exists in the data that can be captured by the hyperplane, making it less effective in handling extensive complex interactions arising from complex biogeochemical processes related to HABs expansion in the water bodies. In such circumstances, the random forest model, with its capacity to capture subtle interactions and non-linear correlations, may be more appropriate.

Another type of ML model is the Artificial Neural Network (ANN), which uses the principle of neurons in humans for its operations. The most common type of ANN is the Multilayer Perceptron (MLP), which has a feedforward structure and is often trained using the backpropagation algorithm (Cabaneros et al., 2019). In the context of HAB prediction, MLP networks are made up of one input layer with nodes depending on the number of input water quality variables, one output layer representing the algal biomass, and one or more hidden layers. The number of hidden layers and nodes in each layer is often determined by trial and error, although iterative methods were suggested by Gandomi and Roke (2015). Each node in the hidden layer receives an input signal that gets multiplied by their corresponding connection weights. The weighted signals are summed with a bias parameter, and the output is activated based on a chosen activation function. Compared with most mechanistic models, ANN models require less computational time and learn patterns in data more effectively. ANN models are ineffective for predicting sequential data, thereby making them unreliable for real-time HAB detection.

Recurrent Neural Networks (RNNs) are introduced to address the limitations of ANN regarding its inefficiency with sequential data. RNNs are deep learning models structured to increase the depth of networks that enable various levels of data representation (Salehinejad et al., 2017). RNNs are structured to deal with the temporal dependency of data and are often used for time series forecasting and text completion (Hewamalage et al., 2021). The availability of data collection systems involving multiparameter sensors, Application Programming Interface (API), and GIS software enables the collection of high-frequency water quality data that can be used for real-time HABs monitoring. These high-frequency time series data are essential for forecasting HAB occurrence using the relationship between algal-related parameters and the observed meteorological, environmental, ecosystem, and nutrient data. The structure of RNNs enables them to store, remember, and process past data for extended periods, making them vital for observing the effects of lagged parameters and legacy nutrients on algal blooms. RNNs can map the input sequence with the output sequence at a given timestep and predict the output for the next sequence (Salehinejad et al., 2017).

Table 1. ML model structures from literature.
ModelsFiguresDescription
Random
Forest
A random forest model adopted by Liu et al., (2019) and developed using 500 regression trees. Each regression tree was trained with random subset of the whole datasets consisting of various water quality parameters to make chlorophyll-a predictions.
Artificial
Neural
Network
A fully connected multilayer perceptron adopted by Zhang et al. (2015) for predicting chlorophyll-a using water temperature (WT), chemical oxygen demand (COD), suspended solids (SS), Secchi disc depth (SD), total phosphorus (TP), nitrate nitrogen (NO3- N), pH, and dissolved oxygen (DO).
Support
Vector
Regression
Geometric visualization of SVR adapted from Toledo-Pérez et al. (2019), indicating a hyperplane that best fits the data points while also controlling the margin around the hyperplane within which a certain fraction of the data points.

The most basic form of RNN is the Elman Recurrent Neural Network (ERNN), also referred to as Simple RNN or Vanilla RNN. These RNNs are trained with input temporal data xt to produce output yt, which could be time-dependent or used to predict a temporal shift in x. The training objective is to minimize the error between the prediction and observed value by optimizing the network weights.

However, basic RNNs are limited by the vanishing gradient problem, which refers to the model's inability to retain information for extended periods. This occurs when basic RNNs are developed with many layers, leading to a decrease in gradient, which prevents efficient model learning. In simple terms, the basic RNNs are characterized by short-term memory, making them unable to learn long sequences and often inefficient in capturing long-term dependencies in data, which led to the development of models such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs) (Bianchi et al., 2017; Salehinejad et al., 2017; Torres et al., 2021).

Table 1 (continued). ML model structures from literature.
ModelsFiguresDescription
Long
Short-term Memory
model
A general structure of LSTM showing the various components regulating the flow of information in the cell. Xt refers to input datasets (e.g., water quality variables at a given timestep), and yt refers to the model output at the same time (e.g., chlorophyll-a, cyanobacteria cells). The input and output variables could also be the same by using the predictive outputs of the previous steps to compose the input sequence to the next step, as Yussof et al. (2021) did in the multistep prediction of chlorophyll-a using the previous timestep of observation.
Hybrid
Model
An integrated model adopted by Cao et al. (2022), consisting of Convolutional and maximum pool layers for extraction of features and the LSTM layers for prediction of cyanobacteria blooms. The Convolutional part can be considered as a special data-preprocessing technique, which provides input datasets for the LSTM part.

LSTM was developed by Hochreiter and Schmidhuber (1997) to solve the vanishing gradient problem encountered by standard RNNs by introducing a more elaborate internal processing unit called cells. LSTM addresses the issues by keeping a constant error flowing back through time with no imposition of bias toward recent observations (Bianchi et al., 2017). The user often determines the number of cells/nodes to include in an LSTM model. An LSTM cell is made up of different nonlinear components that interact. The propagation of information across time in individual cell states is enabled through the linear interaction of the components. However, LSTM is computationally slow and requires higher memory, given the multiple memory cells in its structure (Salehinejad et al., 2017).

The limitations identified in LSTM led to the development of Gated Recurrent Units (GRUs) by Cho et al. (2014).

GRUs are like LSTM but with only two gates to modulate the flow of information in the cells. Unlike the LSTM, the input and forget gates are merged into the update gate, which determines the amount at which the cell content is updated with new information. The other gate in GRUs is the reset gate, which controls the relevance of the previous cell state to compute the next candidate cell. These complex connections between the cells and layers enable LSTM and GRUs to capture relevant patterns and dynamics related to algal growth and expansion.

More recently, the combination of different ML algorithms has improved predictive performance. ML models could be considered a unit in a workflow for finding patterns in data or for predictive purposes. Effective model training is highly reliant on data treatment and preprocessing options. Data transformation techniques such as Principal Component Analysis (PCA) and simple correlation analysis before model training are great examples of hybrid models researchers often use unknowingly (Didona et al., 2015). These hybrid models help reduce the dimensions of input parameters related to HABs, which could reduce the runtime of HAB model development while maintaining model integrity. Hybridization of machine learning models could enable better prediction and monitoring through effective feature extraction and pattern observation (Abdelrahim et al., 2016). Hybrid Machine Learning (HML) combines algorithms, processes, and procedures from similar or different fields of study to complement one another. Real-time monitoring of HABs will benefit from using optimization algorithms like particle swarm optimization and genetic algorithms to obtain optimum model parameters and ensemble models for accurate HAB prediction.

Application of ML Models for HAB Monitoring

Drivers and Environmental Stresses

The potential of ML models to monitor HABs has been shown by various studies, ranging from exploring HAB drivers to predicting bloom dynamics using different algal parameters as indexes, to evaluating climatic and anthropogenic effects on HAB expansion. Nutrient influx is a critical enabling condition for HAB expansion and toxin secretion. Macronutrients such as nitrogen and phosphorus are vital for algal growth, and their dynamics in freshwater systems are a huge factor in determining algal expansion (Bae and Seo, 2021). Regulation of inland activities like fertilizer application and waste disposal are strategies to monitor and reduce nutrient runoff into water bodies. However, internal nutrient loadings from sediment in freshwater can release phosphorus and increase its availability for algal utilization. These accumulated concepts motivated Haggard et al. (2023a) to use regression trees and non-parametric changepoint analysis to assess the contribution of nutrients to the expansion of cyanobacteria and toxin secretion. The study further determined thresholds and structures in physicochemical data with microcystin concentrations in Lake Fayetteville, Arkansas. While nitrogen and phosphorus are likely drivers of toxin production in the lake, most physicochemical variables show a changepoint with microcystin, which elucidates the importance of monitoring internal and external nutrient loadings for proper HABs management.

Controlling the expansion of HABs entails understanding the relationship between environmental stresses and algal-related responses. This will help facilitate Best Management Practices (BMPs) and other strategies that reduce eutrophication and HAB expansion. In a study by Park et al. (2015), a 2-dimensional hydrodynamic and water quality model (CE-QUAL-W2) was linked with regression trees to predict chlorophyll-a concentration as an indicator of HABs using predictors such as water temperature, pH, dissolved oxygen, and nutrient contents. The regression trees proved effective in identifying the sensitivity of chlorophyll-a to different stressor variables. Diurnal patterns in environmental variables and nutrient availability linked with planting seasons affect algal population dynamics. Light intensity varying over a short period also affects algal biomass due to an uneven rate of photosynthesis. These complex interactions in freshwater systems complicate the monitoring process and necessitate studying the impacts of environmental variables and nutrient contents on algal productivity at different timescales. Liu et al. (2019) used a random forest model to predict chlorophyll-a concentration in a freshwater lake at different timescales. While chlorophyll-a predictability increased with daily datasets compared to their monthly counterparts, the drivers of algal dynamics varied with timescales. This emphasizes the need to consider temporal patterns for understanding algal dynamics when making decisions about HAB management. Because toxin-producing algal, such as Microcystis aeruginosa complex (MAC) species, can produce cyanotoxins detrimental to human and cattle lives, Segura et al. (2017) adopted the random forest model to predict MAC events in the Salto Grande reservoir in the Uruguay rivers using in situ measured environmental variables such as temperature, salinity, turbidity, and wind speed. The model was developed using 1000 individual trees, and approximately 70% of the entire dataset was used to train the model. The study revealed the random forest model's potential to correctly predict the occurrence of MAC blooms using easy-to-measure environmental variables with an adequate sampling size. The study further showed the relationship between the expansion of the algal species and environmental variables. Effective management of HABs often entails identifying factors contributing to bloom development and the risk of bloom associated with each factor. HABs enabling conditions are diverse and often vary with the algal species involved. Peretyatko et al. (2012) used regression trees to assess the risk of cyanobacterial bloom occurrence in 48 ponds from the Brussels Capital Region. The study identified environmental constraints that threaten eucaryotic phytoplankton, which favors cyanobacterial blooms due to their ability to develop adaptive strategies. This shows cyanobacteria's resilience and tendency to withstand stringent environmental conditions to perpetuate their detrimental effects.

HABs ML Model Optimization

The use of ML models to combat HABs entails having optimized model hyperparameters for accurate predictions and effective decision-making (Zhu et al., 2023); thus, data preprocessing is essential for model development. For example, random forest performance highly depends on hyperparameters such as leaf numbers. The maximum leaf number determines the leaf nodes for each tree in the model. This was explored by Yajima and Derot (2018) when forecasting chlorophyll-a to monitor phytoplankton biomass in two waterbodies using a random forest model. The study revealed that variation in model performance varies with leaf numbers and water quality parameters. The random forest model was further used to rank the importance of each water quality parameter, and the most influential parameters were derived for the two studied stations. In a different study by Lou et al. (2017), the SVR model performance was optimized using the particle swarm optimization (PSO) algorithm, which enables the development of an intelligent model to monitor algal growth dynamics in the Macau Main Storage Reservoir (MSR). PSO enables the identification of important variables and their utilization for model development.

The training of ML models could benefit the extraction of features and the utilization of fewer variables to develop a model. This enhances model performance with reduced runtime for the model training process. The genetic algorithm (GA) and cross-validation approach are other techniques often adopted for optimizing model parameters. These methods have proved useful for improving the performance of ML models for HAB monitoring (Busari et al., 2023; Wang et al., 2017). Wang et al. (2017) developed an SVR model optimized with GA to predict phytoplankton abundances associated with algal blooms in a Macau freshwater reservoir. These models could predict and forecast the monthly abundance of phytoplankton, which is vital in developing a water quality monitoring framework. With SVRs huge dependence on kernel function, García-Nieto et al. (2020) introduced a novel multiscale wavelet as a kernel function for an SVR to improve the ability of the model to forecast abnormal proliferation in La Barca reservoir, Mexico, using a differential evolution algorithm for model optimization. In addition to improved performance compared to the Radial Basis Function (RBF) adopted by Giere et al. (2020), the developed model echoes the significant relationship between cyanobacteria cells and chlorophyll-a concentration.

Several ANN models have also been used to understand chlorophyll-a concentration as an index of HABs and its relationship with other water quality parameters. Coad et al. (2014) leveraged the high temporal resolution datasets obtained from an autonomous buoy deployed in the Hawkesbury River, New South Wales, Australia. Using optimized ANN, the authors predicted the daily mean chlorophyll-a concentration as an index of algal blooms. The model made a multistep prediction of chlorophyll-a concentration that could be incorporated with in situ environmental monitoring to provide proactive solutions to estuarine algal blooms. Similarly, Zhang et al. (2015) also adopted the ANN model to forecast three water quality variables to indicate eutrophication in Yuqiao reservoir. The model was used to forecast water temperature, total phosphorus, and chlorophyll-a two weeks in advance. The study demonstrated ANN's prowess in providing valuable information for proper reservoir management. Understanding the process that leads to algal blooms is essential for efficient management. Furthermore, effective prediction of cyanobacteria blooms, the most common HABs in freshwater systems, entails judicious selection of training datasets and optimum parameters. Kim et al. (2023) used ANNs to diagnose a novel algorithm of self-generating training datasets from available meteorological and water quality data in the Nakdong River, South Korea. The study identified training datasets that provided an optimum multistep prediction of cyanobacteria using an optimized model hyperparameter.

Real-Time and Time Series Prediction

Real-time prediction is crucial for creating early warning systems for HAB occurrences. This will provide swift initiation of management actions and prevention of public outbreaks. RNN models like LSTM and GRU can efficiently consider the data's time dependency when making predictions. This is particularly useful for continuously predicting HAB indicators using high-frequency water quality datasets provided by multiparameter sondes. Lee and Lee (2018) leveraged the time series prediction ability of LSTM to predict algal blooms in four major rivers in South Korea. The model was used to make a one-week ahead prediction of algal blooms using chlorophyll-a as an index. Satellite imagery is also a good source of time series data for developing time-dependent models. A study by Yussof et al. (2021) trained an LSTM model using satellite time series data to predict chlorophyll-a concentration as an index of HAB events on the West Coast of Sabah, Malaysia. These time-dependent models are useful for observing trends in HAB events and understanding spatiotemporal patterns in water quality parameters. With the many water quality variables available, time-dependent models could be complex and memory-intensive during training and require some dimension reduction. The use of principal component analysis (PCA) to characterize the main environmental effect triggering algal growth was explored by Wen et al. (2022). Their study used an LSTM model to forecast HABs using the time series of the leading environmental factors. The study emphasized the LSTM model's potential to improve performance when predicting HABs, provided the main factors triggering algal growth are included in the model. This could create spatiotemporal multiple warning levels of HABs based on algal growth rate.

Recently, researchers have combined different algorithms to improve HAB monitoring with ML models. These hybrid models are useful for variable selection (Shan et al., 2022), optimization of model parameters (Xia and Zeng, 2021), and decision support tools based on ensemble model predictions (Cao et al., 2022; Qin et al., 2017). Liu et al. (2022) combined the discrete wavelet transformation of the time series of algal dynamics with the LSTM model to forecast HABs at different time scales. This approach allows the identification of time series patterns and their dynamics across different seasons. Hybrid models could also be structured by combining ML models with PCA for extracting the main environmental factors (MEFs) influencing algae growth. These MEFs can be used as inputs for time series models like ARIMA and LSTM to develop a spatiotemporal HAB forecasting model (Wen et al., 2022). While monitoring algal biomass is detrimental to the environment, the toxins produced are likewise dangerous to the environment and the ecosystem. The current ML application for this purpose includes capturing the non-linear relationships between various toxins and environmental variables. These relationships are vital for developing models for toxin predictions and understanding the environmental conditions that trigger their release.

The potential of ML models to perform real-time prediction of microcystin concentration was explored by Shan et al. (2022). The model consists of one XGBoost module and two parallel LSTM models for multistep predictions using antecedent observations as inputs. The developed model showed improved performance compared to models without the XGBoost unique feature extraction technique and suggested the potential to make early toxicity warnings. Similarly, Haggard et al. (2023b) explored the use of regression trees to identify the threshold values of raw fluorescence units (RFUs) of phycocyanin and chlorophyll-a that could influence the concentration of total microcystin in Lake Fayetteville, Arkansas. The release of microcystin in freshwater systems can be related to a certain level of chlorophyll-a and phycocyanin concentration, which suggests advisories for microcystins can be made based on the threshold of these easy-to-measure water quality parameters.

In summary, algal dynamics in freshwater systems are complex due to the system's various enabling conditions and biogeochemical processes. These complex dynamics are exacerbated by different algal species causing harm through biomass or toxin release. High-frequency datasets provided through water quality monitoring make it possible to develop ML models to capture patterns and predict HAB-related parameters. These models proved useful for identifying significant parameters triggering blooms, observing the effects of nutrient contents, and making time series predictions of HABs, which is essential for the real-time application and creation of early systems. Table 2 shows studies using ML models for HAB prediction.

Table 2. Past research on algal bloom prediction using ML models.
ModelStudy
Area
Predicted
Parameter
Input DataRemarksReference
Random forest.Yuqiao reservoir, China.Chlorophyll-aTotal nitrogen, total phosphorus, ammonia, water temperature, and wind speedThe study showed the drivers of phytoplankton at different timescales and relative roles on nitrogen and phosphorus limitation in lakes.(Liu et al.,
2019)
Random forest.Urayama reservoir, Lake Shinji, Japan.Chlorophyll-aBiochemical oxygen demand (BOD), chemical oxygen demand (COD), pH, and total nitrogen/total phosphorus (TN/TP).The best model parameters varied at different stations in the same reservoir.
The model performance and prediction accuracy varied between the two sites.
(Yajima
& Derot, 2018)
Random forestLake Mjosa,
Norway.
Phytoplankton communitiesTotal nitrogen, total phosphorus, water transparency, water temperature, daily mean air temperature, daily total precipitation, daily sunshine duration, daily mean wind speed, and daily inflow and outflow discharge.The study disclosed the influence of environmental drivers on shift in algal community.(Liu et al., 2023)
Linear regression,
Random forest, Support vector machine.
Billings reservoir, Sao Paulo, BrazilChlorophyll-a,
cyanobacteria, microcystin.
Water temperature (WT), dissolved oxygen, pH, biochemical oxygen demand, chemical oxygen demand, ammoniacal nitrogen, total nitrogen, total phosphorus, nitrite, nitrate, manganese, magnesium, potassium, sodium, and turbidityThe study revealed that at high water temperatures and high cyanobacteria density, microcystin concentration can increase.(Godoy et al., 2023)
Random forest.Salto Grande reservoir, Uruguay.MACWater temperature, turbidity, wind intensity, and salinityThe developed model was able to predict MAC organism using easy-to-measure environmental variables.(Segura et al., 2017)
Support vector regression.Macau Main Storage Reservoir (MSR), China.Phytoplankton abundanceAlkalinity, bicarbonate (HCO3-), dissolved oxygen (D.O.), total nitrogen (T.N.), turbidity, conductivity, nitrate, suspended solid (S.S.), and total organic carbon (TOC)The model predicted phytoplankton abundance with high accuracy.(Lou et al., 2017)
Support vector regression.Lake Utah,
USA.
PhycocyaninpH, temperature, turbidity, D.O., conductivity, precipitation, wind direction, wind speed, minimum and maximum air temperature.The SVR model was developed using easy to obtain input parameters and requires minimal data when retraining to accommodate different lakes.(Giere et al., 2020b)
Support vector regression, Random forest.La Barca reservoir, Spain.Chlorophyll-a,
Total phosphorus
Water temperature (ºC); turbidity, nitrate, pH, conductivity, and D.O.The developed models can establish the significance of each parameter of improved algal growth.(García-Nieto et al., 2020)
Genetic algorithm-Support vector machine, Genetic algorithm-relevance vector. machine.Macau main storage reservoir, ChinaPhytoplankton abundancepH, SiO2, alkalinity, bicarbonate, dissolved oxygen (D.O.), total nitrogen (T.N.), UV254, and total organic carbon (TOC)The study revealed improved prediction performance when SVR was optimized using genetic algorithm.(Wang et al., 2017)
Support vector regression,
Random forest. Artificial neural network, Decision trees, K-neural network.
Western Lake Erie, SA, and Canada.Chlorophyll-aSoluble reactive phosphorus, particulate, organic carbon, total inorganic nitrogen, sea surface temperature.The study was able to identify key variables affecting HABs using ensemble ML models and model structure is often site dependent.(Jimmy et al., 2021)
Artificial neural networkYuqiao Reservoir, China.Water temperature, total phosphorus, and chlorophyll-aWater temperature, pH, electrical conductivity, T.N., ammonium, nitrate, nitrite, D.O., chemical oxygen demand (COD), BOD, total phosphate (T.P.), phosphate, suspended solids (S.S.), total dissolved solids (TDS), Secchi disk depth (S.D.), air temperature, sunshine hours, precipitation.The study demonstrated the potential of ANN to forecast eutrophication up to two weeks in advance and provide valuable information for nutrient management.(Zhang et al., 2015)
Table 2 (continued). Past research on algal bloom prediction using ML models.
ModelStudy
Area
Predicted
Parameter
Input DataRemarksReference
Artificial neural networkBerowra Estuary, a tributary of the Hawkesbury River, NSW, Australia.Chlorophyll-aChlorophyll-a, temperature, salinity, freshwater inflow, salinity recovery time, tidal velocity, and specific growth rate.The prediction accuracy of the developed ANNs decreased from one to seven days advance.(Coad et al., 2014)
Artificial neural networkWestern Lake Erie, USA, and CanadaChlorophyll-a andMicrocystisWater temperature, water clarity, total phosphorus, total dissolved phosphorus, soluble reactive phosphorus, soluble silica, chloride, wind direction, wind speed, ambient temperature, and total daily irradiance.The developed ANN models could predict phytoplankton and Microcystis biomass, and identification of non-linear environmental interactions affecting harmful cyanobacteria.(Millie et al., 2014)
Artificial neural networkAlfacs Bay, SpainKarlodinium and Pseudo-nitzschiaPseudo-nitzschia abundance, water temperature, river flow, wind speed, atmospheric pressure, and mean irradiance.The study showed the complex interactions between anthropogenic, climatic, and hydrologic factors influencing phytoplankton dynamics.(Guallar et al., 2016)
Artificial neural network, Gated Recurrent Unit, Long-short term memory, Support vector regression, decision tree regression, seasonal autoregressive integrated moving average, Linear regression.Han River, South Korea.Chlorophyll-derived trophic state indexChemical oxygen demand, biological oxygen demand, total organic carbon, total suspended solids, total phosphorus, dissolved total phosphorus, phosphate (PO4-P), total nitrogen (TN), dissolved total nitrogen, nitrate (NO3-N), ammonia (NH3-N), Chl-a, temperature, precipitation, flowrate, DO, pH, electroconductivity, total coliform, and fecal coliform.The study revealed the potential of ML models to accurately model water quality and decrease labor intensity attributed to laboratory experiments often engaged in water quality monitoring.(Ly et al., 2021)
Extra trees regression, Support vector regression, Gradient boosting regression tree, Multiple linear regression, Deep learning-based transformer model.South-North Water Diversion Project, Tianjin, China.Chlorophyll-aTotal phosphorus, phosphorous-phosphate, total nitrogen, nitrogen-nitrate, nitrogen-ammonia, potassium permanganate index, and total organic carbon.The study revealed total phosphorus as the major driver of algal growth in the studied site and suggested phosphorus loading control as a strategy to manage algal growth.(Qian et al., 2023)
Long-short term memory model,
Convolutional neural network.
West Coast of Sabah, MalaysiaChlorophyll-aChlorophyll-aThe LSTM model outperformed the CNN model in terms of accuracy using RMSE and the correlation coefficient r as performance metrics.(Yussof et al., 2021)
Extreme gradient boosting- Long-short term memory model (Hybrid).Three Gorges Dam on the Yangtze River, China.Algal cell density and microcystin concentrationChlorophyll-a, cell counting, microcystin, conductivity, pH, D.O., COD, ammonia, total nitrogen, total phosphorus, turbidity, water temperature, water level, air temperature, atmospheric pressure, wind speed, and direction.The proposed model showed potential for real time prediction of algal parameters and swift detection of HABs in aquatic ecosystems.(Shan et al., 2022)
Long-short term memory model -Convolutional neural network (Hybrid).Lake Taihu, China.FAI index for identifying cyanobacteria area.Cyanobacteria area and meteorological time series such as temperature, relative humidity, wind speed,
sunshine hours and precipitation.
The developed CNN-LSTM model efficiently predicted temporal changes in CyanoHAB area better than ordinary CNN and LSTM models.(Cao et al., 2022)
Gated Recurrent UnitMopanshan Reservoir, Harbin, ChinaChlorophyll-aWater temperature, air temperature, total nitrogen, dissolved oxygen, chemical oxygen demand, permanganate index, ammonia nitrogen, electrical conductivity, total hardness, sulfate, and transparency.The study showed that the GRU model based on particle swarm optimization can effectively predict chlorophyll-a concentration in reservoirs.(Zhang et al., 2023)
Wavelet Analysis- Long-short term memory modelLake Mendota, Wisconsin, USA.
Lake Tuesday, Michigan, USA.
Chlorophyll-aChlorophyll-a
Cyanobacterial cell biomass
The hybrid model forecasted high peaks of algal fluctuations and extreme values, reducing forecasting inaccuracies when making multistep predictions.(Liu et al., 2022)

Measures to Consider when Selecting ML Models for HABs Prediction

As data-driven tools, ML models depend on sufficient and reliable data with limited noise. Good data refers to datasets that are large enough and of good quality to train the model efficiently (Zhu et al., 2023). The models are often specific to the nature of the tasks, which determine the type, features, or data required. Depending on the target objective, the datasets could be continuous, high-frequency water quality data from multiparameter sensors. These continuous datasets are specifically helpful in developing a time series model of HABs and forecasting HAB-related parameters at different timescales. Data preprocessing is a critical step in developing data-driven models, and selecting appropriate techniques is often based on user expertise and the nature of the problem. Data preprocessing includes filling gaps in data, scaling, feature engineering, and splitting into smaller groups at training, testing, and validation sets. Further processing could also be required when making multistep predictions of HABs based on a chosen window of interest. Data scaling techniques include Z-score normalization, Min-max scaling, Max-abs scaling, Robust scaling, and Log transformation (Obaid et al., 2019). These techniques help transform and normalize data to ensure the features (water quality parameters) are on a similar scale. This approach is instrumental in addressing variations in units of measurement and an uneven range of water quality parameters.

Like other processes in the modeling framework, model selection depends greatly on the target objectives (Pratim et al., 2023). Selected models should be considered for their scalability, model complexity, interpretability, robustness, and computational resources such as memory, processing power, and runtime. The availability of large datasets may not favor specific ML models. It is crucial to consider RNN models such as LSTM that can handle high data volumes, especially when dealing with time series HAB predictions (Hewamalage et al., 2021). Models like regression trees and random forests perform better in understanding the impact of each predictor on the output variable. While regression tree models can be easily interpreted, they cannot often capture non-linear relationships in complex systems. Neural networks, however, have poor interpretability but solve non-linear problems more efficiently, although feature importance strategies can address low interpretability (Zhu et al., 2023). These trade-offs often exist among various ML models; final decisions are determined based on study design and objectives. Model explainability is another factor to consider when selecting models for HAB monitoring (Lipton, 2016). It could be beneficial to know how well the model's inner mechanism works to better understand the importance of each parameter to the predicted HABs parameter. However, model explainability could be inconsistent with reality, despite having good accuracy, and could require expert knowledge to align the explainability to causality between predictors and HABs output parameter. This is because most explainability methods are designed to better understand the model's working principle, rather than the waterbody's biogeochemical, meteorological, or ecosystem mechanisms (Burkart and Huber, 2021).

Upon model selection and training, model performance is evaluated. Commonly used performance indicators for assessing ML models have been reported by Naser and Alavi (2021). Major metrics used for HABs-related predictions are root mean square error (RMSE), mean absolute error (MAE), mean squared error (MSE), and coefficient of determination (R2). The closer the RMSE, MAE, and MSE are to 0, the better the model performance, while the closer the R2 is to 1, the better the model performance. Improving model performance entails reviewing the size of training datasets, using different optimization packages, and manually tuning individual hyperparameters. In summary, selecting the appropriate ML model for HAB prediction is intuitive and evidence-based. The utilization of diverse ML models enables the resolution of distinct HAB-related challenges, wherein the preprocessing of data holds significant importance as it greatly impacts the performance of the developed models.

Research Needs and Future Direction

HABs are detrimental due to their biomass growth and decomposition and their toxicity, necessitating proactive monitoring actions. The review of existing literature shows the potential of ML models to capture the complex nature of HABs and offer solutions to their menace. The review shows that algal biomass prediction and the identification of significant variables related to HAB expansion have been widely explored. However, fewer studies have used ML models to predict toxicity, which could be due to a lack of high-frequency toxicity data. High spatial and temporal data resolution is an essential characteristic of ML models. Simulation of HAB-related toxins like cyanotoxins in lakes and ponds requires numerous data points to capture the varying characteristics of different toxin-causing algal species.

Unlike the availability of multiparameter sensors that measure high-frequency physicochemical parameters, measuring cyanotoxins at high frequencies is difficult. One way to deal with this limited data issue is an intensive sampling of the waterbody for cyanotoxin concentrations and incorporating data assimilation techniques to populate the data into the required resolution for modeling purposes. Furthermore, the design of biosensors for rapid monitoring of HAB-related toxin levels in freshwater systems is highly needed to facilitate continuous datasets usable by ML models. These sensors will provide datasets for monitoring toxin levels and be a decision-support tool for making regulatory advisories. Investigating the impacts of model and measurement uncertainty on HAB predictions is essential. Model structures contain influential parameters that affect prediction accuracy, while observed data from sensors may suffer from errors such as calibration issues, sensor drift, and maintenance problems. These errors can compromise the integrity of acquired data and the accuracy of model predictions. Future research should focus on incorporating these uncertainties into HAB ML models to ensure better anomaly detection and more reliable predictions. Finally, future research can also explore the deployment of these ML models into real-time HABs and toxin monitoring systems. The real-time prediction should follow standard validation techniques and take an interdisciplinary approach to enable collaboration with environmental, computing, and ecological professionals. Upon being incorporated into real-time monitoring systems, these predictive models will aid in swiftly identifying HAB events and developing early warning systems.

Conclusion

Machine Learning shows excellent potential for HAB monitoring with state-of-the-art models to simulate discrete and time series of various indices of algal proliferation. Researchers have adopted these models to understand relationships between aquatic ecosystems and meteorological and hydrological factors. Complex models are also adopted by combining various ML algorithms to dig deeper into the relationship between water quality and algal-related parameters. The challenge is finding optimum model hyperparameters for efficient modeling, which requires different optimization techniques. ML models are vital for identifying significant parameters influencing algal growth, upon which domain knowledge can be used to explain causality. Different regulations and standards for different parameters across regions could affect the identification of algal blooms. For example, a region could set the chlorophyll-a standard as 40 ug/l, meaning observations above this index value are an anomaly, whereas other regions might set a different standard. The ML application for HABs monitoring will benefit from real time toxin-detecting sensors and provide an avenue to understand environmental and ecosystem conditions triggering toxin release by various algal species. An accurate understanding of these dynamics will aid in the swift implementation of management actions and the protection of public health.

Acknowledgments

The authors sincerely thank the Greenville County, Department of Land Development, Stormwater Division, for providing the dataset utilized for this study. The authors also thank Clemson Computing and Information Technology (CCIT) for the cyberinfrastructure resources and advanced research computing capabilities provided through its Cyber Infrastructure Technology Integration (CITI) group. The authors acknowledge the Southern Sustainable Agriculture Research and Education (S-SARE) program under subaward number LS21-2595 and the S.C. Sea Grant, NOAA Federal Award No. NA22OAR4170114 for supporting parts of this project.

References

Abdelrahim, M., Merlos, C., & Wang, T. G. (2016). Hybrid machine learning approaches: A method to improve expected output of semi-structured sequential data. Proc. 2016 IEEE 10th Int. Conf. on Semantic Computing (ICSC) (pp. 342-345). IEEE. https://doi.org/10.1109/ICSC.2016.72

Anderson, D. M., Glibert, P. M., & Burkholder, J. M. (2002). Harmful algal blooms and eutrophication: Nutrient sources, composition, and consequences. Estuaries, 25(4), 704-726. https://doi.org/10.1007/BF02804901

Bae, S., & Seo, D. (2021). Changes in algal bloom dynamics in a regulated large river in response to eutrophic status. Ecol. Model., 454, 109590. https://doi.org/10.1016/j.ecolmodel.2021.109590

Bevilacqua, M., Ciarapica, F. E., & Giacchetta, G. (2010). Data mining for occupational injury risk: A case study. Int. J. Reliab. Qual. Saf. Eng., 17(4), 351-380. https://doi.org/10.1142/s021853931000386x

Bianchi, F. M., Maiorino, E., Kampffmeyer, M. C., Rizzi, A., & Jenssen, R. (2017). Recurrent neural networks for short-term load forecasting: An overview and comparative analysis. Cham, Switzerland: Springer. https://doi.org/10.1007/978-3-319-70338-1

Biau, G., & Scornet, E. (2016). A random forest guided tour. TEST, 25(2), 197-227. https://doi.org/10.1007/s11749-016-0481-7

Breiman L. (2001). Random Forests. Mach Learn. 45(1):5–32. doi:10.1023/A:1010933404324. https://doi.org/10.1023/A:1010933404324.

Brönmark, C., & Hansson, L.-A. (2002). Environmental issues in lakes and ponds: Current state and perspectives. Environ. Conserv., 29(3), 290-307. https://doi.org/10.1017/S0376892902000218

Burkart, N., & Huber, M. F. (2021). A survey on the explainability of supervised machine learning. J. Artif. Intell. Res., 70, 245-317. https://doi.org/10.1613/jair.1.12228

Busari, I. O., Sahoo, D., Jana, R., & Privette, C. (2023). Chlorophyll a Predictions in a Piedmont Lake in Upstate South Carolina Using Machine-Learning Approaches. Journal of South Carolina Water Resources, 9(1), 9.

Cabaneros, S. M., Calautit, J. K., & Hughes, B. R. (2019). A review of artificial neural network models for ambient air pollution prediction. Environ. Model. Softw., 119, 285-304. https://doi.org/10.1016/j.envsoft.2019.06.014

Cao, H., Han, L., & Li, L. (2022). A deep learning method for cyanobacterial harmful algae blooms prediction in Taihu Lake, China. Harmful Algae, 113, 102189. https://doi.org/10.1016/j.hal.2022.102189

Carmichael, W. W., & Boyer, G. L. (2016). Health impacts from cyanobacteria harmful algae blooms: Implications for the North American Great Lakes. Harmful Algae, 54, 194-212. https://doi.org/10.1016/j.hal.2016.02.002

Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Proc 2014 Conf. on Empirical Methods in Natural Language Processing, (pp. 1724-1734). https://doi.org/10.3115/v1/d14-1179

Christa, C., Ferreira, J., Andrew, R., Xiaohui, Q., & Bijeta, S. (2021). Quantifying the socio-economic impacts of harmful algal blooms in Southwest Florida in 2018. SSRN Electron. J., 2.

Coad, P., Cathers, B., Ball, J. E., & Kadluczka, R. (2014). Proactive management of estuarine algal blooms using an automated monitoring buoy coupled with an artificial neural network. Environ. Model. Softw., 61, 393-409. https://doi.org/10.1016/j.envsoft.2014.07.011

Copado-Rivera, A. G., Bello-Pineda, J., Aké-Castillo, J. A., & Arceo, P. (2020). Spatial modeling to detect potential incidence zones of harmful algae blooms in Veracruz, Mexico. Estuar. Coast. Shelf Sci., 243, 106908. https://doi.org/10.1016/j.ecss.2020.106908

Curiel-Ramirez, L. A., Ramirez-Mendoza, R. A., Carrera, G., Izquierdo-Reyes, J., & Bustamante-Bello, M. R. (2019). Towards of a modular framework for semi-autonomous driving assistance systems. Int. J. Interact. Des. Manuf. (IJIDeM), 13(1), 111-120. https://doi.org/10.1007/s12008-018-0465-9

Didona, D., & Romano, P. (2015). Hybrid machine learning/analytical models for performance prediction: A tutorial. Proc. 6th ACM/SPEC Int. Conf. on Performance Engineering (pp. 341–344). Association for Computing Machinery. https://doi.org/10.1145/2668930.2688823

Dodds, W. K., Bouska, W. W., Eitzmann, J. L., Pilger, T. J., Pitts, K. L., Riley, A. J.,... Thornbrugh, D. J. (2009). Eutrophication of U.S. freshwaters: Analysis of potential economic damages. Environ. Sci. Technol., 43(1), 12-19. https://doi.org/10.1021/es801217q

Elelu, K., Le,.. K., & Le, C. (2023). Collision hazard detection for construction worker safety using audio surveillance. J. Constr. Eng. Manag., 149(1), 04022159. https://doi.org/10.1061/JCEMD4.COENG-12561

Emmanoulopoulos, D., & Dimoska, S. (2022). Quantum machine learning in finance: Time series forecasting. arXiv:2202.00599. https://doi.org/10.48550/arXiv.2202.00599

Gandomi, A. H., & Roke, D. A. (2015). Assessment of artificial neural network and genetic programming as predictive tools. Adv. Eng. Softw., 88, 63-72. https://doi.org/10.1016/j.advengsoft.2015.05.007

García-Nieto, P. J., García-Gonzalo, E., Sánchez Lasheras, F., Alonso Fernández, J. R., & Díaz Muñiz, C. (2020). A hybrid DE optimized wavelet kernel SVR-based technique for algal atypical proliferation forecast in La Barca reservoir: A case study. J. Comput. Appl. Math., 366, 112417. https://doi.org/10.1016/j.cam.2019.112417

Giere, J., Riley, D., Nowling, R. J., McComack, J., & Sander, H. (2020). An investigation on machine-learning models for the prediction of cyanobacteria growth. Fundam. Appl. Limnol., 194(2), 85-94. https://doi.org/10.1127/fal/2020/1306

Gobler, C. J. (2020). Climate change and harmful algal blooms: Insights and perspective. Harmful Algae, 91, 101731. https://doi.org/10.1016/j.hal.2019.101731

Godoy, R. F., Trevisan, E., Battistelli, A. A., Crisigiovanni, E. L., do Nascimento, E. A., & da Fonseca Machado, A. L. (2023). Does water temperature influence in microcystin production? A case study of Billings Reservoir, São Paulo, Brazil. J. Contam. Hydrol., 255, 104164. https://doi.org/10.1016/j.jconhyd.2023.104164

Grattan, L. M., Holobaugh, S., & Morris, J. G. (2016). Harmful algal blooms and public health. Harmful Algae, 57, 2-8. https://doi.org/10.1016/j.hal.2016.05.003

Guallar, C., Delgado, M., Diogène, J., & Fernández-Tejedor, M. (2016). Artificial neural network approach to population dynamics of harmful algal blooms in Alfacs Bay (NW Mediterranean): Case studies of Karlodinium and Pseudo-nitzschia. Ecol. Model., 338, 37-50. https://doi.org/10.1016/j.ecolmodel.2016.07.009

Haggard, B. E., Grantz, E., Austin, B. J., Lasater, A. L., Haddock, L., Ferri, A.,... Scott, J. T. (2023a). Microcystin shows thresholds and hierarchical structure with physicochemical properties at Lake Fayetteville, Arkansas, May Through September 2020. J. ASABE, 66(2), 307-317. https://doi.org/10.13031/ja.15273

Haggard, B. E., Grantz, E., Austin, B. J., Wagner, N. D., & Scott, J. T. (2023b). Chlorophyll and Phycocyanin raw fluorescence may inform recreational lake managers on cyanobacterial HABs and toxins: Lake Fayetteville case study. J. Contemp. Water Res. Educ., 177(1), 63-71. https://doi.org/10.1111/j.1936-704X.2022.3381.x

Hewamalage, H., Bergmeir, C., & Bandara, K. (2021). Recurrent neural networks for time series forecasting: Current status and future directions. Int. J. Forecast., 37(1), 388-427. https://doi.org/10.1016/j.ijforecast.2020.06.008

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Comput., 9(8), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735

Hubert, P., & Padovese, L. (2019). A machine learning approach for underwater gas leakage detection. arXiv:1904.05661. https://doi.org/10.48550/arXiv.1904.05661

Jena, M., & Dehuri, S. (2020). Decisiontree for classi?cation and regression: A state-of-the art review. Informatica, 44(4), 405-420.

Kim, J., Jung, W., An, J., Oh, H. J., & Park, J. (2023). Self-optimization of training dataset improves forecasting of cyanobacterial bloom by machine learning. Sci. Total Environ., 866, 161398. https://doi.org/10.1016/j.scitotenv.2023.161398

Kim, S., Chung, S., Park, H., Cho, Y., & Lee, H. (2019). Analysis of environmental factors associated with cyanobacterial dominance after river weir installation. Water, 11(6), 1163. https://doi.org/10.3390/w11061163

Kouakou, C. R., & Poder, T. G. (2019). Economic impact of harmful algal blooms on human health: A systematic review. J. Water Health, 17(4), 499-516. https://doi.org/10.2166/wh.2019.064

Kratzert, F., Klotz, D., Brenner, C., Schulz, K., & Herrnegger, M. (2018). Rainfall-runoff modelling using Long Short-Term Memory (LSTM) networks. Hydrol. Earth Syst. Sci., 22(11), 6005-6022. https://doi.org/10.5194/hess-22-6005-2018

Kumar, A., Sharma, S., & Mahdavi, M. (2021). Machine Learning (ML) technologies for digital credit scoring in rural finance: A literature review. Risks, 9(11), 192. https://doi.org/10.3390/risks9110192

Kwon, D. H., Hong, S. M., Abbas, A., Pyo, J., Lee, H.-K., Baek, S.-S., & Cho, K. H. (2023). Inland harmful algal blooms (HABs) modeling using internet of things (IoT) system and deep learning. Environ. Eng. Res., 28(1), 210280-0. https://doi.org/10.4491/eer.2021.280

Lee, J. H., Huang, Y., Dickman, M., & Jayawardena, A. W. (2003). Neural network modelling of coastal algal blooms. Ecol. Model., 159(2), 179-201. https://doi.org/10.1016/S0304-3800(02)00281-8

Lee, S., & Lee, D. (2018). Improved prediction of harmful algal blooms in four major South Korea’s rivers using deep learning models. Int. J. Environ. Res. Public. Health, 15(7), 1322. https://doi.org/10.3390/ijerph15071322

Lima, M. A., Fernández Ramírez, L. M., Carvalho, P. C., Batista, J. G., & Freitas, D. M. (2021). A comparison between deep learning and support vector regression techniques applied to solar forecast in Spain. J. Sol. Energy Eng., Trans. ASME, 144(1). https://doi.org/10.1115/1.4051949

Lipton, Z. C. (2018). The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 16(3), 31-57.

Liu, M., He, J., Huang, Y., Tang, T., Hu, J., & Xiao, X. (2022). Algal bloom forecasting with time-frequency analysis: A hybrid deep learning approach. Water Res., 219, 118591. https://doi.org/10.1016/j.watres.2022.118591

Liu, M., Huang, Y., Hu, J., He, J., & Xiao, X. (2023). Algal community structure prediction by machine learning. Environ. Sci. Ecotechnol., 14, 100233. https://doi.org/10.1016/j.ese.2022.100233

Liu, X., Feng, J., & Wang, Y. (2019). Chlorophyll a predictability and relative importance of factors governing lake phytoplankton at different timescales. Sci. Total Environ., 648, 472-480. https://doi.org/10.1016/j.scitotenv.2018.08.146

Lou, I., Xie, Z., Ung, W. K., & Mok, K. M. (2017). Integrating support vector regression with particle swarm optimization for numerical modeling for algal blooms of freshwater. In I. Lou, B. Han, & W. Zhang (Eds.), Advances in monitoring and modelling algal blooms in freshwater reservoirs: General principles and a case study of Macau (pp. 125-141). Dordrecht: Springer. https://doi.org/10.1007/978-94-024-0933-8_8

Ly, Q. V., Nguyen, X. C., Lê, N. C., Truong, T.-D., Hoang, T.-H. T., Park, T. J.,... Hur, J. (2021). Application of Machine Learning for eutrophication analysis and algal bloom prediction in an urban river: A 10-year study of the Han River, South Korea. Sci. Total Environ., 797, 149040. https://doi.org/10.1016/j.scitotenv.2021.149040

McGrane, S. J. (2016). Impacts of urbanisation on hydrological and water quality dynamics, and urban water management: A review. Hydrol. Sci. J., 61(13), 2295-2311. https://doi.org/10.1080/02626667.2015.1128084

Millie, D. F., Weckman, G. R., Fahnenstiel, G. L., Carrick, H. J., Ardjmand, E., Young, W. A.,... Shuchman, R. A. (2014). Using artificial intelligence for CyanoHAB niche modeling: discovery and visualization of Microcystis-environmental associations within western Lake Erie. Can. J. Fish. Aquat.Sci., 71(11), 1642-1654. https://doi.org/10.1139/cjfas-2013-0654

Mondal, P. P., Galodha, A., Verma, V. K., Singh, V., Show, P. L., Awasthi, M. K.,... Jain, R. (2023). Review on machine learning-based bioprocess optimization, monitoring, and control systems. Bioresour. Technol., 370, 128523. https://doi.org/10.1016/j.biortech.2022.128523

Naser, M. Z., & Alavi, A. H. (2021). Error metrics and performance fitness indicators for artificial intelligence and machine learning in engineering and sciences. Archit. Struct. Constr. https://doi.org/10.1007/s44150-021-00015-8

Nguyen, J., Chen, Z., Meyer, V., & Chen, D. (2021). Identifying key factors affecting harmful algal blooms in western Lake Erie from the perspective of machine learning. In L. A. Baldwin, & V. G. Gude (Eds.), World Environmental and Water Resources Congress 2021 (pp. 682-693). ASCE. https://doi.org/10.1061/9780784483466.062

Obaid, H. S., Dheyab, S. A., & Sabry, S. S. (2019). The impact of data pre-processing techniques and dimensionality reduction on the accuracy of machine learning. Poc. 2019 9th Annual Information Technology, Electromechanical Engineering and Microelectronics Conf. (IEMECON), (pp. 279-283). https://doi.org/10.1109/IEMECONX.2019.8877011

Osswald, J., Rellán, S., Gago, A., & Vasconcelos, V. (2007). Toxicology and detection methods of the alkaloid neurotoxin produced by cyanobacteria, anatoxin-a. Environ. Int., 33(8), 1070-1089. https://doi.org/10.1016/j.envint.2007.06.003

Oyama, Y., Fukushima, T., Matsushita, B., Matsuzaki, H., Kamiya, K., & Kobinata, H. (2015). Monitoring levels of cyanobacterial blooms using the visual cyanobacteria index (VCI) and floating algae index (FAI). Int. J. Appl. Earth Obs. Geoinf., 38, 335-348. https://doi.org/10.1016/j.jag.2015.02.002

Park, Y., Pachepsky, Y. A., Cho, K. H., Jeon, D. J., & Kim, J. H. (2015). Stressor-response modeling using the 2D water quality model and regression trees to predict chlorophyll-a in a reservoir system. J. Hydrol., 529, 805-815. https://doi.org/10.1016/j.jhydrol.2015.09.002

Pastor, J., & Hernández, A. J. (2012). Heavy metals, salts and organic residues in old solid urban waste landfills and surface waters in their discharge areas: Determinants for restoring their impact. J. Environ. Manag., 95, S42-S49. https://doi.org/10.1016/j.jenvman.2011.06.048

Peng, J., Jury, E. C., Dönnes, P., & Ciurtin, C. (2021). Machine learning techniques for personalised medicine approaches in immune-mediated chronic inflammatory diseases: Applications and challenges. Front. Pharmacol., 12. https://doi.org/10.3389/fphar.2021.720694

Peretyatko, A., Teissier, S., De Backer, S., & Triest, L. (2012). Classification trees as a tool for predicting cyanobacterial blooms. Hydrobiologia, 689(1), 131-146. https://doi.org/10.1007/s10750-011-0803-4

Qian, J., Pu, N., Qian, L., Xue, X., Bi, Y., & Norra, S. (2023). Identification of driving factors of algal growth in the South-to-North Water Diversion Project by Transformer-based deep learning. Water Biol. Secur., 2(3), 100184. https://doi.org/10.1016/j.watbs.2023.100184

Qin, M., Li, Z., & Du, Z. (2017). Red tide time series forecasting by combining ARIMA and deep belief network. Knowl.-Based Syst., 125, 39-52. https://doi.org/10.1016/j.knosys.2017.03.027

Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised representation learning with deep convolutional generative adversarial networks. Proc. 4th Int. Conf. on Learning Representations, ICLR 2016 - Conference Track Proceedings.

Richens, J. G., Lee, C. M., & Johri, S. (2020). Improving the accuracy of medical diagnosis with causal machine learning. Nat. Commun., 11(1), 3923. https://doi.org/10.1038/s41467-020-17419-7

Salehinejad, H., Sankar, S., Barfett, J., Colak, E., & Valaee, S. (2017). Recent advances in recurrent neural networks. arXiv:1801.01078. https://doi.org/10.48550/arXiv.1801.0107

Scavia, D., Wang, Y.-C., Obenour, D. R., Apostel, A., Basile, S. J., Kalcic, M. M.,... Steiner, A. L. (2021). Quantifying uncertainty cascading from climate, watershed, and lake models in harmful algal bloom predictions. Sci. Total Environ., 759, 143487. https://doi.org/10.1016/j.scitotenv.2020.143487

Segura, A. M., Piccini, C., Nogueira, L., Alcántara, I., Calliari, D., & Kruk, C. (2017). Increased sampled volume improves Microcystis aeruginosa complex (MAC) colonies detection and prediction using Random Forests. Ecol. Indic., 79, 347-354. https://doi.org/10.1016/j.ecolind.2017.04.047

Shan, K., Ouyang, T., Wang, X., Yang, H., Zhou, B., Wu, Z., & Shang, M. (2022). Temporal prediction of algal parameters in Three Gorges Reservoir based on highly time-resolved monitoring and long short-term memory network. J. Hydrol., 605, 127304. https://doi.org/10.1016/j.jhydrol.2021.127304

Talekar, B., & Agrawal, S. (2020). A detailed review on decision tree and random forest. Biosci. Biotechnol. Res. Commun., 13(14), 245-248. https://doi.org/10.21786/bbrc/13.14/57

Tirgar, A., Aghalari, Z., Sillanpää, M., & Dahms, H.-U. (2020). A glance at one decade of water pollution research in Iranian environmental health journals. Int. J. Food Contam., 7(1), 2. https://doi.org/10.1186/s40550-020-00080-9

Toledo-Pérez, D. C., Rodríguez-Reséndiz, J., Gómez-Loenzo, R. A., & Jauregui-Correa, J. C. (2019). Support vector machine-based EMG signal classification techniques: A review. Appl. Sci., 9(20), 4402. https://doi.org/10.3390/app9204402

Torres, J. F., Hadjout, D., Sebaa, A., Martínez-Álvarez, F., & Troncoso, A. (2021). Deep learning for time series forecasting: A survey. Big Data, 9(1), 3-21. https://doi.org/10.1089/big.2020.0159

van der Merwe, D. (2015). Chapter 31 - Cyanobacterial (Blue-Green Algae) Toxins. In R. C. Gupta (Ed.), Handbook of toxicology of chemical warfare agents (2nd edition) (pp. 421-429). Boston: Academic Press. https://doi.org/10.1016/B978-0-12-800159-2.00031-2

van der Wal, D., van Dalen, J., Wielemaker-van den Dool, A., Dijkstra, J. T., & Ysebaert, T. (2014). Biophysical control of intertidal benthic macroalgae revealed by high-frequency multispectral camera images. J. Sea Res., 90, 111-120. https://doi.org/10.1016/j.seares.2014.03.009

Wang, H., Ma, C., & Zhou, L. (2009). A brief review of machine learning and its application. Proc. 2009 Int. Conf. on Information Engineering and Computer Science, (pp. 1-4). https://doi.org/10.1109/ICIECS.2009.5362936

Wang, Y., Xie, Z., Lou, I., Ung, W. K., & Mok, K. M. (2017). Algal bloom prediction by support vector machine and relevance vector machine with genetic algorithm optimization in freshwater reservoirs. Eng. Comput., 34(2), 664-679. https://doi.org/10.1108/EC-11-2015-0356

Wen, J., Yang, J., Li, Y., & Gao, L. (2022). Harmful algal bloom warning based on machine learning in maritime site monitoring. Knowl.-Based Syst., 245, 108569. https://doi.org/10.1016/j.knosys.2022.108569

Xia, J., & Zeng, J. (2021). Environmental factor assisted chlorophyll-a prediction and water quality eutrophication grade classification: A comparative analysis of multiple hybrid models based on a SVM. Environ. Sci. Water Res. Technol., 7(6), 1040-1049. https://doi.org/10.1039/D0EW01110J

Yajima, H., & Derot, J. (2017). Application of the Random Forest model for chlorophyll-a forecasts in fresh and brackish water bodies in Japan, using multivariate long-term databases. J. Hydroinf., 20(1), 206-220. https://doi.org/10.2166/hydro.2017.010

Yussof, F. N., Maan, N., & Md Reba, M. N. (2021). LSTM networks to improve the prediction of harmful algal blooms in the west coast of Sabah. Int. J. Environ. Res. Public. Health, 18(14), 7650. https://doi.org/10.3390/ijerph18147650

Zhang, W., & Rao, Y. R. (2012). Application of a eutrophication model for assessing water quality in Lake Winnipeg. J. Great Lakes Res., 38, 158-173. https://doi.org/10.1016/j.jglr.2011.01.003

Zhang, X., Chen, X., Zheng, G., & Cao, G. (2023). Improved prediction of chlorophyll-a concentrations in reservoirs by GRU neural network based on particle swarm algorithm optimized variational modal decomposition. Environ. Res., 221, 115259. https://doi.org/10.1016/j.envres.2023.115259

Zhang, Y., Huang, J. J., Chen, L., & Qi, L. (2015). Eutrophication forecasting and management by artificial neural network: a case study at Yuqiao Reservoir in North China. J. Hydroinf., 17(4), 679-695. https://doi.org/10.2166/hydro.2015.115

Zhu, J.-J., Yang, M., & Ren, Z. J. (2023). Machine learning in environmental research: Common pitfalls and best practices. Environ. Sci. Technol. https://doi.org/10.1021/acs.est.3c00026