Estimating Chlorophyll a Concentrations of Several Inland Waters with Hyperspectral Data and Machine Learning Models

Water is a key component of life, the natural environment and human health. For monitoring the conditions of a water body, the chlorophyll a concentration can serve as a proxy for nutrients and oxygen supply. In situ measurements of water quality parameters are often time-consuming, expensive and limited in areal validity. Therefore, we apply remote sensing techniques. During field campaigns, we collected hyperspectral data with a spectrometer and in situ measured chlorophyll a concentrations of 13 inland water bodies with different spectral characteristics. One objective of this study is to estimate chlorophyll a concentrations of these inland waters by applying three machine learning regression models: Random Forest, Support Vector Machine and an Artificial Neural Network. Additionally, we simulate four different hyperspectral resolutions of the spectrometer data to investigate the effects on the estimation performance. Furthermore, the application of first order derivatives of the spectra is evaluated in turn to the regression performance. This study reveals the potential of combining machine learning approaches and remote sensing data for inland waters. Each machine learning model achieves an R2-score between 80 % to 90 % for the regression on chlorophyll a concentrations. The random forest model benefits clearly from the applied derivatives of the spectra. In further studies, we will focus on the application of machine learning models on spectral satellite data to enhance the area-wide estimation of chlorophyll a concentration for inland waters.


INTRODUCTION
Clean and fresh water is a key resource for the environment and human health. Water quality, however, is threatened by overfertilization leading to algal blooms, oxygen deficiency and hence, mass death of fish. Furthermore, some phytoplankton species, especially blue-green algae, can be harmful for humans. For example, they might spread into drinking water reservoirs and release toxic substances. To overview and assess the dangerous effects of such algal blooms, a continuous monitoring of the algae growth is advisable.
The feasibility of conventional in-situ monitoring approaches is restricted due to their limitations of both spatial coverage and temporal frequency of the recording. As a consequence, we approve and follow the remote sensing approach which is already used for monitoring chlorophyll a concentration in water since the 1990s (Gitelson, 1992). Data recorded by remote sensing techniques e.g. satellites is used successfully in the context of estimating chlorophyll a concentration in ocean waters e.g. by O'Reilly et al. (1998). Regarding inland waters, this estimation seems to be more challenging (Palmer et al., 2015). The available multispectral satellite data is primarily useful for observing ocean water or land surface. For monitoring small water bodies, the spatial resolution of satellite images is often too low. The same applies for the spectral resolution, which is rather insufficient for e.g. estimating phytoplankton pigments based on satellite data (Palmer et al., 2015). In addition, inland waters are optically more complex than ocean waters because of various suspended particles (Hunter et al., 2008). Thus the transferability of the estimation model is not ensured.
The relation between algae existence and remote sensing is primarily linked to the absorption of light in a wavelength of 665 nm on the pigment chlorophyll a (Morel and Prieur, 1977). This reflection minimum and the following reflection maximum around 700 nm serve as the basic wavelengths of the band ratio approaches for the identification of chlorophyll a in inland waters. Ratio approaches are often used by e.g. Gitelson (1992); Schalles et al. (1998);Gons (1999);Dall'Olmo et al. (2003); Gitelson et al. (2007); Zhou et al. (2013). Schalles et al. (1998) rely on the area under the peak around 700 nm or just the respective amplitude for the estimation of chlorophyll a concentration. Alternatively, derivatives of the spectra are applied in this context as well (Rundquist, 1996).
Another approach for the estimation of water parameters by hyperspectral data are data-driven machine learning models primary based on supervised learning. For supervised learning, a dataset is divided into sets where e.g. one set is used for the training of the model and the other set is employed to evaluate the model. The potential of these machine learning models for the estimation of several water parameters is shown in the context of coastal waters by e.g. Keiner and Yan (1998);Gonzlez Vilas et al. (2011);Kim et al. (2014), in the context of rivers by  and Keller et al. (2018) as well as for big lakes by Odermatt et al. (2010).
For the recording of the remote sensing data a multitude of different sensor techniques are applied: spectrometers, hyperspectral cameras and satellite data. They vary widely with respect to the spatial and spectral resolution. Satellite data offers many advantages: the temporal repetition rate is constant, they can cover huge areas and in the long run it is a cost-effective solution. However, the large pixel size might be adverse for small inland waters and the spectral bandwidth is often too coarse. Decker et al. (1992) analyzed the compatibility of the Landsat TM sensor and the SPOT sensor to the chlorophyll a absorption. Neither of the sensors cover the range between 690 nm and 760 nm, so the peak around 700 nm related to chlorophyll a cannot be detected. Then again, push broom sensors with narrow bandwidths of 10 nm to 20 nm between 600 nm to 720 nm seem to be practicable for the estimation of phytoplankton substances (Decker et al., 1992). With the launch of the DESIS (DLR Earth Sensing Imaging Spectrometer) mission in 2018 and the upcoming launch of EnMAP (Environmental Mapping and Anlaysis Program), monitoring of inland waters with reasonable spatial extent and hyperspectral resolution should be feasible. Up to now, unmanned aerial vehicles (UAVs) provide an appropriate spectral and spatial resolution for any remote sensing based monitoring approach over limited areas.
In this study, we assess the transferability of applied machine learning models for the estimation of chlorophyll a concentrations with hyperspectral data and for several inland water bodies. Therefore, we recorded our own hyperspectral dataset in several field campaigns with a spectrometer 1 from 13 different inland waters. As reference data, we collected water samples, which were evaluated with a photometer 2 regarding their chlorophyll a concentration. Finally, the dataset for this study contains of 422 datapoints including hyperspectral data and reference data. For the estimation of chlorophyll a concentration, we apply three different regression models: Random Forest (RF), Support Vector Machine (SVM) and an Artificial Neural Network (ANN). An important factor for the estimation performance of a model is the spectral resolution of the sensor. As we rely on a dataset collected with a spectrometer, we aggregate the spectra to several bands with different resolutions to simulate hyperspectral cameras or satellite sensors. After various pre-tests, we decided to apply four different resolutions with a continuous interval of 4 nm, 8 nm, 12 nm and 20 nm. Additionally, we calculated derivatives of the different aggregated data and applied the same regression models.
The objectives of this study are: • to describe the recorded dataset including the measurement setup of our field campaigns, • to demonstrate the potential of supervised learning models for the estimation of chlorophyll a concentration of different inland waters, • to assess the spectral resolution in the context of estimating chlorophyll a concentration as well as to determine, which bandwidths are suitable to achieve a sufficient regression performance, • to measure the effects of using derivatives of a spectrum to estimate chlorophyll a concentrations and • finally to evaluate the different machine learning models.
We describe the applied sensor systems of the field campaigns and the measured dataset in Section 2. In Section 3, the presentation of the machine learning models follows. Section 4 contains the evaluation of the measured dataset and the assessment of the different approaches. Finally, we conclude our studies in Section 5 and give an overview about future research application based on the presented dataset.

SENSORS AND DATASET
To reveal the potential of supervised learning models for the estimation of chlorophyll a concentrations in different inland waters, many data of such waters as well as varying chlorophyll a concentrations are needed. The presented dataset consists of hyperspectral data and chlorophyll a concentrations. Both types of data are measured with two different sensor systems. The recordings were challenging, since we needed to produce comparable hyperspectral data with varying daytime over a measurement period of four months.  To record hyperspectral data, we used a so-called Reflectance Box (RoX) spectrometer. This specific spectrometer covers a spectral range of 341 nm to 1015 nm with a sampling interval of about 0.65 nm. The sensor includes two fiber optic cables, which are oriented in different directions ( Figure 1). One fiber optic cable is directed upwards and has a cosine receptor at its end. This receptor measures the incoming radiation from the sky. Additionally, it regulates the integration time of the sensor, which is necessary when measuring during different atmospheric conditions such as cloudy conditions or varying sun angle. The other fiber optic cable is directed downwards to measure the reflectance of the water body and the water surface. It has a field of view of 25 • . The ratio of these two values yields in a reflectance value in percent, branch of a river which we use to fit our regression models. The spectrometer was calibrated in the laboratory with an Ulbricht sphere.

Sensors and Data Acquisition
During the field campaigns, the spectrometer was mounted on a tripod to ensure that the cosine receptor was oriented perpendicular to the sky and the other receptor pointed towards the lake surface (see Figure 1). The tripod with the spectrometer was placed as far as possible in the water (see Figure 2). During most of the measurements at natural waters, the lake bed under the spectrometer was invisible. However, if the water was very clear, it might had been visible. When measuring at artificial ponds, the tripod was placed outside the water body (see Figure 1). The data acquisition took place under clear sky conditions up to the occurrence of cirrus clouds. The spectrometer's sampling interval during the measurements was adjusted to 15 s.
The water samples for the evaluation with the photometer were collected every five minutes. They were taken at a depth of 10 cm under the water surface and in an area close to the spectrometer. The depth of 10 cm was chosen due to two reasons: Firstly, this depth ensured to collect data with the photometer as well as with the spectrometer. Secondly, regarding the measurements at all water bodies, the 10 cm depth had to be chosen, since in some of the water bodies high chlorophyll a concentrations were often accompanied by high turbidity. That effect led to a depth of visibility of 20 cm and lower. Additionally, we took some reference samples in higher depths but the differences in the chlorophyll a concentration compared to the water samples at 10 cm depth were negligible. Until the water samples were measured in the laboratory, they were protected from sunlight. The samples were evaluated prompt in a 25 ml cuvette by the photometer, which is able to measure the chlorophyll a concentration in the range from 0 µg L −1 to 200 µg L −1 . These water samples represent the reference data for the chlorophyll a concentrations.
In total, we collected hyperspectral and reference data from 13 inland water bodies between June and October 2018 in the surrounding area of Karlsruhe, Southwest Germany. Three of them were artificial ponds and the other ten were natural waters e.g. a branch of a river or flooded gravel pits. Table 1 provides an overview of the investigated water bodies. The water bodies were selected by accessibility and proximity to each other to cover different locations within one day.

Pre-processing
We applied several pre-processing steps to prepare the hyperspectral data for the regression on the chlorophyll a concentration.
1. The measured hyperspectral data was constrained to the wavelength range of 400 nm to 900 nm in order to avoid sensor noise. 2. Any outlier in the hyperspectral dataset was investigated within the sampling period we measured on a single point. Such outlier occur e.g. by sun glint, waves or shadows. When datapoints exceeded a certain distance to the median for any wavelengths within the sample interval, we excluded them. 3. We generated different spectral resolutions by aggregating the spectral bands of the spectrometer to bands with a spectral resolution of 4 nm, 8 nm, 12 nm and 20 nm. The obtained spectrum was generated by linear weighting of the spectrometer data. 4. Additionally, we calculated first order derivatives (from here on: derivatives) of the different aggregated data to generate a further dataset. 5. Finally, we selected only one of the recorded hyperspectral datapoints, which was within a time span of one minute to the sampled reference data. Eventually, one datapoint was defined by the generated hyperspectral bands (see the first pre-processing step) and a chlorophyll a value as reference value.
In total, we obtained 422 datapoints including hyperspectral and reference data. Furthermore, we generated two distinct datasets according to the pre-processing step 4: a raw dataset and a dataset with derivatives, both with different spectral resolutions (see preprocessing step 3). The distribution of the chlorophyll a concentration, the reference data, as well as the distribution of the samples per water body are shown in Figure 4 and Figure 3.
For the supervised machine learning approaches, the pre-processed data was split randomly into two disjunct subsets: the training subset and the test subset. The splitting was applied on the distinct datasets (see section 2.2). Maier and Keller (2018) applied a random splitting of 30 : 70 on a similar dataset. In contrast, we chose a splitting with 50 % of the data for the training subset and 50 % of the data for the test subset. The split ratio is due to a lower number of datapoints in the whole dataset. However, we needed a sufficient number of datapoints for the training of the models.
The machine learning models were trained on the training subset by linking the hyperspectral data to the chlorophyll a concentration values. Hyperparameters and model parameters are characteristic for each regression model. The former were chosen before the training phase with a grid search approach while the latter were adapted during the training phase. For the RF model, extratrees were selected as splitrule due to their best performance in previous studies . The other hyperparameters of the three different models are summarized and described in Table 2. The SVM was conducted with a radial kernel.
During the training phase, every combination of the grid search was carried out with a 5-fold cross validation and five repetitions on the training subset for each regression model. The combination of the hyperparameters with the best average RMSE performance on the five repetitions was the setup for the final model.
During the test phase, each regression model estimated the chlorophyll a concentration based on the hyperspectral data of the test subset. The estimated chlorophyll a values were compared to the reference chlorophyll a values. The coefficient of determination (R 2 ), the root mean squared error (RMSE) and the mean absolute error (MAE) express the estimation performance.

RESULTS AND DISCUSSION
Regarding the distribution of the measured datapoints per location (see Figure 3) in detail, it stands out that the number of reference datapoints per location is not uniformly distributed. Artificial ponds were more frequently examined than natural water bodies. This is due to the fact, that the natural water bodies in this region are very clean. There is only a minor variety in the chlorophyll a concentration during the measurement period. As we noticed, the chlorophyll a concentrations of the artificial ponds, however, varied a lot within the measurement period. The chlorophyll a concentrations of the investigated water bodies is visualized in Figure 4. The first three bars show the number of datapoints up to 30 µg L −1 and represent only the natural water bodies. While the water samples starting from 30 µg L −1 belong to artificial ponds. A uniform distribution of the chlorophyll a concentrations would be optimal for the machine learning models. This would mean that the same amount of datapoints exist in the training and test subset for every concentration range. Since we measure under real-world conditions a uniform distribution is not feasible without removing to many datapoints.
Tables 3 to 6 present the estimation results of the regression models with different spectral resolution on both datasets, the raw dataset (raw) and the dataset with the derivatives (der). In general, the best performance is achieved with the highest resolution on both datasets. However, this does not support the generalization that a lower resolution leads to lower regression performance. For the best performing regression model on the 20 nm-dataset, the ANN model, the R 2 score of 87.1 % is only 2 % lower than on the 4 nmdataset. The regression results on the 4 nm-dataset and the 8 nmdataset are rather similar. Comparing the regression performance on the 20 nm-dataset and on the 12 nm-dataset, the regression performance on the former exceeds the latter. A possible reason for this finding could be a better positioning of the bands in terms  of chlorophyll a sensitivity for the 20 nm-dataset compared to the 12 nm-dataset.
Regarding the effects of derivatives by comparing the upper and the lower half of the Tables 3 to 6, we can observe that the RF model experiences the strongest influence by derivatives. Improvements between 5 % and 10 % in the R 2 score for all resolutions are reached. For the SVM, we notice slight improvements three times by different resolution with the derivatives, but the effect is not clear. Applying the ANN model on the dataset with the derivatives, the effect is reversed compared to the SVM due to the specific characteristics of an ANN. In summary, calculating derivatives of the spectra results in a loss of the absolute reflectance value. It seems that this circumstance does not influence strongly the estimation performance of the models in general.
Comparing the estimation results between the three machine learning models, ANN shows the best performance for the raw bands and overall. The best performance is achieved for the 8 nm-dataset with R 2 of 89.2 %. RF and SVM demonstrate a slightly worse estimation performance. Both models are in a range of 1 % R 2 to each other for every spectral resolution. Figure 5 visualizes the estimation result of the ANN model for every datapoint of the test dataset with the 4 nm spectral resolution. In general, we can recognize the regression line distinctly. However, some points exist, which are estimated poorly. The green points with high chlorophyll a concentrations represented by the third bar in Figure 4 are estimated lower than they are measured. A reason for this aspect could be the small amount of datapoints in this specific concentration range combined with the small amount of samples from the respective water body. The chlorophyll a concentration range around 200 µg L −1 contains few datapoints as well.
In summary, the machine learning models show satisfying results in estimating chlorophyll a concentrations by hyperspectral input data for different water bodies. In contrast to Keller et al. (2018), we obtain a slightly worse estimation performance. Although, the underlying task of this study has been more challenging since the data was measured in several different water bodies with widely varying visible depth and chlorophyll a concentrations.

CONCLUSION AND OUTLOOK
In this study, we present a dataset consisting of hyperspectral data and chlorophyll a values, which we measured at 13 different inland waters. One main objective is to link the hyperspectral data and the chlorophyll a target values with machine learning models. The estimation performance of the three applied supervised models RF, SVM and ANN show satisfying results for all the datasets. With respect to the different spectral resolutions of 4 nm, 8 nm, 12 nm and 20 nm, which we created from the spectrometer data, the machine learning models have even the ability to estimate the chlorophyll a concentration on the dataset with the lowest spectral resolution of 20 nm. With the perspective of the upcoming EnMAP-mission which has a similar spectral resolution, the opportunity to monitor inland water bodies based on a combination of hyperspectral data and machine learning techniques is demonstrated. The regression results of the RF model are improved noticeably, by using a dataset with derivatives as input data. In general, the ANN models show the best estimation performance on the chlorophyll a concentration.
The study confirms the promising results from previous studies (Odermatt et al., 2010;Keller et al., 2018), combining machine learning models with hyperspectral data to estimate chlorophyll a concentrations of inland waters. Additional, we have faced the challenge to estimate chlorophyll a concentrations of several inland water bodies with varying spectral characteristics satisfactorily.
In future studies, we are planning to conduct further field campaigns to expand the presented dataset. Additionally, we will focus on the estimation of further water contents such as colored dissolved organic matter or cyanobacteria with hyperspectral data. Another challenging task could be the up-scaling of the presented methodologies by using data provided by the DESIS Sensor or Sentinel-2 multispectral data.