A MACHINE LEARNING APPROACH TO MULTISPECTRAL SATELLITE DERIVED BATHYMETRY

Bathymetry in coastal environment plays a key role in understanding erosion dynamics and evolution along coasts. In the presented investigation depth along the shore-line was estimated using different multispectral satellite data. Training and validation data derived from a traditional bathymetric survey developed along transects in Cesenatico; measured data were collected with a single-beam sonar returning centimetric precision. To limit spatial auto-correlation training and validation dataset were built choosing alternatively one transect as training and another as validation. Each set was composed by a total of ~6000 points. . To estimate water depth two methods were tested, Support Vector Machine (SVM) and Random Forest (RF). The RF method provided the higher accuracy with a root mean square error value of 0.228 m and mean absolute error of 0.158 m, against values of 0.409 and 0.226 respectively for SVM. Results show that application of machine learning methods to predict depth near shore can provide interesting results that can have practical applications.


INTRODUCTION
The study of bathymetry in coastal environment is becoming increasingly important because of the strategic importance of these areas and their vulnerability to different pressure factors. Coastal areas fragility derives in particular from the constant pressure of intense anthropogenic activities such as urbanisation, exploitation of natural resources, and climate change-induced natural hazards (Paterson et al., 2011). Considering these aspects bathymetry is particularly important to get a direct measure of the magnitude of these phenomena.
Traditionally bathymetric surveys are conducted using different high precision tools, such as single or multi beam echo sounders. This way to collect information produces high precision measures. On the other hand, high survey costs and difficulties connected to large area measurement (time required, dependency on weather, sea condition etc…) are the main limitation of this approach to bathymetry.
In the last years the use of satellite to estimate water depth became an important support tool to compensate the limitations of traditional surveys. In fact, Satellite Derived Bathymetry (SDB) provide an estimation of water depth, that can easily deal with large areas. Moreover, SDB analysis can be repeated over time. There are of course limitations related to water turbidity, weather and satellite data availability.
About the kind of satellite products to be used there are various scientific articles that show the utilization different kinds of products, such as multispectral data and SAR. Spectral methods use light absorption models to correlate depth. Lidar technology also plays a key role, with bathymetric lidar being an existing technology in the market (Pirotti, 2019). SAR applications use * Corresponding author linear dispersion relation of wave properties to estimate the water depth (Pereira et al., 2019;Wiehle and Pleskachevsky, 2018). From a literature analysis it is possible to observe also different approaches to data elaboration and analysis. Some articles show water depth estimation using some empirical models, such as Stumpf's or Jupp's models (Danilo and Melgani, 2019;Deidda et al., 2016;Stumpf et al., 2003;Tang et al., 2019). This approach generally leads to high uncertainty on estimations. Another way to analyse data is the utilization of physics-based models, with the consideration of parameters like sediment size and wave dynamics (Brando et al., 2009;Lyzenga et al., 2006). This kind of solutions lead to relatively low uncertainty on predictions.
More recently the use of Machine Learning (ML) to estimate SDB is providing promising results. In fact some investigations showed good results in term of errors on water depth estimation, i.e. Root Mean Square Error (RMSE) generally lower than 1m. (Manessa et al., 2016;Mavraeidopoulos et al., 2019;Pike et al., 2019;Sagawa et al., 2019). The analysis of satellite data using a ML approach counts on a large number of algorithms that allow to refine estimations depending on particular aims or conditions. Dealing with satellite data analysis two consolidated ML algorithms are: SVM is a supervised non-parametric statistical learning technique (Mountrakis et al., 2010). This method was originally introduced to separate two classes by defining an hyperplane, that is the linear decision function with maximal margin between the vectors of the two classes (Cortes and Vapnik, 1995). The margin is defined by the so called "support vectors".
SVM can be used booth for regression and classification problems. The particular efficiency of this approach is due to the possibility to transform data with different kernel functions (linear, sigmoid, radial and polynomial).
The RF algorithm was introduced in 2001 (Breiman, 2001); this model is based on an ensemble of decision trees (forest) that grows through training towards best combinations. An ensemble consist of a set of individual trained classifier (decision trees), which are combined for classify new instances (Kulkarni, 2013).
Considering the promising results obtained by ML, this study aims to evaluate different ML techniques performance in SDB analysis. In particular this study aims to: Evaluate the best ML algorithms among those considered (SVM and RF) Evaluate the potential of different multispectral satellite data fusion in SDB analysis.

Study Area
The study area is situated in Cesenatico (FC) in Italy. Cesenatico is a small town on Adriatic costs that in the past years was particularly affected by sand deposition and erosion phenomena. These were particularly problematic because of the touristic vocation of Cesenatico beaches. As evidence of the susceptibility to this kind of problems it is possible to observe that along Cesenatico coasts, there are a lot of artificial reefs, situated approximately at 300-400 m of distance from shoreline; this kind of structures were built in past years as defence against sand erosion. Their efficiency is limited to small area, and causes further erosion/deposition dynamics in other parts of the coast.
The study area is situated in the northern part of Cesenatico port and it is extended approximately 1,78 km 2 ( Figure 1). The area includes the final part of Cesenatico artificial reefs.

Data collection
Water depth data were collected on 26 April 2018 with a traditional bathymetric survey (range of depth 0m-5m). The survey was developed along transects perpendicular to beach axis ( Figure 2). The transects length was approximately 500m -700m. Along transects each point of measure was approximately ~1m distant from other measures (500-700 points of measure per transect).

Figure 2. Points of water depth measure
Water depth measures were taken from low draught boat, with a single-beam echo sounder. The instrument precision on water depth measures was centimetric. Data collected were managed by a dedicated software, which provided georeferenced output.
Data from above mentioned satellite platforms were acquired on the 26 th of April 2018, in order to compare reflectance values with measured water depth.

Data Processing
The processing phase concerns using satellite data as vector of variables for training and fitting the ML model. Measured data from bathymetric survey were used as training and validation of ML models, without any kind of elaboration. Satellite data processing phase was developed with a data fusion approach. In particular all data from the three different satellite platforms were kept in consideration. Dealing with different spatial resolution and different grid products the first preliminary operation was the resampling of all data to the highest spatial resolution grid, i.e. 3 m from the Planetscope images. Considering that the spectral signature of shallow water shows greater reflectance values on the green part of the visible area of the spectrum, decreasing drastically over the red and infrared part, all visible bands of three different satellite platforms were kept in consideration as predictors.
Reflectance values of some bands were then combined in order increase predictors number. In particular some predictors were created using well consolidated remote sensing indexes, basing both on Sentinel 2 and Landsat 8 data. After that different other predictors were created by combining mainly visible bands reflectance (such as normalized difference ratio) of all different satellite data (Planetscope, Sentinel 2, Landsat 8). The above described process generated vectors with 53 variables, i.e. corresponding raster maps of predictors that were used for training the ML methods. The dataset was created by extracting predictors variable values at each measured water depth point.
The whole dataset was then divided in training set and validation one. Due to the spatial distribution of the measured data points, it was decided to choose alternatively one transect as training and another as validation, as shown in the figure below ( Figure 3). In fact, it was noticed that classic resampling technique used for dividing the dataset in training and validation (e.g. cross validation, random split), leads to a high probability of choosing neighbouring points. This led to an initial overestimation of the models' accuracy, due to the lack of spatial independence between training and validation points. The split produced a training set that comprehend 6727 records and a validation one with 6585 records.

Figure 3. Training (yellow) and validation (red) transects
To estimate the performance of the SDB methods, two different algorithms were compared: Support Vector Machine SVM) and Random Forest (RF). The choice of the best configuration of algorithm hyper-parameters was achieved using an iterative procedure that tested against several combination.
At each iteration algorithms were trained on training dataset, using different configurations, and then the estimation of water depth was performed on validation one. At each iteration the Root Mean Square Error (RMSE) was calculated, using the following equation (1).
where ypred is the predicted water depth, yobs is the measured (observed) water depth and N is the number of record of validation dataset. Finally, for each algorithm it was selected the parameter configuration that minimized the RMSE. For the selected configurations also Mean Absolute Error (MAE) was computed, using the following equation (2).
where ypred,i is the predicted water depth, yobs is the observed water depth and n is the number of points of validation dataset.
In the next section a specific description of parameter setting of each of the algorithms compared is provided.

Support Vector Machine (SVM)
SVM setting procedures dealt mainly with kind of kernel and scale factor. The parameter "Kernel" represents the kind of transformation function that substitute the scalar product of predictors. The main kinds of function are linear, radial, polynomial and sigmoid.
During the model tuning it was observed that radial, polynomial and sigmoid kernel reduced a lot the dimensions of dataset; this reduction was due to these specific kernel functions that, dealing with negative values, produced not real number (omitted by algorithm as NaN). For this reason it was decided to use linear kernel function. About the scale factor it is a parameter that defines the variable to be scaled. It was observed that any changes in this parameter didn't produce any improvement in prediction.
Basing on this result it was maintained the default value that is 1.

Random Forest (RF)
RF tuning procedures kept in consideration the parameters "ntree" and "mtry". The "ntree" parameters represent the number of decision trees built by the model. The "ntree" variation range considered was from 100 to 2000 trees. The "mtry" parameter represents the number of variables to be split during to build each tree. The "mtry" variation range assumed was from 2 to 30.

RESULTS
The iterative procedure to identify the best configuration of algorithm lead to the choice of the following parameters for RF: "ntree" was set equal to 600 and "mtry" equal to 10. The next plot presents RMSE variation corresponding to different parameter configuration. The results of algorithm tuning phase showed that RF prediction were characterized by lower RMSE values. The following The next figures show, for each algorithm, the scatter plot of predicted and observed water depth, for all the points comprehended in the validation dataset.   The following plots ( Figure 9) present a comparison of three representatives transect section using measured (observed) depth and depth predicted by the two ML algorithms. The three transects are taken from the ones used for validation, therefore represent independent measures. Results in the top plot in Figure  9 show that water deeper than 4 m is overestimated, reasonably water composition at that point was different from the one used for training.

DISCUSSION
The two tested methods, RF and SVM provided acceptable results, with RF providing higher accuracy values in this study. SVM algorithm RMSE value was equal to 0.409 m and the MAE to 0.23m. Coefficient of determination, R 2 , calculated for predicted depth and observed one, showed a good correlation (R 2 = 0.89). Dealing with negative variables it was not possible to adopt different kernel function to transform data in order to improve algorithm performances; in fact the use of other kernel functions (polynomial, sigmoid and radial) lead to a strong reduction of training and validation datasets (due to the omission of NaN data). In this study the best results were obtained with RF. This algorithm, with the adopted configuration allows to reach extremely content errors (RMSE=0.23m and MAE=0.16m).
Considering the spatial resolution of satellite products adopted (highest spatial resolution 3m) the result is particularly good. In fact it must be considered that inside each pixel the reflectance value represents the reflectance of the average depth of that pixel, with possible disturbance of turbidity, algae…ecc . For this reason while comparing real punctual depth to the predicted pixel depth it is reasonable that some differences remain. Punctual depth in fact can for example give indication of local pits underwater.
However, the exam of longitudinal profiles obtained by measured data and predicted ones ( Figure 9) shows that results are really closed, especially RF one.
In general, it is worth noting that using the ML approach, training must be provided for obtaining such good predictions, therefore care must be taken if the trained model is to be used over a dataset from different imagery and from different dates from those used in training. As matter of fact the water turbidity and other factors can be different, therefore making new predicted results not useful. Nevertheless, a well-trained RF model can provide distributed information using a few surveyed points. A scenario can be one of making a fast survey with only a few transects, thus decreasing time-in-the-field and therefore economic effort.
Another point to take care of is that all forms of disturbance from clear water will decrease the accuracy and also cause spurious errors. For example algal bloom will provide a very different reflectivity value and reasonably underestimate depth (i.e. deep water might be estimated as shallow). The same can happen with dark materials (e.g. oil spill) that absorb light and thus show shallow water as deep water. To avoid these errors, circumstances have to be verified before taking results as final and with the accuracy values reported in this paper.
Further future investigation in this direction will include a definition of variable importance, and further analysis on the implication of using fewer variables and fewer training points. This solution can be particularly indicated if the aim of analysis is the estimation of sand volume movements.

CONCLUSIONS
The study has compared three different ML algorithms performances in predicting sea water depth. The approach to data analysis was a data fusion one, because data from different platform were kept in consideration together in training and validation phases. Results show that RF, with data fusion approach, gave particularly good results with extremely content errors. In conclusion this study shows that ML can be an efficient tool to predict shallow water depth. In this way SDB can support traditional bathymetric surveys in order to get a more efficient monitoring tools.