A COMPARISON OF MACHINE LEARNING MODELS FOR SOIL SALINITY ESTIMATION USING MULTI-SPECTRAL EARTH OBSERVATION DATA

Soil salinity, a significant environmental indicator, is considered one of the leading causes of land degradation, especially in arid and semi-arid regions. In many cases, this major threat leads to loss of arable land, reduces crop productivity, groundwater resources loss, increases economic costs for soil management, and ultimately increases the probability of soil erosion. Monitoring soil salinity distribution and degree of salinity and mapping the electrical conductivity (EC) using remote sensing techniques are crucial for land use management. Salt-effected soil is a predominant phenomenon in the Eshtehard Salt Lake located in Alborz, Iran. In this study, the potential of Sentinel-2 imagery was investigated for mapping and monitoring soil salinity. According to the satellite's pass, different salt properties were measured for 197 soil samples in the field data study. Therefore several spectral features, such as satellite band reflectance, salinity indices, and vegetation indices, were extracted from Sentinel-2 imagery. To build an optimum machine learning regression model for soil salinity estimation, three different regression models, including Gradient Boost Machine (GBM), Extreme Gradient Boost (XGBoost), and Random Forest (RF), were used. The XGBoost method outperformed GBM and RF with the coefficient of determination (R) more than 76%, Root Mean Square Error (RMSE) about 0.84 dS m, and Normalized Root Mean Square Error (NRMSE) about 0.33 dS m. The results demonstrated that the integration of remote sensing data, field data, and using an appropriate machine learning model could provide high-precision salinity maps to monitor soil salinity as an environmental problem.


INTRODUCTION
Soil salinization due to natural processes and human factors is a significant environmental hazard in arid and semi-arid regions (Metternicht and Zinck, 2003;Ren et al., 2019). Soil affected by salt reduces the productivity of agricultural products, ecosystem health, and water quality. It also can lead to soil erosion and land degradation (Khan et al., 2005;Wicke et al., 2011). Based on the Food and Agriculture Organization of the United Nations (FAO) estimates, 397 million hectares of land worldwide have been covered by saline soils, and the affected areas are estimated to be expanding at a rate of two million hectares per year (Koohafkan and Stewart, 2012;Peng et al., 2019). In addition to the main salt content in many areas, soil resources are at risk of secondary salinization, mainly due to low precipitation and high evaporation, shallow groundwater level, and irrational activities of farmers Nicolas and Walter, 2006;. Therefore, careful monitoring, evaluation, and mapping to detect soil salinity can provide sufficient understanding of this threat's temporal and spatial distribution so that it becomes possible to make effective soil restoration for land management (Bannari et al., 2018;Davis et al., 2019). The EC parameter is commonly used to investigate soil salinity dynamics due to its high correlation with soil salinity (Richards, 1954). Traditional methods to analyze EC are very accurate yet time-consuming, discontinuous, and costly. In the last two decades, remote sensing has been widely used to determine and monitor soil salinity characteristics at different scales (Taghadosi and Hasanlou, 2021;Dale et al., 1986;Dwivedi, 2001;Santra et al., 2015). Multi-spectral data, such as QuickBird, IKONOS, SPOT, Landsat, and Sentinel, are useful in identifying and monitoring soil salinity and environmental * Corresponding author hazards (Farifteh, 2007;Koshal, 2012;Ranjbar et al., 2021;Teggi et al., 2012). Also, the Sentinel-2 satellite that was launched with a multi-spectral instrument (MSI) in 2015 is an essential part of global environmental monitoring, which has been continuously used for the past few years due to its high spatial and spectral resolution to identify areas affected by salinity (Malenovský et al., 2012;Taghadosi et al., 2019;J. Wang et al., 2019). The spectral reflectance of soil surface salt properties has been widely used in several studies as a direct indicator to detect and monitor soil salinity. In a case study in Malheur County, Landsat TM satellite images were used to map soil salinity, the results of which showed that a large amount of salt in barren soils could be detected in bands 1 to 4 of the Landsat satellite due to the high spectral reflection of salt in this range (Elnaggar and Noller, 2010). In another study, Sentinel-2 satellite data were used to investigate soil salinity with different spectral compositions, the results of which presented salinity indices by combining two or three spectral bands that achieved the highest correlation with ground-truth salinity measurements (J. . In addition to the spectral indices obtained from the combination of satellite image bands, various transformation-based methods were used to extract appropriate properties to assess soil salinity. For example, the principal component analysis (PCA) and spectral indices were used to monitor soil salinity. The results showed that the PCA technique and salinity indices are a suitable method for predicting salinity from satellite images and offers high accuracy (Khan and Abbas, 2007). In recent years, a wide range of regression methods have been employed to model soil salinity and estimate EC values. These methods' performance varies according to the study area, in-situ data collected, and applied regression methods (Eldeiry and Garcia, 2010;Farifteh et al., 2007;Gorji et al., 2017). For instance, Wang et al. (2007) used correlation analysis, Ordinary Least Square (OLS) method, and spatial regression method to study soil salinity's spatial variation in the yellow river delta. This study showed that the spatial regression model improved the accuracy of soil salinity estimation (Wang et al., 2007). Qu et al. (2008) used the Partial Least Square Regression (PLSR) method to assess soil salinity using hyperspectral data. The results showed that the calibrated PLSR method could predict soil salinity with accurate results (Qu et al., 2008). Recently, soil salinity mapping has been performed accurately using machine learning regression methods, such as Support Vector Regression (SVR) and RF. In 2018, Wu et al. produced the soil salinity map using Landsat satellite data and machine learning regression methods, such as SVR and RF, as well as Multiple Linear Regression (MLR). Comparing different regression methods, the results showed that the RF method with less NRMSE outperformed other models (Wu et al., 2018).  examined different machine learning regression methods for modeling soil salinity at various study sites. The results showed that the Stochastic Gradient Treeboost (SGT) method is the most reliable algorithm for predicting soil salinity in arid regions (F. . In this study, we investigate soil salinity monitoring and soil EC mapping using Sentinel-2 satellite images. The main objectives of this study are summarized: (i) to understand the spectral reflectance characteristics of saline soil in Eshtehard Salt River, (ii) identifying suitable variables for soil salinity prediction over the study area, and (iii) evaluating and comparing the three machine learning regression algorithms, particularly GBM, XGBoost, and RF, in predicting soil salinity using in-situ data collected in the study area, and (iv) to produce the soil salinity map according to high, moderate and low saline content.

Study area
Eshtehard is located in the southwest of Alborz County in the Salt River basin (Figure 1). This region has a diverse structure in terms of hydrology and geohydrology. The study area in this research is located in the Eshtehard Salt River with an area of about ).

In-situ data
To study the soil in the affected areas, measuring salinity, and preparing ground-truth data, in-situ data were collected near the Eshtehard Salt River using the TDR-350 device. Soil sampling of this area using the design of ground control points with 197 samples was taken randomly from different parts of the area around the Eshtehard Salt River in August and September of 2020.
For each sample, we conducted five different measurements. To do this, a sample is collected in the center of the site, and 4 samples are collected from 4 corners of a square of 10m × 10m (Figure 2), and then the average of these five points represents the salinity in a square of 10m × 10m. These steps hold for all samples to optimize the sample values in the pixels of satellite images (Wang et al., 2020). Based on the classes determined by Durand (Durand, 1983), considered five salinity classes (Table 1). Figure 1 shows the in-situ operations and the collection and distribution of sample points in the Eshtehard Salt River area. Field data were randomly divided into 70% for the training set and 30% for the testing set according to the two parameters of computational cost and representativeness.  Figure 2. In-situ data collection tool and a sample photo from the landscape of the study area.

METHOD
In this study, the total data set (n = 197) was divided into a training set (144 soil samples, 70% of the total soil samples) and a testing set (53 soil samples, 30% of the total soil samples). In the entire data set, according to the sampling order, one sample was selected every four samples as a verification sample. As discussed, the main objective of this study was to evaluate the sustainability of the GBM, XGBoost, and RF algorithms to model the relationship between spectral characteristics of the Sentinel-2 satellite data and the soil salinity parameter over the Eshtehard Salt River. The flowchart of the proposed method is illustrated in Figure 3 and is summarized in the following five steps. Additionally, the main steps of the proposed method are discussed in more detail in the following sections.

In-Situ Data (Soil EC Measurement)
The soil salinity geodatabase Training set (70% of soil salinity samples)

Feature extraction
To identify the affected areas and determine the relationship between soil samples' EC values and the corresponding pixel values in satellite images, 10 spectral bands in which soil salinity is reflected were used (Taghadosi et al., 2019). Since the study area is mostly barren and the vegetation is very sparse, it is very useful to use salinity indices that highlight salt-affected surfaces' reflection. The use of vegetation indices, which are commonly used to identify vegetation areas, can effectively analyze soil salinity trends because the negative effects of salinity on plant growth indicate changes in salinity in vegetated areas (Moreira et al., 2015). However, to evaluate the relationship between EC values of soil samples and the corresponding pixel values obtained from salinity and vegetation indices, salinity and vegetation indices were selected as predictor variables in our regression analysis. These indices have shown the best performance for salinity detection in previous studies (Allbed and Kumar, 2013;Scudiero et al., 2014;J. Wang et al., 2019). In image processing and machine learning, feature transformation by converting measured datasets is used to generate datasets that contain useful, informative, and facilitative information (Richards, 2013). This study used PCA and ICA methods as transformation-based features (Lee and Batzoglou, 2003;Ranchordas et al., 2010). A total number of 46 features were extracted from satellite data (Table 3 and 4).  In this step, the correlation between the collected in-situ samples' electrical conductivity and feature extracted values of satellite images is carried out to find the relationship between these variables and their efficiency in predicting soil salinity using the linear regression. Therefore, to determine the correlation between saline soil and each feature and among features, correlation matrixes were generated (Figure 4).

Figure 4.
Correlation matrixes document the correlation between electrical conductivity and multi-spectral features, and the correlation between multi-spectral.

Regression analysis
Regression analysis is generally used to predict the relationship between a dependent and one or more independent variables. In recent years, various regression techniques have been developed in a widespread of applications, which can be used for prediction and model construction (Fan et al., 2015). In this study, GBM, XGBoost, and RF methods were selected for remote sensing inversion of soil salinity.

Gradient Boost Machine
Gradient boost machine algorithm employs ensemble, which simplistically eliminates bias, noise, and variance, which reduce the prediction model's effectiveness. The ensemble uses Boosting methods, which pre-builds many independent models, which are then implemented sequentially to allow the new models to learn from the error of the earlier models. GBM algorithm intuitively follows the concept of running and testing residual models to sustain the new model where the algorithm's cost function was optimized. The user-centric parameters that are optimized for this algorithm using the grid search method include the following: 1) n estimators: the number of sequential trees to be modeled, though GBM is fairly robust at a higher number of trees, it can still overfit at a point, and 2) max features: the number of features to consider while searching for the best split. As a thumb rule, the square root of the total number of features works great. (Friedman, 2001;Ying and Sayed, 2017).

XGBoost
XGBoost is one of the quickest implementations of gradient boosted trees. It does this by tackling gradient boosted trees' significant incapability: considering the potential loss for all possible splits to create a new member. XGBoost address this incapability by looking at the distribution of features across all data points in a leaf and using this information to decrease the search space of possible feature splits. Although XGBoost implements several regularization methods, this fast is the numerous useful feature of the method, providing many hyperparameter settings to be investigated instantly. The usercentric parameters that are optimized for this algorithm using the grid search method include the following: 1) learning rate: Step size reduction used in the update to prevents overfitting, 2) alpha and lambda: L1 and L2 regularization term on weights.
Increasing this value will make the model more conservative, and 3) column sample by tree: This is a family of parameters for the subsampling of columns. (Chen and Guestrin, 2016).

Random Forest
Random forest is a supervised machine learning algorithm effectively used to solve regression and classification problems and determine nonlinear relationships between target and input (Breiman, 2001). This algorithm creates the forest using a set of individual decision trees, each tree with a subset of random features (Belgiu and Drăguţ, 2016). Each tree accesses a random subset of training samples and predicts target values. For regression problems, each tree has a vote, and the prediction value is the average estimate of all decision trees. This algorithm can determine the relative importance of each input feature, which is important in understanding each feature's contribution in predicting RF output. The user-centric parameters that are optimized for this algorithm using the grid search method include the following: 1) n estimator: the number of trees in the forest, and 2) max features: the number of features to consider when looking for the best split. (Heung et al., 2016).

Accuracy assessment
To assess the performance of the regression models, three criteria were chosen: R 2 , RMSE, and NRMSE. The accuracy of the created models was then analyzed based on these criteria (Equations (1-3)): where ̂ is a vector of predicted dependent variables with n data points, is the vector of observed values of the variable being predicted and ̅ is the mean of the observed dependent variables.

Prediction of soil EC maps
Considering the location of the ground-truth data at the study site, the satellite data's pixel values were extracted for analysis. According to Tables 3 and 4, 46 features were considered for each soil sample, and a matrix containing all in-situ data measurements and satellite features was created to be used to train the model. The data matrix was then divided into two sections: 70% for training and 30% for testing, respectively; we developed prediction models using GBM, XGBoost, and RF to estimate soil salinity regarding the independent variables. According to the regression analysis results, the constructed models were used to plot the EC values for each pixel in the satellite image. Among these models, the best regression method based on R 2 , RMSE, and NRMSE values for mapping soil salinity in the whole image was compared, which will be discussed in the next section. Figure 5 shows the predicted EC map for each regression method.

Evaluation of the Accuracy of Estimations
The soil salt content estimation models were constructed by GBM, XGBoost, and RF methods, respectively (Table 5). After parameter optimization on each model's training datasets, we found the average estimation to be R 2 = 0.72 for the GBM model, R 2 = 0.76 for the XGBoost model, and R 2 = 0.69 for the RF model. According to the results in Table 5, the XGBoost model shows high precision and accuracy in predicting soil salinity; also, the RF model presents the weakest results. Our proposed method for all algorithms shows the models with higher accuracy compared to the recent study conducted by . The two main reasons why the XGBoost model outperformed other models include the: (1) computing second-order gradients, i.e. second partial derivatives of the loss function, which provides more information about the direction of gradients and how to get to the minimum of our loss function. While regular gradient boosting uses our base model's loss function as a substitute for minimizing the overall model's error, XGBoost uses the 2nd order derivative as an approximation.
(2) And advanced regularization (L1 and L2), which improves model generalization XGBoost, has further benefits: training is very fast and can be parallelized across batches.

CONCLUSION
In this study, 46 variables, in four main categories (i.e., remote sensing data, terrain characteristics, salinity spectral indices, and vegetation spectral indices) and three models (i.e., GBM, XGBoost, and RF) were selected to estimate soil salinity in the Eshtehard Salt River. The main results are as follows: (1) Overall, the 46 factors considered were significant and contributed to the estimation of soil salinity.
(3) According to the EC map obtained from the XGBoost model, the north and the west of the study area showed relatively low soil salinity. Salted soil was mainly found in the southeast to the center of the study area and Salt River.