A COMPARISON OF TREE-BASED REGRESSION MODELS FOR SOIL MOISTURE ESTIMATION USING SAR DATA

: Soil moisture content plays a pivotal role in biomass development of vegetation coverage at various growth stages. Moisture content of the soil is considered as a crucial parameter for agricultural studies which directly leads to higher fertility rate. Remote sensing techniques, specifically Synthetic Aperture Radar (SAR) sensors, provides suitable opportunity for continuous soil moisture monitoring at various spatial and temporal resolutions. In this study, field campaigns conducted to measure soil surface parameters, including soil moisture and roughness, synchronized with Sentinel-1 pass over an agricultural region near Mohammadshahr, Iran. Fieldwork for soil moisture sampling have done during plants’ (canola and winter wheat) growth stages. The Gradient Boosted Regression Tree (GBRT), eXtreme Gradient Boosted (XGB), and Random Forest (RF) machine learning algorithms were employed to model the relationship between the ground measured soil moisture and polarimetric SAR derived features from Sentinel-1 imageries. The results showed promising results obtained for soil moisture estimation using the dual-polarized SAR dataset over crop-covered agricultural fields with R 2 = 0.95 and RMSE = 0.023 m 3 m −3 using the GBRT regression model.


INTRODUCTION
The moisture content of the soil plays a crucial role in a broad contexts, such as irrigation management, crop growth study, and climate change analysis (Thi et al. 2019;Ranjbar, Zarei, et al. 2021) .In the last few decades, the agricultural industry has widely used new technologies, including remote sensing tools and techniques, to boost agricultural productivity because of today's growing demand (Akhavan, Hasanlou, Hosseini, and Becker-Reshef 2021;Reisi-Gahrouei et al. 2019).Among remote sensing sensors, active microwave sensors, specifically Synthetic Aperture Radar (SAR) are greatly affected by the characteristics of the target, including the surface roughness and vegetation coverage, making the SAR-based soil moisture retrieval a challenging process (Hajj et al. 2017;Paloscia et al. 2013;Ranjbar, Akhoondzadeh, et al. 2021).Soil moisture estimation over vegetated agricultural regions performed in X-, C-, and L-bands using airborne and spaceborne fully polarimetric and dual-polarized remotely sensed datasets, including the Airborne Synthetic Aperture Radar (AIRSAR), Uninhabited Aerial Vehicle Synthetic Aperture Radar (UAVSAR), Advanced land observing satellite/ phased array Lband SAR (ALOS/PALSAR), RADARSAT, and Sentinel-1.Recently, Copernicus Sentinel-1 dataset has sparked interest in using free C-band SAR datasets for soil moisture studies with an acceptable spatial and temporal resolution.Several backscattering models, including physical, empirical, and semi-empirical, have developed and widely employed to reduce the associated uncertainties by taking some factors into consideration, and also studying their effects on backscattered signals (Baghdadi et al. 2016;Ma, Li, and McCabe 2020;Oh 2004;Dubois and Engman 1995).To overcome limitations, some researchers developed theoretical, empirical, and semiempirical models, including Integral Equation Model (IEM), Oh, and Dubois models for soil moisture retrieval (Dubois and Engman 1995;Ezzahar et al. 2020;Fung, Li, and Chen 1992;Oh 2004;Sekertekin, Marangoz, and Abdikan 2020).Despite these models' good performances over bare and sparsely vegetated regions, they demonstrated limitations in welldeveloped vegetated areas; hence, for soil moisture retrieval over vegetated areas, Attema and Ulaby proposed a model, known as Water Cloud Model (WCM), that is capable of modelling backscattered signals from vegetation's canopy coverage (Attema and Ulaby 2016).Models as mentioned earlier have selected and tested in several studies based on their research objectives (Zarei, Hasanlou, and Mahdianpari 2021).Furthermore, over the last few years, several regression-based models, especially machine learning-based algorithms, have been widely used for remote sensing-based environmental studies (M.Ansari and Akhoondzadeh 2019;Mohsen Ansari and Akhoondzadeh 2020;Dev et al. 2016).Field measurements for gathering soil samples are not feasible over large scales due to some factors, like the quick variability of moisture content over time and space, and the timeconsuming process involved.Also, soil moisture retrieval using remote sensing technologies is known as a demanding task in the absence of the adequate number of soil moisture samples; Accordingly, the use of tools and skills for soil parameter (moisture and roughness) sampling provides a suitable context for more accurate surface soil moisture monitoring using remote sensing science (El Hajj, Baghdadi, and Zribi 2019).Numerous field campaigns conducted to assess the potential of remote sensing technologies for continuous soil water content monitoring (Jackson et al. 2005;Mcnairn et al. 2017;McNairn et al. 2015).Among soil parameters, moisture content and roughness parameters sampling are weighted among the substantial variables for agricultural-related studies (Akhavan, Hasanlou, Hosseini, and McNairn 2021).In-situ data collection procedure, applied regression methods, and type of the study region all together affect the final result.In this study, ground soil parameter measurements conducted over a permanent agricultural field during plants' growth stages, from September 2020 to February 2021 (see Figure 1).The main purpose of this study is to evaluate the potential of boosting and bagging tree-based GBRT, XGB, and RF machine learning algorithms for soil moisture monitoring over the vegetated region.For this end, SLC and GRD formats of the Sentinel-1 dataset used to extract dual-polarized SAR features (Table 1 1.Sentinel-1 GRD-and SLC-derived features

Study Area
The study region located in the Mohammadshahr (central district of Karaj country), Alborz province, Iran (Figure 1).The agricultural research farm of the University of Tehran, with an area of approximately 206 and an altitude of 1160 meters above sea level, is located centrally in latitude = 35°48′25″ and longitude= 50°57′11″.

2.2.1
In-situ data: Fieldworks were conducted over canola and winter wheat fields during plants' growth stages using a TDR-350 probe and a man-made surface roughness profilometer to gather useful soil information.Figure 2 shows the landscape of the study area and the TDR-350 probe.During field data collection, soil moisture samples were collected three times in each field during crop's growth stages over predesigned control points that were designed in Google Earth Pro.For accurate soil moisture retrieval, sampling points located within 30 meters distance.Each sample location was replicated four times, and the average of those values were considered as the final values.Based on the sampling strategy, the total number of 305 and 195 samples measured over winter wheat and canola fields, respectively.The collected dataset was divided into 75% training and 25% test categories for the training and evaluation phases.Figure 3 demonstrates moisture content ranges of the soil during ground measurements.To ensure the accuracy of TDR-350, gravimetric soil sampling was also conducted (Figure 4).In this procedure, the specific amount of the moist samples was weighted separately.The samples were then dried in an oven at 105°C for 24 hours.In the next step, dried samples were weighed, and moisture percent was calculated using (1).The results of gravimetric-measured soil moisture values are demonstrated in Table 2. 1.9 2.2 Table 2. Gravimetric vs. TDR-measured soil water content A portable profilometer with a length of 150 centimeters used to measure surface roughness parameters.The profilometer was aligned in two directions, and photos were taken with a digital camera to the roughness status.Then, two measurements were taken at two different locations in each field at the beginning and end of the season.Finally, manual alignment was used to ensure the vertical status of the roughness profilometer, and also photos were captured and post-processed using the Webplotdigitizer software and Python Programming Language (Figure 5).

Sentinel-1:
The European Space Agency Copernicus program provides Sentinel-1 SAR imageries with six days revisit cycle in dual-and single polarization mode (Bahrami et al. 2022).In the present study, the backscatter coefficients, the Entropy and Alpha Dual-Polarized information were extracted from the Side Looking Complex (SLC) format of the Sentinel-1 C-band dataset.In order to extract information, the matrix was extracted after implementing radiometric calibration and TopSAR-Deburst steps.Finally, the backscatter coefficients beside Entropy-Alpha parameters were extracted after multilooking and implementing a polarimetric speckle filter.Besides, the Ground Range Detected (GRD) format of the Sentinel-1 dataset have been used for Gray Level Co-occurrence Matrix (GLCM) feature extraction.

METHODS
In the present investigation, various tree-based models as the ensemble learning machine learning algorithms, including GBRT, XGB, and RF, were deployed.To evaluate the performance of these algorithms, models were trained and tested using a split of 75/25 ratio.The tree-based algorithms, used in this study, are described briefly in the following subsections.

Gradient Boosted Regression Trees
Gradient Boosted Regression Trees (GBRT) takes advantage of boosting as a statistical approach to aggregate weak learners and convert them to a single strong learner model.This model minimizes the residuals, diminishes the loss function, and optimizes the prediction by generating new decision trees and adding them in sequential steps (Bahrami et al. 2021).Furthermore, the GBRT algorithm is sensitive to its parameters; hence, appropriate parameter selection, including the number of estimators, maximum depth, learning rate, loss function, and so forth assists in reaching the best algorithm implementation and final result.

eXtreme Gradient Boosted
The eXtreme Gradient Boosted algorithm uses an additive training process to develop strong learners using the additive learning procedure.The two-phases additive training process compensates for the drawbacks of other methods by fitting the learning phase to all input data, followed by an adjustment phase for residuals until reaching the stopping criterion (Pham et al. 2020).This algorithm tries to control the fitting problem by avoiding biases, under-and over-fitting problems by considering several hyperparameters identical to GBRT's hyperparameters, as well as gamma parameter.

Random Forest
Random Forest (RF), as a robust ensemble learning model, combines several decision trees in a parallel structure in order to determine the optimum nonlinear relationship between target and input features.This model employs the bagging procedure and uses a bootstrap sampling strategy for random subset input variable selection of each decision tree (Akhavan, Hasanlou, Hosseini, and McNairn 2021;Mao et al. 2019).This approach assists in training several models independently, which reduces the variance and prevents the overfitting phenomenon.The final outcome shows the average performance of all decision trees that created the structure of this ensemble learning algorithm.

RESULTS AND DISCUSSION
This section demonstrates the results of GBRT, XGB, and RF algorithm implementation using SAR-derived features extracted from SLC and GRD formats of the Sentinel-1 dataset.Corresponding to each In-situ sampling point, the features listed in Table 2 extracted to train and evaluate the performance of GBRT, XGB, and RF regression models over test dataset.Table 3, and Figure 6 summarizes and demonstrates the performance of these algorithms using the RMSE and R-squared parameters.Additionally, soil moisture maps produced by these regression models are demonstrated in Figure 7.These maps produced based on the combination of soil moisture datasets over canola and winter wheat fields  3. Results of soil moisture retrieval over canola and winter wheat agricultural region using GBRT, XGB, and RF algorithms  3, the results confirm the relatively close performance of GBRT, XGB, and RF for soil moisture retrieval over canola and winter wheat fields.All algorithms performed better for soil moisture retrieval over canola fields.The best result was obtained using the GBRT model with R 2 = 0.95 and the least error content of RMSE = 0.023 m 3 m −3 .The reason behind the better performance of these models over canola field and also the relatively poor performance of algorithms over winter wheat fields compared with canola field would probably be due to winter wheat's biomass development, and also plant's structure that affect the backscattered SAR signal.

CONCLUSION
The principal purpose of the present study was authentic soil moisture investigation over an agricultural region by conducting fieldwork (soil parameter (moisture and roughness) sampling) coincident with ESA Sentinel-1 satellite pass.For this purpose, several SAR features were extracted from the SLC and GRD formats of Sentinel-1 imageries besides field collected samples.Then, tree-based machine learning algorithms employed for accurate soil moisture retrieval.In this study, several features were considered for soil moisture monitoring.Moreover, machine learning algorithms obtained relatively identical accuracies and the most accurate result with the lowest RMSE value was obtained by the GBRT model over the canola field (R 2 = 0.95 and RMSE = 0.023 m 3 m −3 ).This study showed that dual-polarized SAR, as well as machine learning approaches, were effective for soil moisture estimation over winter wheat and canola fields.However, to have a solid and comprehensive conclusion, more tests and experiments over other areas are required.It is recommended to use SAR-derived vegetation features like Radar Vegetation Index (RVI) besides features extracted from optical sensors like Sentinel-2 to evaluate multispectral-derived indexes for soil moisture retrieval.

Figure 1 .
Figure 1.Location of sample points and the study region over canola and winter wheat fields

Figure 6 .
Figure 6.Results obtained using (a) GBRT, (b) XGB, and (c) RF regression models over canola (left column) and winter wheat (right column) fields Key Words As illustrated in Table3, the results confirm the relatively close performance of GBRT, XGB, and RF for soil moisture retrieval over canola and winter wheat fields.All algorithms performed better for soil moisture retrieval over canola fields.The best result was obtained using the GBRT model with R 2 = 0.95 and the least error content of RMSE = 0.023 m 3 m −3 .The reason behind the better performance of these models over canola field and also the relatively poor performance of algorithms over winter wheat fields compared with canola field would probably be due to winter wheat's biomass development, and also plant's structure that affect the backscattered SAR signal.