LAND SUBSIDENCE SUSCEPTIBILITY MAPPING USING MACHINE LEARNING ALGORITHMS

: Land subsidence (LS) is one of the most challenging natural disasters that has potential consequences such as damage to infrastructures and buildings, creating sinkholes, and leading to soil destruction. To mitigate the damages caused by LS, it is necessary to determine the LS-prone areas. In this paper, LS susceptibility was assessed for Kashan Plain in Iran using Random Forest (RF) and XGBoost machine learning algorithms. For the susceptibility analysis, twelve influential factors including elevation, slope, aspect, curvature, topographic wetness index (TWI), groundwater drawdown (GWD), normalized difference vegetation index (NDVI), distance to stream (DtS), distance to road (DtR), distance to fault (DtF), lithology, and land use were taken into account. 291 LS points were used in this study which was divided into two parts of 70% and 30% for training and testing the models, respectively. The prediction power of the models and their produced LS susceptibility maps (LSSMs) were validated using the Root Mean Square Error (RMSE), R-Squared (R²), and Mean Absolute Error (MAE) values. The results showed that the XGBoost had a higher R² equal to 0.9032 compared to that of the RF which was equal to 0.8355. XGBoost model had an RMSE equal to 0.3764 cm compared to that of the RF model which was equal to 0.4906 cm. MAE for the XGBoost model was 0.1217 cm and for the RF model was 0.3050 cm. Therefore, the achieved results proved that XGBoost had better performance in this research for predicting LS values based on the measured ones.


INTRODUCTION
Land subsidence (LS) is one of the most challenging natural disasters that has potential consequences of damage to infrastructures, creating sinkholes, leading to soil destruction and so on (Raspini et al., 2016;Shi et al., 2020).LS is an apparent and slow deformation or collapse of the earth surface which is caused by a number of natural and human factors (Ng et al., 2015;Zhou et al., 2017).LS can be a natural result of a number of natural and man made disasters such as earthquakes, dissolution of carbonate rocks, movement of faults, or an increase in the depth of groundwater (Arabameri et al., 2021a;Yang et al., 2012).LS has become a global threat, which has occurred in various countries such as China, Mexico, Italy, the United States of America, Spain, and Iran (Brown and Nicholls, 2015;Chaussard et al., 2014;Corbau et al., 2019;Galloway and Burbey, 2011;Tung and Hu, 2012).In recent decades, the rate of LS in Iran has increased widely (Motagh et al., 2008;Tarighat et al., 2021).One of the greatest causes of LS in Iran is the indiscriminate exploitation of groundwater for agricultural purposes (Foroughnia et al., 2019;Mohammady et al., 2019).Water is essential to sustain life on earth.However, its availability is not the same in space and time.Groundwater meets a major part of water demand, and the increasing dependence on this source has led to the depletion of groundwater in different * Corresponding author parts of the world.For decades, groundwater has been widely exploited for domestic, agricultural, and industrial purposes (Foroughnia et al., 2019;Mohammady et al., 2019).This requires artificial recharge to balance groundwater depletion and control LS.Water resources are a critical requirement for a sustainable food supply.The increase in temperature and the change in precipitation patterns in space and time due to climate change lead to frequent and severe droughts and floods, which reduce the ability to absorb and store water (Mirza, 2003).Population growth and the industrialization of societies have increased the demand for water resources and this trend will continue, as increasing amounts of water are needed to sustain societies (Dalin et al., 2017;Wada et al., 2010).Faced with the growing demand for access to fresh water, the use of groundwater is used as a vital resource to meet agricultural, industrial, and drinking water demands.The increase in water demand has led to a decrease in groundwater in many parts of the world including Iran (Foroughnia et al., 2019;Mohammady et al., 2019).In many areas, this issue has led to LS, which causes permanent loss of underground water storage (Smith et al., 2017), damage to infrastructure and arsenic contamination (Erban et al., 2013;Smith et al., 2017).Advances in remote sensing, geospatial information system (GIS) spatial analyses and artificial intelligence (AI), have helped in the modeling of several natural hazards such as LS to determine the LS prone areas.Machine Learning (ML) methods are particularly important in natural hazard modeling due to their capacity to handle complex real world problems, as well as due to their high accuracy and efficiency.(Arabameri et al., 2021a;Chen et al., 2019;Ebrahimy et al., 2020;Feng et al., 2020;Lee et al., 2012;Yang et al., 2012).Previous research has verified that twelve factors that have the most important influence on LS are elevation, slope, aspect, distance to road (DtR), distance to fault (DtF), distance to stream (DtS), groundwater drawdown (GWD), normalized differentiate vegetation index (NDVI), topographic wetness index (TWI), curvature, lithology and land use (Arabameri et al., 2021b;Ebrahimy et al., 2020;Ranjgar et al., 2021).ML is a branch of AI (Zhou, 2021).ML has been used in various fields such as landslide (Arabameri et al., 2020), effects of climate change on the environment (Seyed Mousavi et al., 2022) and so on.In the previous study of LS, ML has been widely used to determine the relationship between the influencing factors and subsidence of the area (Ranjgar et al., 2021;Shi et al., 2020) and to estimate and predict LS (Mohammady et al., 2019;Ranjgar et al., 2021;Shi et al., 2020).In previous studies, the XGBoost model was once used to model the LS of the Beijing Plain, China (Shi et al., 2020), however, the influencing factors used in this study except groundwater level had not been taken into consideration.In this research, we have considered these influential factors as well in the modeling phase.Ensemble learning algorithms such as XGBoost and Random Forest (RF) improve the performance of the model by reducing the overall error rate (Zhou, 2012).Compared with traditional ML models such as SVM and ANN, XGBoost and RF have faster calculation speed and compared with deep learning algorithms, these models are good for tabular

METHODOLOGY
The research methodology consists of four steps as follow which are illustrated in Figure ( 2): • Selecting 291 LS points.data with fewer features (Shi et al., 2020).In this study, We have considered the twelve influencing factors in LS, and also a number of the sample points of subsidence were collected using radar interferometry.The data was divided into two parts, including training and test to train XGBoost and RF models and evalaute their results.The final evaluation of the model was undertaken using Root Mean Square Error (RMSE), R-Squared (R²), and Mean Absolute Error (MAE) values.Finally, the LS susceptibility map of the study area was produced from the two models and a comparison was made between them.The remaining parts of the paper is as follows.Section 2 has concentrated on the discription of study area.Section 3, presents the research methodology.Section 4 elaborates the research results.Finally, section 5 concludes the paper and suggests some directions for future research.
(2) (Beven and Kirkby, 1979) The TWI map was produced using the DEM on the ArcGIS 10.3 -ing data (Breiman, 2001).Grid Search was used to adjust the parameters.n_estimator is the number of trees in the RF, and max_features is the number of features to consider when looking for the best split (Huang et al., 2016).

XGBoost regression
XGBoost is one of the quickest implementations of gradient boosted trees (Lu and Ma, 2020).XGBoost is an iterative decision tree algorithm, which uses residuals to improve the model.First, XGBoost supports parallel computing, second, it also supports regularization, which prevents model overfitting (Huang et al., 2022).Although the model is highly accurate, it platform whose value ranges from 2.33 to 18.07.The Lithology map of the study area was produced by GSI (Figure 3i).There are ten different classes of Geo Unit in this area (Table 2).The curvature map was shown in Figure (3e) whose values range from −12.16 to 10.24.Finally, the map layers had created in grids of 20 Meter * 20 Meter size in order to harmonize the data.

RF regression
RF is developed by Breiman (Breiman, 2001;Wang et al., 2020).RF Regression is a supervised ML decision tree-based algorithm, where the decision trees form with random samples from the train can be easily overfit.For this purpose, n_estimators must be Controlled.The eta is used to control the rate of iterations and prevent overfitting.Subsample controls the proportion of the extraction of example.xgb_model parameter is used for selecting a weak evaluator.The objective, max_depth, the alpha and lambda are used to select the loss function, specifies the maximum depth of each tree, control L1 and L2 regularization terms, respectively.We used Grid Search method to adjust the parameters (Table 3).Finally, to find the most important influential factors of the LS susceptibility, mean decrease in impurity (MDI) was calculated for RF (Figure 4) and XGboost models (Figure 5).The results showed that DtF, elevation and GWD hold the greatest impact on the LS occurrence. (2)

Validation
To assess the efficiency of the models, R², RMSE, and MAE were employed (Equations (3-5)).R-squared will give an estimate of the relationship between movements of a dependent variable based on an independent variable's movements.It represents the possible bias in the data and predictions.It does not mean whether the selected model is good or bad.The closer the R-squared to one, the better (Cameron and Windmeijer, 1997).
Root Mean Square Error (RMSE) is the standard deviation of the residual (prediction errors) which is one of the most commonly used measures for evaluating the quality of predictions.It tells how concentrated the data is around the line of best fit.Naturally lower values indicate a better fit for the model (Barnston, 1992).
MAE is the mean of the absolute errors which is the absolute value of the difference between the predicted and the measured values (Eq.5) (Willmott and Matsuura, 2005).

MAE =
∑ �y � i -y i � n i=1 n where y � i = vector of predicted dependent variables with n data points y i = vector of observed values of the variable being predicted y � i = mean of the observed dependent variables

RESULTS AND DISCUSSION
By utilizing the twelve influential factors maps produced, the selected LS points and the methods mentioned in Section 3.3, the mapping and assessment of LS susceptibility for Kashan plain were undertaken.RMSE, R², and MAE were calculated for RF and XGboost models.Table (4) demonstrates the comparison of RMSE, R², and MAE for each models used.The results showed that the XGBoost had higher R² (0.9032) compared with that of the RF ( 0.8355).XGBoost model had less RMSE (0.3764 cm) than that of the RF model (0.4906 cm).MAE for the XGBoost model was equal to 0.1217 cm and for the RF model was equal to 0.3050 cm. Figure ( 6) demonstrates the compatibility between the measured data and the predicted data.As a result, the XGBoost model has a higher prediction accuracy than that of the RF model.

Results of the LS susceptibility Maps (LSSMs)
After applying the models and evaluating the accuracies, the maps of the twelve influencing factors, in the form of a stack on QGIS 3.16 platform was produced to be used as the model input.Then the LSSMs were produced.The values of the LSSMs prepared from the models were in the range of 1.441 cm to -7.497

CONCLUTION
Iran Natioal Cartographic Center (NCC) reports show that the annual average subsidence rate has increased in Iran (https://www.ncc.gov.ir/en/).therefore, LS is an important issue in Iran.LS was affected by a number of factors.We selected the most important influencing factors to predict the LS rate in the Kashan Plain, and investigate the relationship of parameters in LS modeling.The conclusions are as follows: • Phenomenon of LS is one of the most threatening natural hazards of the earth, which has great losses on the economy.
For proper assessments of this issue, it is necessary to develop a suitable model of LS that can be used in any In previous studies, the XGBoost model was once used to model the LS of the Beijing Plain, China (Shi et al., 2020), however, the influencing factors used in this study except groundwater level had not been taken into consideration.In this research, we have considered these influential factors as well in modeling.The achieved results prove that the model is well established.• The results have indicated that the XGBoost model had less RMSE (0.3764 cm) than that of the RF model (0.4906 cm).MAE for the XGBoost model was equal to 0.1217 cm and for the RF model was equal to 0.3050 cm.XGBoost had higher R² which was equal to 0.9032 compared to that of the RF which was equal to 0.8355 indicating better compatability between the predicted and measured LS. • As can be seen in both of the models used in this study, the highest rate of LS is in the northwest and west of Kashan Plain and the lowest rate of LS is observed in the south of Kashan Plain.In addition, in places where the DtR, the DtF and DtS are less, a higher LS rate has been observed.According to the GWD map, in the southwestern and the northwestern parts of the study area, the maximum GWD can be observed.Furthermore, in the southwest of Kashan plain, a high subsidence rate has been occured.• The major strength of this study was quality of the ensemble ML algorithms and their optimum prediction results in the LS mapping.There were some limitations in this research such as lack of implementing hydrological modelings which will be considered in our future research.

Figure 1 .Figure 1 .
Figure 1.Map of the study area and location of the employed LS sample points AREA Kashan plain is a part of Kashan city, which ends at Karkas mountain from the south and is located about 240 kilometers south of Tehran between the Longitudes of 51.05 and 51.54 degrees and the Latitudes of 33.45 and 34.23 degrees (Figure 1).Kashan plain with an area of 1570 Square Kilometers includes the city of Kashan, its central part, the cities of Aran and Bidgol and agricultural lands located in the plain.Kashan Plain is one of the least rainfall regions of Iran.The climate of the study area has two classes of arid and semi-arid.The temperature in this area ranges from 16 •C to 22 •C .Moreover, Elevation is between 799 m and 1336 m above mean sea level.• Data preprocessing and production of maps of the twelve influencing factors in LS. • Employing RF and XGBoost ML algorithms to map the LS susceptibility.• Validation of the performance of each model using RMSE, R² and MAE.3.1 Selecting LS points the spatial distribution of several LS regions was shown in Figure (1).The interferometric sysnthethic aperture radar (InSAR) map produced by the Geological Survey and Mineral Exploration of Iran (GSI) with centimeter accuracy in the first half of 2016 was used to prepare sample points of subsidence in the study area.
local upslope area draining through a certain point per unit contour length b = the local slope in radians

Figure 6 .
Figure 6.Compatibility between the measured data and the predicted data (a) XGBoost, (b) RF.

Table 2 .
Lithology of the study area.

Table 2 .
Lithology of the study area.