A COMPARISON BETWEEN THREE CONDITIONING FACTORS DATASET FOR LANDSLIDE PREDICTION IN THE SAJADROOD CATCHMENT OF IRAN

This study investigates the effectiveness of three datasets for the prediction of landslides in the Sajadrood catchment (Babol County, Mazandaran Province, Iran). The three datasets (D1, D2 and D3) are constructed based on fourteen conditioning factors (CFs) obtained from Digital Elevation Model (DEM) derivatives, topography maps, land use maps and geological maps. Precisely, D1 consists of all 14 CFs namely altitude, slope, aspect, topographic wetness index (TWI), terrain roughness index (TRI), distance to fault, distance to stream, distance to road, total curvature, profile curvatures, plan curvature, land use, steam power index (SPI) and geology. D2, on the other hand, is a subset of D1, consisting of eight CFs. This reduction was achieved by exploiting the Variance Inflation Factor, Gini Importance Indices and Chi-Square factor optimization methods. Dataset D3 includes only selected factors derived from the DEM. Three supervised classification algorithms were trained for landslide prediction namely the Support Vector Machine (SVM), Logistic Regression (LR), and Artificial Neural Network (ANN). Experimental results indicate that D2 performed the best for landslide prediction with the SVM producing the best overall accuracy at 82.81%, followed by LR (81.71%) and ANN (80.18%). Extensive investigations on the results of factor optimization analysis indicate that the CFs distance to road, altitude, and geology were significant contributors to the prediction results. Land use map, slope, total-, plan-, and profile curvature and TRI, on the other hand, were deemed redundant. The analysis also revealed that sole reliance on Gini Indices could lead to inefficient optimization.


INTRODUCTION
Landslides are a type of natural disaster that can have detrimental effects on human livelihood, which includes the destruction of properties, undesirable changes to the environment as well as human casualties . The damages incurred interfere with many economic and social activities. Various factors can be linked to the cause of landslides, where many are beyond human control. These include melting of glaciers, excessive rainfall, mining activities, volcanic eruptions and earthquakes (Mousavi et al. 2011;Dou et al. 2015;) Therefore, the ability to predict landside occurrences is exceptionally vital, especially for disaster mitigation and management. It would also be beneficial if the contributing factors could be identified according to their importance, which would greatly facilitate and expound the benefit of landslide prediction.
Topographical, geological, and hydrological datasets are active conditioning factors for landslide prediction, but each carries different levels of importance (Mahalingam et al. 2016;Afungang et al. 2017). The prioritization of such factors depends on the characteristics of the study area; hence there is no guideline for any particular factors selection/consideration ). one by one. Their results indicate that distance to fault was least influential and curvature was most influential for prediction, respectively. The drawback of this work however, is that they did not rank the importance of each factor before modelling. In a similar work by Sezer et al. (2011), they evaluated ANFIS by beginning with three factors and made their way up to 7 factors. The Receiver Operating Characteristic (ROC) from the first model increased from 67.38% to 98.52% in the 5th model. Again, the significance of each factor was not clearly elaborated. The authors, however, highlighted that plan curvature was the least important factor, while the lithology factor had increased the ROC up to 10%.
It is quite apparent from the literature that researchers are striving to choose the best factors, along with the suitable modelling technique Kalantar et al. 2019, Nguyen et al. 2019. The process of finding the optimal factor combination and appropriate modelling approach is crucial as different factor combinations and model selection can lead to different results. For instance, adding or removing conditioning factors can cause desirable (or undesirable) prediction accuracy values of the selected model ).
This study meant to investigate the same theme where we look at a total number of 14 landslide conditioning factors to determine the best combination that yields the best prediction. In particular, the Variance Inflation Factor (VIF), Gini importance, and Chisquare were used to evaluate the effectiveness of the factors under consideration. Consequently, three datasets are created, i.e., D1, which includes all 14 conditioning factors; D2, which is a dataset (based on D1) that is reduced using factor analysis and importance; and D3 containing DEM derivatives (morphometrics factors). Different modelling techniques (i.e., supervised machine learning) were used namely Support Vector Machine (SVM), Artificial Neural Network (ANN), and Logistic Regression (LR).

STUDY AREA AND DATA USED
For this work, the study area chosen is Sajadrood catchment, which is located in Babol county within the Mazandaran Province of Iran ( Figure 1a). The coordinates for this catchment are approximately in the north latitudes 36°9′ and 36°10′ and east longitudes 52°30′ and 52°40′ with a coverage area of approximately 118.8km 2 . The population is estimated to be around 26,809 people (2006 census). The study area consists of dense forests, agriculture areas and paddy fields (Figure 1b). According to the Iranian Meteorological Organization, Sajadrood's temperature ranges between -3°C (February) to 38°C (August) with a long-term average temperature of 17.1°C. The climatic condition of the catchment is cold and mild mountainous and receives heavy rainfall throughout the year, with an annual average precipitation of 680 mm.
The study has various types of geological formations as shown in Figure 1c. Using a 1: 25,000-scale topographic map of Sajadrood, we generated a 10m Digital Elevation Model (DEM) as the primary data source for landslide susceptibility mapping. In this study, 227 landslide inventory points were collected from satellite imagery and field surveys by the Geological Survey of Iran. 70% of the landslide inventories were randomly used to train three supervised machine learning models, namely the SVM, ANN, and LR. The remaining 30% of the landslide inventory points were reserved for testing the machine learning models.
In this work, 14 conditioning factors ( Figure 1) derived from the DEM and topographic databases (using ArcMap 10.3) are considered, namely altitude, slope, aspect, topographic wetness index (TWI), terrain roughness index (TRI), stream power index (SPI), distance to fault, distance to stream, distance to road, land use, total curvature, profile curvature, plan curvature and geology. These factors are chosen due to their availability and also since they were also used in relevant works such as that by (Nguyen et al. 2019).

Landslide Conditioning Factors Preparation
Selection and preparation of conditioning factors are done according to the works of , which are briefly explained in this section. A region's altitude variation has considerable influence on landslide susceptibility. We, therefore, classified altitude into the five classes using the natural break scheme. Resultantly, the altitude factor ranges from the minimum height of 74 meters to a maximum of 1500 meters (Figure 1d). A crucial factor that triggers landslides as a source of stress and instability in steep areas is the slope. The slope angle map is hence separated into 5 interval classifications: (i) 0°-8.4°, (ii) 8.5°-13°, (iii) 14°-17°, (iv) 18°-23°, and (v) 24°-48° ( Figure 1e). Slope Aspect influences vegetation growth and moisture level of the soil (due to rainfall), wind, and solar radiation. We categorized aspect into the 9 classes (i) flat, (ii) north, (iii) northeast, (iv) east, (v) southeast, (vi) south, (vii) southwest, (viii) west, and (ix) northwest ( Figure 1f).
Topographic wetness index (TWI) measures the tendency of runoff and the position where water converges. Terrain Roughness Index (TRI), on the other hand, indicates slopes that are concave and convex upward, while Steam Power Index (SPI) measures the intensity and erosive power of slope surface runoff. The calculations for these three indices are as follows: where A s = area of catchment in m 2 β = gradient of the slope in radians max, min = largest and minimum value of a pixel I nine rectangular altitude neighbourhoods. SPI, TWI, and TRI are then classified into five classes ( Figure  1o, g, h).
Landslides commonly occur along faults, rivers, and roads, mainly as a result of soil erosion and human activities. In this work, we follow the classification done by Hong et al (2018) and Golkarian et al. (2018). The distances to faults, streams, and roads were separated into five classes using the Euclidean distance function in ArcGIS (Figure li, j, k). Different land use types can be a sign of human activities and/or environmental changes, which can influence ground shape and stability. In this work, it was discovered that the land use map of the study area contained six land use categories, namely (i) agriculture, (ii) paddy field, (iii) residential land, (iv) orchards, (v) dense forest, and (vi) harvested forest. We used supervised classification from the Landsat Enhanced Thematic Mapper (2017) image with an accuracy of 90%.
While surface curvature reflects the shape of the ground surface affecting soil runoff, the profile curvature affects water velocity flow that drains the surface, which also influences erosion and deposition. Plan curvature reflects slopes steepness (horizontal plane) that influences surface runoff characteristics. Total curvature is the surface's curvature, which is by definition, equals to the sum of the profile and plan curvatures. Extra details regarding curvatures (which include equations and formulas), can be found in the literature (Alkhasawneh et al. 2013). In this work, total, profile, and plan curvature maps were classified into three categories: (i) concave, (ii) flat, and (iii) convex ( Figure 1 l, m, n).

METHODOLOGY
The datasets used in this research are shown in Table 1. Dataset D1 includes 14 conditioning factors, D2 is a reduced-size dataset of D1. The eight conditioning factors were derived by applying three-factor optimization techniques namely Variance Inflation Factor (VIF), Gini importance indices, and Chi-square. Lastly, the third dataset D3 includes only DEM-derived factors. Figure  2 simplifies the methodological flowchart of this research.

The importance of Factor Analysis
Selecting suitable conditioning factors is essential to produce accurate landslide susceptibility maps. Multicollinearity, outliers, and spatial variations of conditioning factors are issues that necessitate factor analysis in susceptibility assessment. This type of analysis enables the removal of redundant factors, which makes constructing and training any model simpler (Kalantar et al. 2017). In this work, the highly related features discard approach was adopted. Mainly, an estimation of variance-inflated factor (VIF) was used: where ′ = the multi-correlation coefficient between features. VIF values that are 5 or 10 and higher suggest highly correlated factors. Such features are deemed unsuitable and are consequently removed from consideration (O'Brien 2007). In addition to factor analysis, other techniques to handle data redundancy are the Chi-Square Factor Optimization and Gini Importance methods. A higher Chi-square value is responsible for the more critical prediction factor to detect the landslides. In this work, the p-value was evaluated against a 0.05 level of significance, which allows the establishment of the significant relationship between landslide occurrence and the particular conditioning factors. Also, the Gini coefficient and Cramer's V statistics (both ranging from 0 to 1) are computed for each factor. For the Gini coefficient, a value of 0 indicates that all the variables are equal. A value of 1, on the other hand, denotes inequality among the variables. In contrast, Cramér's measures the correlation between landslide conditioning factors. Here, 0 implies no correlation whereas 1 shows a perfect correlation. Therefore, the highest value of Cramer's reveals the highest correlation between the factors while the highest value of the Gini coefficient represents a lower correlation.

Support Vector Machine (SVM)
The SVM is a machine learning algorithm based on statistical learning theory. It was initially meant for binary classification problems but can be extended to multi-class classification as well.
The SVM operates in a higher dimensional feature space, which is obtained by using a specific kernel function. The intuition behind the algorithm is to discover an optimal separating hyperplane between the positive and negative classes by calculating the maximum margin to the nearest training examples (Cortes and Vapnik 1995). The positive class is annotated as +1, whereas the negative class as -1.
In this work, intuitively, the positive class refers to landslide whereas the negative class to non-landslide. Specifically, the algorithm is given a set of n labelled training examples In this work, x_i represents each of the abovementioned conditioning factors. Depending on the type of data, the SVM's performance is determined by the choice of the kernel function. Commonly used kernels are the RBF (radial basis function), polynomial, sigmoid, and linear. In this work, we opt for the linear kernel due to its simplicity. Overall, the linear separating hyperplane of the SVM can be written as follows where = the coefficient vector, which decides the separating hyperplane's final orientation. The variable is hyperplane's offset from the origin and the slack variable δ i caters for penalizing any constraints violation (Cortes and Vapnik 1995).

Linear regression (LR)
Similar to the SVM, LR is a binary classifier as well. In the context of this research, LR's main objective is to identify the optimal coefficients associated with each independent variable (i.e., conditioning factor) by discovering relationships with the dependent variables (Ozdemir and Altural 2013), which in this work are landslide vs. no-landslide.
The LR does assume a normal distribution (Pradhan and Lee 2010), and the independent variables are annotated as 0 and 1 to reflect landslide and no-landslide, respectively. Since LR calculates its output based on the Sigmoid (or Logistic) function, the output is a probability value. Specifically, LR determines the probability of a class based on the following where θ denotes the linear model parameters, which are the coefficients representing the weight contribution of each conditioning factor in, the function g(z) is the logistic function that calculates the probability of whether the input values correspond to the positive class y=1, indicating a landslide. In this work, g(z)>0.5 is considered to be in a positive class.

Artificial neural network (ANN)
In contrast to statistical models, the ANN is independent of any data's statistical distribution hence does not require the calculation of any statistical variable (Pradhan and Lee 2010). ANNs also have the ability to generalize even when dealing with imperfect/incomplete data for nonlinear problems (Tian et al. 2019). In this study, a Multi-Layer Perceptron (MLP) ANN was trained and learning the weights is achieved using the backpropagation algorithm. The MLP is a widely used architecture that consists of three main components, namely the input layer (input data), output layer (provides prediction results), and one or more hidden layers that interconnect the input and output (Aditian et al. 2018). As with any machine learning model, training the MLP-ANN begins with random weight assignments for each neuron. Learning occurs by a continuous update of each of the weights and stops upon reaching acceptable training accuracy. The updating of the weights is basically performed via the minimization of a particular error function that calculates the difference between the predicted and the actual output values. To gain more insight into the algorithm, readers can be directed to the following literature (Kim et al. 2014)

Accuracy Assessment
The metrics used for classifier evaluation is Overall Accuracy (OA), Kappa Statistics, Receiver Operating Characteristics (ROC), and Prediction Rate Curve (PRC) area. Overall accuracy (OA) determines the proportion of sites that have been correctly mapped. It is obtained by dividing the total number of pixels that are correctly classified by the total number of pixels. OA is expressed as a percentage. According to Shafii and Price (2001) and Viera and Garrett (2005), Cohen's Kappa interprets the degree of agreement between observed and predicted values. A Kappa of 1 indicates the best agreement in the model. ROC stands for the Receiver Operating Characteristics curve. Based on Tsangaratos and Ilia (2016), the ROC plots the true positive rate (i.e. the rate at which the model correctly predicts landslide) against the false positive rate (i.e. the rate at which the model predicts landslide as non-landslide). The area under the curve (AUC) calculates the area under the ROC, which indicates a classifier's overall accuracy. An area of 0.5 indicates weak, whereas one as flawless (Beguería 2006). The Prediction Rate Curve is a plot where the vertical y-axis is the success rate (i.e. truly detected landslides), and the horizontal x-axis is the total positive landslide-prone areas. It is also used to determine the prediction prowess of a classifier (Beguería 2006;Pourghasemi and Rossi 2019); they were similar to the ROC, the area varies from 0 to 1.

RESULTS
First of all, the importance of each conditioning factor was investigated by analysing the VIF and Gini importance indices. The former is shown in Table 2 whereas the latter in Table 3. From Table 2, VIF values less than 10 indicate low correlation, whereas VIFs above 10 suggest higher correlation. It can also be seen that most of the Tolerance values are higher than 0.1, indicating less correlation between the factors (exceptions being for land use and aspect).
In Table 3, higher Chi-square values with a p-value less than 0.05 indicate that the factor is significant for landslide prediction. Specifically, Chi-square analysis highlights distance to road, altitude, and geology more than any of the other factors. Land use is seen as the least important factor. The results of the Gini importance indices include information value (IV), Cramér's V and Gini coefficient values. Higher IV values can be seen for distance to road, geology and TWI, which can be translated to "strong" predictors for mapping landslideprone zones. Cramér's V, on the other hand, shows all factors expect distance to road, having values less than 0.3. This indicates an insignificant correlation (except for distance to the road at 0.69). The Gini coefficient values indicate a slight correlation between all factors (all values ~ 0.5). The degree of correlation, however, was higher for distance to the road (0.26), which is a value close to zero.
As previously mentioned, three datasets (D1, D2, and D3) are considered in order to see which one would provide the best representation for landslide susceptibility mapping. Note that D1 consists of all the 14 conditioning factors. The intuition behind D2 and D3 is to see whether a reduced set of CFs can also achieve good accuracies. Hence, for the purpose of optimization, redundant factors are removed prior to modelling, which is consistent with the work in Mousavi et al. (2017). The differences between VIF, Gini, and Chi-square led us to choose the most common factors. Consequently, slope, total curvature, plan curvature, profile curvature, TRI, and land use were removed from the datasets to create the D2. D3 included only DEMderived factors.
The three classification models SVM, ANN, LR, were constructed based on the three datasets. In Table 4, it can be seen that all three algorithms performed equally well in mapping the spatial distribution of landslide-prone areas. It appears that SVM performed best with an overall accuracy (OA) of 82.81% using dataset D2 as compared with ANN (OA = 80.18%) and LR (OA = 81.71%). Dataset D1, which contains all 14 factors, showed inferior performance in OA. Additionally, the accuracy for all models dropped to the value of 62.55%, 66.96%, and 60.35 for SVM, ANN, and LR model, respectively, using D3, which indicates that the DEM-generated dataset is not a suitable representation.
Generally, the Kappa statistics in Table 4 showed "substantial agreement" between the observations (ground truth or inventory map) and predictions (landslide susceptibility map) for the three models using the D1 and D2 datasets, while the degree of agreement abruptly declined to "fair agreement" for all models utilizing the D3. This is in line with the explanations provided by (Viera and Garrett 2005). This again shows that DEM-derived data alone from D3 is insufficient for training the classifiers. The AUC (Table 4) shows promising levels for all three models when considering D1 and D2.
For instance, using the D1, a maximum AUC of 0.88 was obtained by ANN and LR, whereas, LR model performed better using D2 at 0.89. Besides, all three models only reported: "moderate accuracy" for D3. Likewise, the prediction power and success rate for true positives for the three models were evaluated by PRC (Table 4) and the results indicated the best performance was obtained by ANN (0.88) using D1. Prediction performance, however, was for the SVM (0.58) when using D3. Finally, for this study, all accuracy assessments and validation techniques agreed that applying SVM, ANN and LR using D3 were unreliable for accurate landslide susceptibility mapping, while almost all three models performed better when using D1 and especially D2. When looking at processing time, the LR model using D1 performed ~2.67 times faster than when it was exploited in the D2 dataset. The LR technique has been implemented within 0.03 seconds using optimized factors (D2) and is ranked the fastest algorithm to compare with SVM (0.14s) and ANN (0.27s). Again, it confirmed the importance of factors optimization for a broad set of variables and conditioning factors in landslide-prone zones where we are dealing with large sets of data and variables.
For a better understanding of each conditioning factor and reliability of our predictions, we omitted each factor in time from model. Table 5 indicated that just removing the distance to the road had a significant effect on the level of agreement between the observations and predicted landslide areas so; the final map may seem unreliable without this particular factor, this confirmed uncertainty associated with Cramér's V and Gini coefficient results, as well.  Table 3. Accuracy assessment and validation of SVM, ANN, and LR.

DISCUSSION
The increased measures of VIF for slope, total curvature, plan and profile curvature, and TRI has been detected as collinearity and redundancy in the datasets. Tolerance values less than 0.1 also indicated the presence of multicollinearity in land use and aspect. The Chi-square method, on the other hand, categorized distance to the road and land use as the best and worst factors, respectively. In contrast, Gini indices values obtained controversy results as Cramér's V and Gini coefficient concluded that distance to the road was a redundant variable, whereas IV evaluated distance to the road with a higher degree of inequality as an influential factor. Bergsma (2013) noted that Cramér's V could be biased when Chi-square increases and the result may overestimate the degree of association. To ensure that distance to the road is essential, we examined its absence in the SVM, ANN and LR (Table 5) and computed the Kappa Index. As a result, Kappa decreased dramatically when the distance to the road is removed. In all, this indicated that distance to the road was indeed a very critical factor (in line with Mousavi et al. 2011).
The accuracies of the models were evaluated using the datasets D1, D2, and D3. Mainly, all three models performed well using D1 and D2 datasets. The SVM, using the optimized factors (i.e. D2), outperformed others based on overall accuracy and Kappa. This implies that the redundancy removal in factor optimization leads to better classification performance. The LR algorithm shows identical accuracy and Kappa using D1 and D2 due to the corresponding coefficient matrix with data evaluation and exclusion of nature during the logistic regression process (Mousavi et al. 2017). For this reason, as well, the evaluation results for VIF and Chi-square were in agreement with the LR coefficient matrix to eliminate data redundancy. Validation of the ANN algorithm using PRC shows the highest prediction accuracy and performed significantly well compared to SVM and LR. Due to this, we foresee ANN to be a reliable alternative when dealing with uncertain, noisy and insufficient conditioning factors. The AUC finally validated that all three algorithms performed well, while LR shows the best overall performance using the D2 dataset.
Two experiments by Pradhan and Lee (2010) and Sezer et al. (2011), which was discussed earlier in this article, applied the ANFIS algorithm with almost the same conditioning factors for susceptibility mapping in different study areas. In comparison with these works, the significance of conditioning factors was diverse, and even the most important factor considered by one research was labelled as the least important factor by the other one. Although in our study, we had only four practical factors in common with this researches, we could obtain a good level of accuracy using other conditioning factors, as well. Thus, prior factor optimization in our research led to avoiding over learning the algorithms, heavy calculation, and modelling, especially when dealing with a large area and several conditioning factors.

CONCLUSION
Three supervised learning models (SVM, ANN, and LR) were constructed based on each dataset. The primary objective was to determine which dataset was most representative for landslide prediction. The first dataset D1 considered 14 conditioning factors; the second dataset D2 had a reduced set of 8 factors, while the third dataset D3 included only DEM-derived factors. VIF, Chi-square, and IV Gini index firmly prioritized the conditioning factors where there is no standard guideline to rank these factors, and it is highly subjective to the characteristics of the study area. Factor optimization ultimately highlighted distance to the road; altitude and geology were significant contributing factors, slope, plan and profile curvature that seemingly affects erosion process more than other factors in many similar studies (Pradhan and Lee 2010) were found to be insignificant factors for this case study. For this particular area, the importance of distance to road indicated that most of the predictions and landslides had been identified in the areas close to the roads. So, road construction may potentially trigger the hazards more than other factors. Predominantly, the SVM model obtained the best accuracy and kappa of 82.81% and 0.65, followed by LR (81.71%) and ANN (80.18%) using D2. The same scenario goes with D1, as well, and SVM (82.15%) achieved the best result even though LR had a hidden factor optimization layer. For this study, SVM was confirmed as the best classifier for mapping the susceptible landslides. Again, none of the algorithms reached a supportable level of accuracy using D3 although ANN behaved more effectively with this incomplete dataset.
To put it briefly, the availability of data from different remote sensing sources lead to deal with massive data and conditioning factors to predict the landslide hazards; therefore, the quality and speed of modelling necessitate factor optimization, in advance. The outcome of this research emphasized that the importance of landslide causative factors differs from one site to another, and it could be remarkably changed by human activities ; also, the choice of optimizer could directly affect the optimization results. The site dependency of landslide conditioning factors and the choice of optimizers emphasize that even a pre-used group of conditioning factors for a particular zone might not be successfully applied to another region. Therefore, for a reliable result, the use of all available datasets in a study area is highly beneficial, besides, without proper optimization algorithms, one cannot omit a factor even it was tagged insignificant by some other researchers. Especially for this study, road construction was the main source of improper human activities in residential areas with lower altitude. Thus, it is recommended to use more than one optimizer prior to classification. Moreover, for those governmental organizations and private sectors involving in road construction, it is suggested that more attention is needed during transport network construction and maintenance in Sajadrood due to geology and unstable soil type. Lastly, this promoted the importance of landslide mitigation and early warning system to decrease casualties and losses where construction is inevitable.  Table 4. The importance of factors using Gini importance and chi-square techniques.