FLOOD SUSCEPTIBILITY MAPPING AND ASSESSMENT USING REGULARIZED RANDOM FOREST AND NAÏVE BAYES ALGORITHMS

: Floods have caused significant socio-economic damage and are extremely dangerous for human lives as well as infrastructures. The aim of this study is to use machine learning models including regularized random forest (RRF) and Naïve Bayes (NB) algorithms to predict flood susceptibility areas using 410 sample points (205 flood points and 205 non-flood points). Ten flood influencing factors including elevation, topographic wetness index, rainfall, normalized difference vegetation index, curvature, land use, distance to river, slope, lithology, and aspect have been used in the modelling process. For this purpose, 70% of the data was used for training and the rest employed for testing the models. Accuracy (ACC), sensitivity, specificity, negative predictive value (NPV), and the area under the curve (AUC) of the receiver operating characteristic (ROC) were used to validate and compare the performance of the models. The results showed that the RRF model on the testing dataset had the highest performance (AUC = 0.94, ACC = 90%, Sensitivity = 0.89, Specificity = 0.92, NPV = 0.89) compared to that of the NB model (AUC = 0.93, ACC = 89%, Sensitivity = 0.84, Specificity = 0.96, NPV = 0.81). The employed models can be used as an efficient tool for flood susceptibility mapping with the purpose of planning to reduce the damages.


INTRODUCTION
Floods have always been known as part of a natural hydrological cycle.However, the trend of increasing recurrence and magnitude of floods has been recorded in recent decades (Kundzewicz et al., 2013).Global warming which means rising temperatures causes a number of natural disasters such as hurricanes (Wolf et al., 2020) and floods (Trenberth, 2008).Among natural disasters, floods are the most devastating events, affecting millions of people around the world (Shen and Hwang, 2019).Floods have resulted to nearly 20,000 fatalities and 75 million homeless each year (Khosravi et al., 2016).However, it should be noted that the flood phenomenon is inevitable and the management and forecast of future floods can be an important step to reduce the damage (Cloke and Pappenberger, 2009;Tehrany et al., 2015).Therefore, the production of flood susceptibility map (FSM) is an important step for flood damage assessment and supports an informed decision making for the flood disaster management.In recent years, several approaches have been developed for flood susceptibility modelling (Bahremand et al., 2007).Hydrological methods are unable to accurately flood susceptibility areas modelling because they use linear hypotheses and require long-term hydrological data for monitoring.Therefore, these methods are unsuitable for the study of large-volume catchments (Liu and Smedt, 2004;Tehrany et al., 2013).Thus, researchers have tried to achieve the goal of flood modelling using methods based on statistical models such as frequency ratio (Cao et al., 2016;Rahmati et al., 2016).Recently, machine learning (ML) approaches have been used by researchers to produce flood-susceptibility maps due to their better performance than other models (Bui et al., 2018;Chen et al., 2021;Panahi et al., 2021).These models focus more reliably on discovering the relationship between the target variable and the flood factors.In flood modelling using ML algorithms, several algorithms including adaptive neuro-fuzzy inference system (ANFIS), artificial neural network (ANN), support vector machine (SVM) and random forest (RF) have been frequently used in different basins of the world.A number of studies in this field have been presented in recent years.For example, Khosravi et al. (2018) have evaluated the efficiency of four decision two models namely logistic model trees (LMT), alternating decision trees (ADT) and to flash flood susceptible areas at the Haraz Watershed in the northern part of Iran.The results have showed that ADT methods had the superior performance because this method is one of the advanced decision tree (DT) methods and has reached an AUC value of 0.976.Esfandiari et al. (2020) has developed the Pseudo Supervised Random Forest (PS-RF) model to flood hazard risk mapping at the Fredericton, Canada.The accuracy measures of this model evaluated with true positive rate, true negative rate, false positive rate, accuracy, Cohen's Kappa coefficient and Matthews Correlation Coefficient.Lei et al. (2021) scrutinized the applicability of the recurrent neural network (RNN) and convolutional neural network (CNN) models for urban flood susceptibility mapping.The findings have showed better prediction performance of CNN method.Furthermore, terrain ruggedness index was the most influencing factor in flood inundation in Seoul, South Korea.Ahmadlou et al. (2022) has used the classification and regression tree (CART) model, Genetic Algorithm and Grid Search to optimize the parameters for flood modelling and the achieved AUC (CART-GS) was equal to 0.927.(Habibi et al., 2022) has evaluated the efficiency of the Chi-square automatic interaction detection (CHAID) algorithm for flood susceptibility assessment in the Sardabroud watershed.The main objective of this study is modelling flood susceptibility mapping and its assessment using regularized random forest (RRF) and Naïve Bayes (NB) as two machine learning algorithms in Sardabroud watershed, Mazandaran Province, Iran.

Study area
Sardabroud watershed includes the city of Kelardasht, which has a population of over 50,000 people.Kelardasht city is located in the west of Mazandaran Province, Iran.A number of flash floods have occurred in this area which damaged road, houses and other infrastructure besides loss of life and property.The watershed has an area of 460 km 2 , altitude ranging from -31 m to 4816 m with a slope above 66 degrees (Figure 1).The watershed originates from the Alam-Kuh mountain.The average annual rainfall of the watershed is around 840 mm., and its average temperature varies between -5ºC to 30 ºC.

Method
The process for flood susceptibility modelling includes (a) producing flood inventory map using historical data of 205 flood and 205 non-flood sample points, (b) determining floodinfluencing factors and investigating their correlations, (c) employing the ML models, (e) production of flood susceptibility map, and (f) evaluating the accuracy and reliability of the models and selecting the best model for flood susceptibility mapping as presented in

Data used
For production of the FSM, the production of flood inventory map (FIM), determining the spatial relationships among flood occurrences and the spatial correlations among their influencing factors are very important.The FIM was produced using historical data of 410 floods and non-flood points.Based on the previous research (Ahmadlou et al., 2022;Lei et al., 2021), data availability and the geo-environmental characteristics of the study area, 10 influencing factors including elevation, normalized difference vegetation index (NDVI), rainfall, topographic wetness index (TWI), land use, curvature, distance to river, slope, lithology and aspect were considered for the modelling (Figures 4 and 5).The lithology map was prepared in 1:50000 scale.The landuse and NDVI maps have obtained of Landsat8 in 1:250000 scale.The employed data area as follows: -Elevation: Elevation is generally indirectly related to floods (Fernandez and Lutz, 2010).The ground generally becomes flatter by elevation decreases and the amount of water carried by rivers increases.Therefore, it can be shown that the occurrence of floods increases with decreasing elevation.
-Aspect: Aspect is a morphometric factor (Pham et al., 2021;Costache et al., 2021) derived in this study from ASTER Digital Elevation Model (DEM) of 30 m resolution.Aspect layer can indirectly affect the occurrence of floods.
-Rainfall: Rainfall as the main source for runoff production at the ground level results in flooding at low-lying areas.Shortterm torrential rains and long-term low-intensity rains can cause floods (Rodda, 2011;Cao et al., 2016).Many researchers have considered this factor as an influential factor on flooding (Prasad et al., 2021;Saha et al., 2021).Therefore, the proximity of the study area to the Caspian Sea is the main reason to consider rainfall for flood susceptibility modelling.
-TWI: TWI is an hydrological factor that defines the ratio of the area between the specified basin and the slope angle (Wilson and Gallant, 2000).The TWI provides a measure of water accumulation and flood potential for each pixel.
-Slope: The slope factor is calculated based on the DEM grid resolution and height range.As the slope increases, the flood has more destructive power, in other words, a flat ground has a constant risk of flooding due to its zero slope (Torabi Haghighi et al., 2018).
-Distance to river: The areas close to rivers are more prone to flooding occurrence (Eini et al., 2020).Proximity to rivers is one of the influential factors of floods due to the dependence of floods on groundwater reserves (Janizadeh et al., 2019).
-Curvature: A number of researchers have considered curvature as an important conditional factor for floods (Ahmadlou et al., 2021).Runoff is accelerated or reduced depending on the shape of the slope.Convex slopes accelerate ground flow and may also affect soil penetration and saturation.
-NDVI: NDVI is a measure of vegetation characteristic of an area and there is a negative correlation between the flood occurrence and vegetation density (Prasad et al., 2021;Saha et al., 2021).
-Lithology: Floods can be affected by lithology and geological structures due to their impact on soil porosity and permeability.Permeability is very low on very resistant rocks and as a result there is a higher potential for flooding.
-Land use: Land use factor can affect runoff.In areas with high vegetation density, the surface runoff will decrease compared to that of the residential areas (Dodangeh et al., 2020;Saha et al., 2021).

Machine learning models
In this study, two ML models including RRF and NB were used for flood susceptibility modelling at Sardabroud watershed, Mazandaran Province, Iran.

NB:
Bayesian method is one of the methods to classify phenomena based on its probability of occurrence or nonoccurrence.The Bayesian method offers good results after initialization based on the intrinsic properties of probability (especially probability of division) (Rish, 2001).Bayes offers a way to calculate the previous probability (P (c | x)) of P (c), P (x) and P (x | c).Simple Bayesian classifiers assume that the effect of the value of a particular attribute in a class is independent of the values of other special attributes.

RRF:
The algorithm is generally developed for feature selection.In this model, each element of the training data set in each tree node is analysed.It can also be mentioned that the process of selecting the features of the RRF model is greedy.In the stochastic forest model, the tree regression method used for RRF development can select a subset of the model compression function (Deng and Runger, 2013).RRF algorithm is basically developed by random forest method, however the main difference is that in RRF, the regularized information gain, i.e., GainR (Xi, v), is used (Band et al., 2020) (Eq.1). (1)

Validation methods
Variance Inflation Factor (VIF) and Tolerance (TOL) are statistical techniques that detect a strong linear relationship between more than two factors (Hong et al. 2020).VIF and TOL equations are mentioned (Eq.2).The predictive performance of ML models is an essential step in generating FSM and without it, the modelling process is unreliable (Panahi et al., 2021).In this research, negative predictive values (NPV), accuracy (ACC), sensitivity, specificity and the AUC were used to compare the prediction ability of the both models (Eq.3-7) (Rahmati et al., 2019).The performance of the assessment techniques has been widely mentioned in previous research (Lei et al., 2021;Pourghasemi et al., 2021).The confusion matrix is presented in Figure (3).

Multicollinearity diagnostics
The results as indicated in Table 1 for the 10 factors considered for the flood modelling have shown that the land use has the lowest and the elevation has the highest VIF.Therefore, there is no multiple correlations between the considered variables (VIF> 10 and TOL <0.1) (Band et al., 2020) and all the factors are considered for the modelling.

Flood Susceptibility Map
The FSM using the NB and RRF models were classified into 5 categories namely very low, low, moderate, high, and very high (Figures 6 and 8).The results have been proved that the NB model has estimated low susceptibility category (24% of the total area) more than other categories and the RRF model estimated low susceptibility category (22% of the total area) more than other categories.According to the NB map, 18% of the total area was placed in very high susceptibility category, 19% in high category, 19% in moderate category and 20% in very low susceptibility category (Figure 6).For the RRF model, the very high, high, moderate, and very low categories were 19%, 20%, 20%, 12%, and 19% of the total area, respectively.

Comparison of Validation Methods
Table 2 illustrates the predictive ability of the NB and RRF models based on the test datasets.The RRF model had the higher AUC, NPV, specificity, ACC and sensitivity values during the training and testing process.According to AUC values, the RRF model had the better predictive performance compared to that of the NB model (Figure 7).

CONCLUSION
Global warming is inevitable and the presence of natural disasters such as floods can be observed in many places and cause a comprehensive damage worldwide.Artificial intelligence techniques including ML methods, have been recognized as efficient and high-performance methods for modelling natural disasters and predicting flood prone areas.
The purpose of this study was to implement the RRF and NB algorithms to produce flood susceptibility maps in Sardabroud watershed in northern Iran.The predictive performance of models was evaluated using AUC, ACC, sensitivity, specificity and NPV.Both of the models had high accuracy, while, RRF with the ACC, sensitivity, specificity and NPV Values of 0.90, 0.89, 0.92 and 0.89, respectively, has presented better results.The value of AUC for RRF model has been obtained as 0.945.According to RRF map, 19% of the total area was placed in very high flood susceptibility category and for the NB model, the very high flood susceptibility category had been estimated 18% of the total area.The RRF model in this research had higher accuracy for flood modelling than that of studied by (Ahmadlou et al., 2022;Habibi et al., 2022) in this study area.
The findings showed that the RRF model had superior performance than that of NB model to map flood-prone areas.Band et al. (2020) have also used RRF algorithm for flash flood modelling and concluded that this model had a high accuracy for susceptibility modelling.Therefore, the results of this research have confirmed the findings of the previous research.
The accuracy of these models can be compared and evaluated with other basic machine learning models such as Random forest (RF) and Support vector machine (SVM).The results of this study can be used by urban policy makers to consider flood susceptibility in the urban detailed plans and land use planning initiatives to reduce the damages to lives and infrastructure.It is suggested that more flood and non-flood points in addition to the maps of flood influencing factors at higher scales and spatial resolutions be employed in further studies which may affect the accuracy of the results of the flood susceptibility mapping.