GIS-BASED GROUNDWATER POTENTIAL MAPPING USING MACHINE LEARNING MODELS, A CASE STUDY: QOM PROVINCE, IRAN

: By considering the increasing trend in water consumption and significant reduction of water resources in most countries of the world, groundwater resources have become very important. The Target of this study is to implement machine learning models to produce a groundwater potential map (GWPM), identify areas with higher water potential, and also identify influencing factors. Therefore, two algorithms including the random forest (RF) and support vector regression (SVR), were performed that according to the literature have a good compatibility with this type of problems, compared to the other models. Of the 351 well points available throughout the study area, 70% (245 well points) were selected as the target for training the models and the rest 30% (106 well points) were used for evaluating the models. In addition, 20 effective information layers were used for modeling. In this study, an effort was made to focus more on data preparation that is one of the most important parts of model development. The variance inflation factor (VIF) and correlation coefficient were applied to identify the dependent variables. Also, feature selection was done to identify the most influential factors. Finally, two groundwater potential map(GWPM)s were created based on these two models. By calculating the area under the curve (AUC) from the receiver operating characteristic (ROC), the prediction accuracy of the two models was calculated. The values for AUC of the two maps produced by the RF and SVR algorithms were 93.4% and 89.7%, respectively. This study improves the knowledge of groundwater potential in the study area which is one of the cities with water scarcity in the country.


Study area
The study area is Qom province located in the central part of Iran.The considered area is between 34°9'N and 35°11'N latitude and 50°6'E and 51°58'E longitude (Figure 2).Qom covers a total area of 11,237 km² and has a population of over 1,151,672 inhabitants.The use of groundwater resources in this area includes wells, qanats, and springs.The climate of Qom province varies between semi-desert and desert and includes mountainous regions, foothills, and plains.According to its location close to an arid region and far inland, the climate is dry with low humidity and sparse rainfall.The elevation of the study area varies between 800 m and 3200 m, and the average height of area is 930 m.The average annual rainfall in Qom is 618.8 mm, which also varies due to the different altitudes in different areas.Due to its natural conditions, the province faces a shortage of surface and underground water resources.The Qamroud and Qarachay rivers form permanent and surface streams.

DATA
The groundwater potential mapping is done by modelling well or spring locations as target layer.Therefore, in this study, to create the groundwater potential map, 351 well sites throughout the study area were selected then divided into a training dataset (70% = 245 wells) and a test dataset (30% = 106 wells).However, these raw data are not yet suitable for modelling, so an upstream step is required to prepare the collected data.This step is called data preparation.This part will explain in section 2.2.

Groundwater-Related Factors:
For modelling groundwater potential, selecting more effective and relevant factors is so important.Accordingly, 20 factors that are related to GWPM were selected, which can classify as topographical, hydrological, and geological factors.These factors are DEM, slope, aspect, TWI, SPI, land cover, land use, climate type, village density, fault density, qanat density, spring density, river density, Euclidean distance(ED) of villages, ED of roads, ED of rivers, ED of creeks, ED of qanat, ED of springs, ED of faults(Figure 3).TWI which is a secondary topographic factor is calculated based on Equation (1) (Moore, Grayson, & Ladson, 1991): where  = represents the Catchment Area (m2/m)  = is the slope at the point Also SPI can be defined as Equation ( 2) ((Moore et al., 1991) where  = is the specific catchment's area  = is the local slope angle gradient

Data Preparation:
An essential and critical step in the machine learning process is data preparation.Hence, machine learning algorithms are routines, and efforts are often to prepare data (Brownlee, 2020).Then, before train models and creating GWPM, it is necessary to clean and validate data.As shown in a Figure 1, a regular data engineering planned to check data.

Check Duplicate:
In the raw data collected, there is a possibility that some rows are duplicates.An identical row is one that has the same value for all its columns as another row.Duplicate Rows are not only useless for the training step, but also can be misleading during model evaluation (Brownlee, 2020).These redundant rows should be identified and deleted.

Delete Zero and Near Zero columns:
Columns that have only a single value or low variance of observation are probably useless for modelling and should be considered.These single-valued predictors are known as zero-variance predictors and should be deleted.However, columns with very few numerical values may or may not be useless for modelling, and depending on the situation, a decision should be made whether or not to remove them.

Check Samples Columns Data Types and Handle
Categorical Data: Machine learning models only take numbers and output numbers.Therefore, it is important to consider input data types.Especially if some of the columns have categorical types.In this study, dummy encoding techniques were used for variables that are naturally non-numeric (categorical data).

Delete Missing Values:
This step involves identifying rows that have one or more columns with no values and deleting them.In modelling, all observations should have the same size and have a value for all variables.

Normalize data:
The last and most important part of data preparation is data transformation.Not only the scaling of the data, but also all other procedures that need to be applied to the data should first fit on the training data set then be applied to the training and test data sets.Otherwise, the data transformation on the entire data set will lead to an issue known as data leakage, which means some information from the test data set permeates into the data set used to train the model.For this reason, the data set is first split and then normalized (Brownlee, 2020).Random forest is one of the most famous machine learning models trademarked by Leo Breiman and Adele Cutler, which incorporates the output of several decision trees to get a result.It's capable to solve both classification and regression problems.Due to its ease of use and flexibility, it is widely used Classification and Regression Trees represent a class of Decision Trees presented by (Breiman, Friedman, Olshen, & Stone, 1984).The main core of all Random forest algorithms has three hyperparameters, which should be tuned before training.These three hyperparameters are the number of features sampled, node size, and the number of trees.Hence, the random forest classifier is able to solve regression or classification problems (Figure 4).

Supported Vector Machine
Support vector machines (SVMs) are a set of supervised learning methods used for outliers detection, classification and regression problems.The method of SVM can be extended to solve regression problems.This method is defined as SVR.
Since GWPM in this study, is a regression problem, SVR package of SVM in Scikit-learn was implemented.

RESULTS
In this section, the results of the models, GWPMs of two models and their evaluation are mentioned.In order to compare the models, the data preparation section is the same for both models, and after dividing the dataset into a training dataset and a test dataset, the models are trained separately and the results has been reported.The AUC was used to evaluate the models, which is a very efficient indicator of prediction accuracy.

Random Forest
Random forest algorithm implemented in Python, using the sklearn.ensemblepackage and GridSearchCV from the sklearn.model_selectionpackage to find the best hyper parameters.The results of the model RF are shown in Table 1.AUC results are also shown in Figure 5.

SVR
SVr algorithm also implemented in Python, using the sklearn.svmpackage and GridSearchCV from the sklearn.model_selectionpackage to find the best hyper parameters.The results of the SVR are shown in Table 2. AUC results are also shown in Figure 6.The importance of the influencing factors can be obtained while training models, as shown in Figure 7 and 8.These measures help to understand the importance of features and the most variables for map making.Thus, they can be used in dimensionality reduction, whose goal is to obtain maps with high accuracy at lower data dimensionality.

CONCLUSIONS
The main target of this study was to create a GWPM based on two machine learning models for the study area, Qom province in Iran, which is one of the cities facing water scarcity.An attempt was also made to identify the most important factors affecting groundwater potential in both methods.The results show that RF method performs better than SVR, although both have reasonable accuracy calculated by AUC.By the use of data cleaning and dimensionality reduction (using feature importance to select more influential factors to train models) implemented in this research, prediction accuracy for RF and SVR reached from 91.3% and 86% to 93.4% and 89.7% respectively.In prior studies, these techniques were neglected or not mentioned to improve the result and quality of the maps.This study improves the knowledge of groundwater potential in the study area and shows that machine learning methods are operational and can be used instead of the old expensive methods.

Figure 4 .
Figure 4.The Random Forest framework

Table 2 .
Results of SVR