FRACTIONAL SNOW COVER MAPPING BY ARTIFICIAL NEURAL NETWORKS AND SUPPORT VECTOR MACHINES

Snow is an important land cover whose distribution over space and time plays a significant role in various environmental processes. Hence, snow cover mapping with high accuracy is necessary to have a real understanding for present and future climate, water cycle, and ecological changes. This study aims to investigate and compare the design and use of artificial neural networks (ANNs) and support vector machines (SVMs) algorithms for fractional snow cover (FSC) mapping from satellite data. ANN and SVM models with different model building settings are trained by using Moderate Resolution Imaging Spectroradiometer surface reflectance values of bands 1-7, normalized difference snow index and normalized difference vegetation index as predictor variables. Reference FSC maps are generated from higher spatial resolution Landsat ETM+ binary snow cover maps. Results on the independent test data set indicate that the developed ANN model with hyperbolic tangent transfer function in the output layer and the SVM model with radial basis function kernel produce high FSC mapping accuracies with the corresponding values of R = 0.93 and R = 0.92, respectively.


INTRODUCTION
One challenging issue in snow mapping is the trade-off between the temporal and spatial resolution of satellite imageries.Since high spatial resolution reduces the temporal resolution, it eventually limits timely detection of the changes in snow cover.Whereas, high temporal resolution data reduces the precision of snow cover maps due to low spatial resolution.
In order to deal with this problem, various fractional snow cover (FSC) mapping approaches have been proposed and applied to low or moderate resolution images such as spectral unmixing (Painter et al., 2003) and empirical normalized difference snow index (NDSI) (Salomonson and Appel, 2004) methods.In contrast to binary classification approach where a pixel is labeled as either snow-covered or snow-free, the true class distribution can be well estimated in FSC mapping even though the precise location of class fractions within each coarse resolution pixel still remains unknown (Verbeiren et al., 2008).
Artificial neural networks (ANNs) are a machine learning algorithm that can generate an information processing model by resembling the knowledge acquisition mechanism of the brain from the environment (Haykin, 2009), and have been extensively used in various disciplines (Dongale et al., 2015;Kar, 2015;Raji and Chandra, 2016).ANNs have also gained popularity and been frequently employed in various RS applications, like change detection (Dai and Khorram, 1999), land cover classification (Paola and Schowengerdt, 1995) and identification of clouds (Lee et al., 1990).Recently, there has also been an increasing trend in the use of ANNs for FSC mapping (Czyzowska-Wisniewski et al., 2015;Dobreva and Klein, 2011;Moosavi et al., 2014).
Many types of ANNs have been developed and multilayer perceptron (MLP), a feed-forward neural network model, is the most frequently employed one in RS applications (Kavzoglu and Mather, 2003;Mas and Flores, 2008).
Originated from Data Mining and Statistical Learning Theories, support vector machines (SVMs) (Vapnik, 1995) are a state-ofthe-art supervised nonparametric classification and regression tool which has been successfully used in various applications such as seismology (Fisher et al.), energy and finance (Kumar et al., 2016;Mustaffa et al., 2014;Yuan and Lee, 2015), face identification, text categorization, bioinformatics and database marketing (Campbell and Ying, 2011)..Although SVMs were initially proposed for pattern recognition tasks, successful results were also obtained in regression and time series prediction applications (Drucker et al., 1997b;Mattera and Haykin, 1999;Müller et al., 1997).SVMs have also quickly drawn the attention of RS community, and they have many successful applications in image classification (Foody and Mathur, 2004;Huang et al., 2002) and regression (Bruzzone and Melgani, 2005;Kaheil et al., 2008;Zheng et al., 2008) tasks in RS; however, they have not yet been employed for FSC estimation.
The primary focus of this research is to perform a comparison between ANNs and SVMs for FSC mapping applications in RS.As in all nonparametric classification and regression tools, ANNs and SVMs have also "model tuning" parameters that directly affect their generalization ability on unseen data.Thus, the secondary aim is to investigate the optimal combination of basic model building parameters of these methods for FSC mapping applications in RS.The rest of this paper is organized as follows: The study area, satellite image data set with preprocessing stages, brief overviews of both MLP networks and SVMs, and the basic experimental design including sampling strategy, training and testing of ANN and SVM models are described in Section 2. The results are represented and discussed in Section 3. Finally, conclusion and outlook for future studies are given in Section 4.

Study Area
The study area is located on the Ilgaz Mountains lying within the borders of Cankiri and Kastamonu provinces of Turkey, which also covers Kursunlu forest sub-district region (cf. Figure 1).It encompasses an area of ~8500 km 2 with altitude values ranging from 354 to 2326 m above mean sea level.
The vegetation of the study area mainly consists of forest, underwood and alpine flora.The low altitude parts of the northern slopes are generally covered with Oak (Quercus spp.),European Black Pine (Pinus nigra ssp.pallasiana) and Uludag Fir (Abies nordmanniana spp.bornmulleriana) forests.At altitudes from 1000-1300 m, Hornbeam (Carpinus spp.) and Beech (Fagus orientalis L.) are dominantly observed with some other firewood and deciduous plants.At 1500 m and above, pure or mixed forests are established by Black Pine (Pinus nigra ssp.pallasiana) and Scots Pine (Pinus sylvestris L.) (Kuter, 2008).Average annual temperature at the region is about 9.8 °C.The coldest month is January with an average temperature of -0.8 °C, whereas the warmest month is July with 20 °C.Average annual rainfall is 486 mm according to the recordings of meteorological station in Kastamonu.At the mountain peaks, the rainfall is about 1200 mm.Snow cover remains for about six months with thickness reaching about 1 m on slopes due to the effect of Central Anatolian climate (Aydinozu et al., 2011;Kuter and Kuter, 2016).

Satellite Imagery
Moderate resolution imaging spectroradiometer (MODIS) on the Terra satellite has 36 spectral bands ranging in wavelength from 0.4 to 14.4 µm at varying spatial resolutions (Salomonson et al., 2006) and has been extensively used for mapping global snow cover since its launch in 1999.The MODIS snow mapping algorithm is mainly based on NDSI (cf.Eqn.(1)), where the MODIS bands 4 and 6 are used, along with a series of threshold tests and the MODIS cloud mask, and each MODIS pixel with 500 m ground resolution is classified as either snow or non-snow (Hall et al., 2002) 1.

Study Area Image Pre-processing and Reference FSC Maps
All MODIS surface reflectance images are re-projected to a common UTM projection with WGS84 datum in order to match the projection of the corresponding Landsat scenes.In addition to the surface reflectance values of MODIS bands 1-7, NDSI and normalized difference vegetation index (NDVI) (cf.Eqn.
(2)) are used as input variables (i.e., predictors Reference FSC maps are obtained by binary classification of higher spatial resolution Landsat 7 images.In this binary classification scheme, which is the equivalent of the original MODIS binary snow mapping algorithm, a pixel is labeled as snow if its NDSI ≥ 0.4 and Landsat ETM+ band 2 reflectance ≥ 10% and band 4 reflectance > 11% (Klein et al., 1998).By calculating the percentage of snow-covered Landsat pixels within a circular area of 500 m radius centered at a MODIS pixel, FSC value for each MODIS pixel is obtained (cf. Figure 2).Table 2 gives detailed information on the predictor variables and the response used in the analysis.
An MLP is made of a set of input nodes (i.e., the input layer), one or more sets of computation neurons (i.e., the hidden layers), and one set of computation/output nodes (i.e., the output layer) (cf. Figure 3).Connections are always made forward, on a layer-by-layer basis.A neuron on layer i on a strict feedforward MLP always connects to a neuron on layer i + 1.Neither recurrent connections nor skipping layers are allowed.The training of an MLP by backpropagation algorithms involves two stages.In the first stage, also known as the forward phase, random numbers are assigned to synaptic weights of the network using a uniform random distribution and the input signal is propagated through the network, layer by layer, until it reaches the output.As s result, changes are confined to the activation potentials and outputs of the neurons in the network.Then in the second stage, namely, backpropagation, an error signal is obtained by comparing the output of the network with a desired response.The resulting error signal is transmitted through the network layer by layer, yet the propagation is performed in the backward direction this time.In this stage, successive adjustments are made to the synaptic weights of the network.The output of an MLP network is given by Eqn.
(3) as follows: where indicate the number of nodes in the input and the output layers, respectively, and finally, S is the number of neurons in the hidden layer: The most widely employed nonlinear transfer functions are hyperbolic tangent (tansig) and logarithmic sigmoid (logsig) and functions: tansig( ) , Tansig and logsig are differentiable and bounded within the range from -1 to 1, and from 0 to 1, respectively (cf. Figure 4).As soon as the normalization of the inputs and the outputs is completed, the learning process starts.This stage can be considered as an optimization process, in which the error function E has to be minimized: where   12 , , , T L t t t is the target vector of the MLP.In order to prevent the MLP from overtraining, the early stopping approach is often employed, where the available data is divided into three subsets as the training, the validation and the test set.To update the network weights and biases, the training set is used.During the training, the validation set is used to guarantee the generalization capability of the model; therefore, training should stop before the error on the validation set begins to rise.Finally, the test set is used to check the generalization ability of the MLP.
When the most basic case is considered, SVM is a linear binary classifier.If a test sample is given subsequent to training, it assigns a class from one of the two possible labels.For a classification in RS, the test data to be labeled is generally an individual pixel with a set of numerical measurements (i.e., reflectance, radiance or raw DN values) for each band of the multi-or hyper-spectral image.
Let us assume that a set of input vectors for the training data is given by xi with a number of relevant features (i.e., predictors).Each vector is associated with the corresponding label (i.e., class) denoted by yi such that 11 ( , ), ,( , ) ii yy xx , where N  x R is an N-dimensional space and i is the number of available samples.For the simplest binary case, yi can take only two possible labels, +1 and 1  .The ultimate aim of the learning task in binary SVM is to decide such an hyperplane that (1) the data points on one side would be labeled as 1 i y  and those on the other side as 1 i y  , and (2) the distance of the hyperplane from the labeled points of two classes would be maximized.The position of the optimal separation hyperplane with maximum margin that minimizes misclassifications is defined by two hyperplanes (i.e., support vectors) parallel to the separating hyperplane bordering the nearest training points from the two classes as illustrated in Figure 5. Two hyperplanes parallel to the optimal hyperplane are expressed as: where "  " is dot or scalar product, b denotes the offset of the hyperplane from the origin, i x is the vector of predictors, and is the weight vector of N elements that determines the direction of the optimal separating hyperplane.
Two inequalities in Eqn. ( 9) can be written in the following single inequality: The optimal separating hyperplane with maximum margin can be obtained by minimizing the norm of w as shown in the following equation with the inequality constrained given in Eqn. (10): Figure 5: Optimal separating hyperplane and support vectors (Richards and Jia, 2006).This is a constrained optimization problem in which the objective function in Eqn. ( 11) is minimized subject to the constrained in Eqn.(10) by using Lagrange multipliers.After some rearrangements and substitutions, the decision rule that defines the optimal separating hyperplane is given as: where for each of r training cases there is a vector Linear SVM implementation assumes that the feature data is linearly separable which is a hardly met condition in multi-or hyper-spectral RS images.Additionally, spatial and spectral overlapping of classes in RS images make it difficult for linear decision boundaries to separate classes with high accuracy.Soft margin and kernel methods have been introduced to overcome these issues.Radial basis function, linear, quadratic and polynomial kernels are frequently used to transform the raw data into a high dimensional feature space (Mountrakis et al., 2011;Pal and Mather, 2005b).
SVMs have also exhibited quite satisfying performance in regression even though they were initially developed for pattern recognition tasks.In SVM regression, the main goal is to find a function ()  f x that has at most  deviation from the actually obtained targets i y for all the training data, and at the same time is as flat as possible.Errors less than  are considered as negligible, yet any deviation larger than this is not acceptable.Then, ()  f x takes the following form: where X denotes the space of input patterns.According to Eqn. ( 13), "flatness" can be expressed in terms of small w , and one way to ensure this is to minimize

 
2  w w w , which can be re-written as a convex optimization problem: Eqn. ( 14) tacitly assumes that such a function ()  f x exists and it approximates all pairs ( , ) ii y x with  precision, i.e., the convex optimization problem is "feasible".In the case of infeasible scenario where some errors may be allowed, slack variables * , ii  can be used to deal with otherwise infeasible constraints of the optimization problem in Eqn. ( 14), which is also known as "soft margin" method.Consequently, the following formulation given in Vapnik (1995) is obtained: The constant 0 C  controls the trade-off between the flatness of () f x and the amount up to which deviations larger than  are tolerated.This leads to a so called  -insensitive loss function   which is graphically illustrated in Figure 6 and can be expressed as: , otherwise.
Figure 6: A basic illustration of the soft margin loss setting for a linear SVM (Haykin, 2009).

Experimental Design
After the removal of water, cloud cover and bad-quality pixels from the MODIS images, 8872 and 27338 pixels are available in the training and testing data sets, respectively.In order to train ANN and SVM regression models, 3% of 8872 (i.e., 254) pixels in the training data set is selected by stratified random sampling (cf.Table 1).Stratification is carried out with respect to snow cover fraction from 0.0 to 1.0 with 0.1 intervals in order to prevent ANN and SVM regression models from being biased towards a certain snow cover fraction.Surface reflectance values of MODIS bands 1-7, NDSI and NDVI are used as predictors, and associated FSC values derived from higher spatial resolution ETM+ images are used as response variable.
The chosen neural network structure for FSC mapping is a feedforward network with one hidden layer trained via backpropagation learning rule with 9 nodes in the input layer and 1 node in the output layer.Since there is no unique theory to determine the optimal values of an ANN's internal variables (Moosavi et al., 2014), the number of nodes in the input and output layers are set equal to the number of predictor (i.e., input) and response (i.e., output or target) variables.The gradient-based Levenberg-Marquardt backpropagation is used during ANN training.The tansig function is employed in the hidden layer.For the output layer, logsig and tansig functions are alternatively assigned in order to investigate their effect on the estimated FSC values.
Various methods have been proposed to determine the optimum number of neurons in the hidden layer such as 2n + 1, 2n and n, where n is the number of nodes in the input layer (Moosavi et al., 2014); however, trial-and-error approach is an appropriate way as indicated by Mishra and Desai (2006), and Shirmohammadi et al. (2013).Therefore, 4-22 nodes with increment of 3 are tested for the hidden layer.The ANN training data is split into three subsets by random sampling: 70% for training, 15% for validation, and 15% for testing.
A successful SVM application requires the optimal selection of two SVM key parameters; kernel function and regularization parameter C.However, there is no unique and universally accepted heuristics for the selection of these parameters which often leads to a basic trial-and-error approach (Mountrakis et al., 2011;Yuan and Lee, 2015).
Kernel functions employed in SVM applications can be grouped under four main categories; linear, polynomial, radial basis function and sigmoid kernels.In RS literature, radial basis function (RBF) (i.e., ) and polynomial kernel (i.e.,

 
  ) are frequently preferred (Huang et al., 2002;Pal and Mather, 2005a); therefore, these two kernels are used in the study.For RBF kernel, the width of kernel γ and C should be defined.On the other hand, the degree of polynomial d and again C should be determined for polynomial kernel.In order to select optimal values for these parameters, grid search method with cross validation is used as proposed by Kavzoglu and Colkesen (2009).In this approach, a coarser grid with exponentially growing sequence of (C, γ) and (C, d) is applied, and once the optimal region on the grid is identified, a finer grid search is performed.Finally, the pair of parameters that gives the highest cross validation accuracy is used for final training process.For RBF kernel C takes values in   1 2 15 2 , 2 , , 2 , whereas γ is varied from 2 -5 to 2 3 .For polynomial kernel, the degree d takes values from the set   2,3, 4 .
The performances of ANN and SVM regression models during both training and testing stages are assessed by root mean square error (RMSE) and correlation coefficient (R) values: where N is the total number of observations, i y is the ith value, and ˆi y is the ith predicted value.

CONCLUSIONS AND OUTLOOK
Continuous monitoring of snow cover is often considered as an essential task in order to observe the changes in hydrology, meteorology and climatology of earth at both local and global scales.In this study, the suitability of nonparametric ANN and SVM methods for FSC mapping is investigated.Surface reflectance values of MODIS bands 1-7, NDSI and NDVI are used as predictor variables; whereas, the reference FSC values are generated from higher resolution Landsat ETM+ data, and they are used as response variable.
The best performance is achieved by the MLP-tansig model over the independent test area.The performance of SVM-RBF model is slightly less than that of MLP-tansig.Advantage of ANNs and SVMs over traditional statistical methods lies in the fact that they make no assumptions about underlying relationship between a set of predictor variables and a response.Additionally, once the models are trained they can efficiently and quickly calculate FSC values.
The training process of SVMs depends only on two parameters; the choice of kernel function and C, therefore it can be considered as much simpler.However, the same conclusion cannot be made for ANNs, since the number of "model tuning" parameters to be optimally decided is large.Parameters such as the number and size of the hidden layers, the type of transfer function in the hidden and output layers, the range of the initial weights, the learning rate and the value of the momentum term, the number of samples employed during the training of a network have a significant impact on the predictive performance and the generalization ability of the network and should be considered during the design and implementation of an ANN.Consequently, a potential future extension of this study is to investigate the effect of these model building parameters of ANNs in a more detailed fashion by using different and larger satellite image data sets.

Figure 2 :
Figure 2: Derivation of reference MODIS FSC maps from binary-classified higher resolution Landsat ETM+ images.
 h and o denotes the elements of the hidden and the output layers, respectively, h ji w is the weight that connects the jth node of the input layer with the ith neuron of the hidden layer, o ik w is the weight connecting the ith neuron of the hidden layer with the kth node of the output layer, h i f is the transfer function of the ith neuron of the hidden layer, o k f is the transfer function of the kth node of the output layer, h i b and o k b are the biases of the ith neuron in the hidden layer and the kth node in the output layer, respectively.Furthermore, h i n is the excitation level of ith neuron of the hidden layer (cf.Eqn.(4)), o k n is the excitation level of kth node of the output layer (cf.Eqn.(5)), h i y and o k y are the outputs of the ith neuron of the hidden layer and the kth node of the output layer, respectively, jx is the jth node of the input layer, N and L

K
xx is the kernel function which enables the data points to spread in such a way that a linear hyper plane can be fitted.

Figure 7 :
Figure 7: Landsat-derived reference FSC values vs. estimated FSC values by ANN and SVM models.

Figure 8 :
Figure 8: (a) Reference binary snow map of the test area obtained from Landsat ETM+ data, and (b) FSC map of the test area generated by the MLP-tansig model.
. Two MODIS and Landsat 7 ETM+ image pairs taken on December 26, 2002 and February 28, 2003 are used in the study.Spatial subset of MODIS and ETM+ pair taken in 2002 is used for training, and that of 2003 image is used for testing of ANN and SVM models.The details of the satellite image data can be found in Table

water and bad quality pixels are identified by spatial masks obtained from the associated MODIS ancillary data and excluded from further analysis.
). Cloud-covered, cloud shadow,

Table 1 :
Training and test data used in the analysis.

Table 2 :
MODIS and Landsat ETM+ data used in ANN and SVM models.

Table 2 :
R and RMSE values of ANN and SVM models on training and test data.Table4shows RMSE values obtained during the training of ANN models with logsig and tansig transfer functions in the output layer for different number of neurons in the hidden layer.The optimal number of neurons in the hidden layer that gives the lowest RMSE values are 10 and 7 for MLP-logsig and MLP-tansig networks, respectively.During the training of the MLP networks, MLP-tansig gives the best training performance with R = 0.98 and RMSE = 0.0772, whereas R and RMSE of MLP-logsig are 0.91 and 0.4144, respectively.

Table 4 .
Results of MLP networks with two different transfer functions in the output layer with respect to different number of neurons in the hidden layer during training.The boldface figures indicate the minimum RMSE values obtained on the test subset of the training data.For each kernel function, different SVM regression models are trained with various settings as explained in the previous subsection.Next, the performance of each model is evaluated on the training data by using R.It can be seen from Table5, all of the kernel functions gives quite satisfactory results on the training data set; however, RBF kernel with C = 270 and γ = 3 provides the highest R. The second best performance belongs to both 2nd and 3rd order polynomial kernel functions with the same R value of 0.94, and with C values of 245 and 280, respectively.SVM with 4th order polynomial kernel exhibits relatively poor performance on the training data with R = 0.92 and C = 310.

Table 5 .
Optimal training settings for SVM kernel functions.The independent test data provides a more rigorous and realistic assessment of ANN and SVM regression models since it is not included in the training.Not surprisingly, accuracy as measured by R and RMSE values are lower for the test than training.According to the results given in Table3, MLP-tansig network gives the best performance on the test data set with R = 0.93.