BENCHMARKING DEEP LEARNING FRAMEWORKS FOR THE CLASSIFICATION OF VERY HIGH RESOLUTION SATELLITE MULTISPECTRAL DATA

: In this paper we evaluated deep-learning frameworks based on Convolutional Neural Networks for the accurate classiﬁcation of multi-spectral remote sensing data. Certain state-of-the-art models have been tested on the publicly available SAT-4 and SAT-6 high resolution satellite multispectral datasets. In particular, the performed benchmark included the AlexNet , AlexNet-small and VGG models which had been trained and applied to both datasets exploiting all the available spectral information. Deep Belief Networks, Autoencoders and other semi-supervised frameworks have been, also, compared. The high level features that were calculated from the tested models managed to classify the different land cover classes with signiﬁcantly high accuracy rates i.e., above 99.9%. The experimental results demonstrate the great potentials of advanced deep-learning frameworks for the supervised classiﬁcation of high resolution multispectral remote sensing data.


INTRODUCTION
The detection and recognition of different objects and land cover classes from satellite imagery is a well studied problem in the remote sensing community.The numerous national and commercial earth observation programs continuously provide data with different spatial, spectral and temporal characteristics.In order to operationally exploit these massive streams of imagery, advanced processing, mining and recognition tools are required.These tools should be able to timely extract valuable information regarding the various terrain objects and land cover, land use status.
How automated and how accurate these recognition tools are, is the critical aspect for their applicability and operational use.Regarding automation, although unsupervised and semi-supervised approaches possess a native advantage, when comes to big data from space with important spatial, spectral and temporal variability, efficient generic tools may be based on supervised approaches which have been trained to handle and classify such datasets [Karantzalos et al., 2015, Cavallaro et al., 2015].
Class 1 Class 2 Class 3 Figure 1: Certain state-of-the-art models have been tested on the publicly available SAT-4 and SAT-6 high resolution satellite multispectral datasets.All the models have been designed and trained based on the DeepSat dataset.
Recently, deep learning architectures have gained attention in computer vision and remote sensing by delivering state-of-the-art results on image classification [Sermanet et al., 2013], [Krizhevsky et al., 2012], object detection [LeCun et al., 2004] and speech recognition [Xue et al., 2014].Several deep architectures [Deng, 2014, Schmidhuber, 2015] have been employed, with the Deep Belief Networks, Autoencoders, Convolutional Neural Networks and Deep Boltzmann Machines being some of the most commonly used in the literature for a variety of problems.In particular, for the classification of remote sensing data certain deep architectures have provided highly accurate results [Mnih and Hinton, 2010, Chen et al., 2014, Vakalopoulou et al., 2015, Basu et al., 2015, Makantasis et al., 2015, Marmanis et al., 2016].
Deep architectures require a significant amount of training data, while labelled remote sensing data are not broadly available.Recently, a new publicly available dataset with a large number of training data was released [Basu et al., 2015] In this paper, motivated by the recent advances on deep convolutional neural networks, we benchmark the performance of certain models for classifying multispectral remote sensing data (Figure 1).Based on the DeepSat dataset, we have trained different models and reported on their performances.In particular, Alex-Net, AlexNet-small and VGG models have been implemented and trained on the DeepSat dataset [Krizhevsky et al., 2012, Jaderberg et al., 2015, Ioffe and Szegedy, 2015, Vakalopoulou et al., 2015].The high accuracy rates demonstrate the potentials of advanced deep-learning frameworks for the supervised classification of high resolution multispectral remote sensing imagery.
Comparing with Deep Belief Networks, Autoencoders and Semisupervised frameworks [Basu et al., 2015] the proposed here Alex-Net and VGG deep architectures outperform the state-of-the-art delivering classification accuracy rates above 99.9%.
The remainder of the paper is organized as follows.In Section 2., we briefly describe different deep learning models while in Section 3. we present and discuss their results.The last section provides a short summary of the contributions and examines potential future directions.

DEEP-LEARNING FRAMEWORKS
In this section all the tested models and their parameters are presented.Both training and testing datasets had been normalised before inserted into the networks.The implementation of all compared deep learning frameworks was performed in the open source Torch deep learning library [Collobert et al., 2011], while the specific implementation is available in authors' GitHub accounts.

AlexNet-Pretrained Network
Similar to [Vakalopoulou et al., 2015, Marmanis et al., 2016] the already pretrained AlexNet network [Krizhevsky et al., 2012] has been employed, here, for feature extraction.In particular, features from the last layer (FC7) were extracted using two spectral band combinations (red-green-blue and NIR-red-green).In this way, a vector with high level features of size 2x4096 has been created for each patch.Using the training dataset an SVM classifier has been trained (Figure 2) and then the produced model was used for the classification of the testing patches.
The main drawback of this specific setup is the high dimensionality of the employed feature vector (i.e., 2 x 4096) as the pretrained model can not handle more than three spectral bands per patch.Two different band combinations (red-green-bue and NIR-redgreen) have been formulated in order to exploit all the available spectral information.

AlexNet Network
In order to overcome the previous problem we trained an Alex-Net Network using the DeepSat dataset.The model consists of 22 layers: 5 convolutional, 3 pooling, 6 transfer functions, 3 fully connected and 5 dropout and threshold.The model follows the patterns as depicted in Figure 3.More specifically, the first convolutional layer receives the raw input patch which consists of 4 channels (or input planes) and is of size 28x28.The image is filtered with kernels of size 4x3x3 and a stride of 1 pixel, producing an output volume of size 16x26x26.The second layer is a transfer function one which applies the rectified linear unit (ReLU) function element-wise to the input tensor.Thus, the dimensions of the image remain unchanged.Next comes a max pooling layer, which is used to progressively reduce the spatial size of the image in order to restrict the amount of network computation and parameters and protect from overfitting.This pooling layer uses kernels of size 2 and a stride of 2, producing an output volume of size 16x13x13.
The next 3 layers follow the same pattern (Convolutional-ReLU-MaxPooling).The Convolutional layer accepts the 16x13x13 volume and produces an output of size 48x13x13 by using 3x3 kernels, with a stride of 1 and a zero padding of 1 pixel.The Convolutional layer is followed by a ReLU and a MaxPooling layer.The latter having kernels of size 3 and a stride of 2 is delivering output volumes of size 48x6x6.The seventh layer is also a Convolutional layer which delivers an output of size 96x6x6 by applying 3x3 kernels and stride and zero padding of 1.The eighth layer is a ReLU one.Layers 9,10 and 11,12 follow the same pattern (Convolutional-ReLU).The ninth layer is filtering the input volume with kernels of size 3x3, with a stride of 1 and zero padding of 1, delivering an output of size 64x6x6.The eleventh convolutional layer uses the same hyperparameters.The twelfth convolutional layer is a maxpooling one, which uses kernels of size 2 and a stride of 2 to produce an output of size 64x3x3.
After that, we use some fully-connected (FC) layers.The thirteenth layer is a simple View one which converts the given volume of size 64x3x3 to an output volume of size 64*3*3x1=576x1.
Next comes a Dropout layer with a probability of 0.5, which masks part of the current input using binary functions from a Bernoulli distribution.This layer sets to zero the output of each hidden neuron with probability of 0.5.The fifteenth layer is a simple linear one, which converts the input volume of size 576x1 to an output volume of size 200x1.The linear layer is followed by a Threshold (TH) one.The next 3 layers are of the same logic (Dropout-Linear-Threshold) and have the same hyperparameters.The 21-th layer is a Linear one, which results to an output volume Regarding the implementation, we trained the model with a learning rate of 1 for 36 epochs, while every 3 epochs the learning rate was reduced at half.We set the momentum to 0.9, the weight decay parameters to 0.0005 and the limit for the Threshold layer to 0.000001.

AlexNet-small Network
Another model that we tested was a simpler small AlexNet Network.The model consists of 10 layers.The first layer is a convolutional layer and is feeded with the original input image of size 4x28x28, producing a volume of size 32x24x24.In this setup, we did not use the ReLU function but the Tangent one.Consequently the Tangent layer comes after the first convolutional one, applying the tangent function element-wise to the input tensor.After that, follows a third max pooling layer which narrows down the size of the image from 32x24x24 to 32x8x8.The next 3 layers follow the same pattern and result to an output volume of size 64x2x2.Finally, 4 fully-connected layers follow.At this point we should mention that the model does not contain Dropout layers contrary to the previous full AlexNet model.Nonetheless, results were satisfactory, mostly because of the 2 max pooling layers which made the spatial size of the images smaller and controled overfitting.
Finally, the parametrization for the AlexNet-small was chosen to be similar with the previous full AlexNet one.Again we trained the model with a learning rate of 1 for 36 epochs.The learning rate was reduced in half at every 3 epochs.We set the momentum to 0.9 and the weight decay parameters to 0.0005.

VGG Network
Moreover, we experimented with the recent VGG model [Jaderberg et al., 2015] which was initially proposed for text recognition.The model consists of 59 levels and it repeatedly makes use of specific layers that include dropout and batch normalization [Ioffe and Szegedy, 2015].It should be noted that these two kinds of layers are very important in order to accelerate the entire process and avoid overfitting, as the parameters of each training layer continuously change.The first group of layers has 4 levels and they have the following pattern: the training model starts with a convolutional layer that takes the input image of size 4x28x28 and it produces an output of size 64x28x28.Then, a batch normalization layer is applied with a value of 0.001 added to the standard deviation of the input maps.After that, a ReLU layer is implemented and lastly, a dropout layer with a probability of 0.3.The next group of layers has also 4 levels and it follows the same logic, except the last layer, which is a max pooling one of kernel size 2 and a stride of 2 that converts the input from 62x28x28 to 64x14x14.These two groups of layers are repeatedly used with the same hyperparameters, except dropout which sometimes has a probability of 0.4.The last 7 layers of the entire training model include some fully connected layers.
The deeper architecture of the VGG model requires bigger amount of data as well as more training time.For that reason data augmentation with horizontal and vertical flips was performed.Finally, the network was trained for 100 epochs, reducing the learning rate in half every 10 epochs.

Deep Belief Networks, Autoencoders, Semi-supervised frameworks
Last but not least, the aforementioned CNN-based networks were compared with classification frameworks which have been recently evaluated in [Basu et al., 2015] for the DeepSat dataset.In particular, approaches based on Deep Belief Networks, Stacked Denoising Autoencoder and a semi-supervised one were employed.
The training and testing included both datasets using different parameters, features maps and configurations for each technique.[Basu et al., 2015] concluded that the semi-supervised approach was more suitable for the DeepSat dataset performing 97.95% and 93.92% accuracies for the SAT-4 and SAT-6 datasets respectively.

EXPERIMENTAL RESULTS AND EVALUATION
In this section the performed experimental results are presented along with the comparative study.Both training and testing have been performed separately for the two datasets of SAT-4 and SAT-6.For the quantitative evaluation, the accuracy and precision measures have been calculated.
where T P is the number of correctly classified patches, T N is the number of the patches that do not belong to the specific class and they were not classified correctly.F N is the number of patches that belong to the specific class but weren't correctly classified and F P is the number of patches that do not belong to the specific class but have been wrongly classified.
Regarding the experiments performed on the SAT-4 dataset, the calculated precision and accuracy after the application of the different deep learning frameworks are presented in Table 1.One can observe that all employed models have obtained quite highly accurate results.More specifically, the overall accuracy rates were in all cases more than 99.4%, while the estimated was more than 98.7%.Therefore, the employed deep architectures can address the classification task in this particular dataset quite successfully.As expected, the estimated accuracy and precision rates for the Pretrained AlexNet were lower than the other models, since it was trained on the ImageNet dataset.Moreover, the use of the AlexNet pretrained network makes the computational complexity much higher than in the other models.
The other three models successfully classified all the corresponding classes, achieving accuracy rates more than 99.90%.Additionally, as expected the AlexNet network performs slightly better than the AlexNet-small, which means that the ReLU layer and deeper architectures are more suitable for this specific dataset.In particular, the class that scored lower regarding the accuracy and precision was the grassland, as some patches were misclassified as barren land or trees.The VGG and the AlexNet models resulted into the higher accuracy/precision rates (i.e., above 99.94%).

Overall Accuracy
(%) Method SAT-4 SAT-6 DBN [Basu et al., 2015] 81.78 76.41 CNN [Basu et al., 2015] 86.83 79.06 SDAE [Basu et al., 2015] 79.98 78.43 Semi-supervised [Basu et al., 2015] 97.95 93.92 Pretrained-AlexNet [Vakalopoulou et al., 2015] 99 Regarding the experiments performed with the SAT-6 dataset, results from the quantitative evaluation are presented in Table 2.As in SAT-4, the pretrained AlexNet model resulted into the lowest accuracy and precision rates comparing to the other ones.The VGG model was the one that resulted into slightly higher accuracy rates than the AlexNet and AlexNet-small models.In all cases the overall accuracy rates were higher than 99.6%.Moreover, one can observe that the water bodies class was easy to be discriminated from the other ones, as in all cases the calculated accuracy was more than 99.99%.On the other hand, the Roads class was the one that resulted in the lowest accuracy rates as certain misclassification cases occurred with the Trees and Grassland classes.
In addition, for the qualitative evaluation of the performed experiments the t-SNE technique [Van Der Maaten, 2014] was employed.t-SNE has been tested in different computer vision datasets e.g., MNIST, NIPS dataset, etc. and it is suitable for the visualization of high-dimensional large-real world datasets.In Figure 4 the visualization of the last layer features of the AlexNet-small model for both SAT-4 and SAT-6 is shown.In particular, different classes are represented with different colours in the borders of the patches.One can observe that the different classes are well separated in space, which can justify the high accuracy rates that have been delivered and reported quantitatively in Table 1 and Table 2.
Last but not least, the deep architectures of AlexNet & VGG were compared with the recently proposed and evaluated ones from [Basu et al., 2015].In Table 3, the overall accuracy rates for both SAT-4 and SAT-6 datasets are presented.This benchmark included results after the application of Deep Belief Network (DBN), Convolutional Neural Network (CNN), Stacked Denoising Autoencoder (SDAE), a semi-supervised learning framework, AlexNet (pre-trained and small) and VGG.In [Basu et al., 2015], the highest accuracy rates were obtained from the semi-supervised classification framework and were 97.95% and 93.9% for the SAT-4 and SAT-6, respectively.As one can observe all the pro-  posed and applied models in this paper (including the pre-trained AlexNet which scored lower that the other ones) outperformed the semi-supervised framework of [Basu et al., 2015].The proposed deep models efficiently exploited the available spectral information (all available spectral bands) and created deep features that could accurately discriminate the different classes.
In order to evaluate the contribution of the NIR band, we moreover experimented with training the models with and without its use.In particular, experimental results after applying the VGG model to both SAT-4 and SAT-6 dataset are presented in Table 4.
As expected the classes of Tress and Grassland present lower precision and accuracy rates when the NIR band was excluded from the training process.However, the overall accuracy and precision rates were slightly different, especially on the SAT-6 dataset.

CONCLUSIONS
In this paper, experimental results after benchmarking different deep-learning frameworks for the classification of high resolution multispectral data were presented.Deep Belief Network (DBN), Convolutional Neural Network (CNN), Stacked Denoising Autoencoder (SDAE), a semi-supervised learning framework, Alex-Net (pre-trained and small) and VGG were among the frameworks that were evaluated.The evaluation was based on the publicly available DeepSat dataset including both SAT-4 and SAT-6.Comparing with Deep Belief Networks, Autoencoders and Semi-supervised frameworks [Basu et al., 2015] the proposed here Alex-Net and VGG deep architectures outperform the state-of-the-art delivering high classification accuracy rates above 99.9%.The quite promising quantitative evaluation indicates the high potentials of deep architectures towards the design of operational remote sensing classification tools.

Figure 2 :
Figure 2: The FC7 layer of the Pretrained AlexNet network has been employed for extracting features and train a SVM classifier.Two different band combinations (red-green-bue and NIR-redgreen) have been formulated in order to exploit all the available spectral information.

Figure 3 :
Figure 3: A brief illustration of the AlexNet model which was employed here.The network takes as input patches of 4x28x28 (dimensions).The network consists of 5 convolutional, 3 fully connected layers and 6 transfer function.A 4 or 6 way soft-max layer is applied depending on the dataset.
Figure 4: Qualitative evaluation based on the t-SNE technique.Results from the SAT-4 (top) and the SAT-6 (bottom) datasets are presented.Different classes are defined with different colours in the borders of the patches.This visualisation corresponds to the last layer of the AlexNet-small model.
, contains six different classes: barren land, trees, grassland, roads, buildings and water bodies.It contains 324.000 training and 81.000 testing patches.The large number of labelled data that DeepSat provides, makes it ideal for benchmarking different deep architectures.
. DeepSat contains patches extracted from the National Agriculture Imagery Program (NAIP) dataset with about 330,000 scenes spanning the entire Continental United States and approximately 65 terabytes of size.The images consist of 4 bands: red, green, blue and Near Infrared (NIR) and were acquired at a ground sample distance (GSD) of 1 meter, having horizontal accuracy up to 6 meters.The dataset is composed of two discrete units, each of them having two different group of patches: one for training and one for testing.SAT-4 is the first unit and it has four different classes which are: barren land, trees, grassland and a class that consists of all land cover classes other than the above three.It contains 400.000 training and 100.000 testing patches.The second unit,

Table 1 :
Resulting classification accuracy and precision rates after the application of different deep architectures in the SAT-4 dataset.

Table 2 :
Resulting classification accuracy and precision rates after the application of different deep architectures in the SAT-6 dataset.

Table 3 :
The resulting overall accuracy rates after the application of different learning frameworks for both datasets.

Table 4 :
Resulting classification accuracy and precision rates after the application of the VGG model at both datasets without the NIR Band.