Remote Sensing Image Classification with the SEN12MS Dataset

Image classification is one of the main drivers of the rapid developments in deep learning with convolutional neural networks for computer vision. So is the analogous task of scene classification in remote sensing. However, in contrast to the computer vision community that has long been using well-established, large-scale standard datasets to train and benchmark high-capacity models, the remote sensing community still largely relies on relatively small and often application-dependend datasets, thus lacking comparability. With this letter, we present a classification-oriented conversion of the SEN12MS dataset. Using that, we provide results for several baseline models based on two standard CNN architectures and different input data configurations. Our results support the benchmarking of remote sensing image classification and provide insights to the benefit of multi-spectral data and multi-sensor data fusion over conventional RGB imagery.


INTRODUCTION
One of the most crucial preconditions for the development of machine learning models for the interpretation of remote sensing data is the availability of annotated datasets. While wellestablished shallow learning approaches were usually trained on small datasets, modern deep learning requires large-scale data to reach the desired generalization performance. In computer vision, the great success of deep learning was largely driven by the desire to solve the image classification problem, i.e. assigning one or more labels to a given photograph. For this purpose, many researchers have relied on the ImageNet database (Deng et al., 2009), which contains millions of annotated images. In remote sensing, the same task is often called scene classification, which similarly aims at assigning one or more labels to a remote sensing image, i.e., a scene. As (Cheng et al., 2020) summarizes, there has also been a lot of progress in this field in recent years, with a growing number of dedicated datasets (cf. Tab. 1). As can be seen from this non-complete selection, most datasets built for remote sensing image classification deal with high-resolution aerial imagery, usually providing three or four spectral channels (RGB, or RGB plus near-infrared). Only EuroSat and BigEarthNet provide spaceborne multi-spectral imagery, with So2Sat LCZ42 being the only existing scene classification dataset covering the other large data modality -synthetic aperture radar (SAR) data 1 . Combining all points, i.e. dataset size, availability of more than just a single sensor modality, and versatility, it becomes obvious that most existing datasets lack the power to train generic, region-agnostic models exploiting multi-sensor information.
With this paper, we present the conversion of the SEN12MS dataset to the image classification purpose as well as a couple of baseline models including their evaluation. Since SEN12MS is -in terms of spatial coverage and sensor modalities -significantly larger than all other available datasets, and sampled in a more versatile manner, this will enhance the possibility to benchmark future model developments in a transparent way, and to pre-train remote 1 After this paper was accepted for publication at ISPRS Congress 2021, it came to the authors' attention that BigEarthNet in the meantime was extended by BigEarthNet-S1, a collection of Sentinel-1 images corresponding to the Sentinel-2 images contained in the original dataset. Thus, now another multi-modal scene classification dataset exists.
sensing-specific models that can later be fine-tuned to individual problems and user needs.

SEN12MS FOR IMAGE CLASSIFICATION
In this section, the SEN12MS dataset in its new image classification variant is described. All resources, most notably labels or pre-trained baseline models, can be downloaded from https://github.com/schmitt-muc/SEN12MS in an open access manner. The goal of both the repository and this paper is to support the establishment of standardized benchmarks for better comparability in the field.

The original SEN12MS Dataset
The SEN12MS dataset (Schmitt et al., 2019) was published in 2019 and contains 180,662 so-called patches, which are distributed across the world and all seasons. For each of those patches, the dataset provides, at a pixel sampling of 10m and a size of 256 × 256 pixels, • a Sentinel-1 SAR image with two polarimetric channels (VV, VH) • a Sentinel-2 optical image with 13 multi-spectral channels • four different land cover maps following different classification schemes.
The SEN12MS dataset was designed having the following key features in mind: • Its main distinction from other deep learning-oriented datasets was (and is) its focus on multi-sensor data fusion. Instead of containing only optical imagery, SEN12MS provides both SAR and multi-spectral optical data to cover the most relevant modalities in remote sensing.
• Instead of being sampled over a single study area or a geographical region of limited extent (e.g. individual countries or continents) the data of SEN12MS is sampled from all  (Schmitt et al., 2019) inhabited continents. This makes the dataset unique with regard to the possibility to train generalizing models that are comparably agnostic with respect to target scenes.
• In contrast to its predecessor, the SEN1-2 dataset, all images contained in SEN12MS come as geotiffs, i.e. they include geolocalization information. On the one hand, this information can be used as an additional input feature (c.f. Uber's CoordConv solution (Liu et al., 2018)). On the other hand, it can be used to pair the SEN12MS data with external geodata.
• By providing dense -albeit coarse -land cover labels for each patch, semantic segmentation for land cover classification was intended to be one of the main application areas of the dataset.
Since its publication, SEN12MS has been used in many studies on deep learning applied to multi-sensor remote sensing imagery. Examples include image-to-image translation (Abady et al., 2020, Yuan et al., 2020 and land cover mapping with focuses put on weakly supervised learning , Yu et al., 2020, model generalization , and meta-learning (Rußwurm et al., 2020). This shows the dataset's potential for both application-oriented as well as methodical research.

Creation of Scene Labels from Dense Labels
The original SEN12MS dataset contains four different schemes of MODIS-derived land cover labels. From those schemes, the IGBP scheme (see, e.g., (Sulla-Menashe et al., 2019)) was chosen as background for the conversion into a classification dataset. This was done because the IGBP scheme features rather generic classes, including both natural and urban environments with a moderate level of semantic granularity. The other, LCCS-based, classification schemes, in contrast are less generic and over-focus on different topics of interest, e.g. land use or surface hydrology. As already proposed by (Yokoya et al., 2020), the 17 original IGBP classes were converted to the simplified IGBP scheme (cf. Table 2) to ensure comparability to other land cover schemes such as FROM-GLC10 (Gong et al., 2019), and to mitigate the class imbalance of SEN12MS to some extent.
For the generation of single-label scene annotations, simply the land cover class corresponding to the mode of the pixel-based land cover distribution in that scene was used. For the generation of multi-label scene annotations, the histogram of land cover appearances within a scene was converted to a probability distribution. Then, to remove visually underrepresented classes, only classes with a probability larger than 10% were kept. An illustration of some example images including their single-label and multi-label scene annotations is provided in Fig. 1.

Dataset Statistics
The class distribution of the different label representations is summarized in Fig. 2. It can be seen that the dataset is fairly imbalanced, with the classes Savanna, Croplands, and Grassland being very frequent and classes such as Wetlands, Barren, and Water being comparably underrepresented. The class Snow/Ice can basically be considered as non-existing in SEN12MS. Due to the generally uniform spatial sampling of the SEN12MS data, this imbalanced class distribution is a representation of the natural imbalance of the real world, but should be considered when the data is used for the training and evaluation of machine learning models.
While (Sulla-Menashe et al., 2019) provides a rough estimate of the accuracy of the original, global IGBP land cover map at 500 m GSD -namely 67% -, there is no accuracy assessment of the upsampled IGBP maps provided as dense annotations in the original SEN12MS dataset. So, to provide a better intuition of the accuracy of the single-label and multi-label scene annotations presented with this paper, we have randomly selected 600 patches for human annotation. This human annotation was conducted by visual inspection of the corresponding high-resolution aerial imagery in Google Earth. While this form of evaluation suffers from a certain subjectivity and the difficulty of distinguishing some land cover classes visually, it still serves the purpose to gain a better feeling for the label quality, if the numbers provided in Tab. 3 are taken with a grain of salt. In this evaluation, a single human label was given to each patch, and Top-1 accuracy refers to the case in which the MODIS-derived single-label scene annotation matches this human label, whereas Top-3 accuracy refers to the case in which one of the top-3 multi-label scene annotations matches this human label. As can be seen, the average accuracy is somewhere around 80%, which is significantly better than the original accuracy of the global IGBP land cover map. This is, of course, caused by the simplification of the IGBP scheme, as well as the reduction to just few scene labels instead of coarseresolution pixel-wise annotations.

BASELINE MODELS FOR SINGLE-LABEL AND MULTI-LABEL SCENE CLASSIFICATION
To provide the community with both baseline models and results, we have selected two well-established convolutional neural network (CNN) architectures for image classification: ResNet and DenseNet. The architectures and the necessary adaptations and settings are shortly described in the following.

ResNet
The ResNet architecture (He et al., 2016) was designed to mitigate the problem of vanishing gradients, which tended to appear for very deep CNNs before. This is realized by the introduction ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume V-2-2021 XXIV ISPRS Congress (2021 edition)  Figure 1: Some randomly selected samples from the SEN12MS dataset, including simplified IGBP land cover annotations on scene level. First row: Sentinel-1 (VV backscatter). Second row: Sentinel-2 (RGB). The main class of the scene is set in bold, additional classes in the multi-label case are set in regular font.
of so-called shortcut connections, i.e. instead of learning a direct mapping from input to output layer, the shortcut connection skips one or more layers by passing the original input through the network without modifications. Then, the network learns residual mappings with respect to this input. In this work, we used a ResNet50, i.e. a variant with a depth of 50 layers.

DenseNet
The DenseNet architecture  is based on the finding that CNNs can be deeper, more accurate and efficient to train if they contain shorter connections between layers close to the input and layers close to the output. Thus, DenseNets directly connect each layer to every other layer in order to ensure maximum information flow between layers in the network. Different to ResNets, the features are not combined through summation; instead, they are concatenated. In this work, we employ a DenseNet121 with a depth of 121 layers.

Training Details
Both models were trained using binary cross entropy with logit loss for multi-label classification. To keep everything simple, optimization was performed with an Adam optimizer, a learning rate of 0.001, a decay rate of 10 −5 , and a batch size of 64.
In order to evaluate the usefulness of different input data configurations, separate models were trained for the following cases:  • S2 RGB: Sentinel-2 RGB data only (as a computer visionlike baseline) • S2 MS: all 10 surface-related spectral channels of Sentinel-2 • S1+S2: Sentinel-1 dual-polarimetric data plus the 10 surfacerelated channels of Sentinel-2 Each model was trained from scratch on the official SEN12MS training split with early stopping based on a validation set randomly selected from the training set. Figure 3 illustrates multi-label prediction results for three example patches. Table 4 contains a summary of accuracy metrics for the multi-label classification results. The F1 score is the harmonic mean of the precision and recall metrics, i.e.

BENCHMARK RESULTS
where the precision p is the fraction of correct predictions per all predictions of a certain class, and the recall r is the fraction of correct predictions per all appearances of a class in the reference annotations.
From the results, different insights can be drawn: • There is generally not a huge difference between the two baseline CNN architectures. This is particularly interesting, because both models are of significantly different depths.
Thus, this provides a hint towards the hypothesis that the achievable predictive power is more limited by the training data than by the model capacity.
• Overall, the weakest performances are achieved for those models that take only optical RGB imagery as input. However, a measurable improvement is observed when multispectral data is used, with even more improvement provided by the data fusion-based model that exploits both optical and SAR data. This suggests that spectral diversity is of high importance in remote sensing-based land cover mapping and confirms once more that there is a significant difference between the analysis of conventional photographs and remote sensing imagery.
• While SAR-optical data fusion is helpful on average, for some classes, e.g. Shrubland, Wetlands, and Croplands it does not seem to be very helpful. This indicates that observationlevel fusion based on a simple channel concatenation is not enough and more sophisticated fusion strategies, e.g. with sensor-dependent CNN streams, are needed.
• Due to the reduction of dense labels to scene labels, the Barren class is underrepresented -there are only 29 patches carrying a Barren label in the multi-label test dataset. This renders the results for the Barren class insecure. It is still interesting to note that for both CNN architectures, data fusion provides the best result for this class, while multi-spectral imagery yields the worst result -even worse than RGB only.  : Multi-label predictions for three example patches using the S1+S2 input data configuration. The second example is particularly noteworthy: While the reference denotes the patch as pure Savanna, the models add finer information by recognizing urban or cropland structures, respectively. • As already discussed in , the IGBPbased Savanna label can be problematic: The MODIS-derived reference considers the scene of the second example in Fig. 3 as Savanna, although it visually seems to rather be a mixture of built-up structures and croplands. Interestingly, both baseline models are able to identify those classes (i.e. Urban / Built-Up for ResNet50 and Croplands for DenseNet121)albeit not both at the same time, and without removing the Savanna class.
All in all, the achieved accuracies are of the same order of magnitude as the accuracies reported by (Sumbul et al., 2019) for the BigEarthNet dataset. While SEN12MS and BigEarthNet are not directly comparable (as SEN12MS is more versatile and contains Sentinel-1 SAR imagery, but also uses a simpler class scheme), this indicates the usability of SEN12MS for the training and evaluation of scene classification models. Besides, the fact that achievable accuracies on both datasets are similar, this suggests a certain saturation for off-the-shelf image classification models for remote sensing scene classification. Investigating this further would be an interesting future research direction.

SUMMARY & CONCLUSION
With this paper, we have presented the SEN12MS dataset repurposed for remote sensing image classification. To achieve this, the original land cover annotations, which are provided at a resolution of 500 m and a pixel spacing of 10 m, are converted to both single-label and multi-label annotations. Based on a randomized validation of 600 patches by a human expert, an average accuracy of about 80% was estimated. Using the dataset and two standard CNN image classification architectures, we have trained and evaluated several baseline models, which can serve as a baseline for future developments and provide insight to the benefit of using multi-sensor and multi-spectral data over plain RGB imagery.