CONTRASTIVE SELF-SUPERVISED DATA FUSION FOR SATELLITE IMAGERY

: Self-supervised learning has great potential for the remote sensing domain, where unlabelled observations are abundant, but labels are hard to obtain. This work leverages unlabelled multi-modal remote sensing data for augmentation-free contrastive self-supervised learning. Deep neural network models are trained to maximize the similarity of latent representations obtained with different sensing techniques from the same location, while distinguishing them from other locations. We showcase this idea with two self-supervised data fusion methods and compare against standard supervised and self-supervised learning approaches on a land-cover classification task. Our results show that contrastive data fusion is a powerful self-supervised technique to train image encoders that are capable of producing meaningful representations: Simple linear probing performs on par with fully supervised approaches and fine-tuning with as little as 10% of the labelled data results in higher accuracy than supervised training on the entire dataset.


INTRODUCTION
Increasing numbers of Earth-orbiting satellites produce large quantities of remote sensing data every day. The analysis of this data with machine-learning (ML) techniques is of great interest for many applications in Earth Observation, such as land use monitoring or change detection (Zhu et al., 2017). However, many of the most frequently used ML algorithms for these tasks are supervised, and thus depend on the availability of highquality labels for each observation (Scheibenreif et al., 2021). Obtaining the labels is typically a laborious process, involving expensive human expert annotators. This leaves the vast majority of available remote sensing data unlabelled, and therefore out of reach for supervised ML algorithms. Our work targets all applications of ML in the remote sensing domain where it is possible to obtain small amounts of labelled data at reasonable expense, but not in quantities that are sufficient to train large neural network models. In such scenarios, a combination of self-supervised pre-training and subsequent supervised finetuning makes it possible to leverage large unlabelled datasets in conjunction with a small amount of labelled observations. In particular, contrastive self-supervised learning (SSL) recently emerged as a powerful way of fitting deep neural network models on unlabelled datasets and to obtain strong performance on related down-stream tasks (Jaiswal et al., 2021). The central idea of contrastive SSL is to compare and distinguish samples from one instance with samples from other instances (Wu et al., 2018). This incentivizes the model to learn meaningful features that are constant for different observations of the same instance, but vary across instances. In existing literature, multiple observations of one instance (e.g., an image) are typically obtained by applying strong random augmentations to the original sample. The models are trained to match augmented versions of the same image and thus learn to become invariant to the applied augmentations (Chen et al., 2020a). The direct application of standard contrastive SSL methods for natural images to remote sensing data is not straightforward given the different data characteristics (Ayush et al., 2021). Commonly used augmentation techniques (e.g., changing image hue or saturation) might * Corresponding author not be well-defined for non-RGB remote sensing modalities, or introduce undesirable invariances in the resulting model.
The contributions of our work are as follows: • We propose an augmentation-free variant of contrastive SSL. Same-instance (i.e., positive) samples are obtained from near-in-time imagery of the same scene by satellites with different sensing techniques (see Fig. 1).
• Our approach exploits the geo-location information of remote sensing data to match observations between sensors. This enables the model to jointly learn representations of data from multiple sources, thus performing data fusion without supervision.
• We show that this approach yields significant improvements on down-stream classification tasks, particularly when only small amounts of labelled data are available.  Figure 2. Sample of Sentinel-1/2 image pairs (RGB bands of Sentinel-2 and VV polarization of Sentinel-1) with multi-labels obtained from the DFC2020 dataset. The multi-label classes are given with the majority class in bold font.

RELATED WORK
The idea of contrastive learning (Hadsell et al., 2006) has recently received a lot of attention in the computer vision literature (Wu et al., 2018, Oord et al., 2018, Chen et al., 2020a, He et al., 2020, Tian et al., 2020 and subsequently produced methods that achieved stronger ImageNet classification results than supervised training (Chen et al., 2020b). Following this success, contrastive SSL was also adopted in the remote sensing domain.
Recently, the geo-information of remote sensing data was exploited to collect images of the same scene at different points in time as temporal positives and contrasted against images from other locations (Ayush et al., 2021). In combination with an auxiliary geo-location classification task, this approach results in improved performance across a number of downstream classification and segmentation problems. The Contrastive Predictive Coding method (Oord et al., 2018) has been adapted to remote sensing data by drawing positive pairs as different crops from a satellite image, resulting in improved downstream classification accuracy over ImageNet pre-training, particularly in the multi-spectral case (Stojnic and Risojevic, 2021). In Seasonal Contrast, a two-step procedure for contrastive SSL on remote sensing data has been proposed (Mañas et al., 2021). First, a representative, unlabelled dataset is purpose-built and then a SSL model based on MoCo-v2 (Chen et al., 2020c) with multiple embedding subspaces is trained with augmented and temporal positives. Besides classification and segmentation, contrastive SSL has also recently been applied to tackle problems that are more specific to the remote sensing domain like change detection Bruzzone, 2021a, Saha et al., 2021) or data fusion (Chen and Bruzzone, 2021b). Most similar to our work, self-supervised learning has been shown to enable change detection with pre-and post-change satellite images of different modalities (Saha et al., 2021). This technique differs from our approach in the loss function, which combines deep clustering, temporal consistency and contrastive losses, the backbone architecture which starts to share latent representations of different image modalities before a common convolutional layer, and the change detection down-stream task. Another recent related work on self-supervised data fusion with multi-modal satellite data (Chen and Bruzzone, 2021b) utilizes model architectures based on ResUnet blocks and operates on pixel-level representations, distinguishing it from our work.

DATA
This work uses multi-modal satellite data from the Sentinel-1 and Sentinel-2 satellites of the European Space Agency's Copernicus Program. Spatially aligned image pairs are obtained from the SEN12MS dataset (Schmitt et al., 2019). See Fig. 2 for representative samples.

Sentinel-1
The Sentinel-1 mission consists of two polarorbiting satellites with C-band synthetic aperture radar (SAR) devices (Torres et al., 2012). Sentinel-1 provides SAR imaging at up to 5m resolution with dual polarization and revisit times of about 1 week, even in cloudy conditions. We use the VV and VH polarizations of the ground-range-detected Sentinel-1 products in interferometric wide swath mode (10m resolution).
Sentinel-2 Sentinel-2 consists of two polar-orbiting satellites that provide multi-spectral imagery covering the visible, near infrared, and short-wave infrared wavelengths with a ∼5 day revisit rate and up to 10m resolution (Drusch et al., 2012).
SEN12MS SEN12MS is a large scale dataset of spatially aligned Sentinel-1/2 images (180,662 paired observations obtained in the same season), which we use in this work (Schmitt et al., 2019). The resolution of all bands for both modalities is pre-processed to 10m. SEN12MS also contains MODIS landcover information, which is not utilized here due to its low resolution (500m). Instead, we use a dataset of Sentinel-1/2 observations published by the IEEE GRSS for the Data Fusion Contest 2020 (DFC2020) with high-fidelity dense land-cover annotations for model evaluation in a classification task (Yokoya et al., 2020). DFC2020 provides a split into test and validation sets of 5,128 and 986 observations, respectively. The land-cover labels are available on a pixel basis and cover the 8 classes: Forest, Shrubland, Grassland, Wetland, Cropland, Urban/ Built-up, Barren and Water. To create classification targets, we aggregate the dense labels of each scene by selecting the majority class. For multi-label classification, each scene is labelled with all classes that cover more than 10% of the image, following the approach by (Schmitt and Wu, 2021) (see Fig. 2). We utilize VV and VH polarizations of Sentinel-1, and all 13 spectral bands of Sentinel-2. Figure 3. Overview of the model architectures for supervised (left) and self-supervised (middle) land-cover classification approaches including fully-connected classification heads.

METHODS
We address the problem of learning meaningful data representations from spatially aligned multi-modal satellite imagery in a self-supervised fashion. This paper presents two approaches that extend recent advances in SSL of natural (Chen et al., 2020a) and medical (Windsor et al., 2021) image representations to the remote sensing domain. These approaches use contrastive SSL, which tasks image encoders to map multiple views of an instance close together in latent space, while maintaining distance to other instances (see Fig. 1). Multiple views are typically not available in natural images, thus necessitating the use of random augmentations to simulate them. In medical imaging, multiple views are available when multiple imaging techniques are applied on the same the subject. Similarly, in the remote sensing domain geo-location information facilitates the collection of multiple views per scene from different sensors, which we leverage in this work. The resulting models are tested with single-and multi-label land cover classification as down-stream tasks.

Supervised Baselines
As point of comparison for the self-supervised methods presented in this paper, we provide the results of 4 supervised learning strategies. The two single-source methods OnlySen-1 and OnlySen-2 are based on either Sentinel-1 or Sentinel-2 data alone and consist of a ResNet18 network (He et al., 2016) with adapted number of input channels (2 for OnlySen-1, 13 for OnlySen-2). For the EarlyFusion data fusion approach, the Sentinel-1/2 inputs are concatenated across the channel dimension and processed by a ResNet18 to estimate the land-cover class for the given scene. Finally, the LateFusion model consists of two ResNet18 encoders with adapted input layers for the Sentinel-1 and Sentinel-2 inputs. The resulting embeddings are concatenated before the ResNets' fully connected layers and then processed by a single linear classification layer (see Fig. 3).

SSL Approach 1: D-SimCLR
The Simple framework for Contrastive Learning of visual Representation (SimCLR) is a commonly used approach for contrastive SSL with image data (Chen et al., 2020a). SimCLR defines a contrastive loss in the latent space to maximize the similarity of augmented versions of the same data sample (see Eqn. 1). In the classical setup, the model is a siamese neural network (Bromley et al., 1993) with weight sharing that consists of a convolutional encoder f (·) followed by a non-linear multilayer perceptron (MLP) g(·). During training, each sample of the mini-batch xi is randomly augmented twice to create two visually different views of the same data point. The loss for the positive pair i, j (augmented versions of the same image) over a batch of 2N augmented samples is computed as: where sim(·, ·) is the dot product is the indicator function τ is a so-called temperature parameter The latent vectors zi result from a pass of sample xi through the encoder and the subsequent projection MLP, i.e., zi = g(f (xi)). We implement this standard formulation of SimCLR on the RGB bands of Sentinel-2 images, following the original method as closely as possible. The positive pairs are created by randomly sampling strong augmentations such as ColorJitter, Flipping, Grayscaling and GaussianBlur for each view.
Extension: D-SimCLR To adjust the SimCLR approach for data fusion with satellite imagery, we omit the random augmentations in favor of spatially aligned images from different sensors. To account for the potentially large difference between the two data modalities (e.g., SAR for Sentinel-1 and multispectral imagery for Sentinel-2), we replace the weight sharing with dedicated encoders fs1(·), fs2(·) and projection MLPs gs1(·), gs2(·) for the different sources (called Dual-SimCLR, or D-SimCLR). The latent vectors of different views therefore depend on distinct model components. To calculate the contrastive loss (Eqn. 1), latent vectors of the positive pair i, j (Sentinel-1/2 images of the same scene) and other elements of the mini-batch k, are computed as: In our experiments, both encoders are identical ResNet18 networks with adjusted input layers for the Sentinel-1/2 bands. The projection heads are MLPs with two fully connected layers and ReLU activation functions that map to a latent dimensionality of 128. For downstream land-cover classification, the vectors fs1(x s1 i ) and fs2(x s2 i ) are concatenated and processed by a linear layer to obtain classification scores (see Fig. 3).

SSL Approach 2: Multi Modal Alignment
In medical imaging, contrastive SSL has been used to align whole body scans of a subject obtained with different scan modalities for the purpose of unsupervised cross-modal scan registration (Windsor et al., 2021). The contrastive learning procedure is defined as a matching problem where the model tries to maximize the similarity of latent representations derived from scans of one subject, while distinguishing it from those of other subjects. Multi Modal Alignment (MMA) uses two spatial encoders fvgg(·) with identical architecture inspired by the VGG network to compute spatial feature maps (see Fig. 3 and (Windsor et al., 2021)). A correlation map for the scans is computed as the 2D convolution of the two feature maps over each other: Ci,j = zi * zj, with zi = fvgg(xi). The contrastive loss then follows Eqn. 1 with sim(zi, zj) defined as the maximum value of the correlation map Ci,j. Unlike SimCLR, this method computes the similarity at the level of 2D feature maps rather than between vectors, which retains spatial information at the embedding level. Additionally, this method omits the projection heads. We adapt this approach by replacing medical data (i.e., whole body scans with different modalities) with remote sensing data from different sensing techniques. The matching problem thus tasks the model to match scenes rather than individuals across modalities. For evaluation with land-cover classification, the encoders' feature maps are average-pooled, concatenated and then passed to a linear classification layer.

EXPERIMENTS
The supervised baseline models are trained on the test split of the DFC2020 dataset (see Section 3) with the Adam optimizer and cross entropy loss function. Similarly, the SSL models with classification head are fine-tuned on the DFC2020 dataset after self-supervised training on SEN12MS. The targets are singleand multi-label land cover classes at the scene-level. We use an image size of 128×128 pixel, cut at random locations from the native 256×256 pixel images. To mitigate the unbalanced class distribution (e.g., 1600 instances of Forest, but only 99 of Barren), we oversample rare classes during training by drawing multiple 128×128 pixel crops at random locations from the original images. This results in a dataset of 10,393 observations with approximately uniform class distribution. This dataset is randomly divided into training and validation splits which contain 80% and 20% of the data, respectively (resulting in 4102 unique training samples). We tune hyperparameters (batch size, learning rate, number of training epochs) with random search based on the performance on the validation split. Model performance is evaluated on the validation split of the DFC2020 dataset (i.e., the test set in our work). We again use 128×128 pixel crops but evaluate the entire images by drawing 4 nonoverlapping 128×128 pixel crops in a sliding window fashion from the original data, resulting in 3,944 images.

Evaluation Metrics
The classification models are evaluated based on accuracy (see Eqn. 5) for single label classification, and F1-Score (see Eqn. 8) for multi-label classification. We provide class-wise metrics for the 8 land-cover classes, the average of class-wise values, and the overall average across all samples (i.e., without first aggregating by class). This ensures fair evaluation despite the unbalanced class distribution in the DFC2020 validation set. We report the arithmetic mean and standard deviation over 5 runs with different random seeds for each metric.
Label fraction We also evaluate the influence of dataset size (i.e., number of labelled observations) on model performance.
To that end, the EarlyFusion and LateFusion models are trained for single-label classification on random subsets of the DFC2020 test split comprising 1%, 10% or 50% of the original dataset (corresponding to about 80, 800 and 4,000 samples). We find moderate performance differences between training on 50% or 100% of the data, however accuracy is greatly reduced when using 10% or 1% (-12 and -22 average accuracy points for Late-Fusion) of labelled samples (see Fig. 4).

Self-Supervised Setup
The self-supervised models are trained on the SEN12MS dataset without access to land-cover labels. Standard SimCLR utilizes the RGB channels of Sentinel-2 images with batch-size 768 and learning rate of 3 · 10 −5 for 100 epochs. For D-SimCLR, we use a batch-size of 128, learning rate of 3 · 10 −5 and temperature value of 0.07 for 50 epochs. MMA is trained with a learning rate of 10 −5 for 100 epochs while the batch size and temperature are 128 and 0.005, respectively. To evaluate the quality of resulting image encoder models, we add a classification head consisting of a single linear layer to the pre-trained models. The performance is then evaluated by fine-tuning them for single-label and multi-label land-cover classification with labelled samples of the DFC2020 dataset.
Label fraction We investigate the degree to which a lack of labelled samples can be offset by self-supervised pre-training of the image encoders. To that end, the SSL models are fine-tuned with varying amounts of labelled data (see Section 5.1). This reveals strong performance of the SSL approaches at any label fraction, but particularly when little labelled data is available (see Fig. 4). Using only 10% of labels, D-SimCLR still outperforms the strongest supervised approach (LateFusion) trained on 100% of the data by +3 average accuracy points. Table 3. Average accuracy and F1-Score (%) of linear probe on single-and multi-label classification on the test data.
Accuracy/F1-Score (%) Single-label Multi-label D-SimCLR 59 ± 6 60 ± 0 MMA 57 ± 1 56 ± 1 Linear-probe We evaluate the quality of the image representations obtained from the self-supervised encoders by linear probing on the DFC2020 land-cover classification problem. To that end, the self-supervised embeddings are fixed and we train only the parameters of a linear layer for single-label classification. This setting allows us to assess how well the SSL methods encode the samples into linearly separable land-cover groups in the latent space. In this simple linear probing setup, D-SimCLR still performs on par with the supervised methods (trained from scratch) and achieves single-label and multi-label accuracies of 59±6% and 60±0%, respectively (see Table 3). Qualitative visual inspection of the latent spaces of our supervised baselines and the SSL methods with t-SNE (Van der Maaten and Hinton, 2008) reveals that MMA and D-SimCLR structure the latent space by land-cover classes to a similar degree as the supervised approaches (see Fig. 5).
W a t e r B a r r e n U r b a n C r o p l a n d s W e t l a n d s G r a s s l a n d S h r u b l a n d

Cross-dataset Evaluation
To assess if the power of self-supervised pre-training transfers across datasets, we fine-tune MMA and D-SimCLR models which were trained on SEN12MS for land cover classification on the EuroSAT dataset (Helber et al., 2019). EuroSAT consists of 27,000 Sentinel-2 images with land-cover labels from 10 classes. We randomly split the dataset into train (60%), validation (20%), and test (20%) sets. The OnlySen-2 model is used as supervised baseline. After 20 training epochs, this yields an average classification accuracy of 91±1% on the test set. After selfsupervised pre-training on SEN12MS, we fine-tune MMA and D-SimCLR for the EuroSAT classification task. To that end, the Sentinel-1 backbones are dropped from the models. We find that the pre-trained models converge fast to strong land-cover classifiers and achieve average accuracies of 97±1% (MMA) and 95±0% (D-SimCLR) after 20 training epochs, outperforming the supervised baseline.

DISCUSSION
Our experiments with contrastive self-supervised data fusion focus on two aspects: (1) We fine-tune image encoders that were pre-trained on an unlabelled dataset in a self-supervised fashion. This reveals strong performance on the land-cover classification downstream task in both the single-and multilabel setting. Fine-tuning the D-SimCLR method results in better classification accuracy than any of the supervised baseline approaches. This illustrates that contrasting multi-modal satellite imagery is a useful target in SSL that effectively learns Sentinel-1/2 data characteristics from unlabelled datasets. This insight is particularly valuable when training labels are scarce.
As our experiments with varying label fractions reveal, the selfsupervised pre-training strategy consistently outperforms supervised training, even when few labelled observations are available.
(2) We use linear probing to evaluate the self-supervised image embeddings. Here we find that linear classification of frozen D-SimCLR embeddings can provide better accuracy than standard supervised training. This result further establishes the utility of SSL for data fusion. Across our experiments, D-SimCLR consistently outperforms the MMA method. MMA was initially designed to preserve spatial information from the input image in the embedding space. This property is useful for tasks like scan registration or dense prediction, but is not properly utilized in our single-and multi-label classification problems, which might explain the performance difference to D-SimCLR. The results of the original SimCLR approach also were not competitive with our D-SimCLR. Putatively, this is due to a reduced amount of spectral information in the RGB input data and the problem that strong augmentations as suggested in the original paper result in hardly recognizable scenes when applied to remote sensing data. The D-SimCLR method on the other hand is tailored to remote sensing data and bypasses the challenges of the original approach by leveraging multi-modal satellite data.

CONCLUSION
This work investigated the idea of leveraging the geo-location information of remote sensing data for contrastive self-supervised data fusion. We present two techniques that utilize multimodal remote sensing data of the same location (but from different satellites) as positive pairs in contrastive SSL. Both techniques produce meaningful representations of Sentinel-1 and Sentinel-2 images, as illustrated by linear probing. Fine-tuning the contrastive SSL models strongly outperforms standard supervised data fusion approaches for both single-lablel and multilabel classification. With only 10% of labels, D-SimCLR performs better than any of the supervised approaches trained on the entire dataset. These results demonstrate the potential of SSL for data fusion. Future work could extend the presented methods to dense prediction tasks, or investigate the utility of incorporating additional satellite data modalities.