SEMI-SUPERVISED SEMANTIC SEGMENTATION NETWORK VIA LEARNING CONSISTENCY FOR REMOTE SENSING LAND-COVER CLASSIFICATION

Current popular deep neural networks for semantic segmentation are almost supervised and highly rely on a large amount of labeled data. However, obtaining a large amount of pixel-level labeled data is time-consuming and laborious. In remote sensing area, this problem is more urgent. To alleviate this problem, we propose a novel semantic segmentation neural network (S4Net) based on semi-supervised learning by using unlabeled data. Our model can learn from unlabeled data by consistency regularization, which enforces the consistency of output under different random transforms and perturbations, such as random affine transform. Thus, the network is trained by the weighted sum of a supervised loss from labeled data and a consistency regularization loss from unlabeled data. The experiments we conducted on DeepGlobe land cover classification challenge dataset verified that our network can make use of unlabeled data to obtain precise results of semantic segmentation and achieve competitive performance when compared to other methods.


INTRODUCTION
In remote sensing science and technology, the classification of remote sensing images is one of the most basic research issues, and it is the basis of other remote sensing research and application. In the past, traditional machine learning methods, such as support vector machine, were generally used for classification and recognition of remote sensing images. Traditional machine learning methods generally combine human prior knowledge and intuitive experience to design and select several characteristics and features that are strongly related to the task (LeCun et al., 2015).
In recent years, deep learning has become mainstream in image processing and convolutional neural networks (CNN) have achieved great success (LeCun et al., 2015). With a large number of data sets, the CNN models can be trained by end-to-end to get a more robust feature representation and higher accuracy. Although the currently popular methods can obtain better results, most of the current models are trained by supervised fashion, which needs a large number of labeled data to cooperate with deep networks for learning parameters , Zhu et al., 2017, Ball et al., 2017. However, collecting accurately labeled data is extremely time-consuming and laborious, especially accurate pixel-level labeled data. Because labeled data requires a certain amount of expert knowledge and is difficult to obtain for security or privacy considerations (Castrejon et al., 2017). For example, in the field of remote sensing, it is difficult to obtain high-precision, high-quality surface cover data. Therefore, for many practical problems and applications, the lack of resources to create sufficiently large labeled datasets has limited the widespread application of deep learning technologies.
A potential promising approach to solve this problem is semisupervised learning (SSL). Semi-supervised learning is a type * Corresponding author (a) labeled data (b) labeled data and unlabeled data (c) supervised learning (d) semi-supervised learning of machine learning technology that lies between supervised learning and unsupervised learning. It usually uses a small number of labeled data and a large number of unlabeled data to train a neural network (Chapelle et al., 2009). It has found that combining unlabeled data with a small number of labeled data can significantly improve learning performance. For example, see figure 1, more accurate decision boundaries can be found by using more unlabeled samples. For supervised learning, obtaining data annotations is costly and time-consuming, and is difficult to obtain a large amount of labeled data. While the acquisition of unlabeled data is relatively cheap, so the ap- Figure 2. The proposed semi-supervised semantic segmentation framework (S4Net) based on consistency regularization. The UNet is used here with a shared-weights strategy. For the labeled data part, the input images are fed into the network to get predicted outputs, then we can compute supervised loss (such as cross entropy loss). On the other hand, for the unlabeled data part, the input images are augmented and then fed into the network to get two outputs, and their consistency loss is calculated.
plication of semi-supervised learning is more extensive.
To overcome the problem of a large amount of data required for supervised learning, we proposed a semantic segmentation network based on semi-supervised learning, named S4Net in this paper. Specifically, the consistency regularization was introduced to exploit the unlabeled data, which encourages the pixel-level consistency of output under different random transforms and perturbations. Finally, the network was trained by the weighted sum of a supervised loss from labeled data and a consistency regularization loss from unlabeled data. We performed experiments on a public DeepGlobe land cover classification challenge dataset and verified this method can take advantage of unlabeled data and achieve improvements in the context of a small amount of data.

RELATED WORK
In this part, the past proposed semi-supervised learning methods for image classification are reviewed. After this, we will discuss related semi-supervised learning works.
Semi-supervised learning (SSL) is somewhere between supervised and unsupervised learning (Chapelle et al., 2009). It can be divided into two categories: transductive learning and inductive learning. It is noted that semi-supervised learning has to rely on some assumptions. The detailed information please refer to the book review (Chapelle et al., 2009). Next, we will review semi-supervised learning methods based on deep learning methods.

Semi-supervised learning for image classification
One of the most simple methods is Pseudo-labeling (Lee, 2013), which is widely used in practice, likely because of its simpli-city and generality. The class which has the maximum probability was used as the label of samples. The π-model and temporal ensembling (Laine, Aila, 2016) proposed a method based on consistency regularization that takes advantage of the stochastic and minimizes the difference between the predictions under different random transforms and perturbations to input samples (Sajjadi et al., 2016). Different from the π-model, Mean Teacher (Tarvainen, Valpola, 2017) used a more stable predicted output by using an exponential moving average of network parameters. Instead of using the randomness of the network, Virtual Adversarial Training (VAT) (Miyato et al., 2018) directly used as target a small perturbation to input which would most significantly affect the output of the prediction function inspired by adversarial training. Instead of adding perturbations to each single training sample, Smooth Neighbors on Teacher Graphs (SNTG) (Luo et al., 2018) encouraged neighbors to get similar predictions while the non-neighbors are pushed apart from each other. The Co-Training method (Qiao et al., 2018) can learn multiple neural networks from different views and use adversarial examples to force differences between different views. Inspired by the mixup method , Interpolation Consistency Training (ICT) (Verma et al., 2019) proposed a semi-supervised learning method by enforcing the output at an interpolation of unlabeled samples to be consistent with the interpolation of the output at those samples' outputs. Instead of using the class which has the maximum predicted probability as labels, Deep Label Propagation (Iscen et al., 2019) used the transductive label propagation method to obtain pseudo labels according to the manifold assumption. The MixMatch (Berthelot et al., 2019) combined ideas and components from the current dominant paradigms for semi-supervised learning.

Semi-supervised learning for semantic segmentation
Though substantial recent progress has been made in developing semi-supervised algorithms in image classification task for comparatively small datasets, many of these methods do not scale readily to the semantic segmentation task of real-world applications. Some works have been proposed for semisupervised semantic segmentation task in recent years. Hong et al. (Hong et al., 2015) proposed a decoupled network to learn classification and segmentation networks separately by exploiting unlabelled samples with image-level labels and pixel-wise annotations. Souly et el. (Souly et al., 2017) proposed to use a GAN architecture for semi-supervised semantic segmentation. In this architecture, generated data, unlabeled data, and labeled data were fed to a discriminator to get class confidences and generate confidence maps for each class as well as a label for fake data. Hung et al. (Hung et al., 2018) also proposed an adversarial network for semi-supervised semantic segmentation. The difference is they design a fully convolutional discriminator to discover trustworthy regions of unlabeled samples that facilitate the training process for segmentation. Kalluri et al. (Kalluri et al., 2019) devised a universal segmentation model, which can be jointly trained across different datasets with different categories.

METHOD
In this section, we first formulate the semi-supervised learning problem, and then we present our semi-supervised semantic segmentation framework, denoted as S4Net.

Overview
In the context of supervised learning, all the input data is labeled and the neural network is usually trained by minimizing a supervised loss term: where the supervised loss s is usually formulated as the cross entropy loss and f θ (·) denotes the neural network with parameters θ.
However, for the context of semi-supervised learning, one can access a number of labeled and C is the number of classes. NL and NU are the number of labeled and unlabeled samples with NL NU . The goal of semi-supervised learning is to get a better model by using all labeled data and unlabeled data than supervised learning. Thus, the loss function is formulated as the weighted sum of a supervised loss Ls from labeled data and a regularization loss Lu from unlabeled data or both labeled and unlabeled data: where λ is a hyperparameter, which quantified the importance of the regularization loss.
To make use of unlabeled data, the consistency regularization (Sajjadi et al., 2016, Laine, Aila, 2016 Algorithm 1: Mini-batch training for semi-supervised semantic segmentation Require: neural network with parameters θ Require: random perturbation function ϕ for t in [1, number of epochs] do for each minibatch B do get labeled samples x L and unlabeled samples x U from B compute supervised loss Ls using Equation (1) get two random perturbations ϕ1, ϕ2 perform random perturbation on unlabeled samples ) compute semi-supervised consistency regularization loss Lu using Equation (4) compute total loss L = Ls + λLu update θ using optimizer, e.g., SGD end end 2017) was usually introduced to exploit the potential data manifolds: wherexi refers to an example xi that is applied to a random perturbation. In image classification, the random flip, random crop and random noise are usually used as the random perturbation. The network parameterθ is either equal to the original parameter θ or any other transformation of it, such as the exponential moving average over the update of the network. The consistency regularization term Lu often uses mean squared error (squared L2 norm) or Kullback-Leibler divergence, which encourages the pixel-level consistency of the output under different random transforms and perturbations. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume V-2-2020, 2020 XXIV ISPRS Congress (2020 edition) 3.2 Semi-supervised semantic segmentation framework (S4Net) Figure 2 shows the proposed semi-supervised semantic segmentation framework. We adopt the UNet (Ronneberger et al., 2015) with ResNet-50 (He et al., 2016) model pre-trained on the ImageNet dataset as our segmentation baseline network. The decoder network uses 3×3 convolutions and strided 4×4 transposed convolutions to recover the original input size. The detailed network structure is shown in Table 1. For labeled samples, we fed them to the network to compute supervised loss Ls, such as cross entropy loss. For unlabeled samples x U , one can get two different transformed samplesx U 1 ,x U 2 by performing two random perturbations ϕ1, ϕ2, namelyx U 1 = ϕ1(x U ), Here we use the random affine transformation as the random perturbation. Then, feeding them to the network can get two outputs f θ (x U 1 ), f θ (x U 2 ). Different from the classification task, to compute the pixel-level consistency of two outputs, we have to perform the inverse transform to put every pixel to the original location. We denote two inverse transforms of the random perturbations as ϕ −1 1 , ϕ −1 2 . Thus, we can get inverse transformed outputs ϕ −1 ) and the semi-supervised consistency regularization loss can compute as follows: Here we used mean squared error as the consistency regularization loss. For the affine transformation, we used the translation in the range [-0.2, 0.2] factor of both height and width, scaling in the range [0.75, 1.25] and rotation in the range [−15 • , 15 • ]. As mentioned above, the algorithm flow of the proposed semi-supervised semantic segmentation framework is shown in Algorithm 1.

Dataset
To verify our method, we consider using the land cover classification dataset on DeepGlobe Challenge 1 (Demir et al., 2018). This dataset offers 1,146 high-resolution sub-meter satellite images and each image has a size of 2448×2448 pixels. The whole dataset is split into training, validation and test set, each with 803, 171 and 172 images. The mask images are RGB images with 7 classes, see figure 3. The unknown class is ignored in the evaluation stage.
It is worth noting that we only use the training set as the experimental data, and randomly divide 100, 503, and 200 images as labeled data, unlabeled data, and validation dataset.

Implementation Details
Our implementation used the PyTorch framework and an NVIDIA Titan X GPU was used to accelerate training. We used stochastic gradient descent (SGD) with a mini-batch size of 6 to train our model, including 4 labeled samples and 2 unlabeled samples. The weight decay was set to 0.0001 and the momentum was set to 0.9. Cosine annealing strategy was used as the learning rate policy. The initial learning rate started from 0.01 and the models were trained for a total of 100,000 steps. Mean IoU Hung et al. (Hung et al., 2018) 55. For the weight of semi-supervised consistency regularization loss component λ, we used a sigmoid-shaped ramp-up curve function e −5(1−x) 2 in the first 80,000 steps. The maximum of λ is 2.0. For the data augmentation strategy, we used the random horizontal and vertical flip. And finally, the crop size is 512×512.
For evaluation, the mean intersection over union (mIoU) is calculated as the evaluation metric. The IoU is defined as the size of the intersection divided by the size of the union of two sets.
where Rg and Rp are the set of label pixels and the set of prediction pixels. ∩ and ∪ denote the intersection and union operations, respectively. | · | denotes the number of pixels in the set. The mIoU can be obtained by averaging the per-class IoU.

Experimental Results
To evaluate our method, we trained UNet in a supervised way as the baseline. And we also compared with Hung et al.'s (Hung et al., 2018) method, in which they used a generative adversarial network to determine the confidence maps of unlabeled data output. The experimental results were shown in Table 2. As we can see, the baseline method can achieve 62.1 mIoU. However, Hung et al.'s method got a worse result. We suspected that the reason is that the training process is unstable for the generative adversarial network when the number of unlabeled data is much larger than that of labeled data. We also trained UNet both on labeled data and unlabeled data to get the upper bound of semi-supervised learning and achieved 66.8 mIoU. The experimental result showed that our method can achieve 4.7 mIoU improvement compared with the baseline method.
The detailed per-class performance of our method and other methods on the validation dataset were presented in Table 3. Similarly, Hung et al.'s method got worse results, especially forest land, rangeland. We find that our semi-supervised method outperforms supervised baseline methods by a significant margin, for example getting 4.22% and 23.92% improvement for forest land and rangeland. For barren land and water class, our method can get 57.97 and 81.10 mIoU and achieve better performance compared with fully supervised results. Thus, we believe that our proposed method takes advantage of unlabeled data.
We also visualized some results of the validation dataset for qualitative comparison, as illustrated in Figure 4. As we can see, our semi-supervised method can do better for details than the baseline method. And our semi-supervised method can achieve better integrity and correctness. For example, our Agriculture land Forest land Rangeland Urban land Barren land Water Hung et al. (Hung et al., 2018) 80  Table 3. The detailed per-class performance of our method and other methods. The bold font means the performance of our method is lager than the baseline method and Hung et al.'s method.

Hyperparameter analysis
In this section, the weight of semi-supervised consistency regularization loss λ will be analyzed. Since it is not possible to try all possible values, based on previous literature (Laine, Aila, 2016, Verma et al., 2019, we used five different λ choices here: 0.2, 2, 10, 20 and 100. The implementation detail and setting are same as the previous experiments. We evaluate the results of different λ choices based on mIoU on the validation dataset and the experimental results are shown in Figure 5. As shown in Figure 5, the reported result significantly more than the other four values when the weight value equal to 2.0. Thus, we use 2.0 as the default weight of semi-supervised consistency regularization loss. To evaluate the robustness of the proposed method, we considered three different numbers of labeled samples. In detail, same as previous experiments, 200 images from 803 were selected as the validation set according, and the data containing labeled and unlabeled data were divided from the remaining 603 images. Then we run three times to calculate the mean and standard deviation by using different random seed.

Evaluation robustness of the proposed method
The results obtained are recorded in Table 4. Under three different settings, the proposed method is superior to the baseline method. Even in the case where the number of labeled data is very small, that is, the training dataset containing 20 labeled data (the rest are unlabeled data), the proposed method can still obtain considerable results. We also observed that when the number of labeled data was reduced from 50 to 20, there was a larger decrease in accuracy. We attribute this to the fact that the network cannot get enough valid and correct signals from the training data species when there is a small number of labeled data. However, we can mitigate this problem by utilizing large amounts of unlabeled data through semi-supervised learning.

CONCLUSION
In this work, a novel semi-supervised semantic segmentation framework (S4Net) was proposed via enforcing consistency regularization for remote sensing images. The proposed method can make use of unlabeled data to improve performance by encouraging the pixel-level consistency of output under different random transforms and perturbations. The experiments show that this method is promising and can bring higher accuracy when there are fewer labeled samples. Especially in remote sensing application scenarios, such as pan-sharpening (Zhang et al., 2019a) and super-resolution (Zhang et al., 2019b), where accurately labeled data is difficult to obtain, semi-supervised learning can play a greater role.
In the future, we will continue working for semi-supervised to make use of unlabeled data and future research should consider more diverse transformations or perturbations. For example, we can introduce adversarial perturbation to augment training samples.