AUTOMATIC CLOUD DETECTION METHOD BASED ON GENERATIVE ADVERSARIAL NETWORKS IN REMOTE SENSING IMAGES

: Clouds in optical remote sensing images seriously affect the visibility of background pixels and greatly reduce the availability of images. It is necessary to detect clouds before processing images. In this paper, a novel cloud detection method based on attentive generative adversarial network (Auto-GAN) is proposed for cloud detection. Our main idea is to inject visual attention into the domain transformation to detect clouds automatically. First, we use a discriminator (D) to distinguish between cloudy and cloud free images. Then, a segmentation network is used to detect the difference between cloudy and cloud-free images (i.e. clouds). Last, a generator (G) is used to fill in the different regions in cloud image in order to confuse the discriminator. Auto-GAN only requires images and their labels (1 for a cloud-free image, 0 for a cloudy image) in the training phase which is more time-saving to acquire than existing methods based on CNNs that require pixel-level labels. Auto-GAN is applied to cloud detection in Sentinel-2A Level 1C imagery. The results indicate that Auto-GAN method performs well in cloud detection over different land surfaces.


INTRODUCTION
Remote sensing images have been applied into many fields such as change detection, land cover and land use classification and environmental monitoring (Novo-Fernández et al., 2018). However, 66% of the Earth's surface is covered by clouds most of the time (Zhang et al, 2004). This blocks the signal from land surface and alters the reflectance of ground objects, reducing the applicability of optical images (Fisher, 2014). Since the transmittance of thick clouds in optical images is 0, the signals of ground objects are completely blocked and highlights such as bare land and buildings are easily confused with them. Clouds not only block the ground objects, but also affect the subsequent processing of image fusion, registration. Although humans can very accurately label cloud masks, this process is very timeconsuming, expensive and difficult. Thus, the automatic cloud detection in optical remote sensing images is very important.
Over the past decades, many methods have been proposed for cloud detection. Generally, the traditional methods can be divided into two types: threshold-based methods and multitemporalbased methods. Threshold-based methods are widely used to generate basic masks, and can distinguish cloudy from clear-sky pixels employing the spectral differences between dark land and clouds (Sun et al., 2018;Wang et al., 2016;Zhu et al., 2015;Parmes et al., 2017) but often fail to separate clouds from highlights. Multitemporal-based methods use clean images from different time periods within a certain area to produce clean synthetic images, then calculate the difference between cloudy images and clean images. Wei et al. (2016) select visible-to-NIR bands to separate land surfaces from clouds, and use the short-wave infrared bands to * Corresponding author distinguish clouds from snow/ice in MODIS and Landsat 8 data. Zhu et al. (2012) propose Fmask, a rule-based automated cloud detection method for cloud detection in Landsat 8 images. Frantz et al. (2015) use the spectrally correlated NIR bands, which are additionally affected by a view angle parallax to separate clouds and land surfaces, and further improves the separation of potential cloud pixels (PCPs) produced by Fmask. Han et al. (2014) firstly detects thick clouds using a threshold method then uses a modified scale-invariant feature transform (SIFT) method to transform cloud-free reference images (acquired from the same region at a different time) to the coordinates of the cloudy image.
Some disadvantages of the traditional methods are: 1) thresholdbased methods are based on human experience and professional knowledge. It is very difficult to set a threshold that can be used for other satellite images; 2) Method based on multi-threshold for multi-spectral bands is limited by the bands of satellite sensor (Parmes et al., 2017); 3) Multi-temporal methods require time series images which are not always available or practical to process. They may be suitable for the specific regions, but do not work well in other regions.
Recently, many methods based on machine learning and especially deep learning (DL) have been proposed for cloud detection in remote sensing. Mateo et al. (2017) propose a deep learning-based method for cloud detection in Proba-V multispectral images. A supervised CNN architecture for cloud detection in SPOT6 images is proposed in (Goff et al., 2017). Xie et al. (2017) firstly transform RGB to HIS color space, then clusters the image into super-pixels by a saliency detection method GS04 (Zhao et al., 2015), then train a CNNs model to detect clouds. U-Net architecture has been proved effective for on-board cloud detection in small satellite images (Zhang et al. 2018). Multi-scale convolutional features are used to detect cloud in medium and high-resolution remote sensing images of different sensor (Li et al. 2019). Although these DL-based methods have achieved very high accuracy for cloud detection in remote sensing images, they require pixel-level ground truths labeled by humans which are very time-consuming to obtain. Thus, an unsupervised feature extraction method that do not require pixel-level ground truths is more desirable.
Generative adversarial networks (GANs) have been proposed as an unsupervised deep learning model (Goodfellow et al., 2014). It is a generative architecture in which two networks play a minimax game: a generative network translates a random input into a realistic sample, and a discriminative network distinguishes the generated sample from the true sample. Isola et al. (2017) propose a pix2pix GANs framework for image-to-image translation with paired images; this method can realize the translation between an aerial photo and a map. Zhu et al. (2017) propose a method for image-to-image translation with unpaired images using cycle consistency loss to train G and D to be consistent with each other. Due to its effectiveness, GANs is one of the most promising methods for unsupervised learning on complex distributions.
In addition, the human attention mechanism has been widely studied in recent years. By introducing attention mechanism, the neural network can focus more attention on an area of interest (Vaswani et al., 2017). The method proposed in (Qian et al, 2018) combines visual attention and GANs to remove raindrop in images; it adopts a LSTM network to locate the raindrop regions of the input image which will guide the generative and discriminative networks to pay more attention to raindrop regions and ignore other regions without raindrop.
In this paper, a method based on GANs for automatic cloud detection is proposed, where the GANs architecture is redesigned and injected with attention mechanism to detect clouds. GANs is used to translate the detected regions between cloud and background and help attention concentrate detected regions on clouds. Experimental results on Sentinel-2A images in China show that the proposed method performs well under different background conditions both in vision and quality.
The rest of this paper is organized as follows. Section 2 introduces the framework of the proposed Auto-GAN method and the details. Section 3 describes the experimental data and setup, and discusses the comparative results with the official Sentinel-2 cloud masks (sen2cor) and baseline deep learning-based methods. Finally, the conclusions are drawn in Section 4.

METHODOLOGY
The inputs for the Auto-GAN method are an image and a single corresponding image-level label (whether there are clouds in the input image). Auto-GAN aims to detect cloud regions automatically by extracting the feature difference between unpaired cloudy and cloud-free images. The proposed method consists of four networks: an attentive network, two generative networks and a discriminative network. The discriminative network is used to distinguish whether there are clouds in the input image. The attentive network is used to detect and delineate cloud regions in the input image. One of the generative networks is used to translate the cloud regions into cloud-free regions. The other generative network is used to restore the translated cloudfree regions back into cloud regions.

Overview of the Proposed Method
GANs were originally designed to produce samples. The basic GAN architecture contains two networks: a generative network (G) and a discriminative network (D). G tries to generate samples in order to confuse D, while D tries to distinguish real samples from generated samples (Goodfellow et al, 2014). This game can be represented by the following formulation: (1) where d ! = distribution of target samples d $ = distribution of source samples As shown in Figure 1, Auto-GAN consists of three tunnels (attention tunnel, translation tunnel and restoration tunnel) and a

Cloud-free image
Cloud image(C) F 1 (C)

Attentive network A Binarization
Training stage discriminative network. The attention tunnel is used to detect cloud regions which is also called attention map in this work. And we use a segmentation network to produce the attention map of the cloud which will guide the translation and restoration processes to pay more attention on cloud regions. The attention map is represented by a matrix P = μ [0,1] of grayscale values ranging from 0 to 1, with higher values in P representing more the attention to the corresponding region. The translation tunnel aims to translate a cloudy image into a cloud-free image, which is then put into the discriminative network to discriminate whether it contains clouds or not. In order to confuse the discriminative network, the cloud regions need to be translated should be as large as possible. The restoration tunnel aims to restore the translated cloud-free image back to a cloudy image, which is then compared with the original input cloud image. By minimizing the difference of global consistency between the restored image and the original image, the translated/restored regions should be as small as possible. This trade-off between large translated regions and small restored regions can guide the attentive network to concentrate more attention on cloud regions automatically.

Expansion with translation network
In this phase, we put a cloud image c into the attention network A to detect cloud regions and get a rough attention map A(c). T is utilized to translate the cloud regions into cloud-free regions (T(c)), which will be used to replace regions where cloud were detected in A(c). As shown in Figure 3, the input image and T(c) perform a mask operation with A(c). The mask operation is as follows: Where c = input cloud image T(c) = generated background image by T F % (c) = fused cloud-free image 1 = a matrix of 1 with the same size as A(c) Then, the discriminative network (D) is used to assess the quality of the fused cloud-free image (F % (c)) by measuring the feature difference between the fused cloud-free image (F % (c)) and a real cloud-free image (x). The real cloud-free image (x) is not necessarily from the same location, but should contain similar land covers. We give the expression of the loss function of D as follows: Where d c = distribution of cloud images d n = distribution of cloud-free images The loss function of F % which concerns T and A is: We adopt the least squares function as our loss function for D. If we would use the cross entropy as the loss function, the generator would not optimize the generated images that are recognized as real images by D even if these generated images are still far away from the decision boundary of D, which means that the image translated by F % would not be of high quality. The least squares method is different. In order to minimize the loss of the least squares function, on the premise of confusing D, the generative network T will pull the generated image closer to the decision boundary, where confusion is more likely.
We assume that T(c) is a high-quality cloud-free image which can confuse D. So, if F % (c) is to confuse D, A(c) produced by the attentive network A need to be as large as possible.

Reduction with restoration network
The adversarial loss alone can only guide A to expand detected regions A(c), and T to generate a high-quality background image. In order to constrain A to focus only on cloud regions (which means that the translation tunnel only translates cloud regions and keeps background regions unchanged), another generative network R is used to restore the translated background into cloud (R(F % (c)). As shown in Figure 3, the restoration process is also a mask operation. The restoration process is as follows: It can be seen that the restored image F 2 (c) is fused with two components: R(F % (c)) + T(c) and c. The fusion factor is A(c). We compare F 2 (c) with the original input cloud image c to assess the effect of the restoration process as follows: We adopt the absolute loss as the loss function of F 2 which involves A and R. We call this the global cycle consistency loss. The absolute loss can assess the absolute difference between F 2 (c) and c and can avoid image blur.
As shown in equation (5) and (6), to minimize L F 2 , A(c) should be as small as possible. For example, when A(c) = 0, F 2 (c) = c, L F 2 = 0. Thus, the restoration tunnel can constrain A to reduce the detected regions and guide R to restore the translated background of F % (c) back to a cloud region.
The translation tunnel tries to expand the translated regions to produce high quality cloud-free images to confuse D, and the restoration tunnel tries to reduce the restored regions to keep as much information of original image as possible in order to make the restored image F 2 (c) and the original image c globally consistent. So, we train them together and the loss function of A, T and R can be combined into the Auto-GAN loss as follows: with l a weight parameter between T and R loss functions. The value of L AG is fed to A, T and R. In order to minimize L AG , the networks A, T and R will optimize their parameters. After A, T and R being well-trained, A can detect clouds accurately, T can translate cloudy images into cloud-free images and R can restore the cloud-free images back into cloudy images.

Optimization
In order to improve the detection accuracy of A, we consider both cloud-free images and images full of clouds. As shown in Figure  4, we put them into A to produce cloud attention maps. So, the proposed method introduces another algorithm for the optimization of A to make the best use of spectral information. According to the spectral information of input images, the attention maps of cloud-free and fully cloudy images should be matrices of 0 and 1, respectively. The optimization function is as follows: Where d f = distribution of image full of cloud 0 = a matrix of 0 with the same size as A(x) We adopt the least squares function that can speed up network convergence as the loss function of the optimization process. This loss function takes two extreme cases into consideration: cloudfree and full of cloud. To minimize L AG , A will learn the spectral information of clouds and various background land covers to produce more accurate attention maps.
In the proposed Auto-GAN method, T, R and A cooperate with each other and their parameters are updated together. We give the final loss function of T, R and A as follows:

Data Description
To demonstrate the effectiveness of Auto-GAN, Sentinel-2A imagery were selected as training and testing data. Sentinel-2A is a high-resolution multi-spectral imaging satellite that carries a multi-spectral imager (MSI) for land monitoring which covers 13 spectral bands in the visible, near infrared and shortwave infrared at high spatial resolutions (10 m, 20m and 60m). The true color composite image of bands 2/3/4 with spatial resolution at 10 m was adopted as our experimental data.
The southeast of China was selected as our study area. There are many mountains, rivers and vegetation in these areas. Due to geographical factors, the economy there is more developed than other areas in China, so, there are more buildings and concrete roads in this area. Many land surfaces mentioned above are difficult to be separated from clouds, which makes it hard to detect clouds accurately. The land surface features in these images were very representative of the southeast of China. 12 Sentinel-2A Level 1C images were acquired from 1 May 2018 to 30 September 2018 from Copernicus Data Hub and cropped into 48 patches without overlapping (40 for training and 8 for testing).

Experimental Setup
For the making of training dataset, each training image patch was clipped by using a slide window with size of 256 × 256. These training patches were manually classified into two subsets: a cloudy image set and a cloud-free image set. We also extracted patches from images full of clouds to optimize A. To make full use of the image information, we rotated these patches with 90°, 180° and 270° to augment training samples. In this way, 122176 patches were obtained, where the number of cloud-free, cloudy, and all-cloud patches were 69424, 44690, and 8062 respectively. To avoid overfitting during training, the dropout rates of the attentive network, generative networks and discriminative network were set to 0.75, 1.0, 1.0 respectively. The batch size was set to 1 and the epoch was set to 4 (330000 iterations) for the training of all methods.
In the experiments, we compared Auto-GAN with three baseline methods: Sen2cor, U-net and Deeplab-v3. Sen2cor is a processor for Sentinel-2A Level 2A product (Main-Knorn et al, 2017), and the default Sentinel-2 cloud mask. U-net firstly connects downsampled feature maps with up-sampled feature maps to make full use of the image features in image segmentation (Ronneberger et al, 2017). Deeplab-v3 proposes atrous spatial pyramid pooling (ASPP) and has shown state-of-the-art performance in image segmentation (Chen et al, 2017).
For all methods, we adopted Adam-optimizer as the optimizer to train the networks and its parameters were fixed as: Beta1 = 0.9, Beta2 = 0.999 , the initial learning rate = 0.0002, and the exponential decay with decay rate=0.96 is used as the decay policy. Our training and validation experiments were both conducted with the TensorFlow platform on Windows 7 operation system with 16 Intel (R) Xeon CPU E5-2620 v4 @ 2.10 GHz and an NVIDIA GeForce GTX 1080Ti with 11 GB memory. For the model training, the inputs of the proposed Auto-GAN method are images and the corresponding image-level labels (a single value to indicate whether there are clouds in the images). The inputs of baseline DL-based methods are images and the corresponding pixel-level labels (binary cloud masks).
To predict pixel-level labels of testing images, we apply the welltrained attentive network (A) on testing images with a slide window of size 256 × 256. Overlapping is imposed when sliding window across the testing image with a stride of 128 pixels to avoid boundary effects. Since the outputs of Auto-GAN are the attention of clouds, and the outputs of baseline methods are the probabilities of clouds, the thresholds were set respectively to 0.3 for Auto-GAN and 0.5 for baseline methods to obtain binary masks of the outputs.
To quantitatively assess the performances of the proposed Auto-GAN and baseline methods, the ground truths manually labeled are compared with binary masks, using the following measures: The OA represents overall accuracy of cloudy and cloud-free regions obtained by the method in the true cloud and cloud-free regions. Precision represents the accuracy of true cloud regions correctly detected by the method over all cloud regions obtained by the method. Recall represents the accuracy of cloud regions obtained by the method in true cloud regions. The F1-score is a weighted harmonic mean of Precision and Recall. The higher the values of these precision indices, the better the performance of cloud detection method.

Results Analysis
The details of visual results are shown in Figure 5. The cloud detection performance of Auto-GAN is tested over different underlying surfaces of buildings, bare land, vegetation and water. In the visual results, white pixels represent cloud and black pixels represent background.
It can be seen that the thick clouds can easily be detected by all methods. However, the results of Sen2cor contain highlight areas and less clouds most of the time. The reason is that Sen2cor mainly uses the spectral information of the image, thus it is not sensitive to the shape of the objects and the boundary of clouds. U-Net and Deeplab-v3 can detect most of the clouds, but their results still contain few highlights, especially U-Net. All the baseline methods usually cannot distinguish well between highlights and clouds. In contrast, the proposed Auto-GAN method can always distinguish clouds from highlights, and the cloud detection results are very similar to the reference. In the proposed method Auto-GAN, the weight λ of global cycle consistency loss is the only hyper-parameter in the L AG loss function. Fig. 6 shows the comparison of overall accuracy of Auto-GAN on test data with three values of the hyper-parameter λ = 1, 5, 10 and different number of training iterations. It can be seen that in the first 120000 iterations, the greater the value of λ is, the greater the value of OA. Overall accuracy OA is higher with λ = 10 than with λ = 1 or λ = 5. The reason is that when the value of λ increases, the generative networks will be better trained, which will benefit the training of the attentive network. The values of OA, Precision, Recall and F1-score under different values of λ are shown in Table 2. It can be seen that the Auto-GAN has the best performance with λ =10.
For training, the baseline methods require images and corresponding pixel-level labels (binary masks), while Auto-GAN only uses images and corresponding image-level labels (whether the image contains clouds or not). On average, it takes more than 20 hours (1200 minutes) for a human operator to annotate pixel-level labels for a Sentinel-2A image with size 10980 × 10980 pixels. However, it only takes about 12 minutes (100 times faster) on average to annotate image-level labels for all patches cropped from the same Sentinel-2A image by the same person. So, the proposed Auto-GAN can be applied to other satellite images very quickly and efficiently.
For the translation from a cloudy image to cloud-free image, the translation results of the images containing small and large clouds are shown in Figs. 7 and 8 (in Appendix), respectively. It can be seen that the translation tunnel performs better on small clouds than large clouds. This is because with smaller cloud regions, the translation tunnel can get more information from background regions to translate the clouds into background. On the contrary, there is not enough background information in images with large cloud regions for translating the clouds into background.

CONCLUSION
In this work, a novel GAN-based method was proposed for cloud detection in high resolution remote sensing images. In particular, the proposed Auto-GAN framework outputs the pixel-level labels only using image-level labels in training, which reduces the time of annotating training samples by a factor of 100 compared to manually annotating pixel-level labels. Furthermore, Auto-GAN method designs three complementary tunnels to simultaneously detect cloud regions and translate images. The injecting of attention mechanism is beneficial for generating high quality images which in turn improves the performance of the attention tunnel. The Auto-GAN is trained and tested on Sentinel-2A images over China. Only the well-trained attentive network is used to predict cloud regions on the testing images. The Auto-GAN method outperforms all baseline methods on overall accuracy, precision and F1-score, and all but one method on recall measure (marginally under-performing). Both visual and quantitative analyses of experimental results demonstrate that Auto-GAN framework is very effective on cloud detection in remote sensing images.
In the future, we will focus on the following works: 1) reducing the computing time by designing simpler and more efficient networks; 2) making a dataset of bright buildings and testing the performance of our method on cloud detection on these surfaces; 3) further reducing human labor in labelling training data by training a binary classifier to annotate image-level labels of input images automatically.