LEARNING FROM NOISY SAMPLES FOR MAN-MADE IMPERVIOUS SURFACE MAPPING

Man-made impervious surfaces, indicating the human footprint on Earth, are an environmental concern because it leads to a chain of events that modifies urban air and water resources. To better map man-made impervious surfaces in any region of interest (ROI), we propose a framework for learning to map impervious areas in any ROIs from Sentinel-2 images with noisy reference data, using a pre-trained fully convolutional network (FCN). The FCN is first trained with reference data only available in Europe, which is able to provide reasonable mapping results even in areas outside of Europe. The proposed framework, aiming to achieve an improvement over the preliminary predictions for a specific ROI, consists of two steps: noisy training data pre-processing and model fine-tuning with robust loss functions. The framework is validated over four test areas located in different continents with a measurable improvement over several baseline results. It has been shown that a better impervious mapping result can be achieved through a simple fine-tuning with noisy training data, and label updating through robust loss functions allows to further enhance the performances. In addition, by analyzing and comparing the mapping results to baselines, it can be highlighted that the improvement is mainly coming from a decreased omission error. This study can also provide insights for similar tasks, such as large-scale land cover/land use classification when accurate reference data is not available for training.


INTRODUCTION
The global man-made impervious surface, consisting of buildings, roads, and other man-made structures, is a crucial indicator of the human footprint on Earth. It provides a possibility for evidence-based decision making with respect to various applications and challenges such as climate change, disaster management, and sustainable development (Pesaresi et al., 2016). Like any other classification tasks, impervious area mapping can be performed using deep learning (DL) approaches, that have proved to be very powerful tools (Zhu et al., 2017, Lang et al., 2019. Despite the success of deep neural networks in the remote sensing (RS) field for various supervised learning tasks such as land cover classification and change detection, its superior performance highly depends on the availability of massive training data with accurate annotations. Without them, the performance of DL would inevitably suffer because deep neural networks can overfit to the noise in the training data (Zhang et al., 2016). Even though DL is shown to be robust to non-adversarial label noise, the required amount of clean data increases as there are more noisy labels (Rolnick et al., 2017). This aspect is crucial for RS, because collecting reliable training labels for extended areas or even on a global scale is a costly and error-prone task. On the other hand, there is a large amount of geospatial data products available from previous efforts. In the case of builtup area/human settlement/impervious area mapping, examples include the Global Urban Footprint (GUF) (Esch et al., 2012, Esch et al., 2013, the Global Human Settlement Layer (GHSL) (Corbane et al., 2017), the GlobeLand30 land cover map (Chen * Corresponding author et al., 2017), and the Global Human Built-up And Settlement Extent (HBASE) Dataset (Wang et al., 2017). To fully exploit such datasets for training, however, one has to take into account the errors they, as predictions of machine learning approaches, may contain, due to temporal gaps or inaccuracy in the original processing chain. In addition, the classes in the reference data might have a slightly different definition from those to be considered for the task at hand. Therefore, how to robustly learn a superior model from potentially noisy reference data is a problem of great importance, especially in deep learning applied to remote sensing.
The idea of learning from noisy samples is based on the desire to better exploit the reliable samples of the training set, while being less impacted by the unreliable ones. For this purpose, it is critical to distinguish the reliable or clean samples from the others (Chen et al., 2019, Han et al., 2018, Northcutt et al., 2017.
A first way to deal with this problem is resorting to different forms of regularization algorithms and avoiding the overfitting to noisy labels (Damodaran et al., 2019, Han et al., 2018, Ma et al., 2018. Another approach is to explicitly or implicitly model the noise using a noise transition matrix that is either based on prior knowledge or learned using e.g. directed graphical models or conditional random fields (Patrini et al., 2017, Sukhbaatar, Fergus, 2014. A third idea is to use noise-tolerant loss functions such as mean square error and mean absolute error, which theoretically guarantee a good results according to statistical learning principles (Zhang, Sabuncu, 2018, Reed et al., 2014. In a different way, inaccurate labels can be also corrected and updated during the iterative training process with a bootstrapping scheme. This can be carried out either by alterna-tively updating network parameters and correcting labels during the training process (Reed et al., 2014, Tanaka et al., 2018 or by means of an independent prediction step based on selected reliable samples. Inspired by the related work about learning from noisy samples, in this study, we propose a framework to exploit noisy training samples from existing maps (called reference data from hereon in order to distinguish it from actual ground truth) to improve baseline mapping results achieved by means of attention-based FCNs. In this framework, robust loss functions for fine-tuning of the pre-trained models are used so that the employed original reference data labels can be modified during the fine-tuning of the pre-trained model. This way, the adverse effect of incorrect labels can be alleviated.
The remainder of this study is organized as follows. Section 2 contains a description of our proposed framework including the general idea and details of the specific implementation used in this study. In Section 3, we introduce the experimental design as well as the used data and experimental setup. We compare our achieved results from different approaches with each other and with the state-of-the-art in Section 3.2 and 3.3. An interpretation and discussion of the outcome of our experiments is also presented along with the results. Finally, Section 4 summarizes the main findings and contributions of this study.

A FRAMEWORK FOR LEARNING FROM NOISY REFERENCE DATA
The proposed framework is illustrated in Fig. 1. The focus of this study is highlighted by colorization. The main part is to improve the preliminary mapping results of specific areas via fine-tuning of the pre-trained model, where we propose to consider using robust loss functions because the employed reference data is not absolutely accurate with respect to our target task. Instead of using the default loss function for classification problems, i.e. binary cross-entropy, we propose to use two kinds of robust loss functions, which are able to consider noises within the reference labels during training. To compare the performance of different approaches, we chose to map test areas that are far from the training sites and as a result might be subject to domain shift.
The network architecture for the first preliminary prediction and the subsequent finetuning, our choice and preparation of reference data used for fine-tuning, and the employed robust loss functions will be presented in detail in the following sections.

Attention-based FCN for MIS mapping
The adapted attention-based FCN-ResNet architecture is illustrated in Fig. 2. The main part is a ResNet-based FCN-8s, which is chosen to capture spectral and spatial local features by multi-scale feature fusion (Long et al., June 8-10, 2015). In addition to this ResNet-based FCN-8s, we employ attention modules: a position attention module (PAM), and a channel attention module (CAM). PAM and CAM together, in a parallel manner is called dual attention module. These attention modules, on the one hand, are similar to a "feature selection" process where salient features are being assigned larger weights. On the other hand, they are assumed to learn additional features by taking into account the long-range contextual information over both the channel dimension and the spatial dimension (Fu  , 2018). In this way, the feature representation can be further improved and enhanced. The sum of the output features of the modules is subsequently exploited for the prediction of the HSE. Additionally, this architecture is adapted for HSE mapping by outputting a final prediction of the half-size, instead of the same size as the input images, as illustrated in Fig. 2. This is realized by removing one up-sampling layer from the original FCN-8s. This architecture results in a HSE classification map with 20 m GSD, which is the same as the GSD of the used reference data during the preliminary training.

Reference data preparation for fine-tuning
When ground truth data collecting is not desired for training models, data from existing maps (the so called reference data) can be used. To this aim, in this study a combination of points extracted from the GUF map and from the one obtained by a pre-trained model is used. Specifically, if in one location the label is non-built-up according to GUF and non-urban according to the DL pre-trained model, that location will be added to the non-impervious training set. In all other cases, it will be added to the impervious training set. This decision is based on the observation that GUF is strong at detecting sparsely built-up areas in villages or suburban areas and the predictions from the DL model include roads and other impervious surfaces (Esch et al., 2017, Qiu et al., 2020.
Considering the errors in the original data, it is certain that there is some noise in the labels. Accordingly, the use of these data to refine DL predictions for a specific test area should be performed carefully. This is the reason why robust loss functions has been applied to the fine-tuning of the pre-trained models.

Model Fine-tuning via Robust Loss Functions
CNN training is based on updating the network weights to minimize a loss function that expresses the divergence between the model predictions and the reference labels. If the labels are noisy, the weights update can be sub-optimal, thus hindering model convergence or even worse leading to overfitting to the noisy input data. Loss functions that are robust against label noise are helpful because they rely less on the labels in the reference data. In this study, we propose to use two kinds of Figure 2. Attention-based FCN-ResNet architecture. "H", "W", and "C" denote height, width, and the channel number of the feature maps, respectively. The size of the final prediction is half of the input patch, which is decided based on the GSD of the used reference data during the preliminary training.
robust loss functions that are both modifications of the categorical cross-entropy (CCE) loss commonly used for classification. The original CCE loss is defined as where y k is the k-th element of the target label represented by a one-hot encoded vector.ŷ k is the k-th element of the predicted class probabilities, and K is the number of classes. The case of impervious area mapping is a binary classification problem, so the loss is simplified into binary cross-entropy (BCE): Instead of directly using the original labels, the L soft loss function dynamically changes the target labels based on the current state of the model: where β is a parameter to be selected according to the confidence in the reference data. Specifically, instead of directly using the label in the reference data for loss calculation, L soft updates the label by combing it with the current prediction from the model. In this way, both the predictions and the reference data are used for loss calculation, a procedure mentioned as soft bootstrapping in (Reed et al., 2014). As a result, the potentially noisy samples are less heavily relied during fine-tuning.
A second option is the Lq loss function, a generalization of CCE and mean absolute error (MAE) proposed in (Zhang, Sabuncu, 2018): where q is a hyper-parameter that controls how CCE and MAE are combined: Lq is equivalent to CCE when q → 0, and becomes MAE when q = 1. The idea is to take advantage of the benefits of both CCE and MAE, as successfully shown in (Fonseca et al., 2019).

EXPERIMENTAL RESULTS AND DISCUSSION
To validate the proposed framework, a number of experiments with Sentinel-2 images have been performed.

Experimental Setup
The original reference data for the preliminary training comes from five European scenes, Berlin, Lisbon, Madrid, Milan, and Paris, as indicated in Fig. 1, which is with a GSD of 20 m (Langanke, 2016). The test areas have been selected across the world to better assess the potential of the framework. Specifically, in this study four sites were selected for test: Beijing, Jakarta, Nairobi, and Tehran. For each test scene, checking points for accuracy assessment (manually labeled grid-based checking point, MLGCPs) were prepared, as presented in Fig.  3. Only the Sentinel-2 bands with 10 and 20 meter spatial resolution were considered. More details of reference and Sentinel-2 data pre-processing, and MLGCPs can be found in the previous work (Qiu et al., 2020).
In the first stage, the input images and their corresponding reference labels (from the five European scenes) are used to train the network with the Nesterov Adam optimizer implementation of Keras (Chollet et al., 2015). We used a minibatch size of 8  images. The learning rate is 2 × 10 −3 . To control the training time and avoid overfitting, early stopping was used, and the monitored metric is the validation loss with patience of 10 epochs. After getting the pre-trained model from the first stage, a preliminary prediction can be obtained by feeding the images to the model. Subsequently, fine-tuning is carried out for each ROI for 10 epochs. With the fine-tuned models, the final predictions can be obtained.

Accuracy Assessment
The accuracy of the mapping results is presented in Tab. 1 and 2, and is based on the independent MLGCPs. To show the improvement from this study, we also show some baseline maps, including the GUF layer, the maps obtained by DL, i.e., without fine-tuning, as well as the results after fine-tuning without considering the sample noise. Four parameters are used for both L soft loss (β) and Lq (q) loss, respectively. Tab. 1 and 2 show an improvement from the default fine-tuning (using BCE), with the average Kappa value increasing from 0.73 to 0.75, and the average Omission error decreasing from 14.3% to 8.8%. Additionally, there is a further improvement from fine-tuning to finetuning with robust loss functions, with the average Kappa value increasing to 0.78, and the average Omission error decreasing to 6.7%. Finally, within the employed robust loss functions, L soft is better than Lq in general, and L sof t with β = 0.7 provides the best results. Accordingly, the best setup has been used in all the following reported tests.
Please note that all these improvements are consistent among the four test areas. By comparing OA and Kappa values, as well as omission and commission errors, it is clear that this improvement is mainly coming from a decreased omission error while the commission error may increase. One possible explanation is that the above mentioned pre-processing procedure is prone to include more errors due to commission. As a result, areas containing even a small proportion of buildings or manmade structures tend to be recognized as impervious areas after the fine-tuning process.

Discussion
To better understand the results, in this section we visualize the achieved improvements by means of colored maps. We first compare the results to baseline maps looking at the classification of the MLGCPs, and then present a detailed comparison of mapping results on a city scale as well as in zoomed-in areas. Figure 4 presents the comparisons with test areas in Nairobi as an example, representing pervious points as " * " and impervious points as "+". The mapping result after fine-tuning overlaid with MLGCPs is first presented on the top, followed by Sentinel-2 images overlaid with MLGCPs, where false pos-itives and false negatives, corresponding to commission and omission errors of different approaches, are colored in green and red, respectively.

Analysis of the causes for omission and commission errors
It is clear from Fig. 4 that the main urban areas as well as some big roads are correctly mapped. By comparing with the results from baseline approaches, the decreased omission error is visualized by the decreased number of red crosses. Additionally, it can be seen that the remaining mistakes of the fine-tuning approach, both false positives and false negatives, are in areas such as roads, sparsely built-up areas such as small villages, as well as at the urban-rural fringe. In all these areas it is difficult to discriminate between built-up areas and non built-up areas using images with 10 meter spatial resolution. This can also be partly explained from the fact that the remaining false negatives (omission errors, red crosses), are missing in the GUF layer, too (Esch et al., 2017). We need to mention that the roads and other non-vertical impervious surfaces are not included in the GUF layer.
3.3.2 Spatial analysis of the improvements Figure 5 visualizes the difference between the final result of the procedure presented in this work with respect to some baseline maps, i.e. the results before fine-tuning (Pre), the GUF layer, and the used reference (Ref), using the whole city of Nairobi as an example. The figure shows additionally mapped areas in green as well as those areas that are eventually removed in red. While most of the map remains the same, there is a clear difference between the mapping results after robust-loss-based fine-tuning and those from other approaches. Compared to Pre, there are clearly more impervious areas, which is consistent with the decreased omission error observed in Tab.2. Compared to GUF, some roads are added as impervious areas, which again is expected according to the experimental setup as well as the definitions of the GUF layer and the mapped impervious surface in this study. Compared to the chosen reference, there is mainly a removal of impervious areas, which is expected as pervious areas might be mistaken as impervious ones in the reference preparation procedure, and these mistakes are corrected during fine-tuning, even using robust loss functions.
A more detailed analysis is provided in Fig. 6, where three zoomed-in areas for Beijing, Jakarta, and Nairobi, respectively, are shown. Some correctly removed and added areas in the mapping results with respect to Pre, GUF, as well as Ref are clearly visible.
Finally, Figure 7 compares the mapped impervious areas obtained by means of different approaches as well as in the original baseline maps in subset areas of Nairobi. In line with previous observations, the improvement is mainly resulting from a decreased omission error. Possible explanations for the remaining mapping mistakes, mainly due to commission errors, include the small data size for the model fine-tuning, as well as the about pixel-level geo-location accuracy of the Sentinel-2 images (Drusch et al., 2012).

CONCLUSIONS AND OUTLOOK
MIS mapping provide key information in support of the development of local governments and world-wide collaborations to address issues such as climate change and air pollution. To better map man-made impervious surfaces on a large scale without relying on massive amounts of training data, and more specifically to decrease the omission errors for impervious area mapping, this paper proposes a framework to exploit possibly noisy reference data that are already globally available. The main idea is fine-tuning pre-trained DL models using available reference data in a specific ROI. To this aim, robust loss functions are used to mitigate the effect of potential errors in the reference data. This framework is validated using four test areas across the world, and improvements over baseline results have been obtained. Future research effort is aimed at an improved verson of the proposed framework for a better understanding and practical applications, e.g., by further investigating into the influences of different FCN architectures and comprehensively testing the potential of this approach in more test sites. Figure 6. Comparison of the produced result to three baseline maps in three subsets of Beijing, Nairobi, and Jakarta, respectively. The color legend is the same as in Fig. 5.