URBAN CHANGE DETECTION BASED ON SEMANTIC SEGMENTATION AND FULLY CONVOLUTIONAL LSTM NETWORKS

: Change detection is a very important problem for the remote sensing community. Among the several approaches proposed during recent years, deep learning provides methods and tools that achieve state of the art performances. In this paper, we tackle the problem of urban change detection by constructing a fully convolutional multi-task deep architecture. We present a framework based on the UNet model, with fully convolutional LSTM blocks integrated on top of every encoding level capturing in this way the temporal dynamics of spatial feature representations at different resolution levels. The proposed network is modular due to shared weights which allow the exploitation of multiple (more than two) dates simultaneously. Moreover, our framework provides building segmentation maps by employing a multi-task scheme which extracts additional feature attributes that can reduce the number of false positive pixels. We performed extensive experiments comparing our method with other state of the art approaches using very high resolution images of urban areas. Quantitative and qualitative results reveal the great potential of the proposed scheme, with F1 score outperforming the other compared methods by almost 2.2%.


INTRODUCTION
Urban change detection is one of the most studied topics in remote sensing since it provides useful insights concerning the cities' growing patterns and future tendencies. Low air quality, water contamination and limited greenery spaces are only some of the environmental issues that arise from the continuous urban growth. Moreover, many other social problems can be raised from extending urban areas, like poverty and increased crime rates. It is therefore reasonable to understand and study thoroughly such expansion trends in different spatial scales so as to create better city infrastructures and prevent situations that can be extremely dangerous both for the environment and humanity.
During the last decades, the high availability of earth observation data has enabled the remote sensing community to collect multimodal, multitemporal satellite images laying in this way the foundation for constructive research studies. To this day, manual change detection approaches have been replaced with automatic supervised and unsupervised algorithms such as graphical models and Markov Random Fields (Singh et al., 2014, Benedek et al., 2015, Vakalopoulou et al., 2016, Vakalopoulou et al., 2015, Karantzalos, 2015, kernels (Volpi et al., 2012), as well as Principal Component Analysis (Li, Yeh, 1998, Deng et al., 2008. With the advent of neural networks, recent works are more and more oriented to deep learning approaches, producing state of the art results and setting promising prospects for the urban change detection task. In (Caye Daudt et al., 2018b), a patch-based framework is suggested where two different architectures (Siamese and Early Fusion) are examined using the Onera Satellite Change Detection bi-temporal * mar.papadomanolaki@gmail.com dataset. (Caye Daudt et al., 2018a) then transforms these approaches to fully convolutional versions based on a UNet-like scheme. Multi-task learning methods involving supplementary tasks mainly including semantic segmentation (Liu et al., 2019, Daudt et al., 2018 have also been employed, since they can benefit greatly the training procedure by providing additional fruitful feature representations. Furthermore, since the problem of change detection involves sequential data, the need to calculate temporal dynamics emerges, leading to the employment of Recurrent Neural Networks (Hopfield, 1982, Rumelhart et al., 1986. Such models have been largely employed by the computer vision community on a wide variety of applications like tracking (Milan et al., 2016), action recognition (Singh et al., 2016), etc. Long Short Term Memory Networks (LSTMs) (Hochreiter, Schmidhuber, 1997) are also widely adopted for such tasks (Byeon et al., 2015, Ehsani et al., 2018 since they moderate the vanishing gradient problem (Hochreiter, 1998) when dealing with long-term dependencies. As far as remote sensing is concerned, (Mou et al., 2019) incorporates a recurrent network on top of a convolutional architecture combining in this way spectral, spatial and temporal information in an end-to-end framework. Moreover, fully convolutional LSTMs have been utilized in (Papadomanolaki et al., 2019) where recurrent blocks are integrated into every encoding level of a UNet-like architecture (Ronneberger et al., 2015), thus capturing temporal relationships at different resolutions in a fully convolutional manner. That way pixel level maps of changed areas can be provided and analysed.
In this paper, we tackle the problem of building change detection for very high resolution satellite images by further evolving the already existing framework in (Papadomanolaki et al., 2019). More specifically, the proposed architecture is enriched with an additional decoding branch that performs building semantic segmentation, providing the network with ancillary feature attributes during the training process. The attained quantitative and qualitative results indicate the great potential of the suggested scheme which is also compared with state of the art fully convolutional approaches, namely fully convolutional Early Fusion (FC-EF), Siamese Concatenation (FC-Siam-Conc) and Siamese Difference (FC-Siam-Diff) (Caye Daudt et al., 2018a).
The remaining of the paper is organized as follows. In Section 2, we describe the proposed fully convolutional, multi-task framework as well as the employed dataset. In Section 3 we present and discuss the experimental results while in Section 4 we make a conclusion and examine potential future directions.

Multi-task L-UNet
The proposed scheme is based on the widely known UNet architecture (Ronneberger et al., 2015) consisting of one encoding branch with n levels and two decoding branches. Firstly, D input image volumes in the form of (Batchsize x Channels x Height x Width) are passed to the model, where D stands for the employed number of dates. Each of the D images is processed separately by the encoding layers using shared convolutional weights. More specifically, every encoding level Ei with i ∈ {1, 2, .., n} produces spatial feature vectors X t i for t ∈ {1, 2, .., D}. These feature vectors are then fed to a Long Short Term Memory (LSTM) block which is added as a skip connection on top of every encoding level, determining the temporal attributes using a gating mechanism (Hochreiter, Schmidhuber, 1997). Here, instead of multiplying the spatial feature vectors X t i with high dimensional weight matrices, we perform convolution operations in order to calculate the hidden and cell states. In this way, any gating process is configured as where G t i is the general form for each of the forget (f t i ), input (in t i ), output (o t i ) or cell (c t i ) gates at time step t of encoding level i, Φ is the activation function and W G t i is a single strided convolutional layer with 3 × 3 kernels and padding equal to 1. Next, cell state C t i is obtained as where f t i is the forget gate, in t i is the input gate and c t i is the cell gate at time step t of encoding level i. Finally, hidden state H t i is calculated as where o t i is the output gate.
After the last encoding level En, hidden state H D n is upsampled by the corresponding decoding level and concatenated with the information stored in H D n−1 . This upsampling procedure continues in the same way until the last decoding level where the feature vectors are back to their original dimensions.
For a better comprehension, the left part of Figure 1 illustrates the previously described approach for the case of D = 5. In every encoding level there are two sets of convolutional, batch normalization and rectified linear unit layer (Conv-BN-ReLU) successions with a convolutional LSTM block on top of them. At the first encoding level the input image planes are increased to 16, while in the following ones the depth planes rise to twice their size reaching in this way the number of 256 at the end of the last encoding level. After that, the decoding levels decrease the planes from 256 to 128, 64, 32, 16 and finally the probabilities are produced after a log sof tmax layer. All the convolutional layers adopt 3 × 3 filter kernels with stride and padding equal to 1, while the pooling layers reduce the spatial resolution of the images by 2.
The proposed scheme is further enriched with an additional decoding branch, depicted on the right part of Figure 1, which performs building detection for the input dates. This time skip connections involve concatenations not with temporal, but with spatial feature vectors that have resulted from the different encoding levels. The building segmentation maps can be produced for all the input dates or for some of them depending on the application and the computational complexity allowed. For this implementation, we decided to train the network using only the first and last, out of the multiple date inputs.
During the training process of the previously described multitask learning framework we define different loss quantities using cross entropy for the optimization of each one of them where y s,l is a binary indicator that shows if class l is the correct answer for observation s, while p s,l holds the probability that observation s belongs to class l.
Five different loss values are involved in our training scheme, which are also combined together in a circular way so as to achieve better performances. In particular, we use cross entropy loss Loss ch for the building change detection map, as well as Loss 1 seg and Loss D seg for the building semantic maps. For this study, we provide the building segmentation maps for the first and last date only. Additionally, cross entropy is used to provide one more loss, Loss ch2 , that focuses on the building change detection by mixing together the final feature outputs resulting from the building segmentation branches. If s denotes the final network output, then Loss ch2 is defined by calculating the cross entropy for feature vector s ch = s D seg − s 1 seg , where s D seg and s 1 seg are the output feature vectors resulting from the building segmentation branch for the first and last date respectively. Moreover, we also establish Loss D seg2 for the building segmentation map of the last date, which is computed using the feature vector s D seg = s 1 seg + s ch , namely the addition of feature outputs resulting from the building map of the first date and the building change map. These two additional loss functions are integrated to reduce the number of false positive values in the building change detection map, combining the predicted probabilities of the network's both decoding branches in a circular way.
For the final optimization of the network we use the weighted sum of all the previous losses choosing a weight equal to 0.6 for the building change detection branch due to the limited number of changed pixels in the dataset. In all the rest employed losses we assign a weight equal to 0.1.

Dataset and Implementation Details
All the experiments were based on the Attica VHR dataset, which involves 5 multispectral very high resolution images illustrating a 9 km 2 region in the East Prefecture of Attica, Greece, for five different years. In particular, there are images acquired in 2006, 2007 and 2009 which were captured by Quickbird satellite, while there are also images for the years of 2010 and 2011, which were captured by WorldView-2. Every sample is pansharpened and atmospherically corrected, with the available groundtruth of both buildings and change of buildings having been manually annotated by remote sensing experts after an attentive and time demanding photo-interpretation. Also, Quickbird images were resized to the WorldView-2 resolution which is approximately 8000 by 7000 pixels. It should be mentioned here that all experiments were conducted using the four similar spectral bands of both sensors; Red, Green, Blue and Near Infra-Red.
The whole region was divided into 36 equal subregions of approximate size 1100 by 1300 pixels; 28 of them were used for training, 4 for validation and 4 for testing. We tried to split the dataset parts as wisely as possible so that there is sufficient information during the training process as well as adequate testing features in order to draw reliable conclusions. For the training process, patches of size 64x64 were produced with a stride of either 32 in case building change pixels were included, or 64 in case the patch did not include any building change pixels at all. This strategy was applied as a data augmentation approach to enrich the building change semantic category since it is extremely scarce compared to the no change one. In addition, patches whose number of building change pixels exceeded the threshold of 3% were randomly flipped in all possible angles proportional to 90 degrees while their brightness, contrast and saturation levels were also randomly altered. Lastly, each class was associated with a weight inversely proportional to the total pixel number included in it.
As far as hyperparameters are concerned, Adam optimizer was picked with a learning rate equal to 10 −4 and a batchsize of 10. The dataset split seemed to work properly since the training was conducted successfully without overfitting signs. Early stopping criteria were employed for every adopted approach in order to cease the training process and pick the optimal network weights. The applied methods needed less than 50 epochs to converge, while all experiments were implemented using the PyTorch deep learning library (Paszke et al., 2017) on a single NVIDIA GeForce GTX TITAN with 12 GB of GPU memory.

EXPERIMENTAL RESULTS AND DISCUSSION
In this section we present the experimental results along with a comparative study.

Quantitative Evaluation
For the quantitative evaluation, precision, recall, F1 score and balanced accuracy metrics have been employed. pixels that have been wrongly classified as l. TN is the number of pixels that have been rightly recognized as not belonging to l. Finally, FN represents the pixels that belong to l but the model has associated them to some other class.
In Table 1 we provide the quantitative outcomes of the proposed method with and without semantic segmentation of buildings. We also compare them with all the methods described in (Caye Daudt et al., 2018a). The estimation of the accuracy metrics was carried out on the testing part of the Attica VHR dataset after a post processing phase where objects with areas smaller than 150 pixels were discarded.
As one can observe, the integration of the building semantic segmentation decoding branch boosts the F1 score in all approaches. Starting with the FC-Siam-Conc method, F1 rate has increased by 1.6% in the case of 5 dates, while precision has also raised by 3.6% which indicates that the multi-task learning process contributes much to the lessening of false positive pixels. On the contrary, for the FC-Siam-Diff method, the precision rates remain very low not only in the 2 dates case but also in the 5 dates case, with or without multi-task learning.
Regarding FC-EF, the numerical results are slightly better in the 5 dates case, although F1 score does not exceed the level of 48% in neither of the two experiments. It should be mentioned here that for the FC-EF method we could not perform the task of building semantic segmentation simultaneously since the different dates are fused along the channel dimension before being passed through the model, preventing in this way the construction of separate spatial feature vectors for each individual date. As far as the proposed framework is concerned, it appears that it yields the most successful results regarding false positive pixels since the precision rate is 52.42% in the case of 5 dates, exceeding the next best precision score of multi-task FC-Siam-Conc by approximately 2.2%. In addition, the F1 score reaches the value of 55.82% which is also 2.2% higher than the corresponding F1 rate in the multi-task FC-Siam-Conc case. For the L-UNet approach, we notice that the F1 rate is always above 50% which means that when temporal dynamics are calculated, the attained total number of false positive and false negative pixels seems to be more balanced. Finally, the highest balanced accuracy score was established by the multi-task FC-Siam-Diff method of 2 dates, where the highest recall rate has also been achieved. Nevertheless, the precision rate is particularly low which means that even though false negative pixels are more limited, false positive pixels continue to exist. As a whole, FC-Siam-Conc as well as L-UNet approaches achieve almost the same balanced accuracy rates with precision and F1 scores outperforming the FC-Siam-Diff cases. Regarding building semantic segmentation, the provided accuracy metrics are related to the year of 2006, with multi-task L-UNet with 2 dates reporting the best performances. Lastly, in the last column of Table 1 we provide the approximate computational time needed by the different employed methods to complete one training epoch. L-UNet requires twice the time needed in all the methods presented in (Caye Daudt et al., 2018a), while time demands increase further when building semantic segmentation is integrated.
In general, one can notice that in all applied methods accuracy results demonstrate that the networks are prone to many errors, especially if we consider that precision rates for the building change category never go above 53%. This is probably caused by two main problems that exist in change detection datasets. The first one is related to registration and parallax errors that disorientate the network's learning process. Secondly, the appearance of certain building roofs may be differentiated through time, resulting in a variation of provided spectral values for certain areas that do not really undergo any semantic change on the buildings. One critical issue also lies in the fact that the building change semantic category is greatly scarce compared to the no change one. In the case of Attica VHR dataset, the total number of no change pixels for the training dataset is almost 84 times larger than the number of change ones. Nevertheless, despite these obstacles the extraction of temporal features seems to handle the available information in a better and more constructive way, especially when it is coupled simultaneously with building semantic segmentation.

Qualitative Evaluation
In Figure 2, we provide predictions that resulted from some of the investigated methods on the Attica VHR testing regions for the building change detection task. Green colour stands for true positive pixels, black for true negatives, red for false positives and yellow for false negatives. In the first row, the additional building of 2011 is detected successfully only by the multi-task L-UNet with 5 dates. Multi-task FC-Siam-Conc with 5 dates has only partly detected the building change, whereas multitask FC-Siam-Diff with 2 dates and FC-EF with 5 dates have failed completely to recognize it. In the second row, all methods seem to have identified the building changes, with multitask L-Unet with 5 dates having attained the lowest number of cept FC-EF with 5 dates which also includes the largest number of false positive pixels.
In Figure 3 there are also several illustrations on building predictions from Attica VHR testing regions of 2006. With a closer look, one can observe that all multi-task methods appear to handle the building semantic segmentation problem in a similar manner. In every approach however it is visible that the building boundaries are not very precise most of the times. This problem can also be observed in Figure 4 where all network outcomes resulting from the proposed multi-task scheme are demonstrated, for a region of Attica VHR testing part.

CONCLUSION
Throughout this work, we have evaluated and compared a fully convolutional multi-task deep architecture which takes advantage of temporal dynamics as well as building footprint features among the different dates in order to deal with the building change detection problem. Results show that the exploitation of temporal dynamics alone can boost the model's performance compared to other state of the art architectures which are based exclusively on spatial feature representations. Accuracy metrics are even further ameliorated when the task of building semantic segmentation is performed simultaneously for the first and last date of the dataset. Still, urban change detection remains a remarkably challenging problem since the building change semantic category is extremely scarce compared to the no change one. Apart from that, spectral rooftop alterations and registration errors tend to disorientate greatly the networks during the training process, resulting in a large number of false positive pixels. In the future, we aim to investigate further the possible combinations of multi-task deep frameworks as well as tackle the issue of imprecise boundaries in an attempt to produce even more accurate segmentation maps. In addition, we plan to explore if the trained models can generalize well when tested on other very high resolution datasets depicting cities with different city infrastructures.