BUILDING DETECTION FROM AERIAL IMAGERY USING INCEPTION RESNET UNET AND UNET ARCHITECTURES

: Buildings are one of the key components in change detection, urban planning, and monitoring. The automatic extraction of the building from high-resolution aerial imagery is still challenging due to the variations in their shapes, structures, textures, and colours. Recently, the convolutional neural networks (CNN) show a significant improvement in object detection and extraction that surpasses other methods. To extract building, in this paper two segmentation architectures, the UNet and the Inception ResNet UNet are implemented and then tested on the Inria aerial image datasets. The Inception ResNet UNet utilizes the Inception architecture and residual blocks. This makes the model wide and deep, though there are a few differences between numbers of UNet and Inception ResNet UNet parameters. The analyses show that UNet has a high rate of metrics in the training progress. However, on the unseen dataset, Inception ResNet UNet extracts buildings more accurately (97.95% accuracy and 0.96 in the dice metric) in comparison with UNet (94.30% accuracy and 0.55 in the dice metric).


INTRODUCTION
The development of remote sensing earth observation systems led to the availability of aerial images at almost all times and locations.It opened numerous applications in computer vision and photogrammetry, e.g., change detection (Gomroki et al., 2022, Isaienkov et al., 2021;Zhang et al., 2020), long-term large-scale monitoring (Immerzeel et al., 2009;Lehmann et al., 2015), and urban management (Mignard and Nicolle, 2014).One of the vital elements that can be extracted from the aforementioned aerial images are buildings.For this task, some datasets and benchmarks have been developed, such as the Inria aerial image dataset (Maggiori et al., 2017) and the Massachusetts buildings dataset (Mnih, 2013).The aim of these processes is to detect the features of buildings or other urban elements (binary or multiple) in aerial images by semantic segmentation (Huang et al., 2018;Ji et al., 2018;Li et al., 2021;Pan et al., 2019).Semantic segmentation is a crucial task in computer vision and remote sensing community, which deals with assigning a label to each pixel in an image (Yuan et al., 2021).Different machine learning algorithms including artificial neural networks (ANN) have been used to perform this task in the recent years (Mas and Flores, 2008).Over the previous years, researchers have proposed many methods to deal with spatial dependency algorithms (Tarabalka et al., 2009), geographical object-based image analysis (Blaschke, 2010), feature extraction algorithms (Yang et al., 2010), and super-pixel algorithms (Hadavand et al., 2019).These methods could be considered as preprocessing steps for the task of building extraction.CNN revolutionized a new way to deal with this problem by involving a mathematical convolution with the traditional ANN algorithm.Mathematical convolution in image processing is a matrix operation that works by applying a kernel to each pixel and its neighbours to produce a new value for the centre pixel (Gonzalez, 2009).Nowadays, researchers paid more attention by introducing the AlexNet (Krizhevsky et al., 2012) and showing the good performance on the ImageNet dataset (Deng et al., 2009).The reasons for super-passing CNN algorithms are that they provide an end-to-end solution and object-based classification (Diakogiannis et al., 2020).In the CNN architecture, any convolution layer generates a new feature from the original image data and uses it as extra information to get a better result.Due to the use of plenty of convolutional layers in the CNN algorithms, they are usually known as "deep CNN," "deep networks," or "deep learning algorithms".Deep learning models are successfully applied in different computer vision and remote sensing tasks such as object detection (Wu et al., 2020;Zhao et al., 2019), image segmentation (Ghosh et al., 2019;Wang et al., 2019), human activity monitoring (Toshev and Szegedy, 2014;Zheng et al., 2019), object tracking (Ciaparrone et al., 2020;Zhai et al., 2018) and also the semantic segmentation.Semantic segmentation is the essential input for plenty of applications in computer vision and remote sensing, including scene understanding for autonomous driving (Siam et al., 2018), augmented reality (Ko and Lee, 2020), and different environmental monitoring applications such as precision agriculture (Anand et al., 2021), change detection (Venugopal, 2020), and urban mapping and monitoring (Du et al., 2021).In urban remote sensing, discriminating different elements of a city, including different kinds of buildings, paved areas, water bodies, trees and grasslands, cars and clutter are challenging due to variations in shapes, structures, textures, and colours differences (Diakogiannis et al., 2020).In object-based image analysis, this problem is solved by defining several subclasses for a specific class such as building, and therefore, in the postprocessing step, they will merge to get a map of buildings (Benz et al., 2004).However, having an algorithm able to detect a class of objects with different characteristics is still a difficult task in remote sensing image analysis.Among the various existing architectures, UNet (Ronneberger et al., 2015) is a well-known and powerful architecture that shows prominent results in labelling remote sensing imagery in different applications (Feng et al., 2018;Freudenberg et al., 2019;Yang et al., 2019).The UNet structure was originally developed by Ronneberger et al. (2015) to segment biomedical images consisting of an encoder-decoder block to label the pixels of the input image.This model aims to distinguish between the disease location and the corresponding total area in biometrical images to obtain the size and location of the disease in the body.Chhor et al. (2017)  removing down-sampling layer for ease of use in optimization and tackle vanishing gradient.The loss is set to negative value of Dice.This leads to 0.75 in Dice coefficient and IOU 0.60.Emek and Demir (2020) used the Sentinel SAR images of Sentinel-1 SAR and Sentinel multi spectral images that cover 120 km 2 .Their model is a CNN-based on the UNet architecture.They achieve an implementation accuracy of 81%.The output mask of the model detects some other elements such as buildings, e.g., in wooded areas, some wood is classified as buildings because of its high reflectance value.In addition, it is powerful enough to deal with building extraction problems in complex urban landscapes (Pan et al., 2020).Wang and Miao (2022) developed RS-UNet.This architecture is based on incorporating the Residual Learning in UNet and combination of Focal Loss (FL) and the Atrous Spatial Pyramid Pooling (ASPP).Focal Loss was used for connection between encoder and decoder, and ASPP as a loss function.For the extraction of more features in the images, a larger size of images has been used in training which was implemented at the size of 512×512 px.However, this increases the training time.In the architecture, the encoder and decoder parts five layers have been used.FL was used for balancing the encoder and decoder parts.The results of various sizes of images with a larger size (512×512) have 97.66%precision in 200 epochs, which is reduced to 97.41% in 128×128 px.The selected best size for training is 256x256 px with consideration of time and precision.In this paper, the UNet and Inception ResNet UNet architectures are trained and analysed on the Inria aerial image dataset.All buildings are categorized in one class.Our analyses show that by using UNet with the same kernel size in convolution, leads to inability to detect the very large and very small building in the image.In addition, the detection is limited into number of building size.Furthermore, it is not deep enough to detect all kind of building.In some cases, this architecture detects shadows as part of building.The Inception ResNet UNet is a deep and wide.Due to using of various kinds of kernel size, the architectures could detect all kind of building with various shapes, structures, textures, and colours.More details of architectures are presented in the following sections.This paper is organized as follows: Section 2 presents the methodology.In section 3, experimental results are discussed and interpreted.The summary and conclusions are represented and discussed in section 4.

METHODOLOGY
Our proposed deep learning structure is based on the UNet and Inception ResNet UNet architectures.Inception ResNet UNet is an improvement on UNet to solve the convergence problem for deeper encoder-decoder layers.The bing deeper and wider of Inception ResNet UNet allows for precise detection of the object in the image, which is why the Inception ResNet UNet is selected.In the following, the UNet and Inception ResNet UNet architectures are explained.

UNet architecture
UNet is a convolutional network were proposed in 2015 for medical image segmentation to obtain the precise location and area of objects in a class (Ronneberger et al., 2015).The architecture has a U-shaped structure consisting of two main paths called contracting and expansive paths by the authors and is known as encoder and decoder.The contracting or encoder path uses repeated convolutions with 3×3 kernel size, same padding, and stride one with Rectified Linear Unit (ReLU), followed by batch normalization and max pooling, which increases the number of feature layers and decreases the size of the image simultaneously.There are no fully connected layers in this model.This part of the architecture is a typical CNN that can be replaced with any pretrained model.In every step of down sampling, the number of features is doubled.Contrarily, in an expansive or decoder path, up-convolution is used to decrease the number of features and take the image size back to the original input image.Every upconvolution step halved the number of features.To prevent losing the details, concatenating the features from the contracting path is considered in up sampling.In this step, typical convolution layers are applied to the concatenated features.This procedure continues until mask image creation and getting the result.Figure 1 depicts the UNet network in the proposed paper.The encoder is considered on the left and the decoder on the right side.512×512 px is the input size of the image in the model.Dataset images are cropped to this size, e.g. the 5000×5000 px size, which is the size of any image in the Inria aerial image dataset.In the preprocessing step, they are cropped to a size of 512×512.This operation yields 100 images with a 12-px overlap in images and their side images.The output image size is 512×512 px.

Inception ResNet UNet architecture
The main structures of network architecture are represented in Figure 3.The Inception ResNet UNet architecture is a modification of the UNet and the Inception ResNet v2 (Szegedy et al., 2016b).Inception ResNet is a combination of the Inception architecture (Szegedy et al., 2015) and residual blocks (He et al., 2016).The Inception architecture has convolution with multiple kernel sizes at the same level (Szegedy et al., 2015).In other words, instead of transforming a single convolution, the architecture considers multiple convolutions with different kernel sizes in parallel at every block (Figure 5), and at the end of every block, they are concatenated to form a single layer of features (Szegedy et al., 2016a).Due to utilizing multiple kernel sizes, objects of different sizes will be detected in the image.In other words, the model gets wider and deeper (Szegedy et al., 2016a).Sub-blocks depicted in Figure 5 contain parallel convolutions.For improvement in accuracy and reducing computational complexity in Inception ResNet v2, the modifications in architecture are summarised as follows: Factorizing the 5×5 kernel size into the two 3×3 kernel sizes (as well as 7×7 into the three 3×3 kernels).Instead of using the 5×5 kernel size, we utilize the two 3×3 kernel sizes.The analysis shows they yield the same results, though in this case the number of parameters is reduced.This procedure is repeated in the 7×7 kernel, which is replaced with three 3×3 kernels (Szegedy et al., 2016b).Factorizing the n×n into n×1 and 1×n convolutions.Every n×n convolution consists of two linear kernels in the horizontal and vertical directions.If we combine these two kernels, we get a squared kernel.In this model, every square kernel is divided into two linear parts in the horizontal and vertical directions.This again leads to reducing the parameters without losing the accuracy (Szegedy et al., 2016b).Typically, the accuracy is increased by making the network deeper by adding more layers.This may aid the network in learning the basic and complex details of an image.By adding more layers, due to the overfitting, the accuracy starts to degrade.The number of layers and designing deeper layers is a challenge to obtain optimum results, especially in building detection and segmentation.An aerial image consists of various building types.The model should be able to detect all kinds of buildings (varying in shapes, structures, textures, and colours) that are all considered in one class.We can make the network deeper by considering the residual block.In a residual block, each layer feeds into the next layer and directly into the next layer.As previously stated, the Inception ResNet is a combination of the Inception architecture and residual blocks.It consists of 164 layers of very deep and wide CNN.In Inception ResNet the performance is optimised by balancing the filter at every stage.There are 37 blocks in the encoder and six blocks in the decoder part of the Inception ResNet UNet network (Figure 3).The result of convolutions in encoder is concatenated into three different parts in decoder, as in the original UNet algorithm (Figure 1), to form the final result.The number of inputs and outputs of each block is included in Figure 3.The output size of features of every block is mentioned in Figure 3 in every block unit.Block 3 is then 10 times repeated, and finally, the output size is mentioned in the related section.The output feature size is 61×61×320, which is the input of the next block.We have the same conditions in block 5.The outcome is presented after 20 iterations of this block.The output feature size is 30×30×1088 which feeds into the next section.The details of sub-blocks of Figure 3 are displayed in Figure 5.In Figure 5, block 1 simply shows the typical convolution, batch normalization, and an activation function.Blocks 3 and 5 have the skip connection and block 1 and others have block 1 in their structure.In Inception ResNet UNet architecture, we have one to four parallel convolutions in every block (Diakogiannis et al., 2020).The common property of all blocks is to concatenate the results of all internal operations of the block, similar to residual connections, to produce the output, which is usually the input of another block.Block 4 shares two outputs, one for concatenation and the second for zero-padding, which is reserved for use in the decoder part.Block 6 is the core part of the output of the network, which uses the concatenation of the input image and the output of the UNet encoder-decoder block to produce the result.The proposed structure has 36 million parameters which need to be trained.In the 2014 ILSVRC classification challenge (Russakovsky et al., 2015), VGGNet (Simonyan and Zisserman, 2015) and GoogLeNet (Szegedy et al., 2015) produced comparable high performance.VGGNet needs more resources for computations; in other words, this architecture has 138 million parameters.The computation cost of VGGNet is higher than GoogLeNet, with 5 million parameters.In the 2014 ILSVRC classification challenge (Russakovsky et al., 2015), VGGNet (Simonyan and Zisserman, 2015) and GoogLeNet (Szegedy et al., 2015) produced comparable high performance.VGGNet needs more resources for computations; in other words, this architecture has 138 million parameters.The computation cost of VGGNet is higher than GoogLeNet, with 5 million parameters.The observations (Szegedy et al., 2015) show the quality of Inception ResNet v2 is higher than GoogLeNet and needs very low resources as we need in VGGNet and the computation cost of Inception is lower than the VGGNet (Szegedy et al., 2015).These are the main reasons for selecting the Inception Resnet v2 over other networks.

Dataset
The Inria aerial image labelling dataset (Maggiori et al., 2017) was used in our experiments.The dataset covers 405 squares kilometres with a 0.30 meter ground sampling distance (GSD), consists of 180 images and a mask image with 5000×5000 px dimensions.Existing masks divide the image area into two semantic classes: building and non-building.The images are captured across various urban landscapes and illumination.The dataset is gathered from the US and Austrian areas, including Bellingham, Innsburck, San Francisco, Tyrol, and Chicago (Figure 4).These cities contain both high and low densities urban features.There is higher density in Chicago, San Francisco, Vienna, and Innsbruck, and lower density in Kistap, Bloomington, and West and East Tyrol.Every image is divided into 512×512-px sub-images for training the algorithm, leading to a total of 30,000 training and validation images and masks.

Implementation details
Inception ResNet UNet and UNet have 36 and 34 million parameters, respectively.The UNet is a standard CNN, but the Inception ResNet UNet is made up of Inception architecture and residual blocks.By considering the number of layers in Inception ResNet UNet, there are a few differences between their parameters.The reason, as mentioned in the methodology section, is updates in the GoogLeNet (Szegedy et al., 2015) that led to a deeper and wider model with a few variations in the number of parameters.In architecture, we have one to four parallel convolutions in every block.We use binary classification in this paper, but it can also be used for multi-labeled classes.
The networks are analysed using training and validation data during the training process in every epoch.The results demonstrate that overfitting or underfitting doesn't occur in the training of the models.After completion of the training, the unseen dataset (form the same distribution as models initial input) is segmented by trained models and compared the metric (accuracy and dice) results.The main aim of trained models is to perform well in unseen datasets.Therefore, our main focus is to analyse the performance of models on the unseen datasets.Figure 6-9 show the results of applying models to these kinds of datasets.The loss function and metrics during training are cross-entropy, dice, and accuracy, respectively.These metrics will be explained in the following.The cross-entropy, which measures the difference between two probability distributions, is used as the loss function of the models (De Boer et al., 2005).Its mathematical equation is as follows: where: ytrue: the mask image ypred: the predicted image The dice and accuracy metrics are computed and used to evaluate the results of experiments.The Dice coefficient measures the overlap between the model's prediction results and the corresponding mask (Milletari et al., 2016).The dice metric returns a value between 0 and 1, and its maximum values coincide with the ideal prediction result.The dice metric is computed using the following equation: The results of computed metrics for the training phase and testing on the unseen datasets for two networks are presented in Table 1.In accordance with it, the results show that UNet reaches higher accuracy on the training dataset; the accuracy and Dice are 99.77% and 0.98, respectively.But its performance is reduced when it is utilized on the unseen dataset; the accuracy and Dice are reduced to 94.30% and 0.55.Inception ResNet UNet performs significantly better and is successful in detecting details of building borders.The Inception ResNet UNet architecture consists of various kernel sizes and residual blocks (Figure 5).By utilizing them, Inception ResNet UNet detects objects with varying shapes, structures, textures, and colour.As illustrated in Figure 8, the Inception ResNet UNet can detect very small and large-scale buildings, though the UNet couldn't detect those buildings accurately.Especially in very large-scale buildings in the Vienna region, the UNet (Figure 6, 6th and 7th columns and Figure 8) couldn't detect footprints.In medium-sized buildings in the Austin region, the models detect almost the same level.Another issue with working with buildings is the relief displacement that occurs for elevated objects such as tall buildings in aerial and satellite imagery.To remove this effect, the image should be processed to generate a true orthophoto, which is a hard process due to the need for precise 3D models of buildings.Therefore, the relief displacement usually appears in orthorectified images, especially in areas with tall buildings.The results of Figure 9 show that the Inception ResNet UNet enables us to extract tall building footprints more precisely.In the UNet, the shadows are detected as part of the building.

DISSCUSSION AND CONCLUSIONS
In

Figure 2
Figure 2 depicts the residual black, which helps to design a deeper model without overfitting.The model is learned the simple and complex elements in the image by designing the model with this concept.The residuals block in deep architecture helps to avoid gradient vanishing in backpropagation, especially for the deeper architectures with plenty of layers(He et al., 2016).

Figure 3 :
Figure 3: An Overview of the Inception ResNet UNet architecture.The left part is a regular convolution neural network, called an encoder.The right side is called the decoder, which consists of convolution blocks, convolution transpose, and concatenation.Zero padding is utilized to get the same size for concatenating features.Details of blocks 1 to 6 have been shown in Figure 5.The output of every block is represented at the end of every section.

Figure 4 .
Figure 4.The images and their related masks from the Inria aerial image dataset belong to the Vienna, Kitsap, and Chicago regions.The images have a dimension of 5000×5000 px.

Figure 5 .
Figure 5. Every block in detail has been used in the ResNet UNet network in Figure 3. Blocks3 and 5 have the residual network and Block 1 in their structure.In architecture, we have one to four parallel convolutions in every block.We use binary classification in this paper, but it can also be used for multi-labeled classes.

Figure 6 .
Figure 6.From (a) to (d), some buildings of small size are detected in Inception ResNet UNet, but UNet couldn't detect them precisely.In (f), large buildings are not detected in UNet properly, though the Inception ResNet UNet detects them more accurately.Some shadows were detected as part of building in (h) in UNet, whereas Inception Resnet UNet could handle relief displacement.

Figure 7 .
Figure 7.The edges in Inception ResNet UNet are detected more precisely in comparison with UNet.

Figure 8 .
Figure 8. From top to bottom: the aerial images (first row), related masks (second row), and predicted masks in Inception ResNet UNet (third row) and UNet (last row).Inception Resnet UNet detects the large and small building more precisely.

Figure 9 .
Figure 9. Tall buildings in Inception ResNet UNet are detected more precisely in comparison with UNet.This model detects some shadows as part of the buildings.

Table 1 .
Result of training for both network architectures Contrarily, Inception ResNet UNet results in training are lower than the UNet (98.17% accuracy with 0.
this research, two deep network architectures, UNet and Inception ResNet UNet, are implemented and evaluated in automatic building detection from aerial imagery.The Inception Resnet UNet could detect buildings of different shapes, structures, textures, and colours in images in almost all regions, though UNet couldn't detect very large buildings, e.g., in the Vienna region.That is because the Inception ResNet UNet model is wide and deep with few variations in the number of parameters in comparison to UNet.Our experiment results demonstrate that Inception ResNet UNet with 97.95% is preferable in comparison with UNet with 94.30% accuracy in unseen data.Our future work includes in improvement in architectures for multiclass classification of aerial images and precise boundary detection of objects for vectorization of objects.