IMPROVING SEMANTIC SEGMENTATION PERFORMANCE BY JOINTLY USING HIGH RESOLUTION REMOTE SENSING IMAGE AND NDSM

: Semantic segmentation algorithms based on full convolutional neural network have greatly improved segmentation accuracy of high-resolution remote sensing (RS) images. However, the interpretation of RS images from single sensor is still challenging due to the variety and complexity of land objects, the extreme imbalance distributions of land objects on size and numbers. In contrast, multiple sensors can provide complementary information on the land classes, and thus benefit the interpretation. In this context, this research explores the joint use of RGB optical bands and normalized DSM (nDSM) to analyze an urban scene. The method firstly concatenated three channels RGB image and one channel nDSM band into a four-channel image. Thereafter, ResNet-101 network with fine adjustment were utilized as the backbone network to retain multiple feature information by residual blocks. Then the augmented RGB and nDSM images were used to training the network. The established model was evaluated on the Postdam test set. Results show that the proposed method achieves 86.85% on Overall Accuracy (OA), 77.42% Mean Intersection Over Union (MIOU), which is 6.88% and 11.39% higher than the result achieved by single RGB images. Especially, small targets, such as car and tree, are higher. The experimental results show that the simple structure adjustment of ResNet-101 network can achieve good segmentation performance on RS images (especially small targets) after the combination of twice augmented RGB channels and nDSM channels respectively.


INTRODUCTION
Remote sensing (RS) technology utilizes sensors to observe and detect target objects in a long distance. High-resolution RS image is an important window for earth observation (Zheng, 2017). At present, semantic segmentation of high-resolution RS images has become a hot issue in RS image interpretation, and widely used in environmental monitoring (Blaschke et al., 2000), crop cover and type analysis ( Yang, 2016), forest tree species analysis (Dechesne, 2017), architectural classification of urban space and land use analysis (Rottensteiner, 2014) etc. However, there are still many complex factors in RS images such as feature diversity of a class of samples, uneven data amount of each class, target space dispersion, variable scale, complex background and shadow etc. which lead to poor segmentation performance and prone to miss segmentation. Due to the characteristics of high-resolution RS images, such as rich shape geometry and texture features, obvious topological relationship of ground object space and huge amount of data (Tang et al., 2013), traditional processing technology cannot make full use of rich details and background information, and results in a phenomenon called "rich data and poor information". This phenomenon makes the segmentation with high precision and high efficiency is still a challenging problem.
With the development of deep learning, semantic segmentation technology has made great progress. Since deep learning methods can automatically extract tailored characteristics for a specific classification task, the processing of RS images over complex scenes has a better choice (Yuan, 2021 ). The biggest difference between the semantic segmentation method based on convolutional neural network and traditional semantic segmentation method is that the network can automatically learn the features of images, carry out end-to-end classification learning, and greatly improve the accuracy of semantic segmentation. Standard semantic segmentation is the process of classifying each pixel into object classes, and extracting semantic information and image features from a large amount of labeled data by using deep neural network. Pixel-based methods are usually effective in extracting details and edges such as (Zheng, 2022) The quality of image semantic segmentation directly determines the quality of classification or recognition. Therefore, the realization and application of an effective image semantic segmentation algorithm is of important practical significance.
With the ever-evolving progress of remote sensing technologies, the resolution of RS image is getting higher and higher, and the ground object information is getting richer (Zheng, 2021). With the continuous improvement of semantic segmentation today, there are still many problems to be solved, mainly reflected in the following aspects.
Firstly, the inconsistence between the segmentation result of RS image and semantic information. Due to the rich information of ground objects in high-resolution RS images, general segmentation methods used for another kind of images may show poor performance on RS images. How to make the segmentation results consistent with ground truth of semantic image objects so as to improve the segmentation accuracy and the average image overlap ratio has become an urgent problem to be solved. Secondly, the phenomenon of "same object with different spectrum" and "different objects with same spectrum" in high spatial resolution RS images. The high spatial resolution RS image has vivid geometric and attribute details, which makes small targets, texture and shadow of ground objects and other interference factors detectable in images. Meanwhile, the spectral response variation of similar objects or even the same ground objects became obvious with the improvement of spatial resolution (Liu et al., 2011). Therefore, the phenomenon of "same object with different spectrum" and "foreign object with same spectrum" are common in high spatial resolution RS images, which brings great difficulties to the segmentation of relevant ground objects.
Thirdly, the segmentation performance of small targets is poor. In general network model, the basic backbone neural network has several down-sampling processes. Because the size of small targets in the feature map is relatively small, especially only one digit pixel size after down-sampling processing, which results in poor classification performance of the designed classifier on small targets (Nogueira, 2019).
In this context, the main purpose of the study is to establish a deep learning network for semantic segmentation of highresolution RS images. In this method, three channels RGB images and one channel nDSM images in the Postdam dataset of ISPRSs are superimposed into four channel images. Fourchannels images are taken as an input and then put into adjusted ResNet-101 network for training.

RELATED WORK
Over the past few decades, researches on RS have emphasized a lot on the application of machine learning, and many deep learning methods have been applied to semantic segmentation of RS images.
Traditional image semantic segmentation techniques mainly include threshold based, edge based, and region based segmentations, and segmentations based on the specific theory. Traditional image segmentation methods are not only difficult to meet the requirements of practical application in real-time scene understanding and image information processing, but also difficult to achieve classification accuracy and segmentation image interpretation efficiency (Liang, 2020). Semantic segmentation based on deep learning can solve the above problems.
In 2015, FCNS (Fully Convolutional Networks) (Long et al., 2015) popularized the original Convolutional Neural Network (CNN) structures. This end-to-end method can process images of arbitrary sizes, which improves processing speed compared with the traditional image block classification method. ResNet (Residual Network) (He et al., 2016) was proposed in 2016. The residual blocks of the network have two structures, "building blocks" and "bottle neck building blocks". Compared with VGGNet and GoogleLeNet, this network, identity mapping and residual mapping are used to transform identity mapping to solve the residual mapping. This method solves the problem that the accuracy decreases with the deepening of the network. SENet (Squeeze-and-Excitation Networks) (Jie et al., 2017) presents a new structural unit called "Squeeze and Excitation" blocks. It adaptively recalibrates channel characteristic responses by modeling interdependencies between the channels. These SE blocks are stacked together to form a SENet. A semantic segmentation method using multi-context paradigm to obtain the optimal patch size is proposed in (Nogueira et al., 2019). This method can capture better ground and context features at the same time, which is of great help in improving the overall classification accuracy and the classification accuracy of small targets (such as vehicles). Resunet-a (Fid, 2020) was proposed for remote sensing image segmentation in 2020. The network consists of a new deep learning architecture, Resunet-a, and a new loss function based on Dice Loss (Dice loss function). Resunet-a uses UNet code structure as the backbone, combines residual joining, empty convolution, pyramid scenario parsing pooling and multi-task reasoning.

METHODS
In this paper, three-channel RGB images, labels and their corresponding one-channel nDSM are augmented twice in the image preprocessing process. The augmented images are divided into training set and verification set. When the image is read, the three-channel RGB image and one-channel nDSM image are stacked. The 4-channel images are input into adjusted ResNet-101 network, and then output TIF format images compared with the test set for evaluation.

Augmentation
In this paper, 5 methods including random image clipping, gaussian blur, special affine transform enhancement (called Rotation), noise enhancement and color enhancement, were used randomly in the first augmentation at the same time. The second augmented images were obtained by horizontal, vertical and mirror inversion of the first augmented images. Each image was expanded into three images. The secondary augmented images are simultaneously put into the network. The threechannel labels and single-channel nDSM corresponding to high resolution RS images were expanded to the same number. The augmentation process is shown (in Fig. 1

The Channel Stacking
In this paper, RGB and nDSM of RS images are read by Tifffle function, whose array forms are (H, W, 3) and (H, W) respectively. At the same time, they are stacked with channel numbers. Then the array output of RS images is (H, W, 4). As shown (in Fig. 2),where (a) is the 3-channel RGB image, (b) is the 1-channel nDSM image, (c) is the image visualization combining RGB image and nDSM image.

ResNet-101
The characteristic of this network is to use a kind of connection called "short connection", which effectively solves the problem of gradient explosion and gradient disappearance caused by the deepening of the network. When using this network to extract features, this method changed the stride, retained the feature graph to a greater extent, and reduced the loss of small target information.

Model Structure
The previous three sections describe the preprocessing methods and channel merging. This section will introduce the architecture of overall approach in detail.
The model takes ResNet-101 network as the backbone, and the last two layers of the model, i.e., GlobalAvgPool2D and Flatten layers are discarded. The method uses ResNet-101 as main framework with five convolution layers as shown (in Fig. 3). The first convolution layer is the convolution with a kernel size of 7×7 and the Max Pooling with a kernel size of 3×3 and the stride of 2. The second layer and the third layer consist of three "bottleneck" building blocks with stride of 2. The fourth convolution layer and fifth convolution layer consist of 23 and 3 "bottleneck" building blocks respectively, the stride both is 1. Compared with the original network, the stride size of last two layers changes from 2 to 1, and output shape of the feature map changes from (8,8,2048) to (32,32,2048).
The whole architecture is presented (in Fig. 4). The first step of the method is augmentation. RS images are augmented twice, the first augmented method is clipping, random Gaussian blur, random special affine Transform enhancement, random noise enhancement and color enhancement, and the second augmentation mode is horizontal, vertical and mirror flipped. The nDSM images also are augmented twice. The first augmentation is clipping. The second augmentation is the same as RS image. The second step is to concatenate three-channel RGB images and one-channel nDSM images. Thirdly, the image is restored to its original size by transpose convolution, and the convolution kernel size of transpose convolution is set as 64×64, the stride set to 8, and the initialization is carried out by bilinear interpolation. Finally, the convolution layer of 1×1 is used. The dataset has 6 categories (including background), so the output shape is (256, 256, 6).

EXPERIMENTAL RESULTS AND DISSCUSSION
In this section, experimental settings will be introduced more specifically. Section 4.1 shows the dataset used in this experiment. Section 4.2 describes the implementation details of the experiment. Evaluation functions are provided in Section 4.3. Finally, the experimental results are analyzed in section 4.4.

Dataset
ISPRS 2D Semantic Labeling Contest Potsdam dataset (https://www.isprs.org/education/benchmarks/UrbanSemLab/2d -sem-label-potsdam.aspx) is a high-resolution aerial image dataset. This dataset has 38 patches of the same size (6000 × 6000 pixels) and a spatial resolution of 0.5 meter. Each patch of the dataset was extracted from orthophoto images, with a total of 24 RS images, and corresponding semantic labels were performed on them. RS images files are composed of different channels, including IRRG (3 channels, IR-R-G), RGB (3 channels, R-G-B), (1 channel, nDSM) and (one channel, DSM). In this experiment, RGB images are combined with nDSM images. Dataset labels are divided into six categories (Background, Building, Impervious surfaces, Tree, Low Vegetation, and Car respectively).
Therefore, 24 images were used as training set and validation set. The 24 RS images augmented 24,000 images were randomly divided into train set and validation set. When there were 15 images for training, one image was used for validating. There are 22,400 training images and 1,600 verification images.
Since the Benchmark Challenge ended in the summer of 2018, all reference data for all benchmarks are available for download, so 14 images without corresponding semantic labels were served as test set. 4000 images randomly cropped from 14 images were put into the pre-trained model for prediction. 4,000 TIF format images were generated and compared with the reference labels provided in the official benchmark.

Implementation Details
In this experiment, training equipment of deep learning network is 8-core 16-thread Intel I9-9900K CPU. NVIDA RTX3090 Graphic card, 24G Memory with CUDA11.2.
The software environment is 64-bit Microsoft Windows10, operating system and the development platform is Anaconda-5.2.0. The built-in Python version is 3.8.8. The deep learning software framework is TensorFlow2.5.0. Adam optimizer was used with a 1×10 -3 learning rate. A total of 100 training epochs with a batch size of 32.

Model Evaluation Function
In order to comprehensively evaluate the performance of the proposed model, Overall Accuracy (OA), Precision, Recall, F1, and Mean Intersection Over Union (MIOU) were used to evaluate experimental results. The above evaluation indexes are often used in previous papers and compared with the recognized evaluation indexes of semantic word segmentation. The calculation formula of each evaluation index is as follows:

Evaluation and Discussion
This part provides a comparison of three methods. The first method is the segmentation result without using our proposed method and simply using ResNet-101 model without adding nDSM (Table1 None and Table2 None). The second method is the segmentation result without nDSM, but our method was used (Table1 Ours and Table2 Ours). The third method is the segmentation result added the one-channel nDSM and ours was used. Experimental results show that the third method has the best results (Table1 Ours+nDSM and Table2 Ours+nDSM).  The output shows competitive performance in all classes (in Table 2). It can be seen that F1 and MIOU of each category have been improved in our method compared with method 1. The segmentation performance of buildings is the best, while tree is the lowest. Compared with method "None", segmentation results (MIOU) of building, road, tree, vegetation and car are improved 10.6%, 10.16%, 12.35%, 9.11%, 14.29%, respectively. Especially, small targets like car and tree increased more than the other classes in F1 and MIOU Convergence Analysis: The convergence of our method is analysed in this part (in Fig. 5). The (a) and (c) are visualizations of training validation accuracy and loss without nDSM, while (b) and (d) are visualizations of training validation accuracy and loss with nDSM.

Global
As can be see from the result, both methods converge gradually, and the accuracy and loss are more stable after nDSM is added. In our method, the training accuracy can reach 98.57% and the verification accuracy can reach 95.48%. The lowest training loss can reach 0.0348 and verification loss can reach 0.1516.

Comparison of Experiments on Potsdam Datasets:
This part compares the segmentation results in different methods (in Table 3)The method used nDSM( Wenkai Zhang et al., 2018) also. At the same time, the segmentation results of similar ResNet-101 networks (Wang Y et al., 2019) in the same Postdam data set. So we compare the class segmentation results of the two methods. (in Table 4).
In general (in Table 3 We can see that (in Table4.) the buildings, trees, vegetation and cars are better than the segmentation results of ( Wenkai Zhang et al., 2018). The result of car segmentation shows that the proposed method is 15.5% higher than the first method. Our proposed method also performs better on buildings with high levels of information.
When compared with the method proposed by (Wang Y et al., 2019), it is found that the accuracy of small-target vehicles can reach 89.5% even without adding ASPP to increase the complexity of the network and without using Superpixel-CRF for prediction. Compared with Three Methods: The results (in Fig. 6) show the comparison of prediction results of different strategies on test set, and the prediction performance of ours in the study is better. The segmentation only using ResNet-101 without nDSM lost more tree information, and the classification boundary of cars were blurred (in Fig. 6 None). Compared with the segmentation results without using our method and simply using ResNet-101 model without adding nDSM, small targets (cars) predicted by our method have clearer boundaries and less predicted tree loss information (in Fig. 6 Ours). When we add nDSM and use our method, it can be seen that the segmentation results of building and tree with height information are better than the other two methods. There is less misclassification phenomenon (in Fig. 6 Ours+nDSM).

CONCLUSION
In the current study, three-channels RGB image and onechannel nDSM image in Postdam data set provided by ISPRS are used for semantic segmentation. The method in this paper is adjusted to ResNet-101 network. The OA and MIOU can be improved quickly by using a simple method.
Experimental results show that the proposed method achieves good results in five evaluation functions by simple adjustment. Compared with non-NDSM, the boundary with nDSM is clearer. In the end, compared with the original method, the accuracy of trees and small targets (cars) prone to misclassification is greatly improved.