Bidirectional Multi-scale Attention Networks for Semantic Segmentation of Oblique UAV Imagery

Semantic segmentation for aerial platforms has been one of the fundamental scene understanding task for the earth observation. Most of the semantic segmentation research focused on scenes captured in nadir view, in which objects have relatively smaller scale variation compared with scenes captured in oblique view. The huge scale variation of objects in oblique images limits the performance of deep neural networks (DNN) that process images in a single scale fashion. In order to tackle the scale variation issue, in this paper, we propose the novel bidirectional multi-scale attention networks, which fuse features from multiple scales bidirectionally for more adaptive and effective feature extraction. The experiments are conducted on the UAVid2020 dataset and have shown the effectiveness of our method. Our model achieved the state-of-the-art (SOTA) result with a mean intersection over union (mIoU) score of 70.80%.


INTRODUCTION
Semantic segmentation has been one of the most fundamental research tasks for scene understanding. It is to assign each pixel within an image with the class label it belongs to. There have been many works for semantic segmentation on the remote sensing images and the aerial images (Demir et al., 2018, Rottensteiner et al., 2014, which are captured in nadir view style. The spatial resolutions in such images are approximately the same for all pixels. Oblique views have a much larger land coverage if the platforms are at the same flight height. For example, the unmanned aerial vehicle (UAV) platform has been used to for urban scene observation (Lyu et al., 2020, Nigam et al., 2018. The images of different viewing directions are shown in Figure 1. The left image of nadir view is from the Vaihingen dataset (Rottensteiner et al., 2014), while the right image of oblique view is from the UAVid2020 dataset (Lyu et al., 2020). Compared with the images in nadir view style, the images in oblique view have very large spatial resolution variation across the entire image.
The state-of-the-art methods for semantic segmentation all rely on powerful deep neural networks, which can effectively extract high-level semantic information to determine the class types for all pixels. Deep neural networks serve as non-linear functions, which map an image input to a label output. Due to its nonlinear property, the label output will not scale linearly as the image input scales. When designing the deep neural networks, there is usually a performance trade-off for objects in different scales. For example, the semantic segmentation of a small car in a remote sensing image is better handled in higher resolution where finer details can be observed, such as wheels. For larger objects like roads and buildings, it is better to have more global context to recognize the objects since their whole shapes can be observed for semantic segmentation.
When objects in an image dataset have very large scale variation, the semantic segmentation performance of deep neural networks will drop if this multi-scale problem is not considered in the network design. A simple strategy is to apply multi- image from Vaihingen dataset (Rottensteiner et al., 2014) is captured in nadir view. The right image from UAVid2020 dataset (Lyu et al., 2020) is captured in oblique view.
scale inference (Zhao et al., 2017), i.e., a well-trained deep neural networks predict the score maps of the same image in multiple different scales, and the score maps are averaged to determine the final label prediction. Such strategy generally provides better performance. However, a good prediction from a proper scale could be undermined by those worse predictions from other scales, which limits the model performance. Maxpooling selects one score map prediction of multiple scales for each pixel, but the optimal output could be the interpolation of the prediction of multiple scales. A smarter way of fusing the output score maps is to leverage on an attention model (Chen et al., 2016), which determines the weights when fusing the score maps of different scales. The strategy has been extended to a hierarchical structure for better performance (Tao et al., 2020).
With respect to the design of deep neural networks, there are several strategies to relieve the multi-scale problem. The first strategy is to gradually refine features from coarse scales to fine scales (Long et al., 2015, Ronneberger et al., 2015, Lyu et al., 2020. The second strategy is to design a multi-scale feature extractor module in the middle of the deep neural networks (Zhao et al., 2017, Chen et al., 2017, Chen et al., 2018, Yuan and Wang, 2018. Self-attention (Fu et al., 2019, Huang et al., 2019, Yuan et al., 2020 and graph networks (Liang et al., 2018, Li andGupta, 2018) have also been applied to aggregate information globally to reinforce the features for each pixel.
In this paper, we propose the bidirectional multi-scale attention networks (BiMSANet) to address the multi-scale problem in the semantic segmentation task. Our method is inspired by the multi-scale attention strategy (Chen et al., 2016, Tao et al., 2020 and the feature level fusion strategy (Chen et al., 2017, Zhao et al., 2017, and jointly fuses the features guided by the attention of different scales in bidirectional pathways, i.e, coarse-to-fine and fine-to-coarse. Our method is tested on the new UAVid2020 dataset (Lyu et al., 2020). One of its challenges is the huge inter-class and intra-class scale variance for different objects due to its oblique viewing style. Our method achieves a new state-of-the-art result with a mIoU score of 70.8%. Compared with the currently top ranked method (Tao et al., 2020), which features on handling the multi-scale problem, our methods outperforms by almost 0.8%.
The contributions of this paper are summarized as follows, • We have proposed a novel bidirectional multi-scale attention networks (BiMSANet) to handle the multi-scale problem for the semantic segmentation task.
• We have visualized in multiple perspectives and analyzed the bidirectional multi-scale attentions in details.
• We have achieved state-of-the-art result on the UAVid2020 benchmark, and the code will be made public.

RELATED WORK
In this section, we will discuss other works that are related to our paper. In order to handle the multi-scale problem for the semantic segmentation, a number of deep neural networks have been designed.
Multi-scale feature fusion. The first basic type of method is to aggregate features of multiple scales from deep neural networks. FCN (Long et al., 2015) and U-Net (Ronneberger et al., 2015) have adopted skip connections between encoder and decoder to gradually fuse the information from multiple scales. MSDNet (Lyu et al., 2020) has extended the connection across scales to further increase the performance. ZipZagNet (Di Lin, 2019) uses a more complex zip-zag architecture between the backbone and the decoder for intermediate multi-scale feature aggregation. HRNet  proposes a multi-scale backbone to exchange information between branches of coarse scale and fine scale. BiSeNet (Yu et al., 2018) proposes a dual branch structure for better performance, one branch for higher spatial resolution, while the other for richer semantic features.
Multi-scale context extraction. Another method is to aggregate multi-scale context from the same feature maps with a module. PSPNet (Zhao et al., 2017) has adopted pyramid pooling module, which has pooling modules of multiple scales to pool context features for the object recognition. DeepLabv3 (Chen et al., 2017, Chen et al., 2018 has utilized atrous spatial pyramid pooling module, which assembles multi-scale features with convolutions of multiple atrous rates. OCNet (Yuan and Wang, 2018) proposes pyramid object context (Pyramid-OC) module and atrous spatial pyramid object context (ASP-OC) module to extract object context in multiple scales.
Context by relations. With the creation of self-attention mechanism (Vaswani et al., 2017) for natural language processing, better semantic segmentation results have also been achieved when self-attention is applied to reason the relation between pixels. Self-attention refines the features in a non-local style, which aggregates information for each pixel globally. DANet (Fu et al., 2019) has utilized dual attention module, position attention and channel attention, to extract information globally. CCNet (Huang et al., 2019) has applied the criss-cross attention module to reduce the computational complexity of the selfattention. OCRNet (Yuan et al., 2020) has used explicit class attention to reinforce the features. However, these types of methods are normally intensive in memory and computation as there are too many pixels, resulting in very dense connections between them. Graph reasoning is another way to include relations among objects. Instead of adopting dense pixel relations, sparse graph structure makes the context relation reasoning less intensive in memory and computation. (Liang et al., 2018) proposes the symbolic graph reasoning (SGR) layer for context information aggregation through knowledge graph. (Li and Gupta, 2018) transforms a 2D image into a graph structure, whose vertices are clusters of pixels. Context information is propagated across all vertices on the graph.
Inference in multi-scale. Multi-scale inference is widely used to provide more robust prediction, which is orthogonal to previously discussed methods as those networks can be regarded as a trunk for multi-scale inference. Average pooling and max pooling on score maps are mostly used, but they limit the performance. (Chen et al., 2016) propose to apply attentions for fusing score maps across multiple scales. The method is more adaptive to objects in different scales as the weights for fusing score maps across multiple scales can vary. (Tao et al., 2020) further improve the multi-scale attention method by introducing a hierarchical structure, which allows different network structures during training and testing to improve the model design.
Our paper also focuses on the multi-scale inference. We have further improved the multi-scale attention mechanism by introducing feature level bidirectional fusion.

PRELIMINARY
In this section, we first go through some network architecture design to better help understand the newly proposed bidirectional multi-scale attention networks.

Multi-Scale-Dilation Net
The multi-scale-dilation net (Lyu et al., 2020) is proposed as the first attempt to tackle the multi-scale problem for the UAVid dataset. The basic idea shares the philosophy of multi-scale image inputs, where the input images are scaled by the scale to batch operation and batch to scale operation. The intermediate features are concatenated from coarse to fine scales, which are used to output the final semantic segmentation output. The structure is shown in Figure 2. The feature extraction part is named as trunk in the following figures.

Hierarchical Multi-Scale Attention Net
The hierarchical multi-scale attention net (Tao et al., 2020) is proposed to learn to fuse semantic segmentation outputs of adjacent scales by a hierarchical attention mechanism. The deep neural networks learn to segment the images while predicting the weighting masks for fusing the score maps. This method ranks as the top method in the Cityscapes pixel-level semantic labeling task (Cordts et al., 2016), which focuses on the multiscale problem. The hierarchical mechanism allows different network structures during training and inference, e.g., the networks have only two branches of two adjacent scales during training, while the networks could have three branches of three adjacent scales during testing as shown in Figure 3. In addition to the predicted score maps, extra weighting masks are predicted from the attention sub-networks for fusing the score maps of adjacent scales. ⊕, stand for element-wise addition and multiplication, respectively.

Feature Level Hierarchical Multi-Scale Attention Net
One limitation of the hierarchical multi-scale attention networks is that the fused score maps are the linear interpolation of the score maps in adjacent scales, whereas the best score maps could be acquired with the interpolated features instead. A simple solution that we propose is to move the segmentation head to the end of the fused features as shown in Figure 4.

BIDIRECTIONAL MULTI-SCALE ATTENTION NETWORKS
In this section, the structure of the proposed bidirectional multiscale attention networks will be introduced.

Overall Architecture
Our design also takes the hierarchical attention mechanism and the feature level fusion into account. The overall architecture is shown in Figure 5. For the input image I of size H × W , the image pyramid is built by adding two extra images I2× and I0.5×, which are acquired by bi-linear up-sampling I to size of 2H × 2W and bi-linear down-sampling I to 1 2 H × 1 2 W . The bidirectional multi-scale attention networks have two pathways for feature fusing in a hierarchical manner. For each pathway, the structure is the same as the feature level hierarchical multiscale attention nets. The design of the two pathways allows the feature fusion from both directions, and the fusion weights can be better determined in a better scale. The reason to use feature level fusion is that we need distinct features for two pathways.
If the score maps are used for fusion, the feat1 and the feat2 in the two pathways would be the same, which limits the representation power of the two pathways. The two pathways take advantage of their own attention branches and features. Attn1 branch and Feat1 are for the coarse to fine pathway, while Attn2 branch and Feat2 are for the fine to coarse pathway. The Feat1 and the Feat2 from two pathways are fused hierarchically across scales, and the final feature is the concatenation of the features from the two pathways.
The Feat1 and Feat2 are reduced to the half number of channels as the Feat in feature level hierarchical multi-scale attention net. This setting is to provide fair comparisons between these two types of networks, since it leads to features with the same number of channels before the segmentation head.
The parameter sharing is also applied in the design. Three branches corresponding to the three scales share the same network parameters for Trunk, Attn1 and Attn2. Feat1 and Feat2 in the three branches are different as they are the output of different image inputs through the same trunk.

Module Details
In this section, we will illustrate the details of each component we applies.
Trunk. In order to effectively extract information from each single scale, we have adopted the deeplabv3+ (Chen et al., 2018) as the trunk. We apply the wide residual networks (Zagoruyko and Komodakis, 2016) as the backbone, namely the WRN-38, which has been pre-trained on the imagenet dataset (Deng et al., 2009). The ASPP module in the deeplabv3+ has convolutions with atrous rate of 1, 6, 12, and 18. The features f b from the deeplabv3+ are further refined with a sequence of modules as follows, The structure is the combination of two feature level hierarchical multi-scale attention nets corresponding to two pathways, where they share the same trunks. The coarse to fine pathway and the fine to coarse pathway are marked with the yellow and the blue arrows, respectively. ⊕, stand for element-wise addition and multiplication, respectively. stands for concatenation in channel dimension.
The trunk T transforms an image input I into feature maps f with nc channels, i.e., f = T (I). nc = n class × d, where n class is the total number of classes for the semantic segmentation task. d is the expansion rate for the channels. d is set to 4 in our case. The first 1 2 nc channels are for the Feat1, while the second 1 2 nc channels are for the Feat2. Attention head. The Attn1 and the Attn2 share the same structure, but with different parameters. The attention heads map the features f b from the deeplabv3+ to the attention weights α, β (ranging from 0.0 to 1.0 with 1 2 nc channels) for the two pathways. For each attention head, the structure is comprised of a sequence of modules as follows, Conv3×3(256)− > BN − > ReLU − > Conv3×3(256)− > BN − > ReLU − > Conv1× 1( 1 2 nc)− > Sigmoid (numbers in the brackets are the output channels).
Segmentation head. The segmentation head Seg converts the fused input feature maps f f used into score maps l (8channels for the UAVid2020 dataset), which correspond to the class probabilities for all the pixels, i.e., l = Seg(f f used ). The segmentation head is simply a 1 × 1 convolution, Conv1 × 1(n class ). Argmax operation along the channel dimension outputs the final class labels for all the pixels.
Auxiliary semantic head. As in (Tao et al., 2020), we apply auxiliary semantic segmentation heads for each branch during training, which consists of only a 1 × 1 convolution, Conv1 × 1(n class ).

Training and inference
As our model follows the hierarchical inference mechanism, it allows our model to be trained with only 2 scales, while to infer with 3 scales (0.5×, 1×, 2×). Such design makes it possible for our network to adopt a large trunk such as deeplabv3+ with WRN-38 backbone for better performance. We use RMI loss  for the main semantic segmentation head and cross entropy loss for the auxiliary semantic head.

EXPERIMENTS
In this section, we will illustrates the implementation details for the experiments and compare the performance of different models on the UAVid2020 dataset.

Dataset and Metric
Our experiments are conducted on the public UAVid2020 dataset 1 (Lyu et al., 2020). The UAVid2020 dataset focuses on the complex urban scene semantic segmentation task for 8 classes. The images are captured in oblique views with large spatial resolution variation. There are 420 high quality images of 4K resolutions (4096 × 2160 or 3840 × 2160) in total, split into training, validation and testing sets with 200, 70 and 150 images, respectively. The performance of different models are evaluated on the test set of the UAVid2020 benchmark. The performance for the semantic segmentation task is assessed based on the standard mean intersection-over-union(mIoU) metric (Everingham et al., 2015).

Implementation
Training. All the models in the experiments are implemented with pytorch (Paszke et al., 2019), and trained on a single Tesla V100 GPU of 16G memory with a batch of 2 images. Mixed precision and synchronous batch normalization are applied for the model. Stochastic gradient decent with a momentum 0.9 and weight decay of 5e −4 is applied as the optimizer for training. "Polynomial" learning rate policy is adopted (Liu et al., 2015) with a poly exponent of 2.0. The initial learning rate is set to 5e−3. The model is trained for 175 epochs by random image selection. We apply random scaling for the images from 0.5× to 2.0×. Random cropping is applied to acquire image patches of size of 896 × 896.  Figure 6. Qualitative comparisons of different models on the UAVid2020 test set. The example image is from the test set (seq30, 000400). Bottom left image shows the overlapped result of the BiMSANet output and the original image. Three example regions for comparisons are marked in red, orange, and white boxes.
Testing. As the 4K image is too large to fit into the GPU, we apply cropping during testing as well. The image is partitioned into overlapped patches for evaluation as in (Lyu et al., 2020) and the average of the score maps are used for the final output in the overlapped regions. The crop size is set to 896×896 with an overlap of 512 pixels in both horizontal and vertical directions.
The mIoU scores and the IoU scores for each individual class are shown in Table 1. Among all the compared models, the BiMSANet performs the best regarding the mIoU metric. Our BiMSANet has a more balanced prediction ability for both large and small objects.
For the evaluation of each individual class, the BiMSANet ranks the first for classes of clutter, building, tree, static car, and human. The most distinct improvement is for the static car, which is 2.72% higher than the second best score. With only the context information, our method could achieve decent scores for classes of both moving car and static car.
For human class, the scores of HMSANet, FHMSANet and BiMSANet are all significantly higher than the DeepLabv3+, which shows the superiority of multi-scale attention mechanism in handling the small objects. Thanks to the bidirectional multi-scale attention design, BiMSANet achieves the best performance for the human class.
Qualitative comparisons are shown in Figure 6. The example image is selected from the test set (seq30, 000400). As the ground truth label is reserved for benchmark evaluation, the overlapped output is shown instead in Figure 6. Three example regions are marked in red, orange, and white boxes.
In the red box region, it could be seen that the deeplabv3+ struggles to give coherent predictions for cars in the middle of the road, while the other three models have better results due to the multi-scale attention. The HMSANet and the FHM-SANet wrongly classify part of the sidewalks, which is outside the road, as road class. BiMSANet handles better in this area. However, part of the road near the lane-mark are wrongly classified as clutter by the BiMSANet. In the orange box region, the parking lot, which belongs to the clutter class, is predicted as the road by all four models, and the BiMSANet makes the least error. In the white box region, the ground in front of the entrance door is wrongly classified as building by all models except the BiMSANet. This is benefited from the bidirectional multi-scale attention design.
We have also shown the performance for human class segmentation in Figure 7. The example image is from the test set (seq22, 000900). The zoomed in images in the middle and the right columns correspond to the patches in the white boxes of the overlapped output. The four patches are from different context, which is very complex in some local regions. Even though the humans in the image are quite small and in many different poses, such as standing, sitting, and riding, our model can still effectively detect and segment most of the humans in the image.

Ablation Study
In this section, we will compare the performance gains by gradually adding the components. The corresponding results are shown in Table 2. It is easy to see that the multi-scale processing is useful for the oblique view UAV images. The mIoU score has increased by 2.67% by including the multi-scale attention into the networks. The feature level fusion is also proved to be useful as it helps the networks to improve the mIoU score by 0.3%. By further adding the bidirectional attention mechanism, the networks improve the mIoU score by another 0.47%.

Analysis of Learned Multi-Scale Attentions
In this section, we will analyze the learned multi-scale attentions from the BiMSANet to better understand how the attentions work. We explore from mainly three perspectives: attentions of different channels, different scales, and different directions. The example image is from the test set (seq25,000400). Attentions from both Attn1 branch and Attn2 branch are used, noted as α and β in Figure 5. α is for the fine to coarse pathway, while β is for the coarse to fine pathway.

Attention of different channels
The multi-scale attentions in our BiMSANet have 1 2 nc channels (16 in our case), which is different from the HMSANet (Tao et al., 2020), whose attention has only one single channel for all classes. The attentions guide the fusion of features across scales. Example attentions of different channels in 1× scale branch are shown in Figure 8. Different channels have different attentions focusing on different parts of the image. It is obvious that different channels have different focus for different classes, e.g., 1th channel more focus on trees, 3th channel less focus on roads, and 7th channel have the most focus on moving cars.

Attention of different scales
In order to analyze the difference of attentions in different scales, we have selected 4 attentions from each of the Attn1 branch and the Attn2 branch as shown in Figure 9. The superscripts are the channel index of the attentions. By comparing the α1 with α2, which are predicted in 1× and 0.5× scales, we could see that attentions in different scales have different focus. The difference of the same channel between α1 and α2 are more worth of comparisons. The same applies for β1 and β2.
From α 1 1 and α 1 2 , it could be noted that the recognition of cars in closer distance are more based on context, since the values of α 1 2 are larger than α 1 1 . The recognition of road that are closer to the camera also relies more on the coarser level features, which is reasonable as the road area is large and requires more context for recognition. It is also interesting to note that the middle lane-marks is even brighter than other parts of the road in α 1 2 , which means the recognition requires more context. It is reasonable as the color and the texture of the lane-marks are quite different compared to other parts of the road. The distant buildings near the horizon relies more on the coarser level features as well.
We have also noticed that the α2 (0.5× scale) and β2 (2× scale) have larger values on average compared with α1 and β1 (1× scale), which means that features with context information and fine details are both valuable for object recognition.

Attention of different directions
In our bidirectional design, both the coarse to fine pathway and the fine to coarse pathway fuse the features from three scales (0.5×, 1×, 2×). In this section, we analyze if the feature fusion in two pathways has the same attention pattern. Attention examples are shown in Figure 10. Attentions α2 and 1 − β1 from two pathways are both for the feature fusion across scale 0.5× and 1×. Although  the attention values of same pixels can not be directly compared as the feature sources are different (Feat1 and Feat2), it is still evident that the attention densities on average are quite different. There are more activation in α2 than 1 − β1, showing that the two pathways play different roles for feature fusion across same scales.

CONCLUSION
In this paper, we have proposed the bidirectional multi-scale attention networks (BiMSANet) for the semantic segmentation task. The hierarchical design adopted from (Tao et al., 2020) allows the usage of larger trunk for better performance. The feature level fusion and the bidirectional design allows the model to more effectively fuse the features from both of the adjacent coarser scale and the finer scale. We have conducted the experiments on the UAVid2020 dataset (Lyu et al., 2020), which have large variation in spatial resolution. The comparisons among different models have shown that our BiMSANet achieves better results by balancing the performance of small objects and large objects. Our BiMSANet achieves the state-of-art result with a mIoU score of 70.80% for the UAVid2020 benchmark.