GLOBAL MESSAGE PASSING IN NETWORKS VIA TASK-DRIVEN RANDOM WALKS FOR SEMANTIC SEGMENTATION OF REMOTE SENSING IMAGES

The capability of globally modeling and reasoning about relations between image regions is crucial for complex scene understanding tasks such as semantic segmentation. Most current semantic segmentation methods fall back on deep convolutional neural networks (CNNs), while their use of convolutions with local receptive fields is typically inefficient at capturing long-range dependencies. Recent works on self-attention mechanisms and relational reasoning networks seek to address this issue by learning pairwise relations between each two entities and have showcased promising results. But such approaches have heavy computational and memory overheads, which is computationally infeasible for dense prediction tasks, particularly on large size images, i.e., aerial imagery. In this work, we propose an efficient method for global context modeling in which at each position, a sparse set of features, instead of all features, over the spatial domain are adaptively sampled and aggregated. We further devise a highly efficient instantiation of the proposed method, namely learning RANdom walK samplIng aNd feature aGgregation (RANKING). The proposed module is lightweight and general, which can be used in a plug-and-play fashion with the existing fully convolutional neural network (FCN) framework. To evaluate RANKING-equipped networks, we conduct experiments on two aerial scene parsing datasets, and the networks can achieve competitive results at significant low costs in terms of the computational and memory.


INTRODUCTION
Capturing and modeling both short-and long-range relations is of paramount importance for many vision tasks, to name a few, semantic segmentation (Fu et al., 2019, Liu et al., 2017, Bertasius et al., 2017, object detection (Shvets et al., 2019, Hu et al., 2018, action recognition , and visual question answering (VQA) (Santoro et al., 2017, Lobry et al., 2019. Being able to reason about such relations among different regions in an image/video is inherent to humans, but is not easy for convolutional neural networks (CNNs). Because an individual convolution layer can only learn features locally, and deep CNNs with large receptive fields have proven to be not efficient at modeling long-range dependencies (Luo et al., 2016, Zhou et al., 2015.
To address this issue, many efforts have been made to enhance the capacity of CNNs to capture long-term relations, such as dilated convolutions (Chen et al., 2015, Chen et al., 2018a, Chen et al., 2018b, introducing graphical models into networks (Chen et al., 2018a, Liu et al., 2015, Zheng et al., 2015, and constructing spatial propagation network modules (Bell et al., 2016, Liu et al., 2017. These approaches make an attempt at capturing global relations by means of a chain propagation way, which is implicitly global and whose effectiveness depends heavily on the learning effect of long-term memorization. Recent advances in self-attention mechanisms (Vaswani et al., 2017, Hu et al., 2018 and relational reasoning networks (Santoro et al., 2017) have shown promising results in explicitly modeling global context. In essence, these methods somehow learn pairwise relations between each two * Corresponding author entities (i.e., feature-map vectors and pixels) and then make use of them for feature aggregation or augmentation. By doing so, a fully connected relationship graph is learned to explicitly represent global context, which, however, leads to a quadratic inference complexity with respect to the number of entities and a high GPU memory overhead. This is computationally infeasible for dense prediction tasks, particularly on large size images.
The aforementioned methods imply that even for entity pairs whose relations actually do not matter, these models have to learn to infer their relationships, which is usually unnecessary. Hence, taking advantage of only relations that should be considered for global reasoning is conceptually interesting and helps in reducing significant computational and memory costs, but still remains under explored.
In this work, our goal is to explicitly model global context with low computational and memory overheads in a fully convolutional network (FCN) for aerial scene parsing by considering a sparse set of important short-and long-range relations instead of all. More specifically, a plug-and-play network module, RANKING (learning RANdom walK samplIng aNd feature aGgregation) is devised and appended on top of an FCN to adaptively sample informative feature-map vectors and then aggregate them in order to produce better segmentation results. This work is inspired by GraphSAGE (Hamilton et al., 2017), an inductive representation learning framework in natural language processing (NLP). But unlike the latter that assumes a static graph where neighbors for each node are fixed, our module is capable of dynamically learning a global sampling.
Contributions. This work's contributions are threefold. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume V- 2020 XXIV ISPRS Congress (2020 edition) Figure 1. Illustration of different distributions of the same object (taking vehicle as an example) in street-view (top) and nadir-view (bottom) images. It can be seen that in overhead images there are more long-range relations (see appearance similarities among white cars). [Statistics based on Cityscapes and ISPRS Vaihingen datasets.] • We propose a simple yet efficient approach for global context modeling in networks with low computational and memory costs by adaptively sampling feature-map vectors and then aggregating them at each position.
• We devise RANKING, a highly efficient instantiation of the proposed method, that implements the sampling by learnable random walks and the feature aggregation via an averaging operation at zero parameters.
• We validate the effectiveness of our network module through extensive ablation studies.

RELATED WORK
Self-attention mechanism. Self-attention mechanism has by now been successfully applied in a wide range of NLP tasks, e.g., machine translation (Vaswani et al., 2017), due to its superior ability in modeling long-range dependencies. A recent trend in NLP is replacing recurrent neural networks (RNNs) by self-attention models, thereby allowing more efficient learning and parallelized implementations. From NLP to computer vision,  extends self-attention for NLP to a more general form of non-local operations, which computes the response at a position by attending to all positions and taking their weighted sum in an embedding space based on a learned affinity matrix. In (Hu et al., 2018), the authors exploit the selfattention mechanism to model relations among sets of objects in object detection. There are also works that make use of the selfattention mechanism for semantic segmentation tasks (Fu et al., 2019, Huang et al., 2019, Yuan, Wang, arXiv:1809). In addition, several variants of the original non-local module  can be found in (Yue et al., 2018, Chen et al., 2019, Zhang et al., arXiv:1908).
Relational reasoning networks. Recently, (Santoro et al., 2017) proposes a relation network to solve problems that involve spatial relational reasoning by learning the potential relations between all feature-map vector pairs, and this network achieves a super-human performance in VQA tasks. Later, in , the authors introduce a temporal relation network module that explicitly learns multi-scale temporal dependencies among video frames for video classification problems. Besides spatial and temporal relations in images and videos, the authors of (Duan et al., 2019) present a structural relation network to reason about structural dependencies of local regions in 3D point clouds. In (Cadène et al., 2019), a multimodal relational network is proposed to represent interactions between a question and image regions and model region relations with pairwise combinations for VQA.
Aerial scene parsing. There is a long tradition of leveraging computer vision techniques for aerial scene parsing. Earlier works (Liu, Liu, 2014, Blaschke et al., 2004, Predoehl et al., 2013

Problem Formulation
where each vector is identified by a spatial position index p = (x, y). Our goal is to learn a set of refined feature-map vectors Z = {z(p)} by globally adaptively aggregating a sparse set of vectors at different locations. To this end, in this work, we propose a network module, RANKING, to learn random walk sampling and aggregate sampled features (cf. Figure 2).

Learning Random Walk Sampling
We consider learning a random walk with t steps operating across grids. The position pτ = (xτ , yτ ) at step τ (0 ≤ τ < t) can be traced to position pτ+1 = (xτ+1, yτ+1) at the next step (τ + 1) with a motion vector − → ω = (uτ , vτ ) using the following equation: The final sampling position pt can be calculated by iteratively applying Eq. (1). In a traditional random walk, the motion vector − → ω can be arbitrarily long and in any direction. Here we would like to learn a data-and task-driven random walk sampling and define a learnable − → ω as follows: With uτ (·) and vτ (·), the next position can be predicted, conditioned on the feature of the current position. For simplicity and more efficient computation, we consider them in the form of a linear embedding, i.e.,  Figure 2. An overview of a RANKING-equipped fully convolutional network for aerial scene parsing tasks. The number of steps in the sampling procedure is 2 in this case.
where wu τ and wv τ are learnable weight vectors and can be implemented as 1 × 1 convolutions.
Since outputs of uτ (·) and vτ (·) are typically real values, the sampling position pτ becomes fractional. It can be seen from Eq. (2) that the estimation of − → ω is associated with the featuremap vector at the position pτ . Hence we make use of an interpolation algorithm with a sampling kernel K to generate f (pτ ) as follows: where N (pτ ) indicates four nearest neighbors of the position pτ on the grid. We do not dive deeper into various choices of K and utilize bilinear interpolation as default.

Feature Aggregation
The goal of this stage is to aggregate sampled features and generate new feature representations that can facilitate the subsequent classification/segmentation tasks.
We first revisit the feature aggregation in self-attention models, which is considered the following equation: Here p is a query position whose response zp is to be calculated and q indicates all possible positions. wpq represents the relationship between p and q. Moreover, C is a normalization constant. Eq. (6) is a weighted sum of all feature-map vectors, but learning pairwise relations W = {wpq} is computationally expensive.
In order to reduce the computational overhead, in this work, we perform the feature aggregation at zero parameters as follows: where V(p) is a set of sampled positions, conditioned on the position p. As compared to Eq. (6), Eq. (7) has two changes: 1) By doing so, we achieve the feature aggregation in an efficient way.
This stage is actually flexible, and we believe that alternative versions, e.g., LSTM (Hochreiter, Schmidhuber, 1997), are possible and may improve results.

Implementation
The proposed network module can be easily incorporated into a large variety of existing backbone CNN architectures in a plugand-play fashion. To make the proposed method fully comparable with others, we choose VGG-16 as the backbone for aerial scene parsing tasks. Outputs of conv3, conv4, and conv5 are fed into respective 1×1 convolutional layers to squash the number of channels to the number of categories, and then the convolved feature maps are upsampled to a desired full resolution and element-wise added to generate initial segmentation maps. These seed predictions are subsequently refined by the proposed RANKING module.

EXPERIMENTS
To demonstrate the effectiveness of our RANKING module, we conduct experiments on two aerial scene parsing datasets, i.e., ISPRS Vaihingen and Potsdam datasets. In our experiments, we perform ablation studies on the Vaihingen dataset and compare our network with existing methods on both datasets. Moreover, we visualize trajectories of the adaptive random walk sampling to provide an insight view into our RANKING module. Notably, image samples in the two datasets are collected from nadir view, and thus the spatial distribution of objects in these images is diverse and complex (see Figure 1).

Experimental Setup
Datasets. The Vaihingen dataset 3 is an aerial image semantic segmentation dataset, which consists of 33 aerial images covering a 1.38 km 2 area of the city of Vaihingen. The spatial resolution of each image is 9 cm, and their average size is 2494×2064 pixels. Three bands, including near infrared (NIR), red (R), and green (G) wavelengths, and digital surface models (DSMs) are available for each aerial image. Besides, pixel-wise annotations of only 16 images are provided, and most existing works (Maggiori et al., 2017, Volpi, Tuia, 2017, Sherrah, 2016, Marcos et al., 2018b select 11 images to train their models. The remaining five images (image IDs: 11,15,28,30,34) are used to test their models. In this work, we follow this train-test split in our experiments.
The Potsdam dataset 4 is more challenging owing to its increasing number of samples, enlarged image size, and finer spatial resolution. Specifically, 38 images with a size of 6000 × 6000 pixels are gathered, and the spatial resolution of them is 5 cm. In addition, each aerial image covers an area of of 3.42 km 2 , and four bands (NIR, R, G, and blue (B)) are collected for these images. DSMs with the same spatial resolution is provided as well. In our experiments, we follow the setup in (Maggiori et al., 2017) and train our network with 17 images. The remaining samples (image IDs: 02 11, 02 12, 04 10, 05 11, 06 07, 07 08, 07 10) are used to test our model.
Initialization and training strategies. We adopt different initialization strategies with respect to each component of our network: the backbone is initialized with corresponding pre-trained CNNs, and convolutional filters in the RANKING module are initialized using a normal distribution with zero mean and a modest standard deviation 0.01. This is based on an assumption that regarding each pixel, its original neighbours should be the most relevant and could provide sufficient semantics for predicting it.
We implement our network on TensorFlow and select Nestrov Adam (Dozat, 2015) as the optimizer. Parameters of the optimizer is set as recommended: β1 = 0.9, β2 = 0.999, and = 1e−08. The initial learning rate is 2e−04 and decayed by 0.1 once the validation loss is saturated. The batch size is 5, and we define the loss function of the network as categorical crossentropy. During the training phase, all weights are learnable, and each model is trained on one NVIDIA TeslaP100 16GB GPU.
Evaluation metrics. To measure the performance of networks for aerial scene parsing comprehensively, we first calculate perclass F1 scores and then average them to obtain mean F1 score. Here, a large F1 score indicates a better result. Besides, mean IoU (mIoU) and overall accuracy (OA) are calculated as well.

Ablation Studies
Effectiveness of RANKING module. In the ablation study, we first evaluate our RANKING module by comparing FCN+RA-NKING with a vanilla FCN. As can be seen in Table 1, by using the proposed RANKING module, our network can achieve improvements of at least 4.37% and 2.23% (see FCN+RANKING-3-1) in the mean F1 score and OA, respectively, as compared to the baseline FCN. Furthermore, the maximum increment for the mean F1 score can reach 4.76% (cf. FCN+RANKING-3-2). In general, introducing RANKING module into the baseline model brings a significant improvement.
Effect of the aggregation operator size. To explore the effect of the aggregation operator size, denoted as p, we evaluate our FCN+RANKING with various p and report results in Table 1. As shown here, when the number of steps is 1 and  3, FCN+RANKING with a large p, i.e., 7 × 7, achieves best F1 scores. Besides, for the number of steps 2 and 4, FCN+RANKING with a 3 × 3 aggregation operator performs best. To conclude, varying the aggregation operator size brings modest effect to our module.
Effect of the number of steps. The number of steps is an essential property of our module as it determines how far the sampler goes. Hence, it is useful to analyze how different numbers of walks influence performance. In Table 1, we can observe that by randomly walking twice, FCN+RANKING can achieve the highest mean F1 score with a small p. One possible explanation could be that the correlation between the center pixel and sampled pixels might be weak once the number of steps exceeds 2.
Computational and memory overheads. To measure the computational complexity, we report FLOPs (floating-point multiplyadds ×10 9 ) and the number of parameters in Table 3. As shown in this table, our model requires comparable FLOPs but can achieve an increment of 4.76% in the mean F1 score compared To calculate the memory consumption, we take only the forward pass of one patch into consideration. As shown in Table 3, our network requires only 14% and 2% of memory consumption required by FCN+GloRe and FCN+non-local, respectively. Besides, comparisons between FCN+RANKING-3-2 and baseline FCN also demonstrate that our module is very lightweight and memory efficient. To conclude, the integration of the RAN-KING module can reinforce the performance of a network for aerial scene parsing at a very low computational cost.
Quantitative results of all models on the Vaihingen dataset are exhibited in Table 2. It can be seen that our FCN+RANKING-3-2 achieves the highest mean F1 score and mean IoU compared to other competitors. To be more specific, our model surpasses FCN-dCRF and SCNN by 4.98% and 3.69% in the mean F1 score, respectively. By comparing FCN+RANKING with FCN+non-local and FCN+GloRe, we observe that although our method requires relatively few computational resources, it can still achieve marginal increments in both the mean F1 score and OA. Besides, we note that our method surpasses other competitors in recognizing cars owing to its capacity of modeling global context.  identify clutter in this complex scene. From the second row, it can be found that networks relying on either spatial propagation modules or memory-consuming reasoning modules could fail to recognize impervious surfaces in the shadow, while our model can predict more accurately.

Results on the Potsdam Dataset
In addition to comparisons on the Vaihingen dataset, we also compare our model with existing methods on the Potsdam dataset. Numerical results are shown in Table 4, and qualitative results are presented in Figure 4. We can observe that the integration of RANKING module contributes to increments of 2.17% and 2.01% in the mean F1 score and overall accuracy, respectively, compared to FCN-dCRF. Besides, FCN+RANKING gains 0.57% higher mean F1 score than FCN+non-local, while 0.53% better than FCN+GloRe.
Some qualitative results are present in Figure 4 (bottom two rows). As we can see in the third row, FCN, FCN-dCRF, and FCN+non-local tends to misclassify impervious surfaces into clutter, while our FCN+RANKING can make accurate predictions. Moreover, by referring to long-range dependencies, RANK-ING module is robust to visual ambiguities (e.g., a low vegetationlike roof in the forth row) and capable of correctly perceiving objects with complex appearances (e.g., a wave-shaped roof in the forth row).

CONCLUSION
In this paper, a computation-and memory-efficient network module, RANKING, is proposed for global context modeling by adaptively sample and aggregate a sparse set of features. As compared to existing self-attention modules, known as heavy computational and GPU memory overheads due to the calculation of dense pairwise relations, our module is lightweight but of high performance. Ablation studies have been carried out on two aerial scene parsing datasets to demonstrate the effectiveness of our module. The visualization of the adaptive random  Figure 4. Examples of segmentation results on the Vaihingen (top two rows) and Potsdam (bottom two rows) dataset. Legend-gray: impervious surfaces, blue: buildings, cyan: low vegetation, green: trees, brown: cars, and red: clutter/background. walk sampling illustrates how the proposed RANKING module works. Furthermore, we evaluate our module by comparing a RANKING-equipped FCN with existing methods, and results suggest that our network can obtain competitive results while at low computational and memory overheads.