AUTOMATIC EXTRACTION OF ROAD CENTERLINES AND EDGE LINES FROM AERIAL IMAGES VIA CNN-BASED REGRESSION

Extracting roads from aerial images is a challenging task in the field of remote sensing. Most approaches formulate road extraction as a segmentation problem and use thinning and edge detection to obtain road centerlines and edge lines, which could produce spurs around the extracted centerlines/edge lines. In this study, a novel regression-based method is proposed to extract road centerlines and edge lines directly from aerial images. The method consists of three major steps. First, an end-to-end regression network based on CNN is trained to predict confidence maps for road centerlines and estimate road width. Then, after the CNN predicts the confidence map, non-maximum suppression and road tracking are applied to extract accurate road centerlines and construct road topology. Meanwhile, Road edge lines are generated based on the road width estimated by the CNN. Finally, in order to improve the connectivity of extracted road network, tensor voting is applied to detect road intersections and the detected intersections are used as guidance for the overcome of discontinuities. The experiments conducted on the SpaceNet and DeepGlobe datasets show that our approach achieves better performance than other methods.


INTRODUCTION
Road extraction from high-resolution remote sensing images is an essential task in the field of remote sensing. It has a wide range of applications, such as vehicle navigation, urban planning, autonomous driving and automatic digital line graphic making. Although various methods have been proposed in recent years, road extraction is still a challenging task because of the considerable variation in road shape, color and context characteristics. Besides, roads is often occluded by objects such as shadows and trees, thereby increasing the difficulty of extraction.
To deal with road extraction task, many CNN-based methods (Panboonyuen T et al., 2017) have been proposed. However, most of these approaches formulated road extraction as a segmentation problem. In automatic map construction, the road centerlines and edge lines are needed. Thus, the thinning operation and edge detection are followed to attain the road centerlines and edge lines, the segmentation-based methods have several defects: (1) Thinning operation could easily bring about spurs around the extracted centerlines. (2) The road topology is not taken into account. So it is of great practical significance with a framework to directly extract road centrelines and edge lines from satellite images. In human pose estimation, researchers aim to locate the anatomical keypoints. The basic idea of pose estimation methods is to produce 2D belief maps for the location of each part. The belief maps encode the spatial uncertainty of each keypoint's location.
Inspired by the belief maps in human pose estimation, for overcoming the above-mentioned shortcomings in the existing methods, this study proposes an end-to-end regression network to learn confidence maps for road centerlines and road width map. A confidence map represents the probability that each pixel is * Corresponding author lying on the road centerlines and a road width map indicates the width of road where each pixel is located. After the regression network predicts the confidence map and road width map, a simple non-maximum suppression (NMS) method and road tracking are applied to obtain accurate road centerlines. Road edge lines are then extracted based on the tracked centerlines and estimated road width. Finally, in order to improve the connectivity of extracted road network, tensor voting is applied to detect road intersections and we use detected intersections as the guidance for the overcome of the discontinuities. The final output of our algorithm is file of road centerlines and edge lines with shp format, which can be directly used in automatic road map generation.
In this study, we use a multi-task learning strategy to jointly learn confidence map and road width map, which could not only improve the efficiency of computation but also enhance the generalization ability of the network. Considering roads have long continuous shape structure, thus the spatial relationship is essential for road extraction. The astrous convolutions (Chen et al., 2018) with different rates are proven to effectively capture the context information, which are also adopted in our network.
This study conducts experiments on the DeepGlobe (Demir et al., 2018) and SpaceNet (Etten et al.,2018) datasets, then we compare our approach to other road extraction methods. Our method achieves equal performance to the state of the arts.
Below the related works of road extraction is presented in Section 2. Methods for road centerlines extraction, edge lines generation and overcome of discontinuities are explained in Section 3. Experiment procedure and results are shown in Section 4. Conclusions are presented in Section5.

RELATED WORK
In recent years, there are many works attempting to extract roads from aerial images. Most of these approaches can be divided into two classes: methods based on heuristic knowledge or methods based on machine learning. The heuristic methods generally utilize some prior knowledge in road extraction, such as edges, radiometry, texture, geometry, etc. For instance, Hu et al., (2005) proposed a model which describes the radiometry and geometry characteristics of roads. He analysed the profiles along the direction perpendicular to the road direction and extracted ribbon roads form satellite images. Mohammadzadeh et al., (2006) extracted main road networks from IKONOS imagery using an approach based on mathematical morphology and fuzzy logic. Movaghati et al., (2010) made a combination between Extended Kalman filter and a special particle filter (PF) to maintain the robustness of road tracing in region where roads are occluded by obstacles. The road tracer finds and follows different road directions after it reaches a road junction. Shao and Guo et al., (2011) presented an effective and fast approach to detect ribbon-like curvilinear structure from remote sensing images. The key content of the algorithm is a simple assumption: the grey value of center pixel and its near neighbor pixels in road region are lighter than that of pixels in the surrounding region. In contrast, machine learning methods take advantage of the huge data to train models for road extraction. Maurya et al., (2011) used the K-Means clustering to classify each pixel in aerial images into two classes, road and non road. Then the non road area is removed based on the morphological features. Huang et al., (2009) used object-oriented algorithm to extract structural features such as Shape Index and Density, then adopted support vector machines (SVM) to classify regions into road or non road based on multiscale spectral-structural features. Wegner et al., (2013) proposed a higher-order CRF for road labeling, in which the spatial properties of road network are exploited and is represented by the higher-order cliques as the prior for road extraction. Mattyus et al., (2015) made use of Markov random field to inference the location of road centerlines and road width based on the OpenStreetMap (OSM). The algorithm is very efficient and the OSM roads of the world could be segmented in one day.
Recently, the convolutional neural network(CNN) has achieved huge success in computer vision and remote sensing image processing, such as image classification, object detection and semantic segmentation. CNNs have also been used for road extraction from aerial images. Mnih and Hinton et al., (2010) made the first attempt of applying deep learning in the field of road extraction. They adopted restricted Boltzmann machines (RBMs) for urban network extraction. To fully use the context information, a large patch is trained to predict the road map in the center area of it and PCA is adopted for decreasing the dimension of the input image. To get better results, Saito et al., (2016) proposed a CNN model for semantic segmentation of aerial images, the image is segmented into three classes (road, building, background). Besides, a new channelwise inhibited softmax (CIS) loss function is designed to obtain better segmentation results. Mttyus et al., (2017) taked advantage of CNN models to have an initial road segmentation of aerial imagery and then designed an algorithm to overcome the missing connections in the extracted road topology. Cheng et al., (2017) proposed CasNet, a cascaded convolutional neural (CNN) network to simultaneously conduct road segmentation and road centerline extraction tasks from aerial imagery. However, the distribution between centerlines and background is heavily biased and their method still needs thinning operation, which cannot extract accurate road centerlines and infer the road topology. Inspired by the deep residual learning and U-Net, Zhang et al., (2017) proposed the deep residual U-Net, which combined the deep residual learning with U-Net architecture and is more powerful in road segmentation. Wei et al., (2017) proposed RSRCNN, a CNN model for obtaining refined segmentation of road structures from aerial images, a loss function considering spatial correlation and geometric information of road structure is designed to train the CNN model. Zhou et al., (2018) proposed D-LinkNet for road extraction, the backbone of D-LinkNet is LinkNet and several dilated convolution layers are added in its center part. D-LinkNet won the first place in the DeepGlobe2018 Road Extraction Challenge.
Several work represent road network as an undirected graph. Bastani et al., (2018) proposed roadtracer, a method that used a CNN-based decision function to guide an iterative search process to generate a road graph. Ventura et al., (2018) designed a CNN that predicts the connectivity between the current road nodes and other nodes in its neighbourhood. In polymapper , the author defines the road as closed graph and used Polygon-RNN to detect the position of graph nodes.

METHODOLOGY
The workflow of the proposed method is shown in Figure 1. Aerial images are inputs in the study, a regression network is trained to predict the confidence map for road centerline and road width map. At the inference stage, after the network predicts the confidence map and width map, NMS and road tracking are applied to attain road centerlines. Then road edge lines are generated based on the extracted centerlines and predicted road width map. Finally, in order to improve the completeness of extracted road network, tensor voting is applied to detect road intersections and the detected intersections are used to guide the overcome of discontinuities.
The value of pixel on confidence map represents its probability of lying on the road centerlines, which is shown in Figure  2. The confidence map has the following properties: y (n) j is local maximum when x (n) j is on the centerlines and the value of y (n) j gradually decreases as the distance between x (n) j and road centerlines becomes larger. The confidence map Yn is defined as follows: y where DC (x (n) j ) denotes the minimum distance from the j-th pixel x (n) j to pixel x k on the road centerlines. σ controls the spread of the peak. In our approach, we set σ = 5. In most datasets for road extraction, the road label is a binary mask, in which the values of the pixels in road areas are 1 and the values Figure 1. Workflow of the proposed method of the pixels belonged to background are 0. Thus, we need to convert the road mask to the confidence map for road centerlines. We adopt a thinning algorithm to obtain road centerlines from road mask and then generate the road confidence maps. The confidence map for centerlines is calculated as Eq. (1).

Road width map
In order to estimate the road width of each pixel on extracted road centerlines, the study proposes road width map which indicates the width of road where each pixel is located. A regression network is trained to predict road width map for input images. The ground-truth for road width map is generated based on road binary mask either. For each pixel on the roads, the width of which is calculated along the direction perpendicular to road direction, as shown in Figure 3. The road width map Zn is calculated as follows : is in the road region and P (x (n) j ) = 0 denotes that x (n) j is in the background.

Network architecture
The whole network is shown in Figure 4. We adopt encoderdecoder architecture to simultaneously predicts confidence map and width map. We choose the pretrained ResNet (He et al, 2016) as the encoder. The network is split into two branches, the first decoder predicts the confidence map while the second predicts the width map. The two decoders share the same architecture. Each decoder has five upsampling layers to gradually upsample the feature map by a factor of 2. Each upsampling layer is followed by 2 convolution layers to generate dense predictions.
Road generally has a long continuous shape structure, thus context information and spatial relationship are important in road recognition. Dilated convolutions with different dilate rates can effectively increase the receptive field of network while preserving the details, so this study adds additional dilated convolution layers in the center part of the network.
The confidence map and width map predicted by our network are denoted as Y and Z. The ground-truth for confidence map and width map are denoted as Y and Z. In this study, meansquared loss is adopted as the loss function of our network. The loss for predicted confidence map is denoted as LY while the loss for predicted width map is denoted as LZ . LY and LZ are calculated as following.
where p denotes position of pixel in the confidence and width maps. N denotes the number of pixels in the map. The total loss function of the network is the sum of LZ and LY , which is defined as Loss = LY + LZ (6) Figure 4. Our architecture is split into two branches: top and bottom, which simultaneously predicts confidence and road width maps.

Extraction of centerlines and edge lines
Since the values of pixels on centerlines in the confidence map are local maximums along the direction perpendicular to direction of road. After the network predicts the confidence map for centerlines and road width map, a Canny-like non-maximum (NMS) is applied to the confidence map to obtain accurate centerlines. Given the confidence map M predicted by CNN, the direction θ perpendicular to the road direction at position p(x, y) is calculated as follows.
where Dx and Dy denote the Horizontal and Vertical gradients. If the value of p(x, y) on M is local maximum along direction θ, p is on centerlines.
Although the centerlines are extracted after NMS, the results are simply binary images (as shown in Figure 5), which lack the road topology. Thus in order to construct road topology, we take advantage of road tracking to track road centerlines.
To track road centerlines, firstly, a point on centerlines is selected as the start point, and the direction of the road at the start point is calculated. The binary image for road centerlines is denoted as C, C = {ci,j|i = 1, ...H, j = 1..., W } where ci,j = 1 for pixel (i, j) on the centerlines and ci,j = 0 for pixel (i, j) on the background. Given the direction θcurrent of current trace point (xcurrent, ycurrent), the positions of candidates for the next trace point are calculated as follows.
The next trace point (xnext, ynext) is calculated using the following formula.
(xnext, ynext) = (xs,t min , ys,t min ) (9) t = argmin t C(xs,t, ys,t) = 1 (10) The road direction θ for the next trace point is updated as follows. θcurrent = θcurrent + t (11) After road centerlines are tracked, road edge lines are generated based on the tracked centerlines and road width map predicted by the CNN model. Let W denotes the predicted road width map, p denotes pixel on the tracked centerlines. The locations of pixels on road edge lines are calculated as following.
where (xp, yp) denotes position of p and θp denotes road direction at p. The extracted road centerlines and edge lines are shown in Figure 5.

Refining road topology
After the previous road tracking, the main road network has been extracted. However, there are still some gaps and isolated road fragments in the extracted road network. Most of isolated segments should be connected to other roads to generate intersections, as shown in Figure 6. Therefore, this study used tensor voting algorithm (Maggiori et al., 2015) to overcome the discontinuities. Though intersections could be directly predicted by CNN, tensor voting is more simple and doesn't need training, which is a more generic for road network refinement.
Tensor voting is a robust method for perceptual grouping. It firstly encodes input space points as stick-shaped tensors or ball-shaped tensors. After encoding the input points into perfect tensors (Maggiori et al., 2015), the information they encode is propagated to their neighbourhood in the voting procedure. After the first voting, the tensors for input points are refined and a second voting is carried out. In the tensor encoding procedure, for each pixel p in the road region, the normal vector of p is n = (nx, ny). p is encoded as tensor T = nxnx nxny nxny nyny . T can be decomposed as follows.
where λ1 and λ2 are eigenvalues of T and λ1 ≥ λ2, e1, e2 are eigenvectors. λ1 − λ2 is the saliency for stick tensor and λ2 is the saliency for ball tensor.
Voting is carried out after points in road region have been encoded. Tensors propagate their information to other points in the neighbourhood, as shown in Figure 6 (a). Assuming that P is the voting point and O is the receiver. The salience of vote from P received by O is calculated as follows.
where s denotes the length of arc along the osculating circle from P to O and κ denotes the curvature. ε denotes the scale parameter. The decay of salience is controlled by c. The vote of P received by O is calculated as follows.
where θ is the angle subtended by the arc of the osculating circle from P to O. R 2θ is the rotation matrix for 2θ. T is the tensors for P .
After tensor voting, the ball saliency of intersections is higher than other points in its neighbourhood, which is shown in Figure 6(c). The intersections are extracted after applying NMS to the saliency map of ball tensors. The detected intersections are shown in Figure 6(d). The detected intersections are used as the guidance for the overcome of discontinuities. If the intersections are on the extended lines of road line segments, the road fragments are connected to the corresponding intersections. Figure 7 shows the refined road topology.

Dataset
This study conducts experiments on DeepGlobe and SpaceNet datasets. DeepGlobe dataset consists of 6226 aerial images. The satellite imagery used in DeepGlobe is sampled from the DigitalGlobe+Vivid Image dataset, the spatial resolution of that is 1m 2 /pixel. We randomly select 4626 images for the training part and 1600 for the testing part. SpaceNet dataset consists of 3347 images, the ground resolution of which is 30 cm/pixel. This dataset includes four areas: Las Vegas, Paris, Shanghai, and Khartoum. We split the dataset into 2780 images for training and 567 for testing.

Implementation details
We implement the proposed network using the Pytorch framework. Encoder is initialized using the pretrained model on Im-ageNet dataset. The network is optimized using RMSprop with learning rate policy of poly. The hyper parameters of our model include initial learning rate (2e −4 ), mini-batch size (2) and max epoches(300).

Evaluation Metrics
Two different measures are used to evaluate the quality of extracted road networks: a classical measure and a measure named connectivity to evaluate the connectivity of topology.
The classical measure (Heipke et at., 1997) consists of recall, precision and F1-score. Their definitions are presented as follows.
where n * m denotes the length of reference road path in the buffer of extracted road network, n * t denotes the length of reference road network. Similarly, nm denotes the length of extracted road path in the buffer of reference road network and nt denotes the length of extracted road network. The buffer width is set as 3 pixels in the experiments.
The Connectivity C (Ventura et al., 2018) is defined as the ratio of continuous road segments, the definition of which is presented as follows.
where Ncon denotes the number of continuous road segments and Ngt denotes the number of gt segments.

Quantitative Results
Our approach has been compared with some deep learningbased road extraction methods, Unet and DLinkNet. This study make experiments on the test set of the DeepGlobe and Spa-ceNet datasets. U-Net structure is widely used in biomedical images segmentation and has shown great performance in road segmentation. D-LinkNet performed well in the DeepGlobe challenge and won the first place in the DeepGlobe Road Extraction Challenge. We report performance in terms of mean recall, precision, F1-score and connectivity across the two datasets.
On the DeepGlobe dataset, our method outperforms other methods in terms of mean recall, precision and F1-score (as shown in Table I). Specifically, in comparison with D-LinkNet, our method obtains increments of 0.69% in mean recall and 0.35% in mean precision, thereby indicates that ours method can extract more complete road networks while has a lower error rate than other methods. Furthermore this study calculates F1-scores to assess the overall performance of extracted road topology. Our approach obtains increments of 0.54% in mean F1-score. In addition, the proposed method obtains increments of 0.52% in mean connectivity. The quantitative results show that our method remarkably surpasses other methods in extracting highquality road networks. This study then conducts experiments on the SpaceNet dataset to further evaluate the performance of proposed method. Results are shown in Table II. The proposed method obtains increments of 1.48% and 0.53% in mean reall and precision, obtains increments of 1.05% and 2% in F1-score and connectivity with respect to D-LinkNet. The reuslts indicate that the proposed method still achieves higher performance on SpaceNet dataset.

Qualitative Results
Figure 8 shows some predicted results of the methods mentioned above on the test set of the SpaceNet and DeepGlobe datasets. The proposed method extracts more complete road networks and the extracted road networks have less discontinuities, especially in urban area where roads are often occluded by buildings and shadows. Although the proposed method can extract relatively complete road networks, it still has a lower error rate and does not produce more incorrect road fragments. The extracted road edge lines shown in Figure 8 indicate that the road width estimated by CNN is relatively accurate compared with the ground-truth.
Generally, in DeepGlobe and SpaceNet datasets, the proposed method achieves higher performance in road topology extraction against the baselines (Unet, DLinkNet). However, there are still some false road fragments and discontinuities in the extracted road neworks, especially in dense urban areas where roads are frequently occluded and have visual similarity with some buildings. Thus, it is still a great challenge to extract road networks in dense urban areas.

CONCLUSIONS
This study proposes a regression-based method for automatic extraction of road centerlines and edge lines from aerial images. The first step is to train a regression network for predicting confidence maps for road centerlines and road width map. Second, after the CNN predicts the confidence map, NMS and road tracking are followed to attain accurate road centerlines. Road edge lines are generated based on extracted centerlines and road width estimated by the network. Finally, in order to improve the connectivity of extracted road network, tensor voting is applied to detect road intersections and the detected intersections are used as the guidance for the overcome of discontinuities.
The major contribution of this study is the introduction of the method that uses a new strategy, which is different from image segmentation to solve the problem of road extraction. We have conduct experiments on the DeepGlobe and SpaceNet datasets and the results indicate that the our approach achieves better performance than some other road extraction methods. In the future we plan to extract road networks in dense urban areas.