Deep cross-domain building extraction for selective depth estimation from oblique aerial imagery

With the technological advancements of aerial imagery and accurate 3d reconstruction of urban environments, more and more attention has been paid to the automated analyses of urban areas. In our work, we examine two important aspects that allow live analysis of building structures in city models given oblique aerial imagery, namely automatic building extraction with convolutional neural networks (CNNs) and selective real-time depth estimation from aerial imagery. We use transfer learning to train the Faster R-CNN method for real-time deep object detection, by combining a large ground-based dataset for urban scene understanding with a smaller number of images from an aerial dataset. We achieve an average precision (AP) of about 80% for the task of building extraction on a selected evaluation dataset. Our evaluation focuses on both dataset-specific learning and transfer learning. Furthermore, we present an algorithm that allows for multi-view depth estimation from aerial imagery in real-time. We adopt the semi-global matching (SGM) optimization strategy to preserve sharp edges at object boundaries. In combination with the Faster R-CNN, it allows a selective reconstruction of buildings, identified with regions of interest (RoIs), from oblique aerial imagery.


INTRODUCTION
In recent years, more and more attention has been paid to the automated analyses of urban areas due to an increase in urbanization and the need for more efficient urban planing and sustainable development. While the consideration of urban trees in planning processes can provide measurable economic, environmental, social and health benefits (Kelly, 2011), the monitoring and analysis of the given buildings e.g. allows to create and update city models (Kolbe, 2009), finding suitable roof planes for solar energy installations (Schuffert et al., 2015) and reasoning about a diversity of processes.
The basis for such automated analyses is typically given with data acquired from aerial platforms. The automated extraction of objects from such data has been a topic of great interest over decades (Mayer, 2008). To reason about possible updates or extract objects of interest, the 2d outlines of buildings as well as a reconstruction of the 3d shape of buildings may be compared to the information given in model data. To foster research on both of these issues, the ISPRS benchmark on urban object classification and 3d building reconstruction (Rottensteiner et al., 2012) has been initialized and addressed by a diversity of approaches.
Due to technological advancements, aerial oblique imagery nowadays provides a ubiquitous tool for an accurate 3d reconstruction of urban environments (Cavegn et al., 2014). Based on the 3d reconstruction, city models with different level of detail (LOD) can be created. Particularly LOD2 building models with distinctive roof structures and larger building installations like balconies and stairs and LOD3 building models with detailed wall and roof structures, doors and windows (Kolbe, 2009) are desirable for a variety of applications. Recent work in the field of 3d change detection (Palazzolo and Stachniss, 2017;Ruf and Schuchert, 2016;Taneja et al., 2015) suggests to use images for an efficient verification or improvement of such models.
In our work, we aim at methodologies that allow online change detection and analysis of building structures in city models given oblique aerial image sequences captured from commercial offthe-shelf (COTS) drones. Hereby, we use "online" to indicate that our algorithms should allow a direct annotation and processing of the input sequence. A key aspect in this process is the real-time image-based depth estimation, giving us information on the current structure of the scene. However, the model data estimated from imagery typically differs greatly from city models in its detail and the objects it holds. While city models only depict buildings, image-based models also contain vegetation and dynamic objects such as cars and pedestrians. Apart from being potential sources for errors, the reconstruction of such objects is not necessary for finding changes in building structures.
In this paper, we examine two aspects to remedy this, namely extraction of buildings from aerial imagery and selective realtime image-based depth estimation for the identified objects. As no dataset exists that allows to train and test a method for building extraction from imagery captured by COTS drones, we use transfer learning by using a large ground-based dataset intended for semantic urban scene understanding and adopting a small aerial dataset to our needs. The assumption is that oblique imagery captured from low altitudes show building parts that are also depicted in ground-based imagery, which e.g. holds for façades.
In summary, our main contribution is a methodology for building extraction and selective 3d reconstruction from oblique aerial imagery which • relies on cross-domain training of a convolutional neural network (CNN) to deliver object proposals corresponding to buildings, • allows real-time image-based depth estimation that preserves strong discontinuities at object boundaries, giving buildings sharp edges, and • is evaluated on different datasets, whereby the focus is put on both dataset-specific learning and transfer learning to obtain appropriate object proposals as the basis for 3d reconstruction.
This paper is structured as follows: We briefly summarize related work in Section 2 giving an overview of the recent advances in the fields of object detection and image-based 3d reconstruction.
In Section 3, we give a detailed description of our methodology for deep building extraction and depth estimation. The training process together with the datasets used, as well as the experimental results are described in Section 4. We give a short discussion of the achieved results in Section 5 before we conclude our paper in Section 6.

RELATED WORK
In the following, we provide an overview on the advances in the field of object detection, followed by a short outline on the related work of image-based 3d reconstruction.

Object Detection
In the past, a rich variety of approaches have been presented for object detection in given imagery. Thereby, several approaches focus on instance search, i.e. the detection of specific types of objects by generating candidate windows around regions of interest (RoIs) and thus delivering object proposals. In general, this can be achieved with window scoring methods and grouping methods. For a more detailed discussion of such methods, we refer to (Sommer et al., 2016), and we instead only briefly summarize the main ideas.
The window scoring methods typically generate candidate windows by applying either a sliding window approach or a random sampling technique. For each candidate window, a score is calculated which allows to rank or discard these windows w.r.t. their score. To define the scoring of candidate windows, different cues may be taken into account such as a generic objectness measure quantifying how likely it is for an image window to contain an object of any class (Alexe et al., 2012). This measure combines several image cues such as multiscale saliency, color contrast, edge density or superpixels straddling. A different objectness measure relies on the number of edges that exist in the window and those edges that are members of contours overlapping the window's boundary (Zitnick and Dollár, 2014).
The grouping methods typically perform an image segmentation followed by a grouping of segments to generate multiple (possibly overlapping) segments that are likely to correspond to objects. Thereby, the grouping is typically based on a diverse set of cues including superpixel shape, appearance cues and boundary estimates (Hosang et al., 2016;Sommer et al., 2016). A commonly used approach is known as selective search (Uijlings et al., 2013). This approach relies on a hierarchical grouping to get a set of small starting regions forming the basis of a selective search. Subsequently, a greedy data-driven algorithm is used to iteratively merge regions based on a variety of complementary grouping criteria and a variety of complementary color spaces with different invariance properties. This yields a small set of high-quality object locations.
With the great success of modern deep learning in a variety of research domains, such techniques have also been introduced in recent years for object detection in the form of deriving object proposals. The combination of a region proposal method with deep CNNs has been proposed with the R-CNN (Girshick et al., 2014). This approach makes use of selective search (Uijlings et al., 2013) to generate category-independent region proposals, but is generally agnostic to the particular region proposal method.
The derived proposals are provided as input to a CNN that extracts a feature vector of fixed length for each region. Finally, the feature vectors characterizing the derived RoIs are used to classify the objects located within these regions using a support vector machine (SVM). The R-CNN approach outperformed classical methods on public benchmarks for object detection. To improve efficiency, the Fast R-CNN (Girshick, 2015) has been proposed. Amongst others, the introduced innovations also allow using the very deep VGG-16 network (Simonyan and Zisserman, 2014) which performed quite well in the ImageNet Localiza-tion+Classification Challenge 2014 (Russakovsky et al., 2015). At this point, the computation of region proposals was identified as bottleneck in terms of runtime.
Instead of using conventional approaches to derive region proposals (Felzenszwalb et al., 2010;Uijlings et al., 2013), the use of a region proposal network (RPN) has been proposed (Ren et al., 2017) which takes an image (of any size) as input and delivers a set of candidate windows, i.e. rectangular object proposals, each with an objectness score. These region proposals in turn are used by a Fast R-CNN for object detection and, due to the increase in computational efficiency, the resulting approach was dubbed Faster R-CNN (Ren et al., 2017). This improvement allows real-time object detections on a GPU achieving state-of-theart results of over 70% mean average precision (mAP) on public benchmarks.

Image-based 3d Reconstruction
In recent years, a lot of attention has been paid to dense imagebased depth estimation and model reconstruction. With the ever increasing computational power of modern hardware, it is possible to perform an online reconstruction from imagery acquired with a single moving camera (Newcombe and Davison, 2010;Newcombe et al., 2011;Stühmer et al., 2012). The presented 3d models are of great detail, but typically depict only a small-scale scenery.
With semi-global matching (SGM) (Hirschmueller, 2008), an optimization strategy was introduced which can be used for depth estimation from a stereo setup or a multi-image matching (Rothermel et al., 2012). Since its announcement, SGM has been used and adopted for a wide variety of applications in the field of dense depth estimation and 3d reconstruction as it provides a good trade-off between accuracy and computational effort. As a result, state-of-the-art methods for aerial image-based 3d reconstruction also rely on the SGM optimization (d' Angelo and Kuschk, 2012;Haala et al., 2015;Rothermel et al., 2012).

METHODOLOGY
The processing pipeline of our approach is composed of two parts: building extraction and selective depth estimation, as outlined in Figure 1. For each iteration of the presented pipeline, we choose a bundle of five consecutive images of an input sequence which depict the scene of interest from five slightly different viewpoints. We select the center image of the input bundle as our reference image and pass it on to the first processing step in which we use an object detection algorithm to identify buildings and mark these with axis-aligned RoIs. The annotated reference image is then passed on to the second step in which it is combined with the remaining four images of the input bundle to estimate the depth for the previously identified RoIs. In the following, both parts of the pipeline will be discussed in more detail.

Building Extraction from Aerial Imagery
For the task of building extraction from aerial imagery, we have adopted and trained the Faster R-CNN (Ren et al., 2017) based on the VGG-16 network of Simonyan and Zisserman (2014). As depicted in Figure 2, the Faster R-CNN first passes the input image to a set of convolutional layers creating a feature map. The VGG-16 based Faster R-CNN holds 13 convolutional layers and produces a feature map of size 14 × 14 × 512. The second processing step is made up of the RPN which computes numerous proposals by sliding differently sized windows over the input. As the RPN is sharing its convolutional layers with the rest of the network, it takes the previously computed feature map as input.
The resulting proposals are recombined with the initial feature map in the third step and fed into a RoI pooling layer which uses max-pooling to convert the feature map within each proposal into a spatially confined feature map of size 7 × 7 × 512. Two fully connected layers then produce feature vectors of size 1×1×4096 which are passed to the bounding box regression to compute four coordinates for each of the bounding boxes and to the softmax classifier to compute the corresponding class scores. We have adjusted the output size of the classifier to two, as we only want to distinguish if an identified object is a building or not. As the regression layer computes the four coordinates of each bounding box, its output size is set to eight accordingly.
Furthermore, as we aim to detect buildings in oblique aerial imagery, which might result in a partial occlusion of buildings by other buildings, we have substituted the conventional nonmaximum suppression (NMS) by the Soft-NMS as presented by Bodla et al. (2017). Instead of removing all bounding boxes of an object that lie up to a certain threshold within the bounding box with the highest score M, the Soft-NMS decays the score of all non-maximum boxes according to the amount of overlap w.r.t.
M. This results in better detections of objects which are partially occluded by other objects of their kind, as the proposals are not eliminated immediately.

Selective Image-based Depth Estimation
For image-based depth estimation, we employ a multi-view plane-sweep algorithm. As input, our algorithm takes a bundle of five consecutive images, one reference image I ref and four matching images I ref{±1,±2} , two to either side of I ref in terms of image acquisition time. To this end, we assume that the input images are provided together with corresponding camera projection matrices Pi = K R T i − R T i Ci , which consist of the rotation matrix R T i ∈ SO(3) and the position of the camera centers Ci ∈ R 3 , relative to a reference coordinate system. As we typically consider an image sequence, we assume the same intrinsic calibration matrix K for all images.
The algorithm samples the scene by using multiple planes Πr ∈ R 3 , parameterized by their normal vector nr and distance dr relative to C ref . For each plane, the four matching images I ref{±1,±2} are warped into I ref according to the plane-induced homography with R, t: relative rotation and translation between the cameras, nr: plane normal vector, dr: plane distance from the reference camera.
For each set of planes that share the same normal vector, we select the distances dr so that the scene is sampled inversely between two bounding planes Πmin and Πmax (Ruf et al., 2017). As similarity measure in the process of image matching, we choose a 9 × 7 Census Transform (Zabih and Woodfill, 1994). In order to account for occlusions, we accumulate the matching cost within the left and right subset of the matching images and select the minimum of the two as suggested by Kang et al. (2001). This yields a per-pixel matching cost for each plane which is stored in a three-dimensional cost volume C of size W × H × D, where W and H represent the image size and D is the number of planes with which the scene is sampled.
Given the cost volume, we employ an edge-aware SGM optimization (Hirschmueller, 2008) to extract the per-pixel minimum and with it for each pixel the distance dr of the plane Πr which best approximates the structure of the scene. The adapted energy function of the SGM optimization is as follows: With the first term, the matching costs of all pixels are accumulated for a given dr. With the second term, all neighborhood pixels q with a neighboring plane parameterization, i.e. the index of dr(p) and dr(q) only changes by a maximum of one, are penalized with P1. Neighboring pixels with other plane parameterizations are penalized with P2. The optimization of Equation 2 is done by dynamic programming along eight concentric paths as suggested by Hirschmueller (2008).
In order to enforce discontinuities at object boundaries, we use the line segment detector ( Thus we reduce P2 to P1, if a given pixel p is on a line segment of I line ref , allowing neighboring pixels with strong discontinuities to influence the selection of the optimal dr. Dependent on the parameterization of the LSD, most line segments are detected at object boundaries, which allows us to enforce strong edges in the depth image. Having found the optimal per-pixel plane parameterizationΠ (p), the final depth image is calculated by intersecting a viewing ray through pixel p withΠ (p).
For a selective reconstruction that is confined to identified RoIs, we perform the per-plane image warping on the complete images, yet the matching and the subsequent SGM optimization is only done within the given RoIs, treating each bounding box individually. We optimized our algorithm with CUDA, achieving real-time performance when aiming to estimate depth maps for every key-frame which are typically generated at 1 Hz -2 Hz by state-of-the-art SLAM systems. This enables the online analysis of the depicted scene.

EXPERIMENTS
Before presenting the results achieved by the Faster R-CNN and our algorithm for depth estimation, we introduce the datasets used for training and evaluation.

Datasets
Deep learning based approaches, such as the Faster R-CNN, require a large amount of training data. To the best of our knowledge, there exists no suitable training dataset for object detection from oblique aerial imagery. Assuming that such images depict both the façade and the roof of buildings, we have adopted two different datasets for training the CNN, namely the Cityscapes dataset for semantic urban scene understanding (Cordts et al., 2016) and the dataset given with the ISPRS benchmark for multiplatform photogrammetry (Nex et al., 2015).
The Cityscapes dataset (cf. Figure 3(a)) consists of a large number of images which were captured from a car driving through urban areas. It provides a pixel-wise semantic ground truth. We generated a ground truth for building extraction by converting the semantic labeling of the building class into axis-aligned bounding boxes, needed for the training of the Faster R-CNN. The dataset provides disjoint subsets used for training and evaluation. The training subset consists of approx. 2900 images, whereas the validation subset contains 500 images.
The Dortmund-Zeche-Zollern subset of the ISPRS benchmark (cf. Figure 3(b)) consists of few large images with approx. 48 mega pixels captured from nadir and oblique viewpoints. Due to the large image size, we have cropped each input image into 64 equal-sized subimages with a size of 1022 × 766 pixels. Because this benchmark does not contain any annotated ground truth, we have selected 700 oblique images and annotated the depicted buildings with bounding boxes. We have divided the annotated images into a training and a validation subset containing 500 and 200 images, respectively.
To further evaluate the performance of Faster R-CNN to detect buildings on a different dataset, we have generated a semisynthetic dataset from GoogleEarth (cf. Figure 3(c)). We have selected seven different scenes with underlying model data in order A fourth dataset was used to demonstrate the performance of our approach on a real-world scenario (cf. Figure 3( In order to increase the amount of training data and to oppose overfitting, we have augmented the initial training sets with affinely transformed copies of the input images. First, we horizontally flipped the input data to increase the database and generalize the model. Second, due to the perspective of the Cityscapes image data, most bounding boxes surrounding buildings are of slim horizontal nature. In order to oppose overfitting to such geometry, we have rotated the images by 90 • and added them to the dataset. Third, to accommodate for different object sizes, the input images were randomly downsampled by three different scaling factors. As the image size of the Cityscapes and the ISPRS datasets differ, we used different scaling factors for each dataset.

Results
We have evaluated the performance of the Faster R-CNN to detect buildings on the validation subsets of the Cityscapes and ISPRS datasets as well as the generated GoogleEarth dataset. We have evaluated models of the Faster R-CNN that were trained with the three previously described training configurations. For evaluation on the Cityscapes dataset, we have downscaled the test images to 40% of their original size. The test images of the ISPRS dataset were used with the original image size, while the GoogleEarth images were scaled to 80%. These values were determined empirically, giving the best results for each dataset. We have set the RPN to generate 12,000 proposals for each image. Table 1 shows the average precision (AP) achieved by the three training configurations in detecting buildings on the three evaluation datasets. For the Soft-NMS, we have used a linear decay function with a NMS threshold of 0.3. Furthermore, in Figure  4(a) we have plotted the precision-recall-curves of Conf1-3, evaluated on the validation set of the Cityscapes dataset. Figure 4  object as a building. These scores are displayed above each RoI together with the corresponding class name of the detected object. In Figure 5, only objects with a score > 0.8 are displayed.
All training and experiments were performed on a NVIDIA Titan X. The runtime to process one image for the Cityscapes dataset is approx. 140 ms, for the ISPRS dataset approx. 160 ms and for the GoogleEarth dataset approx. 285 ms. The differences arise from the different image sizes which were used for the experiment.

Selective Image-based Depth Estimation
We have evaluated our approach for selective image-based depth estimation on two datasets. First, we have generated an image set, made up of five different viewpoints around a reference image (cf. Figure 6(a)), from one of the scenes of the GoogleEarth dataset. The scene shows a few isolated buildings surrounded by lots of vegetation, which is one of the key motivations to use an object detection algorithm to extract RoIs prior to the depth estimation.
Further experiments are done on imagery taken from our own dataset (cf. Figures 3(d) and 6(d)). Again, it shows an isolated object of interest surrounded by lots of scenery not relevant for the task of building reconstruction. It was captured from 15 m altitude.
We have parameterized our algorithm to use 128 frontoparallel planes. The penalties of our SGM optimization were set to P1 = 5 and P2 = 50 in order to enforce smooth reconstructions within object boundaries. The LSD was used with the presented default configurations and the input images were downsampled to half of their size. The camera projection matrices were computed using multicore bundle adjustment (Wu et al., 2011). We employ a final 3 × 3 median filter to reduce outliers. Again, all experiments were performed on a NVIDIA Titan X. The depth estimation for the full image took approx. 510 ms. The selective depth estimation reduced the complete runtime to an average of 226 ms.

DISCUSSION
We consider the underlined numbers in Table 1 as baseline, because they represent the performance of the Faster R-CNN in detecting buildings when respectively trained and evaluated on the same kind of data. Concerning these baselines, we achieve equal or better results when training the Faster R-CNN with a combination of the Cityscapes and the ISPRS datasets, i.e. Conf3. Even when evaluated on a dataset, which it has not seen during training, i.e. the GoogleEarth dataset, the Conf3 model achieves equal results as the Conf1 model when evaluated on the Cityscapes dataset.
The loss in performance w.r.t. the GoogleEarth dataset can be attributed to different factors. First, the fact that it is a different dataset which was not used during training. We assume that the model does not generalize well to building structures which are atypical w.r.t. the training data. Figure 5(f) shows that the model does not identify the building blocks on the right, as well as the individual buildings in the center. Furthermore, the field on the left is falsely detected as a building with a significantly high score. A second factor may be the quality of the underlying 3d model of GoogleEarth and the projected textures. A poor quality in the model or the texture may result in false detections. Nonetheless, the results achieved by Conf3 are significantly better than the ones achieved by the other models on this dataset.
The precision-recall curves reveal how the different training datasets influence the results achieved. Figure 4(a) shows that due to the size of the Cityscapes dataset, in comparison to the annotated subset of the ISPRS dataset, the Conf3 model delivers very similar results as the Conf1 model. Figure 4(b) shows that the additional use of the Cityscapes dataset improves the results compared to the baseline of the ISPRS dataset, i.e. Conf2.
Nonetheless, in order to detect buildings from aerial imagery, the use of the ISPRS training data is needed, as Figures 4(b) and 4(c) show that Conf1 does not achieve any reasonable results.
The training process of the Faster R-CNN keeps the first two convolutional layers of the VGG-16 network fixed (Girshick, 2015). Changing this, so that the weights of all layers are trained worsens the results by 1%-2%. A similar effect was observed when a dense-sparse-dense training (Han et al., 2017) was performed in order to regularize the model. In further experiments, we have evaluated the use of standard NMS instead of Soft-NMS. As expected, the results achieved were inferior to those presented.
The results in Figure 6 show that our algorithm for imagebased depth estimation allows to reconstruct building structures while preserving sharp edges at object boundaries. Our algorithm achieves real-time performance for the full reconstruction. Nonetheless, the runtime can be reduced by 50% if only objects of interest are considered. The results also reveal that there seams to be no significant loss in quality of the selective reconstruction compared to the full depth map. As our main focus in this paper is the use of Faster R-CNN to extract buildings from aerial imagery, an elaborate performance evaluation of our image-based depth estimation algorithm is done in other work.
As the building detection currently only extracts axis-aligned RoIs, it happens that parts of irrelevant objects are encapsulated in the bounding boxes. To overcome this, one can use a semantic segmentation that attributes a specific object class to each pixel. However, as our depth estimation algorithm is parallelized on the GPU, it needs to be invoked on a regular grid. The recently published Mask R-CNN  would allow an extraction of bounding boxes prior to reconstruction and a pixel-wise postfiltering of the resulting depth images.
Altogether, this paper shows that we can use transfer learning of the Faster R-CNN for building extraction from oblique aerial imagery, by using a combination of a large ground-based dataset and a smaller aerial dataset. The results show that the use of the large Cityscapes dataset is necessary to learn building-related features. Nonetheless, a small aerial training dataset is vital to generalize the model to perform detections in aerial imagery. With our approach, we achieve a real-time extraction and selective reconstruction of building structures. However, a registration of the reference frame w.r.t. the city model is needed in order to perform a 3d change detection and analysis.

CONCLUSION & FUTURE WORK
In summary, we present a methodology for cross-domain building extraction from aerial imagery with CNNs. We combine a large ground-based dataset for urban scene understanding with a smaller number of images from an aerial dataset to train the Faster R-CNN method for real-time deep object detection. We achieve an AP of about 80% for the task of building extraction on a selected evaluation dataset. In combination with a presented algorithm for real-time image-based depth estimation, it allows us to selectivly reconstruct buildings from aerial imagery, preserving sharp edges. These reconstructions can be used for 3d scene analysis or change detection in city models.
As axis-aligned bounding boxes do not always appropriately approximate the building outlines, we aim to incorporate a semantic segmentation for post-filtering the depth map. To this end, in the image-based depth estimation, we perform the same image warpings for all identified RoIs. In future work, different homographies are to be used for each RoI, which allows to further adjust the sampling process to the object of interest. Finally, we plan to combine it with an algorithm to register the image data against textureless model data, allowing us to match the depth map against the model data for structural change detection.