Y-SHAPED CONVOLUTIONAL NEURAL NETWORK FOR 3D ROOF ELEMENTS EXTRACTION TO RECONSTRUCT BUILDING MODELS FROM A SINGLE AERIAL IMAGE

Fast and efficient detection and reconstruction of buildings have become essential in real-time applications such as navigation, 3D rendering, augmented reality, and 3D smart cities. In this study, a modern Deep Learning (DL)-based framework is proposed for automatic detection, localization, and height estimation of buildings, simultaneously, from a single aerial image. The proposed framework is based on a Y-shaped Convolutional Neural Network (Y-Net) which includes one encoder and two decoders. The input of the network is a single RGB image, while the outputs are predicted height information of buildings as well as the rooflines in three classes of eave, ridge, and hip lines. The extracted knowledge by the Y-Net (i.e. buildings’ heights and rooflines) is utilized for 3D reconstruction of buildings based on the third Level of Detail (LoD2). The main steps of the proposed approach are data preparation, CNNs training, and 3D reconstruction. For the experimental investigations airborne data from Potsdam are used, which were provided by ISPRS. For the predicted heights, the results show an average Root Mean Square Error (RMSE) and a Normalized Median Absolute Deviation (NMAD) of about 3.8 m and 1.3 m, respectively. Moreover, the overall accuracy of the extracted rooflines is about 86%.


INTRODUCTION
Buildings are the most prominent objects in urban scenes, thus measuring and analyzing 3D shapes and positions of buildings are essential for many applications such as 3D map updating, urban management, smart cities, monitoring, navigation and mapping, civil infrastructure inspection, and scene understanding. Hence, a considerable number of researches is dedicated to automatic building detection, localization, and reconstruction in photogrammetry and remote sensing. As a general categorization, current algorithms for 3D Building Reconstruction (3DBR) can be divided into three basic methods: data-driven (Awrangjeb et al., 2018;Cheng et al., 2011;Kim and Shan, 2011;Sampath and Shan, 2010;Yan et al., 2017), modeldriven (Huang et al., 2011;Partovi et al., 2015;Zhang et al., 2014;Zheng et al., 2017), and hybrid methods (Wang et al., 2016;Xiong et al., 2015). The differences between the data-driven and model-driven methods have been discussed in previous studies (Tarsha-Kurdi et al., 2006;R. Wang et al., 2018). The remotely sensed data such as stereo aerial and satellite images or LiDAR data are the main sources to extract 3D information of urban objects using photogrammetry techniques. However, these data sources are not available everywhere and generation of updated Digital Surface Models (DSMs) needs a considerable amount of effort, time, and cost, especially for large areas. On the other hand, sometimes, it is not possible to capture images from different views to reconstruct 3D models because of obstacles and occluded areas or the limited acquisition time.
To address this issue, many investigations are attempting to reconstruct 3D scenes from monocular images such as single satellite and aerial images as a low-cost solution for rapid 3D mapping and fast 3D visualization and rendering of urban scenes. As widely known 3D reconstruction from a single satellite or aerial image is a difficult ill-posed problem because of inherent * Corresponding author ambiguities related to the scale and shape of the object. However, it is possible to extract structural information of the objects or measure the topology and geometry constraints from a single image in order to generate 3D models. One of the state-of-the-art techniques to extract high-level information from a single remotely sensed image is based on deep learning-based algorithms and Convolutional Neural Networks (CNNs). The high-level information is semantic or geometric features such as depth or height of objects, land cover labels, textures, or camera exterior parameters that can be integrated to reconstruct the 3D shape of an object. However, buildings' heights or footprints are not sufficient for 3D reconstruction of buildings and geometric structures of building roofs such as planes and linear elements of roofs are required for 3DBR. In this paper, the proposed approach for 3DBR is based on extracting the high-level knowledge from an RGB image and forming them to generate parametric models. The required knowledge for 3DBR includes the location of buildings, the linear elements of building roofs (i.e. rooflines) such as eave, ridge, and hip lines as well as the heights of buildings (e.g. normalized DSMs), which are effective to reduce the complexity of reconstruction. However, extracting 3D information from a single 2D image is impossible and under constraint theoretically. Therefore, a novel method including a Y-shaped Convolutional Neural Network (Y-Net) is employed to extract nDSM as well as segmented linear elements of building roofs, simultaneously, from single RGB images. This work's contributions are as follows.  3D parametric models of buildings (LoD2) can be constructed from a single RGB image contributing to a better understanding and interpreting the 3D scenes in realtime applications;  Unlike the traditional photogrammetric techniques for 3DBR from a single image, the proposed method can extract the height information from non-oblique and nearly vertical single images using a CNN;  Since nDSMs and rooflines share high-level features and representations of a building, the geometric structures of buildings can be learned efficiently during a two-stream network training.

RELATED WORK
There are several studies for 3D reconstruction of objects from a single image using photogrammetry techniques. These studies are mostly relying on detecting vanishing features (e.g. points and lines) as well as estimating the camera calibration parameters from oblique images. One of the earliest studies to restore 3D information from a single image is based on deriving geometric constraints such as image lines and object topologies during image interpretations (Van Den Heuvel, 1998). (Jizhou et al., 2004) proposed a framework to extract the height of buildings from an oblique UAV-based image. Their framework is based on the extraction of parallel lines and view angles of buildings. However, they also employed digital maps to calculate the scale of 3D models. (González-Aguilera et al., 2005) developed a software to extract 3D models based on vanishing points geometry of an oblique image. Later, they improved the accuracy of extracting vanishing points and lines using the RANSAC algorithm (Gonzalez-Aguilera and Gomez-Lahoz, 2008). Nowadays, deep learning algorithms have shown remarkable performances in the automatic 3D reconstruction of objects from single RGB images in computer vision applications (Fan et al., 2016;Henderson and Ferrari, 2019;Wu et al., 2017). In photogrammetry and remote sensing, CNNs can be employed to extract height information such as DSMs from single aerial or satellite-based images (Amini Amirkolaee and Arefi, 2019; Ghamisi and Yokoya, 2018), as well as building detection and footprints extraction (Aamir et al., 2019;Wu et al., 2018;Xu et al., 2018;Yang et al., 2018). (Li et al., 2019) used two independent CNNs for land cover classification and building height estimation from single satellite images. The CNN for height estimation task is a fully connected network and estimates a fixed height value for each building block for 3D reconstruction in LoD1. (Tripodi et al., 2019) employed the U-Net to extract the building footprints from single satellite images. Since the footprints have no extra information about the shapes of the building roofs, the final 3D models are in LoD1 only.

PROPOSED METHOD
As shown in Figure 1, the proposed framework for 3DBR based on the Y-Net includes three main steps as data preparation, CNN training, and 3D reconstruction. First, a training dataset is generated for height prediction and roofline extraction. Next, a Y-shaped CNN is designed which includes one encoder block to extract features from input images and two decoder blocks to convert extracted features to nDSMs as well as rooflines. After training the Y-Net using the generated training dataset, it is applied to a test image to extract the essential knowledge of 3DBR. In the third step of the proposed approach, predicted rooflines and nDSMs are combined together in order to generate parametric models of buildings in LoD2, according to the CityGML Standard. The summary of each step and their main components are given in the following sub-sections.

Data Preparation
The main data used in this study include aerial orthophotos and the corresponding DSMs. On the other hand, the required training dataset for the proposed framework should be composed of RGB images (Figure 2, a) and corresponding nDSMs ( Figure 2, b) as well as rooflines (Figure 2, c). Therefore, the Digital Terrain Models (DTMs) are first generated from DSMs by employing the progressive TIN densification algorithm (Axelsson, 2000), and then nDSMs are calculated by subtracting DTMs from DSMs. The nDSMs include the absolute height values of urban objects from the bare Earth. To generate corresponding rooflines, the aerial orthophotos are manually digitized for linear elements of individual roofs into three classes of eave, ridge, and hip lines. Next, the vector-based data are converted to raster images including three RGB channels for three classes of rooflines (i.e. R for eave lines, G for ridge lines, and B for hip lines), as shown in Figure 2, c.
In the pre-processing step, several image tiles are cropped from the generated training dataset and resized to the size of 224×224×n, so that n is equal to 3 for orthophotos and rooflines, and 1 for nDSM tiles. Moreover, the number of training samples increases using different data augmentation techniques such as scaling, rotating, and flipping operations. a b c Figure 2. A sample of generated training data including a: the RGB image; b: the nDSM; c: rooflines

CNN Training
In this paper, a novel convolutional-deconvolutional network (Y-Net) is proposed to extract the height data and rooflines, simultaneously from a single image. The network includes one encoder and two decoders. The structure of the Y-Net is shown in Figure 3. The encoder extracts the high levels of features from RGB images.
The first part of the encoder is an inception-based module with three different sizes of filters (i.e. 1×1, 3×3, and 5×5) which allows the network to take advantage of multi-level features extraction and improves the generalization capability of the network. For instance, it extracts general (5×5) and local (1×1) features at the same time. Next, there are 7 convolutional layers followed by Batch Normalization (BN) and Rectified Linear Unit (ReLU) layers to generate feature maps, as well as three maxpooling layers to reduce the size of feature maps by a factor of 2. The convolutional layers include 3 × 3 kernels with a stride of 1 and the max-pooling layers include 2 × 2 kernels with a stride of 2. In the last part of the encoder, there are three modified residual blocks. The ideas of skip connections and residual blocks are first introduced in the study by (He et al., 2016). However, the ReLU layers perturb the data flowing through identity connections. Therefore, compared to the original residual block, in the proposed architecture, the ReLU layers are removed after addition in order to boost the performance of the network. Y-Net includes two decoders which are exactly the opposite of the encoder. One of the decoders contains the parameters of a regression-based model to convert high-level features into height values of objects (nDSMs), while the other one is for a segmentation-based problem and converts features into rooflines. Since nDSMs and rooflines share the same high-level features and representations of buildings, we applied a weight-sharing constraint between three convolutional layers of two decoders. By sharing features between two decoders, the network is able to estimate more accurate nDSMs for rooflines which are important for 3DBR. The size of the input is 224×224×3, while the output sizes are 224×224×1 and 224×224×3 for predicted nDSMs and rooflines, respectively. To train Y-Net, random initial values are considered for training parameters. Moreover, the berHu loss function (Laina et al., 2016) is applied for nDSM prediction, given by Equation 1. While the logistic log loss is used for roofline segmentation, given by Equation 2. The combination of the loss functions is utilized as Equation 3 and the network is trained using the ADAM optimizer (Kingma and Ba, 2015).
where, x is the difference between the predicted and ground truth values, and c is 20% of the maximal per-batch error.
where, c is a binary attribute of ground truth values in (+1, -1).
Here, +1 denotes the presence of an attribute, and -1 denotes its absence.
where,  is a scale factor for combining two loss functions and equals to 0.001, in this study.

3D Reconstruction
The proposed approach for 3DBR from a single image relies on extracting the essential geometrical knowledge of buildings such as nDSMs (Figure 6, b) and rooflines (Figure 6, c) by applying the trained Y-Net to a test image (Figure 6, a), as shown in Figure  6. The predicted rooflines in three classes of eave, ridge, and hip lines are used to define the locations and orientations of individual building parts. In this approach, most of the building blocks are decomposed into the individual building parts including flat, gable or hip buildings by analysing of the predicted rooflines. In the first step of proposed approach for 3DBR, the predicted rooflines are pre-processed to remove all small and noisy segments. Next, binary polygons of building blocks (Figure 6, e) are generated using the first channel of rooflines which is mostly composed of eave lines (Figure 6, d).
The Minimum Bounding Rectangle (MBR)-based technique (Arefi and Reinartz, 2013) is then employed to enhance the binary polygons and convert them to the regularized and approximated polygons (Figure 6, f). The approximated binary polygons are initial primitives for the prismatic models of building blocks (i.e. LoD1). Next, a rule-based search technique (Alidoost et al., 2019) is utilized to decompose the building blocks into individual buildings (Figure 6, g). To this end, the approximated binary polygons and eave lines are rotated based on the main orientation of the building block. Next, for each binary polygon, all vertical or horizontal eave lines inside the polygon and with the endpoints on the boundary of the polygon  Figure 4). The second channel of the predicted rooflines contains the ridge lines ( Figure 6, h), which is utilized to generate parametric models of buildings (i.e. LoD2). The individual ridge line for each individual building is extracted by analyzing the predicted ridge lines inside each binary polygons (Alidoost et al., 2019). A polyline that is parallel to the main orientation of the individual roof and crossing the center of the polygon is the main ridge line, as shown in Figure 5, b. Then, an optimized line is fitted to the candidate polyline to generate the regularized ridge line for the roof (Figure 5, c). The ridge line is extended if the distances between the endpoints and the eave lines are less than 3 m. Finally, the hip lines can be reconstructed by connecting the endpoints of ridge lines to the vertexes of approximated polygons and the median height values of the eave, ridge and hip lines are then extracted from the predicted nDSMs to generate the final 3D models (Figure 6, k).

EXPERIMENTS AND RESULTS
To evaluate the performance of the proposed approach, an airborne dataset from Potsdam, Germany, provided by ISPRS (ISPRS, 2018), is used which consists of very high-resolution true orthophoto tiles with a ground sampling distance (GSD) of 5 cm and corresponding DSMs derived from dense image matching techniques. Two non-overlapping areas of this dataset are selected for training the Y-Net and 3D reconstruction, as shown in Figure 7. The training dataset includes 4,800 tiles of RGB images, nDSMs, and rooflines which are increased to 24,000 tiles with a size of 224×224 after data augmentation. The Y-Net was trained using the training dataset on a single NVIDIA GTX 1080 Ti with a batch size of 10 for 100 epochs. The learning rate, beta 1, beta 2, and epsilon parameters are selected as 0.01, 0.9, 0.999, and 1 × 10−8 for the Adam optimizer.

Figure 7. Overview of training and testing datasets
In addition to test data from Potsdam (e.g. Areas 1-4 in Table 1), the second dataset from Zeebrugge, Belgium (IEEE, 2015) consisting of a true ortho-photo with a GSD of 5 cm and LiDAR data with a 10 cm point spacing is also employed to assess the transferability of the trained network (e.g. Area 5 in Table 1). The trained Y-Net is applied to the testing RGB images and the predicted nDSMs and rooflines are shown in Figure 8, compared to the ground truth data. The accuracy of the estimated nDSMs is evaluated based on standard metrics such as Mean Error (ME), Standard Deviation (SD), Root Mean Square Error (RMSE), Relative Error (REL), and Root Mean Squared Logarithmic Error (RMSLE), as well as robust statistical metrics such as Median Error (MeE), Normalized Median Absolute Deviation (NMAD), Quantile 68.3% (Q68.3), and Quantile 95% (Q95), as reported in Table 1. Also, the results are compared to other studies for the nDSM prediction task for testing areas in Table 2. Since there are random noises, outliers, and systematic errors in the predicted nDSMs, robust metrics are useful to have accurate and reliable assessments.

Metric [m]
Testing Areas Ave. Area1 Area2 Area3 Area4 Area5 All ME  Table 2. Comparison between the proposed Y-Net and the stateof-the-art methods for nDSM prediction over Potsdam dataset As shown in Figure 8, not only the linear elements of roofs are extracted appropriately, but the buildings are also classified and distinguished from non-building objects such as trees and roads.
The accuracy and quality of the predicted rooflines are calculated using the standard quality measures of completeness (or recall), correctness (or precision), quality (McGlone and Shufelt, 1994;McKeown et al., 2000), the F1 score, and Overall Accuracy (OA), given by Equation (4).
. = + ; . = + ; where, TP is the true positive, FP is the false positive, and FN is the false negative. The quality measures of testing areas for each class of rooflines (e.g. eave, ridge, and hip lines) are presented in Table 3. The results in Table 3 show that the eave lines are estimated with higher precision (about 92.8%) than ridge and hip lines (about 80.8% and 53.4%, respectively). Accordingly, the trained Y-Net is able to distinguish between building and non-building objects better and there are some misclassification errors in ridge and hip lines. Finally, the extracted nDSMs and rooflines are employed for 3D reconstruction of buildings. Since the predicted nDSM includes some outliers as well as systematic errors, the median of height values is considered for modeling of each individual roof. On the other hands, the rooflines, which are extracted completely and correctly, are only utilized to generate approximated binary polygons. Accordingly, the 3D parametric models of buildings can be reconstructed by assigning the height values to the binary polygons and ridge lines, as shown in Figure 9. The geometrical accuracy of generated 3D models is measured based on the 3D coordinates of roof planes' vertexes, compared to the ground truth, and reported as RMSxy and RMSz measures. Experimentally we found that the RMSxy value of 3D models is less than 0.5 m, while the RMSz value is about 3.8 m which is within the accuracy range of the predicted nDSMs. In addition, the quality measures of building footprints in Figure 9 are reported in Table 4, inspired by the ISPRS guideline for evaluation of building reconstruction (Rottensteiner, 2013 Although the accuracy of the predicted nDSMs from single images using deep learning techniques such as the Y-Net is not comparable to the high-resolution DSMs extracted from LiDAR data or image matching techniques, they are valuable information for specific applications such as real-time navigation, rapid 3D rendering, land cover classification using RGB-depth fusion techniques, urban growing, change detection and so on.

CONCLUSION
In this study, we presented a novel approach based on supervised deep learning techniques to extract nDSMs and rooflines of buildings from a single aerial image and generate the parametric models in LoD2.Unlike existing methods in photogrammetry and remote sensing that require both ortho images and highresolution DSMs, the proposed method uses the single RGB images and the power of CNNs to extract the valuable information which is essential for 3D representations of buildings. Although we have some limitations to produce the proper training dataset for rooflines, the results show the reasonable performance of the proposed Y-Net to predict rooflines with the overall accuracy of 86%, and predict the nDSMs with the RMSE of 3.8 m for different test datasets. a b c d e Figure 8. The results of the predicted nDSMs and rooflines: a: the input RGB images from two test datasets; b: the reference nDSMs; c: the predicted nDSMs; d: the reference rooflines; e: the predicted rooflines IEEE, 2015. GRSS Data Fusion Contest. Available online: http://www.grss-ieee.org/community/technicalcommittees/data-a b c d e f g Figure 9. The results of 3D building reconstruction: a: the input RGB image; b: predicted nDSMs; c: predicted rooflines; d: binary polygons, extracted from eave lines; e: approximated polygons and best fitted ridge lines; f: differences between the building footprints and the ground truth; g: 3D models of roofs in LoD2