Fast and Regularized Reconstruction of Building Fa\c{c}ades from Street-View Images using Binary Integer Programming

Regularized arrangement of primitives on building fa\c{c}ades to aligned locations and consistent sizes is important towards structured reconstruction of urban environment. Mixed integer linear programing was used to solve the problem, however, it is extreamly time consuming even for state-of-the-art commercial solvers. Aiming to alleviate this issue, we cast the problem into binary integer programming, which omits the requirements for real value parameters and is more efficient to be solved . Firstly, the bounding boxes of the primitives are detected using the YOLOv3 architecture in real-time. Secondly, the coordinates of the upper left corners and the sizes of the bounding boxes are automatically clustered in a binary integer programming optimization, which jointly considers the geometric fitness, regularity and additional constraints; this step does not require \emph{a priori} knowledge, such as the number of clusters or pre-defined grammars. Finally, the regularized bounding boxes can be directly used to guide the fa\c{c}ade reconstruction in an interactive envinronment. Experimental evaluations have revealed that the accuracies for the extraction of primitives are above 0.85, which is sufficient for the following 3D reconstruction. The proposed approach only takes about $ 10\% $ to $ 20\% $ of the runtime than previous approach and reduces the diversity of the bounding boxes to about $20\%$ to $50\%$


INTRODUCTION
Reconstruction of building façades is one of the key steps towards complete reconstruction of a LOD-3 (Level-of-Details) model in CityGML protocol (Gröger, Plümer, 2012). Semantic objects such as windows, doors, and balconies are important components of a building façade. Extracting them (Hoegner, Stilla, 2015) and arranging them in a regularized manner (Hensel et al., 2019) are two important steps towards structured LOD-3 reconstruction (Zhu et al., 2020). And the street-view image is arguably the best option for the above objectives due to the public availability and effectiveness in collecting, such as the Google street map (Anguelov et al., 2010).
For the detection of semantic objects in street-view images, classical methods include the use of projected histograms (Lee, Nevatia, 2004, Kostelijk, 2012, gradient projection, Kmeans clustering (Recky, Leberl, 2010), correlation coefficient (Mayer, Reznik, 2007), perceptual grouping (Sirmacek et al., 2011) and etc.. Such methods do not consider the structural and spatial distribution of the semantic objects. Recently, methods based on deep learning (Mathias et al., 2016, Liu et al., 2017 have been widely used to extract the semantic objects on building façade, which have achieved impressive results on images with projective distortion and scale difference; but the regularities of semantic objects have not been considered yet.
In general, these semantic objects should conform to certain regularities, such as aligned locations and consistent sizes. However, due to the characteristics of projection distortion and complex background, the geometric attributes of the extracted primitives in images of buildings façade are generally deviated slightly from the expected. Although the regularization of 2D boundaries, such as edges of buildings, are widely studied in * Corresponding Author the community (Xie et al., 2018), the approaches cannot be directly adopted. In addition, the regular arrangements of façades can also be learned for specific scenarios (Dehbi, Plümer, 2011, Dehbi et al., 2017; however, the learned models can only be used in inductive fashion, e.g. it does not generalize to unseen data.
Recently, a general and promising approach to align different objects of building façades using Mixed Integer Linear Programming (MILP) was proposed (Hensel et al., 2019). However, in our practice the MILP is too complex to solve, which requires prohibitively high runtime consumption. Because we are aiming to integrate the pipeline into an interactive reconstruction environment, at least near real-time response of the solver is required. To solve this issue, we reformulate the problem as a Binary Integer Programming (BIP), with all the unknowns in the binary space of {0, 1}, and the objective can be expressed explicitly as logical operations of the binary variables. Rather than MILP, the BIP is relatively more efficient to be handled by state-of-the-art optimization routines (Gleixner et al., 2018, Gurobi, 2014.
In summary, this paper proposes a fast and regularized reconstruction methods for the façades of buildings from street-view images. Firstly, we extract typical façade primitives using realtime object detection pipeline, e.g. the YOLOv3 architectural (Redmon et al., 2016, Redmon, Farhadi, 2018. Secondly, the positions and sizes of the primitives are clustered using BIP by optimizing two competing desires of retaining the best fitness and regularities, for which we require no extra information of the façades. At last, the primitives after clustering are reconstructed in an interactive environment, e.g. SketchUp, by substituting each clustered primitive with a pre-built component model or interactively sketching the component on street-view images.

RELATED WORKS
A lot of works have been devoted to extraction and segmentation of building façades, in the communities of photogrammetry, computer vision and computer graphics. With regard to detecting façade objects from images, in recent years, various deep learning architectures, such as CNN (Krizhevsky et al., 2012) and RNN (Graves et al., 2008), have achieved impressive results for various computer vision tasks, such as image classification (Chan et al., 2015) and object detection (Girshick et al., 2015). Although earlier CNN architectures can greatly improve the accuracy of object detection, the detection rate is very slow. This is because that several segregated steps (Girshick et al., 2015) are used, including generation of proposals and classification of the regions. For this reason, the usage in applications requiring real-time responses is limited. The YOLO (You Only Look Once) network (Redmon et al., 2016, Redmon, Farhadi, 2018, as the name suggested, only requires a single integrated forward passing in the testing stage and achieves real-time detection rates for off-the-shelf video sensors. The incrementally upgraded YOLOv3 (Redmon, Farhadi, 2018), due to the integration of ResNet (He et al., 2016), FPN (Feature Pyramid Network) (Lin et al., 2017), and binary cross entropy loss, greatly improves both detection speed and detection accuracy. In the meantime, it has also increased the performance on small targets, which is suitable for detecting semantic objects with complex repeating structures on the building façade. And therefore, this paper adopts the YOLOv3 as the backbone for the detection of the primitives.
With regard to the regular arrangements of objects, based on explicit or implicit procedural methods, the structure of façade was inferred through grammatical rules, including random grammar (Alegre, Dellaert, 2004), syntax trees (Ripperda, Brenner, 2006), and the bottom-up or top-down hybrid approach (Han, Zhu, 2008). They all required setting the correct parameters of the shape syntax in advance. Although these methods have achieved good results, they assume that the image is composed of a fairly regular grid; in addition, fixed expressions of the grammars are not capable to cover the diversities in real-world applications. Procedural grammars are also quite cumbersome to be edited and compiled, which requires tremendous expert knowledge. Human intervention is also required to select the appropriate grammar for a particular building. Although style classifiers (Mathias et al., 2016) was developed to alleviate the above issues, which automatically recognized architectural styles from low-level image features, the use of style syntax is still needed in advance, which is probably a limitation for this approach.
Recent approaches based on mixed integer programming is arguably the most flexible and powerful tool for the problem of regular arrangement of objects. It has been used for arrangements of the 2D boundaries and 3D planes (Monszpart et al., 2015), reconstruction of surface meshes (Boulch et al., 2014, Nan, Wonka, 2017, modeling of the roof structures of the LOD-2 models (Goebbels, Pohle-Fröhlich, 2019) and the façades (Hensel et al., 2019). However, most of them formulated the optimization problem as MILP (Goebbels, Pohle-Fröhlich, 2019, Hensel et al., 2019 or even mixed integer nonlinear programming (Monszpart et al., 2015), which has unknowns in both spaces of integer and real values. Unfortunately, this kind of problems raised up in the operational research has no efficient solvers for large scale problems, even using state-of-the-art commercial libraries (Gurobi, 2014). A practical remedy is to reformulate the problem into BIP (Nan, Wonka, 2017, Kelly et al., 2017, Kelly, Mitra, 2018, which only considers binary variables and linear energies; the regularities can still be explicitly modeled through the logical operations using the binary variables and there are relatively more efficient solvers for these simpler problems. Therefore, we use BIP to model the regularization problem of the façade objects.

DETECTION OF FAÇ ADE PRIMITIVES USING YOLOV3
We use YOLOv3 (Redmon, Farhadi, 2018) to detect axisaligned bounding boxes of primitives because of its real-time performance. For completeness, we briefly introduce the architecture and implementation details of YOLOv3 here. Rather than other region-based CNN methods (Girshick et al., 2015), YOLO (Redmon et al., 2016) uses regression to directly process the entire image, and obtains categories and positions of the targets in a single forward propagation. YOLO implements an end-to-end pipeline for detection by dividing the image into s × s grids. If the center of the semantic component is in a grid, the grid is responsible for predicting the target. Each grid will generate B bounding boxes, and each bounding box must predict its confidence χ, which is defined as the product of the probability P of the target contained in the bounding box and the accuracy Q, as χ = P × Q. If the grid contains semantic objects, then P = 1, otherwise P = 0. Q represents the intersection ratio of the labeled box in training samples and the predicted box. When Q = 1, it means that the labeled box and the predicted box coincide perfectly.
If a grid contains semantic components, which corresponds to C classes, it is represented by Pi for each class. Therefore, we can obtain the intermediate score of each grid and each class as φi = Pi × χ. The scores are truncated at 0 and non-maximum suppression is used to remove bounding boxes with a large repetition rate. In the end, each bounding box only retains the objects with positive confidence scores and the highest categories. In YOLOv3, in order to improve the accuracy of target detection, the residual network (He et al., 2016) is used as backbone.
The features before entering the residual box and the features output by the residual box are combined to extract deeper feature information. On the building façade, even if they are the same type of semantic objects, their sizes and poses are not the same. YOLOv3 uses multi-scale fusion (Lin et al., 2017) to detect objects, and has good adaptability to the scale changes of objects.

REGULAR ARRANGEMENTS OF FAÇ ADE PRIMITIVES USING BINARY INTEGER PROGRAMMING
After initial extraction of the bounding boxes of the building façade, we then use BIP to restore the spatial regularity of the windows, doors and balconies, inspired by previous work (Hensel et al., 2019). Although the MILP method has been successfully used in many studies (Boulch et al., 2014, Hensel et al., 2019, in our pipeline, because we are aiming at an interactive reconstruction pipeline, the runtime should be kept reasonably low. In the following, we describe our reformulated problem setup using BIP instead of MILP.

Problem setup using binary integer programming
After extracting the initial primitives, we have N bounding boxes for each image, and each bounding box is uniquely de-termined by a set of four parameters (x, y, w, h) , where (x, y) and (w, h) are coordinate of the upper left corner and size of the bounding box, respectively (Figure 1a). Instead of directly optimizing these parameters that are real values using MILP (Hensel et al., 2019), we cast it into a model selection problem using BIP.
Specifically, we first establish a model space for each attribute of the bounding box, e.g. X = {X1, X2, ..., XN } for the attribute of x coordinate. The size of |X| could be the number of bounding boxes N , but we choose to compress it by pre-cluster the model space using a very confident lower bound δ l as described later. We then assign a binary variable a x i,k ∈ {0, 1} to represent the state of the selection, i.e. if the model X k is selected for the attribute x of the i th bounding box. In addition, we use the one-hot vector ξ i 1 to represent the whole state of the i th bounding box as ξ i = (ai,0, ai,1, ..., a i,|X| ) T .  In fact, the model spaces of the attributes of the primitives for a single façade are generally quite limited in urban environment. That is the ratio N/|X| is generally quite large, which leads to unnecessarily too many parameters. Therefore, we pre-cluster all the attributes separately using the mean shift approach (Cheng, 1995); and the threshold is set to the lower bound δ l . The values in the model space X are determined by the centers of the clusters, as shown in Figure 1b. To ensure the accuracies of the results, the lower bound δ l in mean shift clustering should be as small as possible to avoid aggregating parameters of different properties into the same category. It should be noted that, although the same threshold δ l is used for all the attributes, the number of clusters |X|, |Y|, |W| and |H| are generally different.
In summary, the purpose is to optimize all the selecting vectors ξ, under the energy functions and constraints as described be-low. And the total size of explicit unknowns is N × (|X| + |Y| + |W| + |H|).

Energy functions to be optimized
Our loss function consists of a data item and a regularity item. First of all, our goal is to make the sum of the changes of the bounding boxes against the initial locations as small as possible after the regularization. Therefore, we first calculate the residual vector for each bounding box, which represents the errors for different selections, as where the superscript x denotes different attributes.
In this way, the total energy O a d for attribute a caused by the selection vectors, e.g. offsets for the coordinates of upper left corners and differences for the sizes of the rectangles, can be briefly expressed as, Equation 2 means that, for each bounding box, we only account for the error of the selected value in model space, i.e. when a i,k = 1. The final data term of the energy function is therefore intuitively the summation of all the attributes as With only the data term, we always have a trivial solution that have the best fit, e.g. choosing the nearest center of the mean shift clustering. Therefore, we introduce a regularity item. The intuition behind this term is that higher regularity generally means less categories; fortunately, the number of selected categories is easy to model as illustrated in Figure 2. For each attribute a, the total number of selected categories, e.g. the regularity term O a g , can be explicitly expressed as the following logical expression, where · 1 is the L1 norm that is the absolute summation of all the elements of a vector and for binary variables L1 norm simply counts the number of non-zero variables; the binary operator ∨ is the element-wise logical or for the one-hot vectors. Similar to Equation 3, the final regularity term is a weighted summation across all the attributes as where ω denotes the weights of different attributes. And the final energy function is

Constraints of the binary integer programming
The variables ξ i = (ai,0, ai,1, ..., ai,N ) T cannot be adjusted freely. Obviously, because each bounding box can only choose one state, we have the following constraint C1 for each bounding box, C1 : , which is the total number of selected models.
Another practical constraint c2 is that we could very confidently ignore certain model spaces if the residual | i,k | exceeds an upper bound δu, as.
It seems the additional constraints may increase the complexity of the problem, but interestingly, in practice, we find that the additional constraints significantly reduce the runtime, with almost no differences in the final results.

Implementation details
The implementation of Equation 4 needs some tricks, because it involves the logical operations. For two binary variable a and b, the logical or result c = a ∨ b can be modeled by adding the following constraints, In fact, this kind of fixed routines can be handled efficiently and gracefully by state-of-the-art solvers (Chinneck, 2007, Gurobi, 2014. For the parameters, we set δ l ∈ [3, 5] pixels and δu = 10δ l ; and ω x = ω y = 100 and ω w = ω h = 10 are used empirically. In this way, all the energy functions and constraints are linear functions, which are solved using the Mosek library (Mosek, 2010).

Evaluation of detections of façade primitives
This paper uses the CMP façade database (Tylecek, 2012) as the training data set, which contains a total of 606 building façade images around the world. These images are manually labeled with 12 semantic objects on the façade. We choose three typical primitives: window, door and balcony. We built the YOLOv3 model based on Keras (Gulli, Pal, 2017) to train the above data set. At the same time, we took 30 typical building façade images from Google street view (Anguelov et al., 2010) for testing, and manually labeled them for evaluations. In order to verify the effectiveness of this method, we adopted the same evaluation method in (Rahmani, Mayer, 2018). For windows, doors, and balconies, our average extraction accuracy reached 0.917, 0.856, and 0.852, which is higher than 0.84 as baseline (Rahmani, Mayer, 2018). Therefore, it is feasible to extract the primitives on the building façade based on YOLOv3.

Evaluation and comparisons of the regular arrangements of the primitives
We selected three typical building façade images of three cities in the United States (US), United Kingdom (UK), and Canada (CA) to evaluate the performance of the regularization. Both qualitative and quantitative evaluations are conducted and we also compare the runtime performance against the MILP approach (Hensel et al., 2019).
Qualitative evaluations. Figures 3 through 5 compare the extracted and regularized bounding boxes for the US, UK and CA datasets, respectively. The black frame represents the extracted primitives and the red frame indicates the regularized results. It can be noticed that after regularization, the semantic objects on the building façade are arranged more neatly and consistently and still fit well enough to the original bounding boxes. In addition, Figure 6 demonstrates the reconstructed façades for the three datasets in off-the-shelf modeling solutions.  Quantitative evaluations. We counted the number of used model space before and after the regularization to measure the regularity of the results. Table 1 demonstrates the results, and it could be noted that, the selected parameters only account for about 50% for the coordinates of the corners and 20% for the sizes. Comparisons of runtime. In order to verify the efficiency of the method in this paper, we tested six building façades with complex structures and numerous parameters, and compared the proposed BIP approach against the MILP approach (Hensel et al., 2019). The results are shown in the Table 2 and the runtime of the proposed BIP approach only accounts for about 10% to 20% of the MILP approach. For the MILP approach (Hensel et al., 2019), the number of explicit unknown parameters are N (2|X| + 2|Y| + |W| + |H|) + 8N , including 8N real value parameters. In the proposed approach, the number of explicit unknown parameters is N (|X| + |Y| + |W| + |H|). Although the proposed method has slightly fewer parameters, the numbers are still in the same order of magnitude. Therefore, it is the reformulated problem that account for the performance differences.

CONCLUSION
This paper proposed an approach for the regular arrangement of primitives of the building façades using BIP. Compared to the MILP approach, BIP is considerably faster and achieves near real-time performance with similar level of data fitness and regularities. The detected and rearranged bounding boxes of the primitives can be directly used for the modeling of the façade features, which is a key step towards the LOD-3 reconstruction. However, current approaches can only detect axis-aligned objects, future works may be devoted to explore the reconstruction of more complex façade features. Code is available at https://github.com/saedrna/Ranger.