HIGH QUALITY FACADE SEGMENTATION BASED ON STRUCTURED RANDOM FOREST, REGION PROPOSAL NETWORK AND RECTANGULAR FITTING

: In this paper we present a pipeline for high quality semantic segmentation of building facades using Structured Random Forest (SRF), Region Proposal Network (RPN) based on a Convolutional Neural Network (CNN) as well as rectangular ﬁtting optimization. Our main contribution is that we employ features created by the RPN as channels in the SRF. We empirically show that this is very effective especially for doors and windows. Our pipeline is evaluated on two datasets where we outperform current state-of-the-art methods. Additionally, we quantify the contribution of the RPN and the rectangular ﬁtting optimization on the accuracy of the result.


INTRODUCTION
Facade segmentation is an important part of urban scene understanding and 3D building reconstruction and of interest for architectural design, movies and video games.
Early work on facade segmentation has focused on window detection and 3D reconstruction from multiple images (Mayer andReznik, 2006, Reznik andMayer, 2008).In recent years, most work on facade segmentation is based on single images (Cohen et al., 2014, Jampani et al., 2015, Mathias et al., 2016, Rahmani et al., 2017, Schmitz and Mayer, 2016).Additionally, the Varcity dataset (Riemenschneider et al., 2014) has been published focusing on facade image and facade point cloud labeling.As the above papers, we also concentrate on single facade image segmentation.
The biggest challenges of the basic pixel-wise segmentation are noise and the complex shapes of facade objects such as doors, windows and balconies.The first problem particularly concerns algorithms that classify each pixel or superpixel independently of its neighbors.It can be reduced by incorporating interaction with the neighborhood.The latter problem is dealt with by model fitting and by making use of the global structure of facade objects and architectural constraints.Some approaches encode hard architectural constraints in their algorithms, such as a grid window structure and that all balconies have the same dimension.Others employ soft architectural constraints like that the roof is on top of the building, or that shops are on the first floor.An additional challenge of facade segmentation is the variety of building types and architectural styles, which leads to different shapes and arrangements of the facade objects.
The main contribution of this paperis that we introduce a Region Proposal Network (RPN) to create proposals for objects such as window, door, balcony, shop and sky together with their corresponding probability.These probabilities are transformed into features which are input to Structured Random Forest (SRF) (Kontschieder et al., 2014) classification.This leads to a segmentation with very few noise.Finally, a deterministic rectangular fitting is used to create rectangularly shaped facade objects and a grid structure.
The pipeline (Fig. 1) presented in this paper outperforms all other state-of-the-art approaches on the current benchmarks without relying on hard architectural constraints.To clarify the importance of the introduction of the RPN we introduce the RPN to facade segmentation and quantify its contribution to the good overall performance.The high quality results of RPN and SRF are supplemented by a fast and yet accurate model fitting.
The paper is organized as follows: In the next section we give an overview of related work.Sections 3, 4 and 5 describe in depth our pipeline SRF, RPN and rectangular fitting, respectively.Experiments and the technical details are given in Section 6.Finally, we present the evaluation, draw conclusions and point to future work.

RELATED WORK
We distinguish two types of facade segmentation methods: Grammar based (top-down) and classification-based (bottom-up) methods.Top-down methods usually first classify each pixel or generate facade object hypotheses.Then they use shape grammars to parse the facade images.They learn the hierarchy and distribution of facade objects as well as the architectural characteristics of the data set.Because of this they can predict object positions, particularly for windows, even when they are occluded by vegetation or other objects From a processing perspective, top-down methods first divide the facade images in bigger parts and then recursively split each part in smaller facade objects.The division rules are hand crafted or learned and integrated into a shape parse tree grammar.
State-of-the-art grammar based methods usually achieve an accuracy of pixel-wise classification below 85% (Gadde et al., 2016) on the ECP benchmark dataset (Teboul et al., 2011).A problem of grammar based methods is their time inefficiency during training and inference, where they need several minutes to process a typical image (Koziński et al., 2015, Gadde et al., 2016).Currently, the most efficient but also highest quality methods are bottom-up.They employ a pipeline architecture which contains pixel (or superpixel) labeling, object detectors and optimization.Each part of the pipeline tries to correct wrongly classified pixels or to optimize the segments created by previous parts.
Dynamic programming (DP) is applied in (Cohen et al., 2014) to segment facade objects.Pixel-wise classification is used and hard architectural constraints are encoded as constraints in the DP.At each of multiple steps of DP optimization a combination of facade objects is used.In the DP, e.g., the constraints that windows and balconies are rectangular and that balconies on the same row usually have the same dimensions are employed.In the following steps the shop, door, roof and sky segments are optimized.In (Cohen et al., 2017) the same authors additionally make use of the symmetry of the facade.This reduces problems with occlusions and improves the accuracy.
Auto-context (Tu, 2008) is used in (Jampani et al., 2015).The pipeline consists of boosted random trees, object detectors and conditional random fields (CRF).First, an object detector for doors and windows is trained and the scores of the detectors are used as features in the boosted random trees.Additionally, for each pixel 761 low level features such as TextonBoost filters, Histogram of Oriented Gradients (HOG), Local Binary Pattern features and averages over image rows and columns are computed.
For building an object detector, the Integral Channel Features detector (Dollar et al., 2009) is employed, which outputs bounding boxes.The score of each pixel is the sum of the scores of each bounding box that contains the pixel.Additionally, three iterations of the Auto-context algorithm are used.In each iteration the output of the previous step is added as well as features such as the class probability for each pixel, entropy, row and column statistics, distance to the nearest class pixel, color model per class, bounding box features and neighborhood statistics.The iterations improve the accuracy by 1% to 8% on the benchmarked dataset (Jampani et al., 2015).As postprocessing, a CRF is employed to delineate and reduce the noise of the segments.This improves the accuracy by no more than another 1% but adds more than 24 seconds.Compared with less than 6 seconds for all previous steps, the CRF takes more than 80% of the time.(Martinović et al., 2012) presents a three layered approach.On the first layer the image is segmented with a Recursive Neural Network which outputs the class probability distribution of each pixel.The second layer consists of a door and window detector and a Markov Random Field (MRF).Since the results are still noisy and only moderately accurate, a third layer is introduced.
It consists of an energy minimization incorporating architectural constraints as well as characteristics of the datasets such as that the second and the fifth floor must have a running (long) balcony.
The above work is extended in the ATLAS approach (Mathias et al., 2016).First, the image is segmented into superpixels.A range of segmentation methods such as RNN, Perceptron, Multiclass Support Vector Machine (SVM), Multiclass Logistic Regression and CRF have been tried and it has been shown that the SVM works best.The results show that the first layer leads to an improvement of 2% compared to the previous approach.This improvement is due to a significantly higher accuracy for the bigger classes like shop and roof.On the other hand, the accuracy is lower for doors.In the second layer, the door and window detectors are improved and a detector for cars which often occlude the lower part of facades is added.The final postprocessing layer is similar to (Martinović et al., 2012).
(Rahmani et al., 2017) introduces an SRF demonstrating the good performance for facade segmentation.The method outperforms the current classification methods, but the employed iterative optimization algorithms does not perform well quantitatively compared to other optimization approaches.
In the following we describe our novel approach which can be considered as an extension and improvement of (Rahmani et al., 2017).

BASICS OF STRUCTURED RANDOM FORESTS
In this section we present a short introduction of Decision Trees and Structured Random Forests (Kontschieder et al., 2014) as well as their advantages for facade segmentation.For more details, please refer to (Kontschieder et al., 2011a, Kontschieder et al., 2011b, Kontschieder et al., 2014).The main difference between traditional Random Forests and Structured Random Forests is that the output of the SRF is an image patch, while the traditional output is just a single pixel.

Decision Trees
A Decision Tree (DT) is a classification algorithm that accepts as input an n-dimensional feature vector x from a dataset D and outputs a class label y ∈ Y , where Y is the set of class labels.Formally, we can represent a DT by ft(x) = y.A DT classifies a sample recursively by branching down to a leaf node.At each node of a DT a split function is learned deciding how to traverse down until the leaf node is reached.The leaf node assigns a class label to the sample.In facade segmentation the data samples are the pixel's features and the set of labels is: Y = {"window", "door", "wall", sky", ..} Each node decides based on the learned split function which is defined as follows (with parameters θj) The sample will continue its path to the left node if the output of the split function is 0, otherwise it continues to the right.

Training of Decision Trees
During training DTs try to find the best split function (Equation 1) so that they achieve the highest possible accuracy.Formally, the set Sj ⊆ X × Y , which has "arrived" at tree node j should be split in two subsets by the split function in a way that the quality of the split is maximized.Often the measure of quality is the information gain: where The DT selects the parameters θj that maximize the information gain.
The parameters of the split function are usually chosen randomly for a certain number (often up to three) of features and their corresponding thresholds.This process is repeated several times and the combination of features and their corresponding thresholds that maximize the information gain Ij (Equation 2) are chosen as parameters of the split function.

Random Forests
A Random Forest is a set of T independent DTs.To classify a sample, Random Forests accumulate the T predictions of each tree.From these labels the Random Forests choose a single label, usually by majority vote or consensus.DTs have the problem of overfitting which the redundancy of several trees helps to reduce.

Structured Random Forests
For facade segmentation it is beneficial to consider the local and global structures of the facade objects.The features and the segmentation algorithm which embody the architectural constraints as well as the object hierarchy.For this, we use SRFs, as they encode the local structure of the objects in their split nodes.
Standard Random Forests label each pixel independently of its neighborhood, leading to labeling to labels with a lot of salt-andpepper noise.The adaptation of structured learning (Nowozin and Lampert, 2011) to random forests produces as output a patch.This results into almost noise free segments and highly accurately labeled images.
Additionally, SRF have the advantages, that they output for each pixel a patch and that during training also the labels of the neighboring pixels are used.For each pixel multiple labels are proposed from the neighboring pixels and during training the neighborhood is integrated in the split function.
Unfortunately, the patches lead to an exponential increase of the output space compared to the Standard Random Forests.To overcome this problem, different reduction techniques are employed for the output space such as Principal Component Analysis (PCA) (Dollár and Zitnick, 2013) and probabilistic approaches such as (Kontschieder et al., 2014).In this paper a representative patch is computed for each node as the joint probability distribution of the labels assigned to a leaf node.
We use a training methodology similar to (Kontschieder et al., 2011b).The best split parameters are chosen based on the information gain of up to three joint distributions.The training procedure works as follows: Let St be the subset of sample patches that has reached the node t.Each sample of St has dimension d × d with center (0, 0).We randomly choose up to three positions (i, j) around the center patch with |i| ≤ n and |j| ≤ n (n is a chosen neighborhood) as well as a feature and a threshold for each position.The information gain is evaluated for 400 to 1000 randomly chosen combinations of up to three positions, features and thresholds and the best parameters for St are chosen.This is repeated recursively until the leaf node is reached.

Optimization for Structured Random Forests
Each patch is of dimension d × d and we evaluate every pixel, meaning that each pixel is covered by d 2 patches (except pixels at the borders of the image).The d 2 values are distributed over the classes and the final pixel label is chosen by majority vote.
We use an iterative optimization method (Kontschieder et al., 2011b) which produces sharper edges for the segments, a higher accuracy and removes noisy small segments.It selects the best labeling from the set of patches for each tree of the forest.
Formally, let a training image I with labels l be given, let the SRF F be defined as a set of T structured decision trees and let the tree t ∈ T for pixel at position (i,j) predict the patch p(t) (i,j) .We define an optimization function agreement score counting the number of correctly predicted pixels of the patch p(t) on the labeled image l.
When labeling the patch with center at position I(i, j), the patch from the tree that has produced the highest agreement score is selected as representative patch.Other trees with lower agreement scores are ignored.This step is performed for every pixel with d 2 proposals and the class label for each pixel is chosen through majority vote.The optimization can be performed multiple times until convergence.For the complete proof and more details we refer to (Kontschieder et al., 2011b) This iterative technique tries to shape the object boundaries similar to the objects that the SRF has "seen" during training.

REGION PROPOSAL NETWORK
With the recent advances in Convolutional Neural Networks (CNN) (LeCun et al., 1990), the accuracy of object detection and Region Proposal Network (RPN) algorithms has considerably improved (Ren et al., 2015, Liu et al., 2016).Large data sets such as ImageNet (Krizhevsky et al., 2012) and COCO (Lin et al., 2014) have been made available for training and testing.
We use a pretrained model as basis to train our RPN.We employ the Single Shot Detector-300 (SSD-300) (Liu et al., 2016) as RPN for classes window, door, balcony, long (running) balcony, shop, roof and sky.For each object a separate feature (channel) is created (Figs. 1 and 2).The detectors output rectangles with an attached probability for the existence of the object.Each pixel in the box is given the probability value of the object mapped to the [0,255] range with "inverse min-max normalization".In the ECP dataset, the RPN produces six features (Fig. 2).During the experiments we realized that with a small addition of space around an object (padding) during the training phase, the object detector performs better.This is particularly helpful for entrance doors which are usually surrounded by the wall making it easier for the RPN to identify them.Additionally, this helps to distinguish entrance doors from shop doors, because the latter are mainly surrounded by windows.
We have compared RPN with the results of Integral Feature Channel (Dollar et al., 2009) and Deformable Part-based Model (DPM) (Girshick et al., 2012) detectors used in previous work (Jampani et al., 2015, Mathias et al., 2016, Rahmani et al., 2017) and found that the RPN performs considerably better.

RECTANGULAR FITTING
After labeling by the SRF, the image is post-processed.We employ architectural constraints embodying the following assumptions: The facade objects window, door, balcony and shop have a rectangular shape, roof and wall are divided by a horizontal straight line and windows from a grid structure.
First, we count all vertical and horizontal "changes" between window, door, balcony and shop to wall or other objects and viceversa.This reduces the search space for object boundaries and from the statistics we derive the grid structure.Other methods delineate the objects based on the objects on the same row and column.We abstain from this, because the height and the width, particularly of windows, can change depending on the viewing angle of the image and further image distortions incurred during the rectification of the image.Our fitting is, thus, based only on the local labeling.
We have defined a minimization function to fit the objects in the rectangle,.Formally, let rectangle Rx 1 ,y 1 ,x 2 ,y 2 be defined by its upper left (x1, y1) and bottom right corner (x2, y2).Let our object to be fitted have class label oc.We then define the optimization problem as follows: where I[] is the indicator function and k a parameter which is empirically determined.Hypotheses are generated from the statistics of "changes" and the rectangle Rx 1 ,y 1 ,x 2 ,y 2 with the minimum score is selected.
Intuitively, this minimization function produces rectangles which contain as many pixels of class oc as possible and try to avoid pixels of other classes (Fig. 3).
To compute the number of pixels in each rectangle, we use an integral image representation (Viola and Jones, 2001).With it, the score of each rectangle can be computed with O(1) time complexity.Compared to other methods such as CRF (Jampani et al., 2015) and DP (Cohen et al., 2014, Cohen et al., 2017), our optimization method is fastest in terms of time complexity and, still, produces the highest quality results (Figs. 4 and 5).

Datasets
ECP This dataset consists of 104 rectified facade images of buildings from Paris with Hausmannian architecture.All images are labeled according to the seven semantic classes balcony, door, roof, shop, wall and window with balconies, doors, windows and shops modeled as rectangles.The dataset was first published by (Teboul et al., 2011) but some annotations were not correct.In 2012 (Martinović et al., 2012) corrected these annotations.In our experiments we use the later dataset.
Graz The dataset contains 50 rectified images from Graz (Riemenschneider et al., 2012) comprising facades of different architectural styles (Classicism, Biedermeier, Historicism and Art Nouveau).Each image is labeled according to the four semantic classes door, sky, wall and window To train the RPN for the ECP dataset we have collected and annotated a larger number of facade images.

Image Features
The SRF has been trained with similar channels as in (Rahmani et al., 2017), such as RGB and CIELab color, HOG, location information, 17 TextonBoost filter responses and the scores of the RPN.We have removed the features that have a low information gain.We use Extremely Randomized Trees (Geurts et al., 2006), which results in high quality trees as well as fast computation and results of higher quality.
Since the facades are rectified, it is meaningful to add as features the row variance and median of the RGB channels as well as the corresponding deviation from the median.These "statistical" features have a high information gain especially for the Graz dataset for which the content is simpler and where the images do not contain occlusions.

EVALUATION
We have evaluated our algorithm on both datasets using 80% of the data for training and 20% for testing.The split of the dataset has been chosen randomly.The SRF for the ECP dataset is trained with patches of size 17 × 17 and for the Graz dataset we have chosen as patch size 11 × 11.For the ECP dataset we have evaluated also patches of sizes 15 × 15 and 19 × 19 and for the Graz dataset of sizes 13 × 13 and 15 × 15, but the results were not significantly different (the absolute difference in accuracy was 0.1%).
The empirical results have been compared with the results for current published work on the datasets (Martinović et al., 2012, Mathias et al., 2016, Jampani et al., 2015, Cohen et al., 2014, Cohen et al., 2017, Rahmani et al., 2017).From Tables 1 and 2 one can see that our method outperforms the other methods.Our algorithm is more than 2% better on the Graz dataset and 0.4% on the ECP dataset than the current state of the art.Additionally, our method is an order of magnitude better than other methods in recognizing doors due to the good cooperation of the RPN and SRF.
We have evaluated our algorithm in four stages: the Baseline (ours Base ) evaluates the Structured Random Forest performance without the RPN.For the ECP dataset this leads to weak results for windows and, particularly, doors.Since the content of the Graz dataset images is simpler, the statistical features suffice for it to produce good results.
Incorporation of the RPN features into the SRF (ours RP N ) significantly improves the performance.The accuracy for doors on the ECP dataset increases by more than 30% and is better than the current state-of-the-art methods by more than 7%.Additionally, the RPN also improves the accuracy for window by 7%.The windows located on the roof are occluded by balconies due to the viewing angle.This affects our algorithm, which labels the occluded window parts as balcony (Figs. 5 and 6) and,thus, does not achieve an accuracy higher than 80% on the ECP dataset.In the Graz dataset, the RPN also improves the door and window accuracy, but not with the same magnitude as for the ECP dataset.
The optimization step (ours O ) does not significantly improve the overall result quantitatively, but it removes noise which positively effects the rectangular fitting.
The rectangular fitting (ours RF ) improves the overall accuracy by more than 0.5% for both dataset.It creates high-quality labeled images.The final result is suitable for many applications that need highly precisely delineated facade objects, such as 3D city models.
Finally, we note that the developed pipeline has also its weaknesses.Particularly during learning, the SRF develops a high confidence in the RPN since its features have a high information gain.Thus, when the RPN is wrong, i.e., proposes an object with a high probability which actually does not exists, the SRF is not able to recover (Fig. 6).

CONCLUSION AND FUTURE WORK
We have presented a method for facade segmentation which outperforms other state-of-the-art methods in terms of accuracy and quality.The Structured Random Forest, the Region Proposal Network based on a Convolutional Neural Network and the rectangular fitting method constitute a very good combination for facade segmentation.Rectangular model fitting is particularly suitable for this task due to the shape of the facade objects.With the assumption of a grid structure for windows we added a global constraint.Finally we note that our novel approach does not only produce a high quality result, but is also very efficient.We consider to implement and improve it in a way that it can process more than one facade image per second on a normal computer without a significant influence on the accuracy.

Figure 1 .
Figure 1.Architecture of the proposed pipeline for facade parsing (RPN -Regional Proposal Network, SRF -Structured Random Forest)

Figure 2 .
Figure 2. Region Proposal Network (RPN).The output of the RPN for the facade objects in image (a) for (b) balconies, (c) long balconies, (d) windows, (e) shops, (f) doors and (g) roof.The intensity represents the confidence of the network in the existence of anobject at that position.The brighter the region, the higher is the probability of the existence of the object.

Figure 3 .
Figure 3. Rectangular Fitting for a door.(a) The orange polygon is generated by the SRF and the dashed line represents the rectangle with minimum loss.

Figure 4 .
Figure 4. Qualitative results on the Graz dataset.The facade segments are homogeneous and nearly without noise.Column (a) input images, (b) results from Structured Random Forest with Regional Proposal Network, (c) results after 6 iterations of the optimization, (d) results after Rectangular Fitting and (e) ground truth.Object classes: -window, -door, -sky, -wall

Figure 5 .Figure 6 .
Figure 5. Qualitative results on the ECP dataset.The facade segments are homogeneous and nearly without noise.Column (a) input images, (b) results from Structured Random Forests with Regional Proposal Network, (c) results after 19 iterations of optimization, (d) results after Rectangular fitting, (e) ground truth.Object classes:-window, -door, -balcony, -sky, -wall, -roof