LEARNING A COMPOSITIONAL REPRESENTATION FOR FACADE OBJECT CATEGORIZATION

: Our objective is the categorization of the most dominant objects in facade images, like windows, entrances and balconies. In order to execute an image interpretation of complex scenes we need an interaction between low level bottom-up feature detection and high-level inference from top-down. A top-down approach would use results of a bottom-up detection step as evidence for some high-level inference of scene interpretation. We present a statistically founded object categorization procedure that is suited for bottom-up object detection. Instead of choosing a bag of features in advance and learning models based on these features, it is more natural to learn which features best describe the target object classes. Therefore we learn increasingly complex aggregates of line junctions in image sections from man-made scenes. We present a method for the classiﬁcation of image sections by using the histogram of diverse types of line aggregates.


INTRODUCTION
Our objective is the interpretation of facade images that leads to a detailed description including dominant objects such as windows, entrances and balconies.Image interpretation of complex scenes in general needs an interplay between some high-level model for the co-occurrence of these objects and some low-level model for the appearance of these objects.This paper focuses on the categorization of objects in a facade, which is meant to serve a top-down module as a link to the image data.The scope of the paper is to classify subsections of images based on the histogram of relevant aggregates of straight line segments in a bag of words approach.To motivate the idea of learning a feature representation for facade objects we will give a brief synopsis of recent work in the field of facade image interpretation.
We divide recent approaches into several fields of interests.The first group deals with the task of window detection from single images.There are two main approaches, either using gradient projection to find aligned edges (Lee and Nevatia, 2004;Recky and Leberl, 2010) or using a classificator that detects regions of interest by searching over the image (Ali et al., 2007;Jahangiri and Petrou, 2009).The first approach is restricted to facade types of which windows fulfil the alignment assumption, while the second approach does not take any alignment or structure assumption respectively into account.The work of Tylecek and Sara (2010) is an exception.Their work can be seen in between window detection and exploiting repetitive structure.They propose a complex generative model in which they include object geometry as well as neighbourhood relations.
The next group of works performs a pixel wise labelling or facade segmentation.One powerful direction is the combination of a strong pixel wise classification like Random Forests (RF) with an unsupervised segmentation (Fröhlich et al., 2010).Teboul et al. (2010) formulate a constrained generic shape grammar to express special types of buildings.They train a RF classificator to determine a relationship between semantic elements of the grammar and the observed image support.Thus, a pixel wise classification is used as low-level input for the grammar.Another interesting approach is the hierarchical segmentation proposed by Berg et al. (2007).They first parse the image into a coarse set of classes which are further parsed into more detailed classes using meta knowledge from a coarse level of detail.Both are handled within an MRF framework.This can be directly transferred to rules of a grammar, although not done in this work.
We believe that explicit modelling dominant facade objects gives much better evidence to guide a top-down interpretation system.In contrast to these approaches which deal with pixel wise evidence to guide top-down methods, we propose to learn generic parts of facade objects to allow object categorization from object specific image sections1 , not whole scenes yet, see Figure 1 for some examples of given data.Object detection is easily realized afterwards by constructing a sliding window over whole facade images.
Widely used object categorization methods either use a number of object specific features (Fergus et al., 2003) or they learn a huge codebook of local image patches (Leibe et al., 2004) which results in a huge search space for matching image features within this codebook.Recently there are new approaches that deal with learning the parts that represent individual object classes, (Amit et al., 2004;Fidler et al., 2006Fidler et al., , 2009;;Gangaputra and Geman, 2006), thus avoiding fixed and pre-selected features.
Inspired by ideas of Fidler et al. (2009) and guided by the special structure of facade objects we propose a bottom-up approach to learn generic parts of object structure from given training image sections of these facade objects.We learn increasingly complex aggregates of lines from image sections showing individual objects of the target classes.Finally we use learned line aggregates to classify new unseen image segments of learned target classes using the histogram of diverse types of line aggregates.
The paper is organized as follows.In Section 2 we propose our method for facade object categorization.We explain the definition of aggregates and how to use them for object categorization.Learning of aggregates is shown in Section 2.2 and Section 2.3 shows how to select the relevant aggregates.Section 3 explains our experiments and used data which are discussed in Section 3.2.We conclude with Section 4.

APPROACH
The basic idea of our approach is to classify rectified image sections based on the histogram of relevant aggregated straight line segments.

Overview
Straight line segments reveal a large invariance to shadows and changes in illumination.Especially windows show a large variety in appearance, in particular due to the mirroring of other objects in the window panes, which let line segments appear as promising image features.Line aggregates show a large distinctiveness for certain object categories of facades, in case not only pairs of lines are taken into consideration.Therefore we use larger aggregates, say up to five lines, in order to arrive at rich image primitives.We allow aggregates to contain smaller ones as parts.Aggregates show typical configurations depending on the angles, see Figure 3.Not all aggregates are relevant for the classification.We therefore select those aggregates that help the classification.These learned aggregates show structures that are typical for certain objects at facades, see Figure 4. We use the histograms of these relevant aggregates as features for classification.The complete approach is sketched in Figure 2, p. 3. We start from given training image sections, together with their labels, see top row of Figure 2. From these images we first collect all possible aggregates A = {A k } of lines.The aggregates are partitioned into subsets A d of aggregates consisting of d lines.Each aggregate A k has a certain type t k ∈ T which is a function of the number d of lines and their directions, rounded to multiples of π/8.The set T of all possible types of aggregates can be seen as the language for describing our objects.Learning which types are relevant for describing the target classes results in the vocabulary V = {vi} ⊂ T .Classification is done by a simple bag of words (BoW) approach: we interpret identified aggregates of type vi as words of a vocabulary V as it is usually done in BoW approaches.Thus an image section is represented by the histogram h(vi), vi ∈ V of aggregate types restricted to the learned vocabulary.Taking this as feature vector x = [h(vi)] we train an import vector machine (IVM), as proposed by Roscher et al. (2012).This was shown to get a state of the art classification performance.It is a discriminative classifier, therefore usually ensures better discriminative power than generative models and it produces class wise probabilities for test samples.
Having learned the vocabulary V and the IVM model, we classify a new image section (bottom row of Figure 2) by detecting aggregates of the vocabulary, taking its histogram and estimating its most probable class using the IVM model.
Next we describe how to build the aggregates.

Building the aggregates
We are looking for certain geometries dominated by straight lines, sometimes round arches but not arbitrary curves.Thus we start  from straight line-features together with their adjacency graph structure using FEX as described in Förstner (1994) and Fuchs and Förstner (1995).This is different to Fidler et al. (2006) and Fidler et al. (2009), in which they preferably use Gabor-wavelets as they try to model arbitrary object classes.The benefit of using the FEX-procedure is the additional information about the neighbourhood relations of pair wise lines without having to depend on their distance and size.Thus we become independent of a certain neighbourhood size.The neighbourhood of two lines is defined with the Voronoi-diagram of extracted lines.Those lines who join a Voronoi-edge are said to be neighboured.
All neighbouring lines are combined to A2 aggregates in case they are not parallel, thus building junctions which are the intersection points of the lines.The junctions of two neighbouring lines have two types τ of relation to the lines: Either the intersection point is outside of a line, then τ = 1, otherwise τ = 2.An instance of an A2 aggregate is parametrized by its orientation φ ∈ ] − 180 • . . .x k ) where t k ∈ T is a type of the language and x k the position in the image section.
To get aggregates of the next level of complexity we sequentially add neighbouring lines to already existing aggregates.The type t of a A d part is coded by the type names of involved A2 parts and the angle ω between the added line and the existing configuration, again discretized into 16 bins, which gives about 2 million possible configurations for A3 parts, more than 14 billion for A4 etc.
Please note that there is neither a scale nor any other configuration details, except for directions, included.Due to a high variability of facade objects, the clustering of dominant distances between neighbouring line junctions fails.Thus, we ignore distances and just collect co-occurrences of line-junctions and cluster directions between them.
Next we describe how we learn the relevant aggregates.

Feature selection
In the beginning we just know the language T = ti, thus all possible aggregate types.Now we are looking for a subset V ⊂ T of relevant aggregates.The histograms using all types are of a very large dimension, usually contains many zeros and furthermore not all types are relevant for the classification.We therefore identify those bins of the histogram that are informative in terms of classification, which is a typical feature selection problem.
Let X be the set of all available features xi = h(ti), i.e. the number of occurring aggregates of type ti.The task of feature selection is to find a set S ⊂ X of m features xi ∈ X which have the largest dependency on their individual target class c.As a measure for dependency, correlation and mutual information are widely used.It is known that feature sets chosen this way are likely to have high redundancy, thus the dependency between individual features is high, and they are therefore not informative.Following this argumentation Peng and Ding (2005) proposed a feature selection algorithm called Max-Relevance Min-Redundancy (MRMR).To describe dependency between features or features and labels, they use mutual entropy which is given by the expectation of the mutual information and defined by for two random variables x and y.The maximal dependency between an individual feature xi and label c is given by the largest mutual entropy H(xi; c).To select a set Ŝrel of features with the largest dependency on labels c one searches for features satisfying the so-called Max-relevance condition.
But for selecting mutual exclusive features Ŝred one can use the condition called Min-Redundancy In both cases, solving the minimization and maximization problem, respectively, is intractable due to the huge number of possible combinations of features.
Therefore Peng and Ding (2005) propose an approximation by a sequential forward selection.Assume we already have a set Si, initialized by using Equation 1, thus selecting the feature with the highest mutual entropy with its class label In each step that feature is added, which maximizes the MRMRcriterion Si+1 = Si ∪ arg max for which Peng and Ding (2005) propose to use either the difference or the quotient , c) for Q, between relevance and redundancy.In our experiments we tested both criteria and got slightly better results using (7).
We use a Matlab implementation provided by Peng and Ding (2005) to successively select most relevant but less redundant parts.Thus, in the learning step, after collecting all A d parts for each training image we perform MRMR feature selection to get those A d types that fulfil these requirements.
Unfortunately one needs to define the number of features to be selected before hand, which is one of the main unknown parts of our procedure, as we know neither types nor number of relevant features.We solve this by selecting a sufficiently large number of parts by MRMR, which gives a ranking of best suited features and estimate the classification error depending on the number of features.Usually the classification error decreases while added features are still informative and stagnates or even increases.We therefore successively add one feature after the other and estimate the classification error using a simple k-nearest neighbour classifier with a five-fold cross validation on given training samples.After smoothing we choose the number of features with the lowest estimated classification error.Figure 5 shows the average (red) classification error depending on the number of features for four different levels of complexity of the vocabulary.

Experimental setup
We choose a challenging dataset with five classes namely balconies, entrances, arc-type windows, rectangular windows plus background samples with 400, 76, 198, 400 and 400 samples per class, respectively, see Figure 1, p. 2 for some examples.For each sample image its target class c is given.These samples are taken from annotated rectified facade images, such that each sample image contains exactly one object of its given target class.Background samples are sampled randomly from rectified facade images.Samples taken this way that accidentally contain too large parts of foreground objects are removed manually.Please note that they are not resized to have an equal size.
The classification task is to learn a representation of this target class, in a way that we are able to classify new and unknown images to one of these classes.
We perform a five-fold cross validation.The dataset is equally split into five groups, such that different sample sizes per class are equally split, too.In each cross validation step we choose four of the groups for learning the relevant aggregates by using feature selection and the IVM model.The remaining group is used for testing, thus detecting proposed aggregates and testing the IVM model.
For learning we first collect all line pairs (junctions) for all images from the learning set.Performing the feature selection over histograms from these aggregates define the vocabulary of aggregates A2.For each following level of complexity d and again for every image section from the training set we further combine the already learned aggregates with new neighbouring lines.The feature selection gives the set of learned A d aggregates and therefore the words of the vocabulary that best describes the target classes.For classification each image is described by one single feature vector which is the number of occurring parts that belong to the vocabulary.Using feature vectors extracted from the training images we train the IVM classifier.For testing we extract aggregates belonging to the learned vocabulary and build their histogram of words.Using the learned IVM model we got a prediction of the target class, which we compare to the given label to build the confusion matrix.
Next we show results of these experiments

Results and discussion
First we show some learned types of the vocabulary in Figure 6 for A2 to A4.
Having the geometry of the target classes in mind this is a reasonable collection.For A2 we got rectangular junctions and aggregates that are suited to be part of an arc.Also A3 and A4 aggregates are reasonable parts of the target geometries.In or-  1 for testing the classification performance using several subsets A d are shown in Table 5 to 7. We see that using aggregates of different complexity significantly increases the classification performance.Using aggregates from A2 up to A5 gives an overall classification accuracy of almost 80%.Please note that we correctly identify balconies, rectangular and arc-type windows with 89%, 81% and 74%, respectively.Due to noise and occlusions we clearly missed parts of the geometry some times.Thus, when ignoring aggregates of lower levels we miss information about parts of the fine geometry.On the other hand, when ignoring aggregates of higher levels, we miss information about the coarse geometry.Therefore we accept redundancy in features to capture both.

CONCLUSIONS AND FUTURE WORK
We proposed a method for classification using the histogram of types of relevant aggregates of straight line segments.For this we showed how to learn increasingly complex aggregates of line junctions from image sections from man-made scenes.Using these aggregates, provided a reasonable classification performance on a challenging dataset.For all we know, this is the first approach of facade object categorization including balconies and entrances from single view images.The shown classification performance proves that the learned set of line is suited to give good evidence for existence of certain facade objects from bottom-up.This can be done when including the approach into an object detection method like sliding window or using it as a region classifier.This will be used in future work to guide a top down scene interpretation that will not be restricted to pixel wise evidence.Furthermore we will investigate how to include length information into the definition of aggregate types.

Figure 1 :
Figure 1: For each row some examples of given image sections for class balcony, entrance, arc-type windows, rectangular windows and background.

Figure 3 :
Figure 3: (Best viewed in colour) Toy example to visualize the meaning of aggregates of lines.We start from lines collected in the set A1. Aggregates of A2 are given by line junctions; each aggregate is shown in a different colour.Aggregates of A3 are groups of three lines, thus elements of A2 joining one line.They are visualized by two A2 aggregates joining the same colour.Overlapping aggregates are shown behind each other.Aggregates A4 with four lines are built accordingly to the previous lines, adding one line to all elements in A3.Thus they are junctions of four lines, visualized by three A2 aggregates joining the same colour.

Figure 2 :
Figure 2: The scheme: Starting from labelled training images, we derive a set {A k } of aggregates A k with general type t k ∈ T for each image and select those aggregates that belong to relevant types t k ∈ V.The vocabulary V has I elements.We use the histogram h = [h(vi)] of the aggregates of each image as a feature vector for the supervised classification.Given a test image we derive the relevant features and use their histogram for deriving its label.

Figure 4 :
Figure 4: (Best viewed in colour) Detected aggregates from the learned set of relevant aggregate types for one given image section.Left: lines detected by FEX, A1.Middle: relevant line junctions, A2 aggregates.Right: relevant aggregates from five lines, A5 aggregates.As aggregates are possibly mutual overlapping, not all junctions are visible.

Figure 5 :
Figure 5: Feature selection, error curves depending on the number of selected features.gray: classification error for different testsets, red: mean error, green: 1σ band.

Figure 6 :
Figure 6: Feature selection, examples for the first 49 selected parts of A2 to A4, extracted from one cross validation step.Please note that these results are similar in all cross validation steps.
180 • ], thus the direction of the first line, the angle α ∈ [0 . . .180 • [ between the two lines and the types (τ1, τ2) ∈ [1, 2] of their mutual connectivity.Please note that connectivity (middle-middle) is not allowed, as this would be a crossing that is no valid outcome from edge extraction.All angles are discretized in π/8 = 22.5 • steps, thus we have 16 bins for orientation and 8 bins for the angle.Together with three possible values for line connectivity, we have 384 different line junctions.Given these definitions we code the geometry of A2 aggregates by unique numbers between 1 and 384, which define all possible types of A2.Detected instances of A2 aggregates are stored as a list A2 = a 2 k with a 2 k = (t k ;

Table 2 :
Confusion matrix just using aggregates of A3, accuracy 75.2, see Table 1 der to test the classification performance we first use each subset A d separately.Results in terms of confusion matrices are shown in Table 1 to 4. By just using line junction, aggregates of A2 give a reasonable classification performance.Balconies are classified with an 88% true positive rate.Note that we are dealing with single images, thus we have no 3D information, which usually guides recognition of balconies.The confusion matrix proves that classification of entrances is a challenging task; they are mostly classified as balconies.The confusion between rectangular and arc-type windows is low, which shows that just using line

Table 5 :
Confusion matrix using aggregates of A2 and A3, accuracy 74.6, see Table

Table 7 :
Confusion matrix using aggregates of A2 to A5, accuracy 79.8, see Table 1