SUPERVISED OUTLIER DETECTION IN LARGE-SCALE MVS POINT CLOUDS FOR 3D CITY MODELING APPLICATIONS

: We propose to use a discriminative classiﬁer for outlier detection in large-scale point clouds of cities generated via multi-view stereo (MVS) from densely acquired images. What makes outlier removal hard are varying distributions of inliers and outliers across a scene. Heuristic outlier removal using a speciﬁc feature that encodes point distribution often delivers unsatisfying results. Although most outliers can be identiﬁed correctly (high recall), many inliers are erroneously removed (low precision), too. This aggravates object 3D reconstruction due to missing data. We thus propose to discriminatively learn class-speciﬁc distributions directly from the data to achieve high precision. We apply a standard Random Forest classiﬁer that infers a binary label (inlier or outlier) for each 3D point in the raw, unﬁltered point cloud and test two approaches for training. In the ﬁrst, non-semantic approach, features are extracted without considering the semantic interpretation of the 3D points. The trained model approximates the average distribution of inliers and outliers across all semantic classes. Second, semantic interpretation is incorporated into the learning process, i.e . we train separate inlier-outlier classiﬁers per semantic class (building facades, roof, ground, vegetation, ﬁelds, and water). Performance of learned ﬁltering is evaluated on several large SfM point clouds of cities. We ﬁnd that results conﬁrm our underlying assumption that discriminatively learning inlier-outlier distributions does improve precision over global heuristics by up to ≈ 12 percent points. Moreover, semantically informed ﬁltering that models class-speciﬁc distributions further improves precision by up to ≈ 10 percent points, being able to remove very isolated building, roof, and water points while preserving inliers on building facades and vegetation


INTRODUCTION
Outlier detection refers to the process of identifying patterns in data that do not comply with the general or expected behavior of the data.Outliers can be very different in nature, and the exact definition depends on the target application and the underlying assumptions regarding the data structure and the data generating process.Since outlier definition depends on both, given data and task, we propose to learn discriminative classifiers for outlier removal in point clouds and, further, to model class-specific distributions.Our goal is to remove most outliers while retaining the large majority of inliers (high precision).For 3D object reconstruction, missing data (incorrectly removed inliers) is usually more harmful than some few remaining outliers close to the true surface.Parts without sufficient data cannot be reconstructed at all, whereas outliers close to the true object surface are handled with smoothing priors that are built into the 3D reconstruction approach (Häne et al., 2013, Bláha et al., 2016).In this paper, we thus aim for high precision and assume that few, remaining outliers are handled by regularizers of the 3D reconstruction pipeline.
The automatic detection and elimination of noise and outliers in point cloud data sets is a long-standing, active field of research (Cheng and Lau, 2017).Most of these point cloud filtering techniques are dedicated to applications in industrial metrology and hence, are tailored to point clouds with a relatively small proportion of outliers and homogeneous point densities.In contrast, point clouds generated by image-based, multi-view stereo (MVS) 3D reconstruction techniques feature large portions of outliers and very inhomogeneous point densities across a scene.A common strategy is trying to avoid outliers already at the depth map estimation stage through enforcing consistency across views (Goesele et al., 2007, Furukawa and Ponce, 2010, Wolff et al., 2016).Still, gross outliers in MVS point clouds pose significant challenges to surface reconstruction algorithms.Conventional meshing techniques fail in their presence and require substantial manual post-processing.Volumetric 3D reconstruction approaches risk losing many details if regularizers or visibility constraints are enforced strongly.
Here, we propose to view MVS point cloud filtering as a preprocessing step that is applied after depth map fusion and before 3D reconstruction.The main idea is to filter outliers in large MVS point clouds by learning class-specific inlier-outlier distributions with supervised classifiers.Our target application are semantically annotated 3D city models generated by MVS using aerial cameras.We build on recent works (Häne et al., 2013, Bláha et al., 2016, Bláha et al., 2017) that exploit the multi-view imaging setup to simultaneously reconstruct and segment 3D models into semantically meaningful 3D entities such as building facades, roofs, streets, and vegetation, where 3D shape and semantic class-labels are mutually supportive.
Although supervised approaches have proven to be effective for many classification tasks, supervised outlier detection approaches are difficult to realize in practice.First, it is demanding and often prohibitively expensive to obtain an accurate and representative training data set which comprises both normal and outlier data instances.Second, the outlier distribution of the data is sometimes unknown in advance and, third, it can be dynamic in nature.For MVS point clouds derived from aerial images, the inlier-outlier distributions are assumed to be static.Labeling inliers and outliers for training can be done efficiently by overlaying the raw point cloud with an already existing semantic 3D city model and imposing a fixed threshold on the point-to-mesh distances.
We test two approaches: (i) inlier-outlier distributions are modelled globally regardless of semantic classes and (ii) class-specific distributions are learned.Both approaches are validated on large aerial MVS point clouds and compared to a conventional, unsupervised, heuristic baseline.We find that supervised machine learning achieves much higher precision than a heuristic baseline method.Moreover, class-specific filtering further improves results by retaining more inliers in low-density areas like vertical building facades, which will allow more accurate 3D reconstruction.Additionally, we check the generalization capability of our learned models.We show that once sufficiently trained on a larger scene, models can be applied to unseen aerial MVS scenarios and still achieve reasonably good results.

RELATED WORK
A large variety of point cloud filtering approaches exists.Point cloud denoising approaches are typically used in the context of 3D surface reconstruction and do not detect outliers directly.Instead, these methods aim at reducing the noise inherent in point clouds by adapting the position of raw points.In contrast, unsupervised and supervised point cloud filtering approaches are dedicated to detecting and removing outliers among the data without changing the position of raw points.In the following, we try to roughly classify related point cloud filtering approaches into these three categories.
Point cloud denoising has been approached in various ways.The seminal moving least squares (MLS) method of (Levin, 2004) reduces noise in point clouds implicitly by projecting the points onto a locally fitted low-degree bivariate polynomial.Several variants of the traditional MLS approach have been developed, mainly to reduce the filtering effect near sharp features and to handle sparse sampling and outliers.The modifications are based on an iterative refitting scheme to model locally piecewise smooth surfaces (Fleishman et al., 2005), adjust the polynomial fitting procedure (Guennebaud and Gross, 2007), introduce a parameterization-free projection operator (Lipman et al., 2007) or express the MLS procedure as a kernel regression process including robust statistics ( Öztireli et al., 2009, Öztireli, 2015).Further point cloud denoising approaches are inspired by filtering techniques used in image processing (Deschaud andGoulette, 2010, Digne, 2012) or follow concepts developed in the field of differential geometry (Ma and Cripps, 2011) and spectral analysis ( Öztireli et al., 2010).
Unsupervised outlier detection constitutes the majority of approaches and can be further subdivided into (i) statistical-based, (2) clustering-based, and (3) distance-based approaches.Statistical-based outlier detection assumes that the data is generated by a stochastic process.Any data instance that is unlikely to be generated from the estimated stochastic process according to some test statistic is then reported as an outlier (Barnett andLewis, 1974, Eskin, 2000).The statistical outlier removal tool implemented in the Point Cloud Library (PCL)1 assumes that the average distance of a point to its nearest neighbors follows a Gaussian distribution and performs statistical hypothesis testing to identify and discard points whose average distance to their neighbors is outside a certain confidence interval (Rusu and Cousins, 2011).Nonparametric outlier detection methods infer the underlying probability distribution of inliers and outliers directly from the data using clustering (He et al., 2003, Yu et al., 2002, Schall et al., 2005, Latecki et al., 2007).Further, early works often applied distancebased methods (Knorr and Ng, 1998) to find global outliers using the k-nearest neighborhood of a data instance to compute its outlier score.Typically, the outlier score of a data instance is constituted by the distance to its k-nearest neighbor (Ramaswamy et al., 2000) or by the average distance to all other data instances within the k-nearest neighborhood (Angiulli and Pizzuti, 2002).A strategy to identify local outliers is based on the assumption that local outliers are located in areas of relatively low density compared to their k-nearest neighbors.In (Breunig et al., 2000), the outlier score of a data instance is computed as the ratio of the average local density of the k-nearest neighbors to the local density of the data instance itself.Several extensions of this idea have been proposed, mainly to improve the density estimation procedure for linearly distributed data sets (Jin et al., 2006) and to better handle regions of different densities that are not clearly separated (Tang et al., 2002).
Supervised outlier detection refers to approaches where distributions are learned with labeled ground truth.One strategy is to only learn the inlier distribution of points and to view any data instance that deviates significantly from the trained model as an outlier.For example, one-class support vector machines (Rätsch et al., 2002, Amer et al., 2013) or one-class kernel Fisher discriminant analysis (Roth, 2004) have been applied with this strategy.In our paper, we propose to learn both inlier and outlier distributions with labeled ground truth from the data.Moreover, we learn class-specific models to better cope with the varying inlier-outlier distributions as a function of the object class.

OUTLIER DETECTION
Reasons for outliers are manifold.Typical sources are human or instrumental errors, and natural variations or unexpected changes in the behavior of a system.In practice, data sets are usually impacted by multiple types of outliers, and it is subject to the application whether a particular type of outlier is of interest or not.For 3D city modeling from aerial images, outlier removal is an essential pre-processing step to generate a cleaner data set for 3D reconstruction.The nature of outliers is one of the key aspects that needs to be considered when designing an outlier detection algorithm.According to (Chandola et al., 2009), outliers can be classified into the following three main categories: • point outlier: a single data instance that deviates significantly from the remaining data set • collective outlier: a group or sequence of data instances that deviates significantly from the remaining data set, even though the individual data instances may not be anomalous  distance to the remaining points.From a global perspective, P3 would be classified as a normal data instance due to its proximity to cluster C2.However, when examined locally, P3 appears to be a local outlier because its distance to cluster C2 is relatively large compared to the spacing between the data instances of cluster C2.In comparison, data instance P4 should be considered as normal, although its distance to the nearest cluster C1 is roughly the same as the distance between P3 and C2.Lastly, the points forming cluster C3 can be classified as either global outliers or as a small regular cluster.It depends on the application whether such micro clusters need to be detected as anomalous or not.
Most point outlier detection methods fail to capture both global and local outliers.Methods tailored to detect local outliers may be able to identify global outliers as well, provided that global outliers are sparsely distributed and do not form a micro cluster.However, methods tailored to detect global outliers can hardly be applied to detect local outliers (as illustrated in Fig. 1).In general, it is more challenging to detect local outliers than global outliers.First, the definition of locality is a non-trivial task and is often ill-defined, especially if the data exhibits clusters of varying densities.Second, statistical properties of a data instance are strongly affected if its spatial support includes nearby outliers or normal data instances of different distributions.

METHOD
Aerial MVS point clouds generated from nadir and oblique aerial images inevitably comprise a considerable amount of outliers.The purpose of point cloud filtering is to reduce outliers while preserving inliers.We develop a supervised binary classification scheme to assign each 3D point of a raw, unfiltered point cloud to one of the following two categories: • 3D points assigned to the inlier point category are assumed to be located close to the underlying surface of the captured scene.• 3D points assigned to the outlier point category are considered as either global or local outliers.Global outliers are caused by systematic deviations or gross errors in the point cloud generation process (e.g., matching errors or inadequate camera calibration), whereas local outliers are induced by random deviations and uncertainties in the camera pose and depth map estimation procedure (e.g., depth quantization).
Ultimately, the filtered point cloud is derived by discarding all 3D points that are predicted as outliers.
The decision whether a 3D point is deemed as an inlier or an outlier is primarily dependent on the local point distribution given by the 3D points within its vicinity.The neighborhood of inliers can be characterized by well-defined point distributions, even though the sampling density may vary locally due to the texture of the scene and the spatial configuration of the recorded images.In the context of urban scenes, these local point distributions display mainly planar (e.g., ground, building facades, and roofs) or spherical (e.g., vegetation) patterns.In contrast to these characteristic structures, the point neighborhood of global outliers is typically sparse and does not exhibit a distinct geometric layout.
The characteristic point distribution of inliers and outliers is not only an intrinsic property of urban point clouds in general but rather varies across different semantic classes of urban scenes.
In particular, point cloud regions representing building roofs or ground commonly exhibit a low level of noise, as these scene structures are well captured by nadir and oblique aerial images.However, these point cloud regions may be incomplete and show a varying point density due to the low or missing texture of the underlying scene.Point cloud regions representing vegetated areas are usually densely sampled but are impaired by a considerable level of noise due to the repetitive texture of the underlying scene.Vertical scene structures like building facades exhibit more outliers and have often fewer inliers because they typically show repetitive textures and surface areas (e.g., windows) corrupted by specular reflections -two properties which lead to mismatches during image matching within the structure-from-motion pipeline.Further, the orientation of building facades with respect to the viewing direction of the (nadir) camera poses additional challenges to the image matching and depth map estimation procedure (e.g., invalid assumption of fronto-parallel surfaces).
We follow two approaches for supervised outlier detection in urban point clouds.In the first approach, a discriminative model is trained to distinguish between the local point distribution of inliers and outliers without considering the semantic interpretation of the 3D points.Thus, the trained model approximates the average behavior of inliers and outliers across different semantic classes.The second, class-specific approach postulates that the local point distribution of inliers and outliers is specific to each of the semantic classes (building facades, roof, ground, vegetation, fields, and water) for the reasons described previously.A discriminative model is trained for each of the semantic classes to better adapt to the individual inlier and outlier distributions per class.

Feature Extraction
We compute 24 standard features from literature per 3D point Pi that are either adapted from unsupervised outlier detection methods or deduced from LiDAR point cloud labeling methods.The reader is referred to the original works for an in-depth coverage of the applied features that can be grouped into the following five categories: • density-based features (Ramaswamy et al., 2000, Breunig et al., 2000, Angiulli and Pizzuti, 2002, Kriegel et al., 2008, Zhang et al., 2009) • 3D eigenvalue-based features (Weinmann et al., 2013) • local plane-based features (Chehata et al., 2009) • height-based features (Weinmann et al., 2015b) • 2D features (Weinmann et al., 2013) The local neighborhood Ni of a 3D point Pi is defined as the smallest sphere centered at Pi that encompasses the k ∈ N closest 3D points to Pi with respect to the Euclidean distance in 3D space.3D points that are located at the same distance to Pi as its k-nearest neighbor are included in Ni as well.Consequently, the number of neighbors included in a local point neighborhood may vary among the 3D points but has a lower limit of at least k neighbors.Note that the 3D point Pi is excluded from its local point neighborhood Ni.
Following recent trends in 3D scene understanding and classification (Brodu and Lague, 2012), the features are extracted at multiple scales by varying the size k of the local point neighborhood.The rationale behind this approach is threefold: First, it avoids using heuristic or empiric knowledge on the scene to select the scale parameter k.Second, the optimal scale parameter k depends heavily on the local point density and the local 3D structure of the scene and may thus not be identical for each local 3D point neighborhood.In particular, it is presumed that the optimal neighborhood size of both inliers and local outliers is smaller than of global outliers.Last, the feature extraction at multiple scales presents additional information of how the local 3D structure behaves across scales, which in turn may support the discrimination between inliers and outliers.Specifically, it is assumed that the local 3D structure of inliers and possibly of local outliers is stable over a range of scales, whereas the local 3D structure of global outliers alters with varying scale.

Supervised Filtering
We train a Random Forest classifier (Breiman, 2001) using the features listed in Section 4.1 to learn the average behavior of inliers and outliers across all semantic classes.Random Forests have been shown to yield good results for many point cloud classification tasks (Chehata et al., 2009, Weinmann et al., 2015a), run efficiently on large data sets and can cope with redundant features.The optimal hyperparameters are determined via grid search and cross-validation.The classifier outputs a binary label per point indicating whether the respective 3D point is predicted as an inlier or an outlier.Eventually, the filtered point cloud is derived by assembling all 3D points that are predicted as inliers.

Semantically Informed Filtering
In order to allow for different inlier-outlier distributions per object category, we make the supervised classification approach presented in Section 4.2 class-specific.We assume that each 3D point already comes with a class likelihood, which originates from previous image labeling and projection to 3D as described in (Bláha et al., 2016).This additional semantic information per point is used to train multiple classifiers, where each classifier learns the inlier-outlier distribution of a specific semantic class.

Implementation Details
Our point cloud filtering method is implemented in MATLAB.Initial tests showed that the Random Forest classifier provided in the MATLAB toolbox is incapable of processing large data sets.Furthermore, the hyperparameters of the Random Forest classifier cannot be accessed or modified easily.Because of these limitations, the ETH Random Forest Template Library2 is incorporated into the implemented point cloud filtering routine.It is written in C++ and hence, is suited to process large data sets.Beyond a considerable decrease in computation time, it further enables to manually set the hyperparameters of the classifier.

EXPERIMENTS
We evaluate our approach on three large-scale aerial MVS point clouds with different structure and semantic classes.Aerial image sets are Enschede (Netherlands)3 , Dortmund (Zeche Zollern, Germany), and Zurich (Switzerland)4 .The three aerial image sets are acquired in the Maltese cross configuration (i.e. one nadir image and four oblique views to the north, south, east, and west per camera position) to mitigate visibility problems such as foreshortening or occlusion.We use the standard VisualSFM pipeline of (Wu, 2011) to orient the image blocks and the public implementation of plane-sweep stereo (Häne et al., 2014) with semi-global matching as smoothness prior (Hirschmüller, 2008) to estimate per-view depth maps.Further, we apply a multi-class boosting classifier (Benbouzid et al., 2012, Bláha et al., 2016) to predict pixelwise class-conditional likelihoods of the six semantic object classes building (facades), roofs, ground (impervious surfaces), vegetation (trees), fields, and water.Given the depth information and the class likelihoods at each pixel, we generate the input point clouds to our algorithm by back-projecting the pixels into 3D space and assigning a semantic label to each 3D point given by the maximal class likelihood of the corresponding image pixel.

Data Pre-Processing
Data sets Dortmund, Enschede, and Zurich differ in the number and resolution of the images.Consequently, generated MVS point clouds have different point densities.In order to ensure fair comparisons, point clouds need to have roughly the same average point density.We thus balance densities among data sets by adapting the percentage of back-projected image pixels (per view and site) such that the resulting point clouds exhibit a median distance of about 0.45m between the points.After this preprocessing, we have 5.2 million (Dortmund), 14.8 million (Enschede), and 5.8 million (Zurich) of points, respectively.

Ground Truth Labeling and Evaluation Strategy
A major shortcoming of the available data sets is their lack of ground truth, i.e. the actual segmentation of the point clouds into inliers and outliers is unknown.To generate ground truth labels, we take the semantic 3D models created by the approach of (Bláha et al., 2016) as a reference, despite them not reflecting reality perfectly well.For each 3D point of a raw point cloud, we compare its distance to the corresponding semantic mesh of the 3D model against a manually chosen threshold.If the point-tomesh distance is below the threshold, we declare the point an inlier.A 3D point whose point-to-mesh distance exceeds the threshold is declared an outlier.We use a two-sided threshold of 0.6m across all three data sets, which is experimentally determined through visual inspection and corresponds to three times the resolution of the semantic 3D models.
We compute confusion matrices and derive the standard measures accuracy, precision, recall, and F1-score for quantitative evaluation.Since we strive for outlier detection, an outlier correctly detected as such is defined a true positive (TP), whereas a correctly detected inlier is a true negative (TN).Accordingly, an outlier incorrectly classified as inlier is a false negative (FN), while an inlier wrongly detected as outlier is a false positive (FP).We consider our MVS point cloud filtering method as a pre-processing step to generate a cleaner data set as input to a 3D reconstruction Figure 2: Cross-validated precision-recall curves of the supervised, non-semantic filtering approach (dashed lines) and the semantically informed filtering approach (solid lines).The colors indicate the semantic classes building, roof, ground, vegetation, fields, and water.
inliers and outliers equally well as a class-specific one.While this effect is smaller for classes that show similar point distributions like vegetation and fields of Dortmund, it becomes apparent for classes with very different inlier-outlier distributions like building.Roofs, ground, fields, vegetation, and water bodies are mainly horizontally oriented.In contrast, building facades are mostly vertical and dense matching is hampered by textureless regions, repetitive textures, and surface parts corrupted by specular reflections.As a result, building points exhibit a fundamentally different inlier-outlier distribution, which is only inaccurately captured by a classification model averaged over all semantic classes.We provide precision-recall curves in Figure 2. Semantically informed filtering consistently outperforms supervised, non-semantic filtering in terms of both recall and precision.The most striking improvement is observed for the building class (red lines).
Qualitative comparison Figure 3 shows detailed views of the raw, unfiltered point clouds and their filtered versions.Points located in free space are removed correctly by all filtering approaches.However, the quality of the filtering is improved in three different ways if inlier-outlier distributions are learned in a class-specific way.Firstly, isolated building, roof, and water points are correctly removed.Secondly, clustered outliers located between building fronts and erroneous building points in the vicinity of dense roof areas are successfully discarded, too.Thirdly and most importantly, considerably more inliers are retained in low-density areas like vertical building facades.
Generalization capability A general drawback of supervised methods compared to unsupervised ones is that a new model usually has to be trained from scratch per scene.To verify the extent to which our learned models generalize across different scenes, we train two data sets and test on the third (results are shown in Tab. 3).
Class-specific models consistently outperform average ones across scenes regarding precision and F1-score.An interesting finding is that precision (and F1-scores) of all classes are high compared to the unsupervised, heuristic baseline (c.f .Tab. 1).This indicates that a supervised outlier filter trained on a different scene might still work better than an unsupervised heuristic.However, this has to be taken with a grain of salt due to the limited number of data sets, similar acquisition properties, and scene content.As soon as point cloud distributions vary strongly across scenes, this might no longer hold.However, re-training a pre-trained classifier on a very small portion of the new scene might solve the problem.We leave this for future work.

CONCLUSION
In this paper we propose to formulate outlier filtering in MVS point clouds as a supervised classification problem.Further, given point-wise class likelihoods, we show that incorporating classspecific knowledge for outlier detection significantly improves precision while keeping inliers in low-density areas like building facades.Main insights of this work are that (i) inlier-outlier distributions in aerial MVS point clouds are class-specific, (ii) training supervised classifiers per class improves over learning average distributions across all classes, (iii) once classifiers have been trained on a sufficiently large amount of training data, models generalize relatively well to new scenes under the assumption that these have been acquired and pre-processed similarly.
Despite the generic nature of the developed point cloud filtering algorithm, a bottleneck is transferability of a trained model to a new scene and modality.As with any supervised classifier, learned inlier-outlier distributions are directly related to the acquisition technique (e.g., active or passive measurement method, sensor type, aerial or terrestrial data acquisition, flight plan, etc.) as well as the scene content.Application to entirely new scenes that contain a very different set of classes or that have been acquired with a different sensor type need labeled reference data.
In future work we will investigate this transfer learning problem in more detail.We will experiment with point clouds of different modalities (e.g., LiDAR) and replace the traditional classification pipeline with a 3D deep learning approach.

Figure 1 :
Figure 1: Illustration of the different types of point outliers by means of a two-dimensional synthetic data set.The data set encompasses three clusters C1, C2, and C3 of normal data instances, two global point outliers P1 and P2, and one local point outlier P3.Unlike P3, the data instance P4 is normal and belongs to cluster C1.
• contextual outliers: a single data instance that is only anomalous in a specific context (e.g., spatial or temporal context) Point outliers can be further subdivided into global and local outliers.A global outlier is a single data instance that deviates significantly from the entire data set.A single data instance is considered as a local outlier if it differs substantially from other data instances within its vicinity.This notion of global and local outliers is shown in Figure1.P1 and P2 can be easily detected as global outliers, as these data instances exhibit a considerable