SEMANTIC SEGMENTATION OF INDOOR POINT CLOUDS USING CONVOLUTIONAL NEURAL NETWORK

As Building Information Modelling (BIM) thrives, geometry becomes no longer sufficient; an ever increasing variety of semantic information is needed to express an indoor model adequately. On the other hand, for the existing buildings, automatically generating semantically enriched BIM from point cloud data is in its infancy. The previous research to enhance the semantic content rely on frameworks in which some specific rules and/or features that are hand coded by specialists. These methods immanently lack generalization and easily break in different circumstances. On this account, a generalized framework is urgently needed to automatically and accurately generate semantic information. Therefore we propose to employ deep learning techniques for the semantic segmentation of point clouds into meaningful parts. More specifically, we build a volumetric data representation in order to efficiently generate the high number of training samples needed to initiate a convolutional neural network architecture. The feedforward propagation is used in such a way to perform the classification in voxel level for achieving semantic segmentation. The method is tested both for a mobile laser scanner point cloud, and a larger scale synthetically generated data. We also demonstrate a case study, in which our method can be effectively used to leverage the extraction of planar surfaces in challenging cluttered indoor environments.


INTRODUCTION
Semantic information is increasingly becoming an indispensable ingredient of BIM.Applications such as energy flow monitoring, emergency management, retrofit planning, visualisation (Volk et. al., 2014), crucially depend on the availability of the class information of the entities in the model.For new constructions, this information is essentially input in the design phase.In contrast, only after an existing building is geometrically modelled, semantic enrichment of that model takes place.In other words, the labels are given to extracted surfaces; the unstructured point cloud is not perceived to possess categorical information.
However, the case of holding this semantic information prior to geometric modelling could greatly contribute to the conventional modelling process.For instance, directly acquiring a recognition of which points belong to the category of wall, could bypass the need for calculating the surface normal, and making the assumption that horizontal normal form a good basis to comprise that particular class.
In regard to this interest of employing more meaningful features in modelling, the concept of semantic segmentation has arisen in different research domains, mainly in computer vision, and robotics (Thoma, 2016).As an important notion towards complete scene understanding, semantic segmentation is applied to numerous application such as autonomous driving, augmented reality, and computational photography (Garcia-Garcia et al., 2017).
The research in indoor modelling for semantic segmentation is usually performed on RGB-D sensor depth images for small indoor scenes.On the other hand, the importance of 3D point clouds for devising better performing classifiers has been demonstrated (Koppula et al., 2011).Thereafter the robotics community effectively employs SLAM based techniques to jointly extract localization and more meaningful maps of the indoor environments.Despite considerable advances of drift error reduction by pose estimation graphs, there are still feasibility issues for large scale indoor semantic segmentation (Fuentes-Pacheco et al., 2015).
Keeping in mind the large scale building indoor models, beside the inherent limitations of these commonly employed data acquisition techniques in terms of scale, the generally accompanying methodological framework of probabilistic graphical models such as Conditional Random Fields (CRF) also suffer from a similar problem.In order to mitigate the computational burden encountered in optimization, a necessary clustering like super pixel grouping (Fulkerson et al., 2009), or line extraction (Jung et al., 2016) is drawn on the data for large scale classification applications.
In the last five years we have witnessed the revival of neural networks, as deeper architectures become effectively possible (Krizhevsky et al., 2012), (Szegedy et al., 2015).Especially Convolutional Neural Network (CNN) has been the leading network type of many successful practical applications (Karpathy et al., 2014), (Farfade et al., 2015).In CNN, powerful hierarchical representation is generated through self-learned features in a supervised manner directly from the data.Compared to conventional features engineered by a specialist, these selflearned features provide a powerful geometric discriminator among data categories.Initially practiced on image classification as a whole, a number of other computer vision tasks such as object detection, and recognition also benefit from modified networks and algorithms, among semantic segmentation.However, huge labelled training data necessary to match the depth of the network, and the number of parameters to be optimized, becomes more deficient with per pixel labelled training data required for semantic segmentation.
In this study, a method to directly classify large scale 3D indoor point clouds by using Convolutional Neural Networks has been developed.Our main contributions can be expressed as follows;

•
Large-scale indoor point cloud classification: We have acquired and prepared suitable datasets both real and synthetic, and 3D input-output relationships compatible with a simple, fast CNN architecture and tuned our algorithm to train and run over a high number of points, and a floor size indoor space.

•
Effective clutter removal based on semantics: We have tackled the problem of clutter in indoor modelling by reframing it as a simple semantic filtering.
• Demonstration of enhanced planar extraction: Finally we demonstrate how geometry reconstruction can benefit from semantic segmentation with a case study of planar extraction enhancement in particular.(Fig. 1)

RELATED WORK
The literature related to our research can be divided into five interrelated categories.We start with geometric indoor modelling from point clouds in which semantics follow the geometry extraction separately.Then the important work on semantic segmentation originated in image processing, extending to depth images in small scale indoor scenes are addressed.Consequently, the introduction of Convolutional Neural Networks to indoor scene segmentation, and the advances achieved in image processing are discussed.Besides, the implementation of CNNs on 3D data in general with a focus on point clouds are briefly touched.Finally, in order to provide a background for our case study of planar extraction in indoor modelling, relevant papers are mentioned.

Semantic Indoor Modelling
In indoor modelling from point clouds, it is common practice to first extract planar primitives, and subsequently classify them into horizontal structural elements of ceiling and floor, and vertical walls.Various methods have recently been developed based on projection plane histogram analysis (Okorn et al., 2010), plane sweep (Budroni and Boehm, 2010), surface normal (Sanchez and Zakhor, 2012), stacking (Xiong et al., 2013), and diffusion embedding (Mura et al., 2013).Common to all these methods is the sequential approach to label the structural elements according to previously segmented planar surfaces.

Semantic Segmentation
In contrast to the sequential approach, semantic segmentation is the segmentation of the data as a natural result of the classification procedure of a basic unit, i.e. usually being pixel or superpixel level.An early example can be found in the work of Huang et al. (2002), for land cover classification using Support Vector Machines.In order to provide the general framework, and impose the consistency of the segments CRF has been a standard technique for the last decade.A pioneering work of Silberman and Fergus (2011) applied CRF to achieve dense labelling in small indoor scenes captured by a low-cost depth sensor.

Convolutional Neural Networks on Semantic Segmentation
Recently, convolutional neural networks have become the state of the art for semantic segmentation tasks.For indoor scenes, depth sensors are continued to be employed, and a CNN version of full scene labelling is introduced by Couprie et al. (2013).With the advancement of CNN research, different efficient network architectures are proposed.Among them; Deeplab which combines CNNs with fully connected CRF (Chen et al., 2016), Fully Convolutional Networks (FCN) which employs 1*1 convolutions and some skip connections and upsampling (Long and Darrell, 2015), Deconvolutional Neural Networks (Noh et al., 2015), CRF-Recurrent Neural Networks (Zheng et al., 2015), and SegNet (Badrinarayanan et al., 2015) could be named as significant developments.For more details about these architectures the reader is referred to the review paper by Garcia-Garcia et al. (2017).

Convolutional Neural Networks on 3D Data
After a brief period of pause following the very early attempts of the implementation of convolutional neural networks directly on 3D data, a rapid attention has been shown in a variety of research communities.VoxNet (Maturana and Scherer, 2015) based on volumetric representation as the name implies, is one of the first effective implementation of CNN on object detection, in which the whole of a bounding box is classified as the segment based.
On the other hand, Multiview 2D CNNs achieve slightly better results due to their exploitation of pre-trained models on very large image datasets.Recently in a similar approach to ours Huang and You, (2016) classify the urban Lidar points without It is worth here to mention another very recent approach that attempts to break the dilemma of CNNs applicability to 3D data either as volumetric 3D or Multiview 2D, by proposing a new deep learning architectures based on Autoencoders that could directly operate on 3D point cloud (Qi et al., 2017).

Planar Extraction in Indoor Modelling
Primitive extraction is a critical part of many indoor geometry modelling methods for extracting planar surfaces in man-made indoor spaces.Planar surface fitting in laser data is an already extensively studied field in remote sensing community.A comprehensive research by Nurunnabi et al. (2014) compares some of the existing algorithms.A similar comparison for mobile indoor mapping can be found in (Nguyen et al., 2007).For indoor reconstruction implementation, variants of Hough Transform (Okorn et al., 2010), (Oesau et al., 2014) and RANSAC (RAndom SAmple Consensus) (Dumitru et al., 2013), (Ochmann et al., 2014) are employed beside plane sweeping by Budroni and Boehm (2010), or EM (Expectation-Maximization) by Thrun et al. (2004).Sanchez and Zakhor (2012) utilized RANSAC in a region growing method to detect planar primitives.

METHOD
The input to our method is the raw point cloud, and the output is the densely labelled point cloud, being that a label is assigned for each point.In order to be able to employ a fast and a simple CNN architecture, the point cloud is densely voxelized, and an occupancy representation is formed in the first place.For labelling the training data, manual classification of the point cloud is transferred into the voxel domain by means of a majority voting of the corresponding points in that voxel.Subsequently voxels are agglomerated into cubes (Fig. 2).To provide the variability of the training data, each cube is created by shifting along every prime direction with the smallest stride size, being a voxel.Once the cubes are generated, they are fed into the CNN architecture which is designed to handle 3D voxel data.The result of the CNN classifier is assigned to the centre voxel, and the classification is carried on to ensure a continuous dense classification of all voxels.
The details of the preparation of the data and the CNN architecture are explained in this section.

Data Preparation
A deep Convolutional Neural Network classification paradigm is extremely data dependent, therefore data is at utmost importance.
A major drawback of CNN for practical applications is the requirement for large amounts of data.There have been continuous research community efforts in computer vision, culminating in large-scale image databases such as ImageNet, presented (Deng et al, 2009) freely to the service of researches.
Though not in the same scale, a similar strive provides RGB-D datasets for indoor scene reconstruction purposes.However, when it comes to per-pixel labelled datasets, the options are still scarce.Moreover, large scale indoor modelling for point clouds lack a direct point cloud database that could be utilized for training deep networks.Therefore, there is an obvious need for such a database.
CNN demands a lattice structure as an input while the raw point cloud is unordered which cannot directly be processed in CNN architecture.Therefore, to abide in 3D the raw point cloud should be transferred into a volumetric representation.Depending whether the task is classification or segmentation, a number of alternative data preparation paths that can be taken are summarized in Table 1.Initially the bounding box of raw point cloud is calculated.Now we describe how we divide this bounding box into two scales: voxel and cuboid.
A 3D bounding box is firstly divided into voxels which have a certain pre-defined size.There are different ways to represent a voxel, among which the simplest method is the binary occupancy value which evaluates whether there are points existing within the voxel.In the case of present points, the intensity value of this voxel is set to one, otherwise it is zero.A finer way is to count how many points fall into the voxel and assign this number of density as the voxel's intensity value.There are also some more advanced representations which takes into account probability distributions (Maturana and Scherer, 2015).In our case, considering computational simplicity and information preservation, we select this density representation which is both computational efficient and preserves more information of raw point cloud than just occupancy value.
Has no label Entire Cuboid (from Voxel Label) -Table 1 Figure 2. Our Convolutional Neural Network has four main blocs, each having their respective convolution-pooling and activation layers.Input cube is fed into the network and the result of the soft-max layer is assigned to the centre voxel.n is the variable for the number of categories.
Next, the generated voxels are encapsulated as cuboids which will be the input of the neural network.Specifically, given a certain voxel, its corresponding cuboid is an aggregation of the voxels in three dimensions in which the centre voxel is the given voxel.In our setting, each cuboid consists of 21*21*21 voxels.
For the voxel locating in the boundary of bounding box, we use zero-padding to generate its corresponding cuboid to confirm all cuboids have the same size.The concept of voxel and cuboid is analogous to the relation of pixel and image in 2D image plane, which expresses the raw point cloud in a 3D raster format as an input for CNN architecture.
For training data, we should also assign the label of every voxel.
Considering the difficulty to manually assign label to every voxel, we manually assign the label to raw 3D point cloud and then the label of voxels are determined with the label of its points' label using majority voting.For the label of cuboid, we directly assign it as its centre voxel label.The empty voxels will be ignored both in the training and test data.
For test dataset, we also generate the cuboid for each non-empty voxels and assign the classification result to the centre voxel of the cuboid.The points which fall into this voxel will all be assigned the label of this centre voxel.In this setting, we can generate point-based semantic segmentation result, without resorting to a more sophisticated CNN architecture such as FCN or Deconvolutional Neural Networks

Model
A CNN is essentially a discriminative classifier which models the desired output y in this form; φ is the feature set learned through optimizing the parameters θ. ω maps the feature set to the output.In the case of a deeper network these mappings are generated with some non-linear activation functions, and renders the classifier highly applicable to non-linear classification problems.
Network Architecture f (n) is generally composed of convolution and pooling layers which is depicted in Fig. 2.

Optimization
In the optimization process, we employed the cross-correlation entropy cost function with the weight decay value of 0.001.Instead of using the relatively slow Stochastic Gradient Descent, we approximate the minima with ADAM method (Kingma and Ba, 2014).A batch size of 256 with 45 Epoch training is set during the whole training process.

EXPERIMENTAL RESULTS
We conduct our experiments on two different kind of datasets according to their generation sources.These two datasets are dense mobile laser scanner data, and a large scale synthetic point cloud data populated from an architectural CAD model.The overall segmentation results are in parallel with the complexity and the challenge expected from diversifying the sets.The overall classification accuracies are indicated at the right-bottom of the confusion matrices.CNN is implemented with MatConvNet library (Vedaldi and Lenc, 2015).All experiments are conducted with 5 cm voxel resolution.

Indoor Point Cloud Segmentation
Part of a mobile laser point cloud acquired by a TIMMS platform equipped with sideways Faro Scanners are used for real data evaluation (Fig. 4).The cluttered indoor environment consists of two semi-detached rooms, of which another room resides within one of them.The point cloud consists of 4.5 million points in a total area of 90 m 2 .For training data preparation, an attempt to benefit from presegmentation by using algorithms like connected component analysis or region growing turns out to be ineffective due to the highly cluttered environment.Hence we resort to the full manual classification of the points which proves to be laborious in large scale (Fig. 3).As our method envisages the possibility for an online testing as a following study, the point cloud is not treated with a pre-processing of noise reduction.Nevertheless, there is %2 of the points which cannot be possibly recognized to be classified by the human operator.In order to see the potential of our method, we select the general computer lab./ office area as training site, and reserve the meeting room to the test site (Fig. 4).
Table 2 Our method is evaluated with 7 classes, and the results are denoted in Table 2 and Table 3.The overall accuracy is 0.81.In particular the wall detection is very satisfactory, and promises a substantial leverage to our previous object-driven space partitioning framework (Babacan et al., 2016), whereas, accuracies for small objects, such as monitor, and shelf are not very gratifying.Object scale in classification appears as a problem, in addition to the class ambiguities themselves.For instance, the category shelf consist of both standalone bookshelves in any part of the room, and the longer wall-attached ones.Likewise, the category object includes anything of any size that could be counted as an object that is not covered in other categories, i.e. an artificial tree or a computer case.The other objects categories like desk, and chair appear to have fairly well exploitable geometric structures.We provide more exposure about the detailed results of some particular scenes in Fig. 5.
The full segmentation is depicted in Fig. 6a.By virtue of the very high accuracy wall detection results, the segmented wall points become very representative of the indoor model even in their raw format (Fig. 6b).For reference we digitize a CAD model of the test site in AutoCAD Revit.Despite some individual erroneously classified small clutters, and a large cabinet occluding the small wall, the segmented wall points closely follow the indoor model.

Synthetic Data Segmentation
In order to gain different insight, we further analyse another indoor dataset, this time being a synthetic data devoid of any clutter, but increased to a floor-scale environment.A synthetic point cloud is populated from a CAD model of a basement consisting over 40 rooms, and corridors (Fig. 7).The model is relatively simple in terms of the diversity of its categories; it only consists of the structural elements of the building as wall, floor, door, and beam.We generate 10 million points for the whole model, but only run the training with %20 of the points due to computation limitations.
As it could be seen from table 4 and table 5, CNN is very successful in dominant horizontal and vertical architectural structures such as walls, and floor, and also promising door detection results only from the door frame, as opposed to our previous complete door detection framework (Fig. 8).Beams are also recognized in majority.Apart from a thin slice of a wall misclassified as beam, the results are satisfactory in general, and inviting for an object detection framework, as there is no other misclassification at the object scale.The overall accuracy is 0.89.Table 4 Table 5

Semantic Planar Segmentation
We finally display the power of CNN based semantic segmentation in indoor modelling by applying the segmentation results to drive a planar extraction on the point cloud.Our motivation is the fact that planar extraction could become extremely challenging in indoor environments, especially in the presence of high clutter and occlusion.We previously have circumscribed this problem by means of favourable slice selection.However, this solution necessarily limits the information content into a narrow 3D, even convert the problem into 2D line extraction.Leaving aside the pros and cons of this approach, we present an alternative that could effectively be employed directly in 3D.
Once equipped with semantic information, the approach is simple and straightforward; comes down to selecting the relevant categories for plane candidates, and applying the extraction algorithm individually to each individual category.Here we exhibit results for RANSAC algorithm (Schnabel et al., 2007) applied to the mobile laser scanner point cloud dataset.
As can be seen on Fig. 9 the semantic selection of the wall points can effectively reduce the number of irrelevant planes to be deployed in geometric modelling, hence increase the percentage of wall corresponding planes in the overall extraction.This framework is also potentially beneficial in the overall space partitioning process, as demonstrated in our previous work.

CONCLUSION AND OUTLOOK
In this paper we propose a viable method to extract semantic information for indoor modelling.A convolutional neural network is designed for 3D data to obtain semantic segmentation of indoor point clouds.Experimental results demonstrate that the methodology can adapt to different kind of datasets, both real and synthetic at various densities with categorical assortment.A simple example of how this semantic information can be deployed to mitigate the challenge of geometry modelling is also given.
Yet, there is great room to improve the results.First, the inherent class ambiguity issue should be tackled.This problem is closely related to dataset size and variations, hence indoor modelling research community needs to give emphasis on producing large datasets with diverse categories.Present datasets are mainly for small scenes and / or object oriented; the relation between the real environments deprived of prior segmentation information should be established.Finally, we advocate that a geometry modelling could benefit immensely by considering semantics, which we strive to further in future study.

Figure 1 .
Figure 1.The input and outputs of the method; a) raw point cloud of the cluttered indoor environment; b) direct application of previously trained CNN classifier to the test site; c) selection of the wall points determined by the CNN; d) the result of the planar extraction, applied only to the corresponding wall points

Figure 3 .
Figure 3. Clustering algorithms like connected component analysis and region growing algorithm are not effective in cluttered environments (left-hand side).Hence a laborious manual labelling for training dataset generation has been directly applied on 3D point cloud (right-hand side).

Figure 4 .
Figure 4.The real dataset is generated by a mobile laser scanner.The left hand side partition of the environment is set as the training site, where the right-hand side partition has both similar computer laboratory office characteristics, meanwhile a fairly different meeting room space is also present to test our algorithm (green part on the lower right).

Figure 5 .
Figure 5. Particular details from segmentation results.Left hand side of the image pairs are the ground truth, while the right hand sides are the CNN predictions.a) of the 3 human in the scene, two of them correctly labelled while the one in the middle is mistaken as a wall; b) the table in the meeting room is generally segmented with surrounding chairs, despite been trained only from desk examples; c) shelves are fairly detected while monitors are largely missed.

Figure 6
Figure 6.a) full semantic segmentation results from the top view; b) labelled wall points largely delineate the floor plan; c) CAD model digitized from the point cloud for quick reference.

Figure 8
Figure 8. a) the result of the semantic segmentation, among a great majority of correctly labelled points, a façade wall is misclassified as a long beam; b) a close-up to the results, the door frames are recognized well.

Figure 7 .
Figure 7.The synthetic point cloud generated from a CAD model.

Figure 9 .
Figure 9. RANSAC planar extraction results; a) applied to the raw point cloud; b) applied to wall points detected by CNN.It can clearly be seen that the number of irrelevant planes declines, and a cleaner planar extraction is achieved.