OBJECT DETECTION AND CLASSIFICATION FROM CLUTTERED LARGE-SCALE INDOOR SCENE VIA ANCHOR-BASED GRAPH

Indoor object detection and classification from scanned point clouds has recently attracted considerable research interest. However, detecting and classifying objects with arbitrary upward orientation has emerged as a substantial challenge. This paper presents an anchor-based graph method via geometric and topological similarity among indoor objects. With this method, the misclassification that usually occurs in the objects placed non-vertical with the floor is overcome by extracting anchor in each graph via nodes’ geometric attribute and by matching graph via topological relationship between nodes and anchor, rather than the features along the upward orientation. A region growing-based method along the anchor’s upward orientation is proposed for classifying the unlabeled oversegmentation parts. Such an anchor-based method ensures both the accuracy of object classification and the geometric integrity of object. A series of experimental tests using three real-world 3D scans of indoor environments show the effectiveness and feasibility of the proposed method.


INTRODUCTION
3D indoor object detection and classification have received increasing attention in recent years (Landrieu, 2018;Mattausch et al., 2014). It is a fundamental research area for certain applications, such as autonomous vehicles (Mattausch et al. 2014;Naseer et al. 2018), indoor reconstruction (Wang et al., 2016), robotics (Breuer et al., 2011). Moreover, recent advances in scanning technology greatly accelerate data acquisition and improve the accuracy of the scanned point cloud (Wang et al., 2016;Mattausch et al., 2014). All these combined factors have contributed to the flourishment of research towards 3D indoor object recognition and scene understanding.
Indoor object classification from the scanned point clouds is still remarkably challenging (Naseer et al. 2018), and this procedure is complicated by restrictions in the data and the complexity of indoor environments, which may exhibit high levels of clutter and occlusion. Despite the advances in recent research efforts, a satisfactory solution for indoor object classification is still undeveloped, especially in the cluttered indoor area (Mattausch et al. 2014;Wang et al. 2016). Specialized object classification methods try to reduce the complexity of indoor scene by assuming that all indoor objects should be placed vertical with the floor (Mattausch et al. 2014;Nan et al. 2012), or more restrictively, all objects should have the same upward orientation (Armeni et al. 2016). Although these assumptions may hold true for some indoor objects, many elements in real world deviate from that. More recently, the focus has shifted to address these problems by segmented patch-based method (Czerniawski et al. 2018;Mattausch et al. 2014), which segment raw point cloud into a patch set and cluster these patches through their geometric attributes. These methods consider the object classification problem as a patch segmenting and clustering issue, which shows effectiveness in many cases but conducts its difficulty in * Corresponding author classifying objects without capturing the relationship among segmentations. More recent works (Dai 2017;Qi et al. 2016;Qi et al. 2017;Shi et al. 2019) exploit object repetitions to segment the indoor scene and classify the objects by learning-based method. The limitation of these approaches is the need to carefully acquire and learn the 3D geometry of each type of object one wish to detect, which entails large amounts of time. Thus, detecting and classifying the objects with arbitrary poses without enough train sets in a cluttered environment is still a challenge for these methods.
As observation by (Spina, 2015;Laga et al., 2013;Fu et al., 2008), there is a strong correlation on geometric shape and upward orientation between functional parts (referring to anchors) in man-made objects. In the light of these observations, this study proposes an anchor-based graph method for detecting and classifying indoor object, which describe one object as a graph formed by connecting anchors with other parts. The raw point cloud is considered as the input and the labeled points representing object types are the output results.
The remainder of this paper is organized as follows. Important related works are introduced in Section 2. The proposed method is described in Section 3. Experiments on the three datasets are presented in Section 4, followed by a discussion in Section 5. Finally, the conclusions are drawn in Section 6.

RELATED WORKS
According to capacity of dealing with orientation of indoor objects, the current methods for object detection and classification can be classified into two groups: upward orientation-based and graph-based.

Upward Orientation-based Methods
The methods in this group assume all indoor objects placed vertical with the ground and extract the geometric features along upright orientation in each cell to classify indoor objects. The cell can be a patch, a voxel or a point.
By defining cell as a patch, those works classify the indoor objects by patch clustering methods based on Manhattan-world Hypothesis, where the point cloud is first segmented into patches. Mattausch et al. (2014) provide a patch similarity measurement and exploit a DBScan clustering in the diffusion embedding, which can automatic segment and classify the whole scene consistently. This method uses diffusion embedding to reduce geometric deviation among similar patches, which can detect both object furniture and the indoor structures, such as walls, doors and windows, especially in working environments. Valero et al. (2016) recognize the indoor furniture by comparing their height and shape templates with pre-constructed models, where shape templates differ with the type of objects, for instance, a tabletop template is a large horizontal rectangle, the chair leg template is disposed at the vertices of regular polygons or a starlike patch after projecting these legs into floor. This approach is capable of both recognizing typical indoor objects and generating semantic 3D models of furnished interiors for TLS datasets without occlusions. Czerniawski et al. (2018) employ a sixdimensional DBScan-based method to obtain better segmentations by adding points' normals into Euclidean space as the last three dimension, which successfully segment indoor structures, while show its difficulty in classifying objects without capturing the relationship among segmented patches.
By geometric features extracted from partitioned cells, some works (Tchapmi et al., 2017;Wang et al., 2017) expand the wellstudied structure of 2D convolutional neural network (CNN), which has been widely used for image, into 3D CNN to segment and classify indoor objects, while the performance of these voxelbased methods is limited by the resolution of the voxels (Liang et al., 2019). Some deep learning-based researches (Li et al., 2018;Qi et al., 2017) take raw point clouds as input without extra preprocessing and directly exploit the geometric similarity among points to classify indoor objects. Although those works develop a unified architecture for applications ranging from object classification, part segmentation to scene semantic parsing, they rely heavily on the local geometric information extracted from voxels (or points) but fail to refer local relationship among them, which limits robustness.
The above methods show feasibility in detecting objects by geometric features via various cells and show their achievements in dealing with indoor objects with upward direction with respect to the ground.
As sliding windows show effective in reducing workload in computation for deep learning-based methods, many works (Armeni et al., 2016;Simonovsky and Komodakis, 2017;Armeni et al., 2017;Wang et al., 2018;Liang et al., 2019) use adjacent graph to enhance relationship among cells for classifying the indoor objects. Armeni et al. (2016) employ adjacent graph among voxels in sliding window for redefining the detected objects, which are extracted by geometric features similarity after partitioning indoor scenes into k-by-k-k voxel grid. Although this work can extract both indoor structures and indoor objects, its object classification still limits by the resolution of the voxels. Graph CNN-based methods (Liang et al., 2019;Simonovsky and Komodakis, 2017;Wang et al., 2018) address point cloud semantic segmentation directly on raw point clouds via extracted local features from point's geometric attributes and contextual information from the KNN graph for the center point in each fixed sliding window. As the contextual information enhances the consistency of the classified objects, use of these approaches still depend on the priori-defined upright orientation.
The cited researches showed that the objects detection and classification has been far from satisfactory, and the prominent deficiency of these method lies in finding a method available for handling non-upward direction in a cluttered indoor environment.

Overview
In this section, we introduce our object detection and classification method. The proposed method uses raw point clouds as inputs and the labeled points representing object types as outputs, which consists of three main steps:


Pre-processing: The input point cloud is first segmented into a collection of nearly-planar patches and filtered the indoor structure patches via their fitting rectangle areas.
Then the anchor-based graphs in the scene are constructed by anchor extraction and patches' adjacent relationship.  Graph clustering: The graphs are roughly clustered by their anchors' geometric similarity, followed by graph matching algorithm via super-graph to find the corresponding nodes among graphs within maximum likelihood. Then each clustered group is labeled as one type by matching graphs in this group with prepared template-graphs in Object refinement: Each detected object is refined by extending its anchor' fitting rectangle along normal to add the unlabeled patches into it.

Patch Segmentation:
The indoor scene is first partitioned into a set of nearly-planar patches by the region-growing method (Rabbani et al., 2012;Truong-Hong and Laefer, 2015). An initial seeding point is selected in the area with the smallest curvature and has not been assigned to a patch. The point will add into the patch if the following conditions are satisfied: The point will add into the list of potential seed points and continue to grow from the points in the list of potential seed points if the following conditions are satisfied: The process is iteratively applied until all the points are segmented and assigned to patches. is the normal of and is the normal of . is a smoothness threshold, which should be specified in terms of the angle between the normal of and .
( , ) is a distance threshold, which should be specified in terms of the Euclidean distance between and .
is the curvature of point p and is a curvature threshold, which could be specified by the percentile of the sorted curvatures.
In each patch, the fitting rectangle is constructed by taking the bounding box for the patch and projecting it onto the plane spanned by its first two dominant axes to describe its geometric shape. The indoor structures that occupy a large spatial area (Ochmann et al., 2019;Wang et al., 2016;Zolanvari et al., 2018), such as the main walls, grounds and ceilings, will be filtered by their areas of fitting rectangles. As a result, the indoor scene is partitioned into a set of patches with their fitting rectangles.

Anchor-based Adjacency Graph Construction:
A straightforward strategy to construct the adjacency graph through segmented patches is connecting every adjacent patch successively. Apparently, there must be a large number of combinations and the size of the topological graphs would be huge, which may bring difficulties for graph matching. By observations (Spina, 2015;Laga et al., 2013;Fu et al., 2008), one object can be represented by an anchor-based graph with its anchors connecting other parts in the object. As the indoor scene contains some sub-scenes as the combinations of various objects that are spatially close to each other, shown in Figure 1b, rather than independent objects shown in Figure 1a, we propose the subanchor to handle these cases.
Given a patch set, the adjacency graph ( , ) is constructed by connecting every adjacent patch, which can be considered as one object or one sub-scene with some objects, where and denote the node set and edge set in the graph which contain node ∈ and edge ∈ . In each graph, a node with its fitting rectangle area higher than the threshold (0.3 in our experiments) and neighbor size more than 1 is referred to sub-anchor . The with the maximum number of neighbors in each graph is assigned as . If two adjacent nodes are both assigned as , the node with larger angle between its normal and 's normal will be removed. Finally, the anchor-based graphs are constructed by connecting s and s with all their adjacent nodes.
The adjacency graph of one chair is shown in Figure 1a and the adjacency graph of a sub-scene with one table and two chairs is shown in Figure 1b.
(a) (b) Figure 1. The anchor-based graphs: (a) the graph of one chair, where red node represents the anchor and yellow line between red node and blue node represents edge; (b) the graph of a sub-scene with one table and two chairs, where yellow node represents the sub-anchor .

Graph Descriptors:
The attributes of compose anchor flag, center point and the features of corresponding patch. Anchor flag represents its node type, 1 for , 2 for and 0 for other type. As patch representation is to reduce the complexity of the input data, the segmented patches may suffer over-segmentation problem. Thus, for each node, we compute the geometric features mainly from its fitting rectangle, as shown in Figure 2. The node features used in present work are shown in Table 1.  Table 1. The red box of the patch represents the fitting rectangle, whose length is l and width is w.
The attributes of include edge flag and the features for the relationship between and its (or ). Edge flag give its edge's type, 1 for the edge between and and 0 for other edge type. The edge features used in present work are shown in Table 2. The distance is the Euclidean distance between two center points in . The edge orientation is the vector between two center points of nodes in .

Distance
Angle between orientation and 's normal Angle between orientation and 's normal Angle between 's and 's normal Table 2. The edge features.

Graph Clustering
The indoor scene has become a set of anchor-based graphs = { , , … } in the last section. Thus, the object classification problem can be considered as assigning each node from graph in with one type among type-set = { , , … }, which contains three parts: rough clustering via anchor similarity, graph clustering via super-graph and clusters labeling.

Rough Clustering:
The graphs in are first clustered roughly by anchor similarity. This procedure starts by selecting two graphs randomly. Let ( , ) and ( , ) be the selected graphs with corresponding anchor and , where the selected graph can be one object or one sub-scene. Inspired by Mattausch et al. (2014), an anchor geometric similarity measurement is performed on and , which can be expressed as: where ( , ) can be presented as: ( , ) = min , = 1,2,3,4,5 Once the ( , ) is more than the threshold , ( , ) and ( , ) have the similar anchor and these two graph will be clustered into a group. This processing is iteratively performed until all graphs in have been grouped and the graphs with similar anchor are clustered as = { , , … }.

Graph Clustering:
For ∈ , an initial seed graph ( , ) is selected with the maximum number of anchor's neighbors and highest of its anchor. The first indicator ensures the edge's integrity of the selected graph while the second indicator exploits the anchor's geometric completeness via . Then the proposed graph matching algorithm will be performed on the super-graph ̅ that constructed by and other graph in , this graph matching algorithm will be illustrated in Section 3.3.3. The returned sub-graph ̅ after matching represents objects with the same type and will be partitioned into two pieces ′ and ′ along its axis of symmetry. After all graphs in have been tested with , all s are clustered as one group , meanwhile, all ′s are remerged into one graph and added this remerged graph into current group . A new seed graph will be selected from the remaining graphs and continued to match the remaining graphs in .
The processing above repeats until all graph set in have been clustered and indoor scene have been classified into a cluster set = { , , … }, each represents one type of objects.

Graph Matching: Given two graphs ( , ) and
( , ), the graph matching can be seen as finding the optimal correspondence nodes among them, which can be represented as a binary indicator matrix ∈ {0,1} | |×| | . If ∈ matches ∈ , the corresponding entry of is 1, e.g., ( , ) = 1; 0 otherwise. After transferring the matrix into a vector, ∈ {0,1} | || |× , the graph matching between and can be formulated to find the optimal correspondences * that maximizes the matching similarity between and , which can be stated as: * = ( ( | , )), ∈ {0,1} | || |× (6) where ( | , ) is a function measuring the matching similarity between and under corresponding node-pairs indicator , which conducts that the maximum of ( | , ) can be found by traversing all combination of node-pairs between and .
As graph matching between and is formulated as searching all possible corresponding node-pairs from and for maximizing ( | , ), we present a super-graph ̅ ( , ) to represent all corresponding node-pairs between and , as shown in the black box in Figure 3. A super graph ̅ ( , ) is constructed by connecting with via an edge and connecting other node ∈ with ∈ by a virtual edge to represent the corresponding node-pairs between and .
̅ ( , ) contains all nodes and edges in and , the reconstructed ̅ ( , ) is shown in Figure 3. After reconstructing ̅ , ( | , ) can be rewritten as ( , , … , | ̅ ), where conducts the corresponding nodepair and k is the size of node-pairs, which is reduced from | | to successively, as descripted in Eq.7. Thus, finding the optimal * can be stated as finding the sub-graph from ̅ maximizing ( , , … , | ̅ ) . The searching details are shown in Algorithm 1.
where ( ) captures geometric similarity between node-pair in , ( ) measures connective similarity between edge-pair in and ( ) is the splitting/merging penalty for nodes in ; is a virtual edge between ∈ and ∈ ; , and are scalar weights that satisfy + + = 1 . ( ), ( ) and ( ) are defined by Eq. 8, Eq. 9 and Eq. 11, respectively.
where , can be presented as: A has a high likelihood to be an anchor for its neighbor. Thus, directly assigning as one part of certain object may lead to misclassification for its adjacent objects. Inspired by (Alhashim et al. 2015), the anchor likelihood for , , is proposed to compute the probability of merging into current graph, which can be calculated as the ratio between the volume of the external cuboid after and before splitting and 's neighbors from the graph. The smaller is, the less the penalty for assigning to current graph becomes. of node with anchor flag equaling 0 is 0. ( ) is stated as: ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume V-2-2020, 2020 XXIV ISPRS Congress (2020 edition) Each of the three similarity measurement might individually favor different influence on the final clustered results. In this paper, we place more value on the rather than which show sensitive to the clutter and missing parts. The weights used in our experiments are = 0.3, = 0.5, = 0.

Clustering Labelling:
As the scene have been partitioned and grouped into a cluster set = { , , … }, in this section, each cluster ∈ will be labeled as one type ∈ by matching graphs in with template-graphs in = { , , … } , which is constructed by the objects in Alhashim et al. (2015).
Each graph in is compute the similarity with templategraphs in via graph matching algorithm in Section 3.3.3 with setting | | as . is labeled as when and obtain the maximum similarity score.
is labeled as if most of graphs in are labeled as , and all graphs in will be relabeled as .
As tables are usually surrounded by chairs that cause the structure to be occluded under tabletops, the unlabeled patch with a relatively large area surrounded by more than two chairs is labelled a table.

Object Refinement
Although most patches have been classified, some extracted patches are still unlabeled for three reasons. First reason is that this patch may contain missing parts, especially for the legs of the chairs or desks, as shown in Figure 4a. Second reason is that the patch may suffer over-segmentation, as shown in red box in Figure 4b. The last reason is that some tiny parts of the objects, such as the chair handrails, are removed during the graph matching processing, as shown in green box in Figure 4b, since the handrails in other chairs may not be scanned, and in turn, no matching on handrails among the chairs has been found.
Under the definition of the anchor, the legs and tiny parts of the objects tend to be covered by the oriented bounding box of anchor along its normal, as shown in Figure 4. Thus, we extend each anchor's fitting rectangle with a distance of along its normal. Once an extending box attach with an unlabeled patch, this patch will be merged into this object. Anchor (a) (b) Figure 4. The unclassified cases: (a) missing parts in the chair leg (blue box), (b) the anchor over-segmentation (red box) and chair's tiny parts (green box).

EXPERIMENTS
The proposed method was tested on three real datasets of indoor scenes, as shown in Figure 5a.The statistics for these datasets were shown in Table 3. The algorithm was implemented in C++ by Cloud Compare and MATLAB. All the experiments were performed on a 3.60 Hz Intel Core i7-4790 processor with 12 GB of RAM.
Dataset-1 and -2 were taken from the S3DIS dataset, captured by Matterport scanner (Armeni et al., 2016). Dataset-3 was obtained by hand-held active-light scanner (MantisVision Inc.) (Nan et al., 2012). Clutter and occlusion were present in these datasets. Dataset-1 and -2 were obtained by RGBD, the density of point clouds was moderate. Dataset-1 and -2 provided highly detailed objects, while a large amount of data were still missing due to occlusions and restricted accessibility. Dataset-1 was a conference room with various chairs and conference table, while Dataset-2 was a large-scaled cluttered environment containing some office rooms (part-1 and -2) and storage (part-3). Dataset-1 and -2 were tested for common office environment with missing data and Dataset-3 was tested for cluttered and noise environment, which contains objects with various poses.
Quantitative evaluations on the classified results were conducted by using three metrics: completeness, correctness and quality.
= + = + where TP represents true positives, which refer to the number of objects detected both in classified result and ground truth; FP represents false positives, which refer to the number of classified objects that couldn't be found in the ground truth; and FN represents false negatives, which refer to the number of unclassified ground-truth objects.

DISCUSSION
In Figure 5, we showed the experimental results of proposed method. The original point clouds were shown in first column. The points in each room of input data were conducted in the second column. The patch segmentations were shown in third column and the classified results in one room were shown in forth column (One colour represents one type of objects).
For the quantitative analysis on the classified results, the results of three metrics on different type of objects were shown in Table  4. As shown in Table 4, all of chairs in each dataset had a good correctness, which indicated that all classified chairs could be detected in both the raw data and ground truth. The completeness and quality metrics of all datasets on the chairs were higher than 0.8, except for Dataset-2 (part-3), which showed our method was effective with classifying indoor chairs. However, the unclassified chairs occurred, as shown in the red box in Figure 6, in while almost all parts except the back of this chair were missing due to occlusion.
As Table 4 shows, most of tables in the indoor scene could be detected and classified except for Dataset-2 (part-1) since the Lshaped table in Dataset-2 (part-1) was over-segmented and labeled as two long tables for their similar geometric shape. The bookcases were relatively easy-classified objects for its relatively large volume and well-bedded topological graph, which deviated from other objects. The dataset-3 contained various chairs with different poses, and our method was performed well in this dataset.
These results showed that the proposed method was robust for detecting and classifying indoor objects, even with various upward orientation. However, the test on dataset-2 (part-1) indicated that the patch segmentation method encountered difficulties with L-shape table. Figure 6. The failure cases in Dataset-2 (part-3).

CONCLUSIONS
In this work, an anchor-based graph matching method is proposed for detecting and classifying the indoor objects with freely upward orientation. The graphs are matched by performing graph rough clustering via anchor similarity, super-graph segmentation via graph similarity and object geometric refinement successively.
The proposed method was tested with three real indoor scenes. The experiments showed that the proposed method could achieve indoor objects classification without training dataset. The quantitative result of the experiments showed that the object classification precision with almost all completeness, correctness and quality above 0.8. These findings show that the presented method is appropriate for arbitrary upward oriented object classification. The experiments show the effectiveness and availability of the proposed method.
However, the presented method currently can only segment planar surfaces and shows its weakness with L-shape table classification. Those topic will be our future study.