A TWO-STEP CLASSIFICATION APPROACH TO DISTINGUISHING SIMILAR OBJECTS IN MOBILE LIDAR POINT CLOUDS

Nowadays, lidar is widely used in cultural heritage documentation, urban modeling, and driverless car technology for its fast and accurate 3D scanning ability. However, full exploitation of the potential of point cloud data for efficient and automatic object recognition remains elusive. Recently, feature-based methods have become very popular in object recognition on account of their good performance in capturing object details. Compared with global features describing the whole shape of the object, local features recording the fractional details are more discriminative and are applicable for object classes with considerable similarity. In this paper, we propose a two-step classification approach based on point feature histograms and the bag-of-features method for automatic recognition of similar objects in mobile lidar point clouds. Lamp post, street light and traffic sign are grouped as one category in the first-step classification for their inter similarity compared with tree and vehicle. A finer classification of the lamp post, street light and traffic sign based on the result of the first-step classification is implemented in the second step. The proposed two-step classification approach is shown to yield a considerable improvement over the conventional one-step classification approach. * Corresponding author


INTRODUCTION
With the rapid development of 3D lidar technology, point clouds are becoming ever more widely used across various application fields, including building modeling, road mapping, road monitoring and nowadays driverless car technology.Compared with traditional data acquisition methods in photogrammetry and remote sensing, lidar scanners are faster and invariably more accurate in their collection of huge amounts of unstructured 3D point data.However, raw point clouds contain geometric information only, and extracting semantics such as object types is a challenging task.Considering the complexity of data and the variety of object types, it is a difficult and labour-intensive task to manually extract objects from point clouds.Therefore, the development of methods for automated and efficient object recognition in point clouds is of considerable importance.
A common approach to object recognition in point clouds is supervised classification based on the geometric features of objects of interest.The types of features used in classification methods can be divided into two main categories: global features and local features (Bayramoglu and Alatan, 2010;Castellani et al., 2008).Global features, such as size and height, describe the overall shape of the object, whereas local features captured at key points characterize the object surface within local neighborhoods (Tangelder and Veltkamp, 2008).Although global features are useful for the recognition of objects with large intra-class variability (Lehtomäki et al., 2010a;Vosselman et al., 2004), they are not sufficiently discriminative for object classes which are similar, such as the different pole-like objects shown in Figure 1.
In contrast, local features capture the detail of objects, which makes it possible to distinguish similar objects with local differences.However, in order to achieve an acceptable classification accuracy, local features need to be encoded into a sufficiently discriminative high-dimensional feature vector.Consequently, a relatively large number of training samples is required to sufficiently train the classifier, which is a practical challenge in point cloud classification (Khoshelham and Oude Elberink, 2012).In the literature, similar objects have often been grouped into more general categories, such as pole-like objects (Rodríguez-Cuenca et al., 2015;Yokoyama et al., 2013).Poor classification accuracies have generally been reported for similar objects when they have not been grouped (Golovinskiy et al., 2009;Pu et al., 2011).In this paper, we investigate the problem of recognizing similar objects in point clouds by using a segment-based classification approach based on local features, namely point feature histograms (Rusu et al., 2008) encoded by the bag of features (Csurka et al., 2004) method.We propose a two-step classification approach to overcome training sample limitations.We experiment with different supervised classifiers to evaluate the performance of our method.
The paper is organized into five sections.Section 2 provides a review of the state of the art.In Section 3, the proposed classification method based on point feature histograms and bag of features is described.Section 4 discusses the results of classification experiments and, finally, conclusions are provided in Section 5.  Weinmann et al. (2015).

LITERATURE REVIEW
In segment-based methods, individual points are grouped into segments, and segment features are used for classification (Golovinskiy et al., 2009;Khoshelham et al., 2013;Pu et al., 2011).Golovinskiy et al. (2009) considered multiple object classes in an urban area and reported low classification accuracy (60%) for some object classes.Velizhev et al. (2012) improved this workflow by using the Spin image and implicit shape model (ISM) and achieved a precision of 68% and 72% for cars and light poles, respectively.These were the only object classes considered in their experiment.Yang et al. (2015) achieved a good accuracy level for the extraction of urban objects based on segmentation of super-voxels, rather than individual points, using a set of rules defined for uniting separate segments.However, the design of rules and the setting of thresholds in the identification and classification of different object classes required manual interpretation and interaction based on the shape (geometric structure), height and width information of each object.
Detection of pole-like objects such as tree trunks, traffic signs, and light poles have been widely studied for their unique structure (Cabo et al., 2014;Landa and Ondroušek, 2016;Lehtomäki et al., 2010b;Yokoyama et al., 2011).3D Hough transform combined with RANSAC has been shown to work well in the detection of pole-like structures by Vosselman et al. (2004).In the research of Brenner (2009) (2015).Scan line information has also proved very useful in the automated detection of vertical pole-like structures in road environments (Lehtomäki et al., 2010a).Cabo et al. (2014) detected pole structures by the inner and outer radius after voxelization of points.While, most of these methods assumed that poles are vertical, the pairwise 3-D shape context method introduced by Yu et al. (2015) works independently of the pose of pole-like objects.
The classification of pole-like objects is mainly achieved by setting a series of thresholds for feature values (Aijazi et al., 2013;Li and Elberink, 2013;Masuda et al., 2013;Pu and Vosselman, 2009;Yang et al., 2015;Yokoyama et al., 2013).The shortcoming of these knowledge-based methods is that the thresholds should be adjusted under different scenarios.The accuracy of classification is highly related to the threshold setting.Supervised machine learning methods are adopted in the classification of similar pole-like objects for its independence from threshold setting (Fukano and Masuda, 2015;Lai and Fox, 2009).However, the limitation of supervised machine learning based methods is the requirement of a large number of training samples.
The limitation and imbalance of training samples for some object classes pose a significant challenge for the classification of point clouds.Khoshelham et al. (2013)

METHODOLOGY
The conceptual framework of the proposed approach is shown in Figure 2. It includes data pre-processing, feature description, and object classification.In the pre-processing phase, points on the ground and on building façades are removed.The remaining points are grouped into individual segments.These individual segments are manually labelled for the training and evaluation of the adopted classifier.Then, point feature histograms are computed as local features and then encoded into high-level features using the bag-of-features method.Finally, the classification of the segments is performed in two steps.
Figure 2. The conceptual framework of the method consisting of data pre-processing, feature extraction and classification.

Pre-processing
In segment-based classification methods, the removal of the points on the ground, along with those on building facades, roofs, and fences, helps achieve a better segmentation result.

Classification
• Classification of tree, vehicle and pole-like object • Classification of poles into lamp post, street light and traffic sign.

Feature Description
• Feature extraction

Pre-processing
• Removal of ground and façade points • Segmentation into components

Removal of Ground and Façade Points:
To remove the ground points, we use a variant of the progressive TIN densification algorithm (Axelsson, 2000) which is implemented in Lastools (1) .The algorithm requires the setting of four parameters for the filtering of ground points, namely step, spike, offset and standard deviation.In the Lasground tool, suitable values for these parameters are recommended according to land cover type.In this paper, "city and warehouse" was selected as the land cover type, based on the data in this experiment.The building roofs and façades were removed manually in Cloud Compare (2) .

Connected Component Segmentation:
After the filtering of ground and façade points, a connected component segmentation is applied to group the points into individual segments.The connected component segmentation is based on the assumptions that points that are closer than a certain distance belong to one connected component.Different connected component parameters were compared by the performances of segmentation.The maximum distance between points and the minimum number of points of the segment were set as 0.2m, and 500 respectively after comparative experiments.To perform the connected component segmentation more efficiently, the point cloud is first restructured in an octree data structure.After the segmentation, every individual object should ideally be segmented as one single component.In this paper, we focus on the evaluation of classification performance and, therefore, over-segmentation and under-segmentation errors are manually removed at this stage.

Feature Extraction:
To avoid the influence of occlusions and low point density, we extract local features at every point in each segment rather than at key points only.To extract local features, we use Point Feature Histograms (PFH), and to combine local features into segment features we use the bag of features method (Csurka et al., 2004).The PFH is a robust multi-dimensional feature descriptor that describes the local geometry around the surface points (Wahl et al., 2003).It has been demonstrated to be effective in labelling 3D points based on the type of surface it belongs to, and it is very discriminative in classifying various geometric primitives (cylinder, plane, sphere, cone, torus, corner and edge) (Rusu et al., 2008).
In order to calculate the PFH, k neighbors of the query point are selected.A value of 6 was experimentally found suitable for k.For each pair of points (P t , P s ) and their associated normals (n t , n s ), a Darboux frame coordinate system is defined as shown in Figure 3.The axes of the coordinate system are defined as: (2) (1) https://rapidlasso.com/lastools/ (2)http://www.danielgm.net/cc/Figure 3.The Darboux frame coordinate system defined for calculating three angular features for a pair of points.
(Reproduced from point cloud library (3) .) The angular features describing the difference between the two normals n s and n t are defined as: The distance d used here between point P t and P s is the Euclidean distance.After all the triplets <α, φ, θ> between each pair of two points in the k-neighbourhood are computed, the set of all triplets at the query point is binned into a histogram, the PFH.In this process, the value of each feature is divided into 5 subdivisions and the number of occurrences in each subinterval is counted.In order to avoid the overlapping of the values of these three features in each bin interval, a histogram with 5 3 bins in a fully correlated space is created.

Feature Encoding:
Feature encoding by bag-offeatures is performed by constructing a vocabulary of dominant local features from the data directly and creating a histogram of these features for each segment.The construction of the vocabulary is done by the k-means clustering algorithm (MacQueen, 1967).Each cluster center is the mean of a number of similar point feature histograms and represents a frequently appearing local surface characteristic.Once the vocabulary is constructed, a bag of features is created for each segment by counting the number of point feature histograms assigned to each cluster.The resulting histogram encodes the point feature histograms into a feature vector for the segment.

Classification
After all point feature histograms are encoded into segment features, a classifier can be trained using the manually labelled training samples.In order to classify the segments, we consider five object classes: vehicle, tree, lamp post, traffic sign and street light.As we will see, the application of the classifier in a single step will result in poor classification accuracies for similar object classes, i.e. lamp post, traffic sign, and street light, which are less abundant in the data, and are, therefore, represented with fewer training samples.To alleviate this problem, we propose a two-step classification approach.In the first step, a classifier is trained to classify the segments into three general classes: vehicle, tree and mixed pole.In the second step, a classifier is trained to classify the mixed poles into three specific classes: lamp post, traffic sign, and street light.
We experiment with several classifiers: Gaussian support vector machines (SVM) (Andrew, 2000), random forest (Breiman, 2001), decision tree (Quinlan, 1986), and discriminant analysis (Klecka, 1980).SVM classifiers try to find the best hyperplanes with the largest margin to separate one class from another.The performance of SVM is highly dependent on the kernels applied and the data to be classified.In our experiment, the Gaussian kernel is selected for its superior performance over the other two kernels.Decision tree makes a prediction by following the decision in the tree from the root node down to a leaf node.Random forest is an ensemble method that aggregates the results of multiple weak classifiers, each trained by a bootstrap subset of the training set.The quadratic discriminant classifier computes a quadratic decision boundary between training samples of different categories.The parameters of different models under each classifier were trained by optimization experiments and k-fold cross validation.

Data Description
The experimental dataset was collected by the German company TopScan in December 2008 using an Optech Lynx Mobile Mapper system, the basic specifications of which are shown in  The result of the ground points removal is shown in Figure 6.After removal of building and ground points, further segmentation and labeling were conducted, as indicated in Figure 7.The Lidar data collected in strip 4, 5, 6, 12, and 13 are used in this paper.
The training dataset was created by manually labeling the segments using an interface developed in Matlab, which allowed 3D viewing and rotation of each segment, as seen in Figure 8.The number of segments after pre-processing is described in Table 1.
A feature vector was generated for each segment by extracting point feature histograms and encoding these by the bag of features method.We used 125 bins for the PFHs and initially set the vocabulary size in the bag of features method as 30, resulting in a feature vector of length 30 for each segment.Table 1.Number of labelled segments after segmentation and manual labeling.

One-step Classification
Using the labelled segments each represented by a bag of features, we trained the Gaussian SVM, random forest, decision tree and quadratic discriminant classifiers.The scale of the Gaussian kernel in Gaussian SVM was set as 1 / ( ) sqrt P (P is the number of features).In the decision tree, a value of 13 was experimentally found appropriate for the number of minimum leaf size, and the twoing rule (Steinberg, 2009) was adopted as the split criterion.In the random forest, the tree template had the same setting with the decision tree, and 100 was experimentally set as the number of learners.
An experiment with different vocabulary sizes (i.e. the number of features per segment) was then conducted to test the performance of the classifiers.We set the upper and lower limits as 15 and 100 according to the number of labelled segments.The accuracy of the classifiers against vocabulary sizes 15, 30, 50 and 100 is shown in Figure 9. Usually, a larger vocabulary size provides higher discriminative power at the cost of an increase in both storage and processing time (Alonso et al., 2011).The result of the test with different vocabulary sizes reveals that the classifiers showed good performance at a relatively small vocabulary size of 30.Thus, in the following experiments, the vocabulary size is set as 30.Also, the quadratic discriminant classifier is not included in the following experiments due to its low accuracy at all vocabulary size settings.
The classifiers were then evaluated in a one-step classification using a five-fold cross-validation scheme.The training dataset was divided into five folds, and in five iterations the classifier was trained with four folds and tested with the remaining fold.
The average precision and recall over the five tests were used as performance measures for the classifiers.The result of the onestep classification is shown in Figure 10.As it can be seen, vehicles and trees are classified with higher precision and recall than the other three classes.Lamp posts are classified with a low recall, and street lights are classified with a low precision.For the class traffic sign, all classifiers perform poorly as both precision and recall values are below 30%.The number of correctly recognized items is recorded in Table 2. Table 2.The number of correctly recognized items in one-step classification method.
A possible reason why traffic signs have a low classification recall and precision is the limited number of samples compared to other categories.Another reason is the complexity and diversity of traffic sign designs within the dataset, and their similarity to lamp posts and street lights, as shown in Figure 11.In addition, the lamp posts and street lights in this dataset are all somehow attached with sign boards, which make the classification of light poles and traffic signs more complex and difficult.Light poles attached with sign boards can be seen in Figure 12.

Two-step Classification
As is shown in Figure 10, the one-step classification accuracy for street light, traffic sign, and lamp post classes is relatively low.Compared with tree and vehicle, these three categories all share a similar pole-like structure and have a low number of samples.In order to improve the classification result, a two-step classification is performed.At the first step, three general classes, i.e tree, mixed class (lamp post, traffic sign, and street light), and vehicle are classified.In the second step, the classifiers are first trained with the manually labelled segments, and then applied to those segments that were classified as mixed in the first step.The result of the first-step three-class classification in the hierarchical classification scheme is shown in Figure 13 and the number of correctly recognized items in this classification is recorded in Table 3.    From the first step, we can see that SVM outperforms the other two classifiers.The result is consistent with the result of onestep classification.Thus, the classification result of each classifier in the first step is adopted as testing data in the second step, while the manually labelled segments are used as training data.In decision tree, 5 was tested as the best choice of the number of minimum leaf size.The result of the second step classification is shown in Figure 14.The number of correctly recognized items in the two classification steps is recorded in Table 4. Table 4 The number of correctly recognized items in the second step.
The result of the two-step classification by combining the results of two steps is shown in Figure 15.From a comparison of the first and second of the two-step classification, we can see that SVM outperformed the ensemble (random forest) and decision tree methods in the first step, where tree, mixed, and vehicle are each very distinguishable.However, in the second step, when the items are more difficult to separate, the SVM yielded relatively poor performance when compared with the two other classifiers.The two-step classification generated a better result compared to the one-step method, largely due to the fact that the one-step classification could not handle well an unbalanced dataset in which it was difficult to distinguish between basically similar objects belonging to different categories, i.e. traffic signs, lamp posts and street lights.

CONCLUSION
In this paper, the application of point feature histograms combined with the bag-of-features method for segment-based of mobile lidar point clouds was investigated.The proposed two-step classification approach for distinguishing similar objects with unbalanced data was shown to yield a significant improvement over the conventional one-step classification approach.The PFH based bag-of-feature method provides an effective representation of local surface characteristics of objects and therefore has the potential for the classification of point clouds into more specific object classes and object parts.In the future, other local features will be tested, combined with various encoding methods to examine performance differences.Additionally, the proposed two-step method will be tested on more categories with intra-class variability, not only on pole-like objects.

Figure 1 .
Figure 1.Street light, lamppost, and traffic sign have global similarity but local differences.
and Weinmann et al. (2015) used feature selection methods to reduce the required number of training samples.Azadbakht et al. (2016) investigated different sampling strategies to overcome an imbalance in the distribution of training samples.In this paper, we propose a two-step classification method for distinguishing similar objects with insufficient and unbalanced training data.

Figure 4 .
The data was recorded over the city of Enschede, Netherlands.Two rotating scanning sensors were mounted on the top of the vehicle with scanning planes at 45 degrees angles to the central lane, and perpendicular to each other.The vehicle drove at 50km/h and scanned 20kms of the road.The strip overview is shown in Figure5.

Figure 5 .
Figure 5. Overview of the scanned strips from the experimental dataset.

Figure 6 .
Figure 6.Classification of ground and non-ground points.Ground points are in red and non-ground points in green.

Figure 8 .
Figure 8. Interface for manual labeling of the segments.Label Number Tree 82 Lamp post 30 Traffic sign 19 Vehicle 182 Street light 52

Figure 9 .
Figure 9. Classification accuracy with different vocabulary sizes

Figure 10 .
Figure 10.Recall and precision of one-step classification obtained from 5-fold cross validation with vocabulary size equal to 30.

Figure 11 .
Figure 11.Structural diversity of traffic signs.

Figure 12 .
Figure 12.Light poles are sometimes attached to sign boards.

Figure 13 .
Figure 13.Recall (a) and precision (b) values obtained from 5fold cross validation in the first step.

Figure 14 .
Figure 14.Recall and Precision values in the second step of two-step classification.

Figure 15 .
Figure 15.Recall and precision of two-step classification.

Table 3
Number of correctly recognized objects in the first step.