FEATURE RELEVANCE ANALYSIS FOR 3D POINT CLOUD CLASSIFICATION USING DEEP LEARNING

: 3D point clouds acquired by laser scanning and other techniques are difficult to interpret because of their irregular structure. To make sense of this data and to allow for the derivation of useful information, a segmentation of the points in groups, units, or classes fit for the specific use case is required. In this paper, we present a non-end-to-end deep learning classifier for 3D point clouds using multiple sets of input features and compare it with an implementation of the state-of-the-art deep learning framework PointNet++. We first start by extracting features derived from the local normal vector (normal vectors, eigenvalues, and eigenvectors) from the point cloud, and study the result of classification for different local search radii. We extract additional features related to spatial point distribution and use them together with the normal vector-based features. We find that the classification accuracy improves by up to 33% as we include normal vector features with multiple search radii and features related to spatial point distribution. Our method achieves a mean Inter-section over Union (mIoU) of 94% outperforming PointNet++’s Multi Scale Grouping by up to 12%. The study presents the importance of multiple search radii for different point cloud features for classification in an urban 3D point cloud scene acquired by terrestrial laser scanning.


INTRODUCTION
3D point clouds are sets of point coordinates (x, y, z) in space which can be acquired by Light Detection and Ranging (LiDAR), or by Dense Image Matching (DIM) using multiple images taken from different points of view.Laser scanning uses LiDAR measurements to record vast numbers of points in a geographic scene that represent the geometry and textural information of objects (Otepka et al., 2013).It can even capture (micro-)topography of the surface that might be occluded by vegetation and is a powerful remote sensing technique that has been used, for example, in generating 3D models of cities (Rebecca et al., 2008), for flood and glacier modelling (Telling et al., 2017), sand dune modelling (Dong, 2015), and extraction of vegetation parameters (Rosette et al., 2009).
For extraction of information from a point cloud containing different objects, typically a classification step is required.By classification, we refer to the assignment of a semantic label to each point in the point cloud.For example, if vegetation should be separated from other classes, we can use a label (e.g., the numeric value 1) for vegetation points while assigning a different label to other classes (e.g., 0).The assignment of labels depends on the type of information we want to retrieve.In this paper, we assign labels to points by the semantic object class they represent.In the context of an urban scene, typical classes are Building, Vegetation, Natural terrain, Man-made terrain, Water, Vehicle, and Scanning artefact (Noise).
After classification, the point cloud can serve as data basis for analyses that may answer specific questions, such as the separation of natural and man-made terrain in urban planning tasks.With recent advances in technology, Mobile Laser Scanning (MLS) has emerged to scan large areas, such as entire cities (Balado et al., 2018).The 3D model produced from such scanning may, for example, serve as an input map to autonomous systems for navigation (Whitty et al., 2010;Yue et al., 2018), or automatic inspection of power transmission lines (Qin et al., 2018).For such autonomous systems, the classification is important to recognize the type of object represented by the point cloud.Thus, automatic classification of point clouds is crucial to achieve a higher rate of automation and is therefore a frontier research topic.
Several traditional machine learning classification algorithms such as Support Vector Machine (SVM), Decision Trees and Random Forest are commonly applied to 3D point cloud data.However, the capacity of these models to learn is limited and they cannot extract additional features from the training data other than the ones obtained beforehand and input to the model.Neural networks (Rumelhart et al., 1986) have been around for many years, but have recently evolved as deep neural networks because of increased computational capacities and larger amounts of data readily available for training.Such deep neural networks can model highly varying and complex functions to learn the mapping between the input and the output (Brahma et al., 2016).Deep learning methods have recently been shown to be successful in various real-world problems such as image classification, object detection, image captioning, language translation, and text classification in computer vision and natural language processing (LeCun et al., 2015).
There are two ways of classification using deep learning techniques.One approach learns directly from the raw data, which we call end-to-end learning.The other approach is to first extract features from the data which are input to the neural network for learning.We call this approach non-end-to-end learning.Attempts have been made for the classification of 3D point clouds using both approaches, for example, by classifying 3D point clouds on a voxel grid as in VoxNet (Maturana and Scherer, 2015), or in a graph-based approach as in SEGCloud (Tchapmi et al., 2017).These methods are non-end-to-end as they do not work directly on the primary point cloud data for classifying them through neural networks.Recently, end-to-end methods such as PointNet (Qi et al., 2017a), and PointNet++ (Qi et al., 2017b) have been developed that work directly on the 3D coordinates of the point cloud for classification.A drawback of end-to-end methods is that larger amounts of data are required for training than for non-end-to-end methods, and the features used for classification by the network are difficult to interpret.Few attempts have been made to classify 3D point clouds by first extracting geometrical and statistical features from the point cloud before classifying the points using a neural network (Zhang et al., 2018).However, the study by Zhang et al. (2018) does not give an interpretation of how different combinations of extracted features affect the result of the neural network-based classification.Furthermore, they only used the normal vector and height above ground as geometrical and statistical features.
In our research, we investigate the neural network-based classification of a terrestrial laser scanning point cloud by extracting meaningful point cloud features.We examine (i) how different combinations of features affect the result of classification, (ii) which sets of features are important for the classification of different class types, and (iii) compare our results to the deep learning framework PointNet++ (cf.Winiwarter and Mandlburger, 2019).The advantage of our approach is that it does not require a large amount of data to train the neural network as we extract useful features from our geographic knowledge of the point cloud scene.From the results in this study, however, we cannot interpret effects of distance from the scanner on the results of classification which can have an impact on the classification results as the objects closer to the sensor will have a higher local point density than those who are far away from the sensor (Kaasalainen et al., 2011).

Dataset
For our research, we use the Semantic3D dataset (Hackel et al., 2017), consisting of around four billion points acquired by terrestrial laser scanning in 30 scenes of diverse urban places, including market squares and streets.Of this dataset, we select the Bildstein scene and use the first scan ('bildstein1') only.For each 3D point coordinate in the Semantic3D dataset, the laser return intensity (I), as well as Red (R), Green (G), and Blue (B) values corresponding to the true colors of the scene are provided.The data is classified according to the types in Tab. 1.

Extraction of Point Cloud Features
The features we use in this paper are established in the 3D point cloud domain and commonly used for 3D point cloud classification.The surface normals of points are estimated based on the local neighborhood in the point cloud via plane fitting, which is solved with an eigenvalue decomposition of the covariance matrix (structure tensor) of the points.The normal vector features (NF) include the normal vector, eigenvalues (Weinmann et al., 2013), eigenvectors and quality of the plane fit derived from residual point distances to the plane (Dorninger and Nothegger, 2007).The feature Echo Ratio (ER) (Höfle et al., 2009) is a measure of the penetrability of the objects represented by the points.Intensity (I) is the strength of the laser signal scattered back from an object (Höfle and Pfeifer, 2007).

Features extracted for different search radii
We derive further features using Neighborhood Statistics (NStat) in the local 2D neighborhood of points, namely the range, normalized height, variance, rank, points/volume, and openness.Range is the vertical distance from the highest to the lowest z value in the 2D neighborhood.Normalized height is the relative height of the current point's z value to the minimum z value in its 2D neighborhood.Variance describes the variance of all z values in the local 2D neighborhood.The rank of a point represents its relative vertical position 2D neighborhood and accordingly is 100% if the point is a local maximum and 0% if the point is a local minimum.Points/volume describes the point count per unit volume of a cylinder of the neighborhood radius.Openness is defined as the degree of dominance or enclosure of a location on an irregular surface (Yokoyama et al., 2002).
We calculate the described features on the point cloud data for different search radii as shown in Tab. 2. For the calculation of the point cloud features, we use the point cloud processing software OPALS (Pfeifer et al., 2014).

Preparation of Training Dataset
We calculate the features in neighborhoods of different search radii, which we denote as a subscript to the name of the feature set.For example, (NF)1,5 means the normal vector, eigenvalues, eigenvectors, and quality of the plane fit are calculated for search radii of 1 cm and 5 cm, respectively.Additionally, we use the intensity value for each point.We do not use the RGB values from the dataset as our objective is to study the classification only from features derived directly from the LiDAR point clouds.
It is important to note that the number of points per class in the training dataset for experiment 1 is highly imbalanced as shown in the histogram of point numbers per class in the training set (Fig. 1).The balance is improved in the training dataset of experiment 2, achieved by the alternative approach of creating it (section 2.1).  3. Different sets of features used in the experiment

Training the Neural Network
After calculating features for the training datasets, we input them to a fully connected neural network.For meaningful comparisons of different sets of features, we use a fixed strategy for defining the neural network architecture for all sets of features.The fixed hyperparameters for experiment 1 and 2 are listed in Tab. 4.
Since we aim to achieve a correct classification for all classes and the dataset is highly imbalanced, we chose Intersection over Union (IoU, Eq. 1), which is a per-class measure, as evaluation metric.The evaluation measure per class i is calculated as where  is the number of samples from the ground-truth class i predicted as class i in the confusion matrix,  is the number of samples from the ground-truth class i predicted as class j in the confusion matrix, and  is the number of samples from the ground-truth class k predicted as class i in the confusion matrix.
To quantify overall performance, we use the mean IoU (mIoU) over all classes without weighting.The same evaluation metric is used by the Semantic3D benchmark (Hackel et al., 2017).We study sets of different feature combinations and their effect on the classification result by training a neural network.The results of our study are presented in the subsequent section.
Table 4. Hyperparameters of the Neural Network

PointNet++
As a state-of-the art baseline, we employ PointNet++.Since it automatically extracts features from the neighborhood, we assume comparison to accurately depict the quality of our feature choices.However, PointNet++ does not calculate the full neighborhood representation for every point, but employs two different grouping techniques: Multi-Resolution Grouping (MRG) and Multi-Scale Grouping (MSG).Because of the smaller memory footprint, we used MSG (Qi et al., 2017b).
In MSG, the whole point cloud is only used for feature extraction on the smallest search radius (e.g. 25 cm).This information is gathered at a subset of the total points.For the next radius (e.g.50 cm), only this subset is used for the extraction.The subset sizes and search radii used in our study are presented in Tab. 5.This inevitably leads to a loss in information, but allows the network to consider larger neighborhoods at all, which would not be possible without subsampling due to high memory requirements (Winiwarter and Mandlburger, 2019).3. RESULTS

Experiment 1
We use the hyperparameters shown in Tab. 4 for training the neural network.In Tab. 6, we present results corresponding to different combinations of features for experiment 1.The results shown here are for the test dataset, and the IoU as well as mean of IoUs (mIoU) scores have been rounded to two digits for readability.showing the classification result of PN

Experiment 2
For experiment 2, we use L2 regularization in all layers and also reduced the number of hidden layers to three as compared to experiment 1.The architecture of PN was unchanged.In Tab.Results of the classification point out that when we consider feature set A as input to the neural network, it results in a low IoU especially in the classes 4 (Low vegetation) and 7 (Scanning artefact).This is due to the features being only calculated based on a single search radius.Thus, they cannot take into account how the same feature will vary depending on the size of the object as we adapt the search radius.The difference in the IoU of class 3 (High vegetation) and class 4 (Low vegetation) can be attributed to the fact that these classes have similar properties and are only distinguished based on a threshold criterion of height.As the number of examples in class 4 is higher than in class 3 (cf.Fig. 1), this resulted in a low IoU for class 4 as most of the points of class 4 got classified as class 3, as shown by the normalized confusion matrix in Tab. 7. Class 7, representing artefacts/noise, is difficult to classify using only one search radius for feature calculation.By calculating NF on multiple search radii, we are able to strongly increase the classification accuracy.This results in an average increase of 33% with respect to the IoU for each class, with an increase in IoU for class 4 (Low vegetation) of 261%, and for class 7 (Scanning artefact) of 34%.This increase can be attributed to the fact that normal vector-based features for different search radii extract information from the point cloud on different scales, representing objects of different sizes.The pattern of variation, however, is similar for the same type of object.For example, in the case of class 6 (Hard scape), change is expected to be small in the normal vector features when increasing the search radius from 1 cm to 200 cm.However, in the case of class 8 (Car), the NF will change as the search radius increases, since a search radius of 200 cm, for example, might include points from around the object itself.Since there will be a distinct pattern for NF derived for multiple search radii of each object, it can be recognized and learned by the Neural Network, effectively improving the classification result.

CONCLUSION
In this paper, we present a method for the classification of 3D point clouds using deep learning techniques by manually extracting features from the point cloud and then feeding them through a Fully Connected Neural Network.We show that multiple search radii for normal vector features improve the result of classification compared to considering only a single search radius.
The result further improves when we include features derived from neighborhood statistics of the point cloud.We can see that even though our training dataset is highly imbalanced in experiment 1, we achieve comparable classification accuracies for all classes.With only 37% of the data as training set in experiment 2, we find that our model generalizes well on the test set.The stateof-the-art deep learning framework PointNet++ in experiment 1 performs equally well as our model except for classes 4 and 7, two classes with exceptionally low frequency in the dataset.Similar results are observed in experiment 2 except for classification of class 1 where PointNet++ performs even worse than that with feature set A. For the overall classification of point clouds, we can conclude that features extracted with multiple search radii either manually or using the end-to-end deep learning framework PointNet++, greatly improves the classification result.An interesting continuation of this work would be to investigate the further generalization of the model for similar objects.For example, it could be examined if the model can also learn to classify other vehicles such as a mini bus or truck as car, which were not part of the training set.
For the neural network architecture in experiment 1 and 2, we can use Automated Machine Learning (AutoML) techniques such as Adanet (Cortes et al., 2017) for the best neural network architecture search as part of further research.
Further, the ground truth data we use is labelled by humans and always contains some human error associated with the labelling process.An approach to overcome this could be to use synthetic data generated by LiDAR simulation, for example using HELIOS (Bechtold and Höfle, 2016).Such data will have no error for ground truth examples, and it is further possible to augment training examples by scanning a scene from different viewpoints.Using simulated data, the training examples can be easily balanced over the classes, improving the classification result of the rare classes.Furthermore, using different search radii not only for the normal vector-based features, but also for the neighborhood statistics may give better results.

Figure 1 .
Figure 1.Histogram showing the number of examples in the training set We prepare the training set by combining different features.The feature sets are listed in Tab. 3 with IDs, to which we subsequently refer.For instance, {(NF)1, (ER)5, (I)} represents one feature set of training data.ID Features A {(NF)1, (ER)5, (I)} B {(NF)1,5,10,50,100,200, (ER)5, (I)} C {(NF)1,5,10,50,100,200, (ER)5, (NStat)50, (I)} PN PointNet++ Multi Scale Grouping using search radii of 25, 50, 100 and 200 cm Table3.Different sets of features used in the experiment Classification results of the test dataset for different combinations of features From the classification results, we observe that as we increase the number of features, the IoU increases for each class.It is evident that when we use feature set A for training the neural network, class 4 (Low vegetation) and class 7 (Scanning artefact) are being classified with low IoUs of 0.21 and 0.52, which results in a low mIoU of 0.69.When including normal vector features derived from multiple search radii as in feature set B, the IoU increases for all classes.The increase is highest in the case of class 4 (Low vegetation) and class 7 (Scanning artefact).To also consider the height of points in their local neighborhood, we include normalized height, as well as other features derived from point cloud statistics as in the case of feature set C. Although this does not affect the IoU of class 4 (Low vegetation), the IoU of class 7 (Scanning artefact) increases from 0.78 to 0.87.This increase is because without additional information from the neighborhood statistics, many points of class 7 (Scanning artefact) were misclassified as class 5 (Building) and class 6 (Hard scape) in case of feature set B.

Figure 2 .Figure 3 .
Figure 2. Ground truth labels of the test dataset Classification using the PointNet++ Multi Scale Grouping framework results in a mIoU of 0.84 compared to 0.94 for feature set C. PN achieves a higher mIoU than feature set A where only a single search radius is taken into account.The performance of PN is good with all classes except class 4 and class 7 where the number of training examples per class is less compared to other classes.The ground truth test dataset is visualized in Fig. 2. The classification results for feature set C and feature set A are shown in Fig. 3.
10, we present the results for feature ID A, C and PN.The classification results for feature set C and feature set A are shown in Fig. 4. From Tab. 10, we can see that IoUs for class 1 and class 3 have strongly improved with feature set C resulting in an overall increase in the mIoU by 31% compared to feature set A. Comparing PN and C to A shows that the improvement in class 3 is similar, but PN does not improve as much as C in class 1, in fact performing worse than feature set A. Most misclassification occurs in class 5 (Building) and class 6 (Hard scape) with all the three feature sets.They are emphasized by arrows marked with (3) in Fig. 4. The normalized confusion matrix in Tab.11 shows 33.93% misclassification of class 5 into class 6, and 26.29% misclassification of class 6 into class 5 using feature set C.

Figure 4 .
Figure 4. Classification by network with arrows showing the locations of misclassification for (a) feature set C and (b) feature set A The arrow marked with (2) in Fig. 4b shows the misclassification of class 3 (High vegetation), for which almost all points are predicted as class 5 (Building).This misclassification is largely reduced with feature set C as shown by arrows marked with (2) in Fig. 4a.Similar to experiment 1, the misclassification at locations of geometry change is reduced when we consider NF with multiple search radii and NStat as shown by arrows marked with (1) in Fig. 4a and Fig. 4b.An occurrence of misclassification of class 2 (Natural terrain) into class 6 (Hard scape) is shown by arrows marked with (4) in Fig. 4. From the normalized confusion matrix, it can be seen that 4.10% of the points from class 2 got predicted as class 6.

Table 5 .
Number of points per PointNet++-Level, search radii and neighbor count.
The parameters were selected to best represent the search radii used in feature set C, but may be tuned to achieve better results.

Table 10 .
Classification results of the test dataset for different (Winiwarter and Mandlburger, 2019)experiment 2 are consistent with those of experiment 1.The mIoU for experiment 2 compared to experiment 1 is lower due to the fact the we used only 37% of the points in the training set compared to 72 % in experiment 1 and, more importantly, the points in training and test set are not cor-Low vegetation), of which more than 3% (cf.Tab.9) are classified as Hard scape.Class 7 points (Scanning artefact) are also misclassified as class 1 (Man-made terrain) in more than 7% of the test cases, leading to a lower mIoU.Since artefacts do not have a clear geometric signature and make up only 0.7% of the dataset, they cannot be effectively learned by the end-to-end approach of PointNet++, which would need more training data(Winiwarter and Mandlburger, 2019).