TRANSFER LEARNING FOR INDOOR OBJECT CLASSIFICATION: FROM IMAGES TO POINT CLOUDS

Indoor furniture is of great relevance to building occupants in everyday life. Furniture occupies space in the building, gives comfort, establishes order in rooms and locates services and activities. Furniture is not always static; the rooms can be reorganized according to the needs. Keeping the building models up to date with the current furniture is key to work with indoor environments. Laser scanning technology can acquire indoor environments in a fast and precise way, and recent artificial intelligence techniques can classify correctly the objects that contain. The objective of this work is to study how to minimize the use of point cloud samples in Neural Network training, tedious to label, and replace them with images obtained from online sources. For this, point clouds are converted to images by means of rotations and projections. The conversion of a 3D vector data to a 2D raster allows the use of Convolutional Neural Networks, the achievement of several images for each acquired point cloud object and the combination with images obtained from online sources, such as Google Images. The images have been distributed among the validation and testing training sets following different percentages. The results show that, although point cloud images cannot be completely dispensed within the training set, only 10% of these achieve high accuracy in the classification.


INTRODUCTION
Furniture is a key element of indoor environments. These objects allow people and autonomous robots to interact with buildings, locate services and tools, and recognize spaces based on the type of objects they contain. Some models, such as the CityGML standard at its highest level of detail (Biljecki et al., 2016), integrate objects within buildings to know the space occupied and services available. Indoor environments are also changing, rooms are usually reorganized and adapted to current needs. Therefore, it is essential to provide methods to acquire and map these objects quickly and minimize manual intervention.
Indoor laser scanning technology has evolved significantly in recent years. The platforms where the laser scanner is mounted have been diversified into trolleys (Chen et al., 2019), backpacks (Rönnholm et al., 2015), manual tools (Maboudi et al., 2017), mixed reality devices (Khoshelham et al., 2019), robots (Frías et al., 2019), etc. These ramifications allow indoor environments can be acquired more quickly than with conventional Terrestrial Laser Scanning, thus obtaining more data. However, this data is often not enough and must be labelled if Deep Learning (DL) technologies are implemented. Therefore, the task of acquiring and labeling samples is a time-consuming manual process. Although there are datasets with indoor labelled point clouds (Uy et al., 2019), this data does not always match the user's needs, or the number of samples is low to employ certain techniques.
The objective of this work is to evaluate the use of images of indoor objects to minimize the number of point clouds needed in the training of Convolutional Neural Networks (CNN). Images are easier and faster to obtain and label compared to point clouds, and the objects maintain a clear relation in both images and clouds. Different training sessions are held where the percentage varies between images obtained from online sources and images obtained from point clouds.
The rest of this paper is organized as follows. Section 2 collects related work about object classification with Machine Learning (ML) techniques. Section 3 presents an overview of the designed method. Section 4 is devoted to analyse the results. Finally, Section 5 concludes this work.

RELATED WORK
Object classification is a well-studied topic, both in point clouds and in images. Many of the object classification techniques can be applied indoors and outdoors indistinctly (Balado et al., 2020). Objects in point clouds can be classified with ML techniques by feature extraction, converting point clouds into 2D or 3D images or using point cloud-based neural networks.
ML techniques need a low number of samples for training, compared with DL techniques. ML techniques must be designed to extract the most relevant point cloud object features. The choice of features is a design decision, so depending on the design knowledge, relevant features can be lost and other features less relevant can be added. A tendency in the use of these techniques is to extract all available features and let the classifier detect those that are relevant. ML classifiers, such as SVM, Random Forest, Trees, etc., obtain good results in non-complex problems, with low computational cost and little time in dataset generation. Lai and Fox, (2010) extract features from Google's 3D Warehouse to obtain more data samples. Roynard et al., (2016) uses 991 features to train a Random Forest classifier. Oesau et al., (2016) transform point clouds objects to histograms via planar abstraction.
Object classification in images with 2D-CNN is one of the most widespread research lines today. There is a wide variety of network architectures available, implementation is quick and does not require a deep understanding of the problem to be addressed. When generating 2D samples from 3D data, data augmentation with object rotations can be implemented, thus significantly minimizing the number of acquired objects required for training (Tchapmi et al., 2017). The main drawback is that the 3D to 2D conversion loses one data dimension. To minimize this, some authors choose to use orthogonal sections of the object (Gomez-Donoso et al., 2017) and others transform the cloud into depth images (Pang and Neumann, 2016).
The first network to address the problem of classification directly in 3D was VoxNet (Maturana and Scherer, 2015). This network uses 32x32x32 voxels as input, so the point cloud must be structured into a 3D image. The main problem when adapting vector data to 32 levels in each dimension is the resolution loss and the empty voxel generation. In addition, some authors consider that 2D-CNN with multi-views obtain better results than these 3D-CNN (Griffiths and Boehm, 2019;Qi et al., 2016b).
Recently, some authors have designed network architectures that use point clouds as input. These architectures are based on spatial relationships (Qi et al., 2017(Qi et al., , 2016a and graph theory (Feng et al., 2019;Wang et al., 2018). The strong point of these networks is no information is lost due to point cloud conversion to other formats. Their weak point is that they need a much higher computational cost than the alternatives. Garcia-Garcia et al., (2016) train PointNet with CAD models of objects to classify them. Wu et al., (2019)  With regard to the mentioned works, briefly compared in Table  1, the method presented in this paper opts for the conversion of point clouds to images in order to use a 2D-CNN network. The decision is substantiated in the following reasons: (1) The shape of the object is preserved, one of the most relevant factors at classification. (2) It allows the use of data augmentation, generating multiple samples per object.
(3) Computation time and cost are reduced compared to 3D techniques. (4) Existing 2D networks are better optimized than their 3D equivalents and manual feature extraction techniques. (5) Point cloud images can be combined with images obtained from online sources.

METHOD
The classification is based on images downloaded from online sources and images generated from point clouds (hereinafter called point cloud images). Depending on the number of samples per class, multi-view data augmentation is applied to obtain enough samples to evaluate training and assess the behavior of the algorithm. The samples are then distributed among the training, validation, and testing sets ( Figure 1). In this section, the generation of images from point clouds, the CNN selection and the adaptation of the images are explained.

Image generation from point clouds
The input data are individualized point clouds of objects = [ ], where the first three columns are 3D coordinate and the last three are color information. The conversion from point clouds to images is done through an isometric projection. The point cloud is distributed in a plane, which can be visualized and saved as an image. In pixels where more than one point is projected, the color assigned is the average color of the corresponding points. White color is assigned to pixels without points. A point cloud rasterization (Balado et al., 2017) is not necessary since an aspect ratio is not maintained when adapting images to the CNN entrance.
If it is necessary to rotate the point cloud P to generate multiple views of the same point cloud object (data augmentation), a rotation is executed with an angle resolution r on Z axis. Equation 1 shows the rotation matrix on the Z axis according to a step i of angle r. The number of rotations coincides with the number of final images per object. In this way, multiple images can be created per object, as long as the angle r ensures that images of the same object are sufficiently distinct.
For the visualization of the object in isometric projection, and after rotating the object if multi-view generation is necessary, a rotation of 30 degrees is executed on the axis Y according to Equation 2. Then, point cloud is projected on the X plane a by removing the attribute X.

Classification
The InceptionV3 architecture (Szegedy et al., 2016) is used for the classification as it is one of the networks with the best accuracy in relation to the rate of operations required for their training (Canziani et al., 2016). This architecture has proven to work well in a multitude of object classification applications (Saini and Susan, 2019;Xia et al., 2017). The InceptionV3 network has an input size of 299x299x3 pixels. Since the images obtained from online sources and the images obtained from point clouds are in RGB color format, there is no need to adjust color channels. Since all images have different sizes, the images are resized to fit the network input (Gao and Gruev, 2011). Color assignment is performed by bicubic interpolation; the output pixel value is a weighted average of pixels in the four vicinity.

Data
The point clouds used for training, validation, and testing of the neural network were obtained from areas 1 to 4 of the 2D-3D Stanford Dataset . The dataset contains indoor point clouds colored in RGB. The classes of furniture and number of objects available in the dataset are 56 boards, 179 bookshelves, 676 chairs, 21 sofas and 145 tables. Objects have an average density of 10 thousand points per square meter. The number of samples among classes is clearly unbalanced. For each class, 200 point cloud images were generated following the abovementioned method (projection and data augmentation). For each class, 550 images were downloaded from Google Images using the "Download All Images" extension. Figure 2 shows samples for each class.

Training
Once sufficient samples for each class were available, they were distributed and CNN was trained. For each class, 500 samples were used for training, 50 for validation and 100 for testing. The training set consists of 500 images, of which a small percentage (between 0 to 10%) corresponds to point cloud images, the complementary images are downloaded images (respectively 100% to 90%). This variation was done in 2% increments (10 samples per class  Figure 3 and Figure 4 show the evolution of the loss in the successive training sessions containing online images and point cloud images in the validation set respectively. All the networks have converged satisfactorily, however, those that use online images as validation set shows a faster convergence since they do not consider the same feature selection of point cloud images. Table 2 and Table 3 compile the results obtained from the different training sessions on the testing set. Figure 5 shows images of correctly classified objects. Without any point cloud Figure 2. Samples of the five classes: above, images obtained from online sources; below, images obtained from point clouds.

Results and discussion
image in the training set (0% of point cloud samples), the neural network was unable to learn appropriate features to identify each object. Therefore, point cloud images colored in RGB were not similar enough to online images to obtain a satisfactory classification. Adding point cloud images in the training set improves the accuracy. The first ingestion of 10 samples per object (2% of point cloud samples) in the training set increased the accuracy by twofold to 0.67. As point cloud images continued to be introduced into the training set, accuracy increased steeply to 0.88 and 0.87, depending on the validation set. with 50 samples per object (10% of point cloud samples in the training set). This accuracy positions the proposed method with the minimization of point cloud objects very close to the state of the art in 2D-3D Stanford Dataset (Turkoglu et al., 2018), and even improving others (McCormac et al., 2017;Tchapmi et al., 2017;Turkoglu et al., 2018). However, these works present semantic segmentation methods of indoor environment point clouds, and not only object classification method as proposed here, that would require a previous phase of object segmentation from structural elements and their individualization.
Between the use or not of point cloud images in the validation set, no great accuracy differences have been observed. Point cloud images can be eliminated from the validation set to reduce the number of point cloud samples. Table 4 and Table 5 show the confusion matrices for training sessions with 10% of point cloud images. The classes with the highest accuracy were board and chair. From the analysis of the images and errors, the causes of the most relevant confusions can be deduced. Bookshelves were confused with other objects because of their great variation in forms, textures, and contents. Sofas had a high confusion with chairs since in the set of chairs there are some easy chairs. Finally, tables include tables of different shapes as well as desks; in most cases, tables have objects on top of them that difficult visualization. It has also been observed that the objects in point cloud images contained some errors caused in the acquisition and subsequent representation that may influence training and classification. These point clouds often presented diffuse contours, differences in density between objects and between areas of the same object and strong occlusions ( Figure 6). Noise can create shapes that confuse CNN. Occlusions can hide object shapes that the CNN needs for object identification.    Table 5. Confusion matrix of CNN trained with point cloud images in the validation set. Figure 6. Samples with strong changes in intensity, occlusions and shape variation: a) bookshelves, b) chairs and c) tables

CONCLUSIONS
In this work, the use of online images has been studied to minimize the number of point cloud samples needed to train a neural network to the classification of indoor objects. Classification with a CNN has been adopted, so point clouds have been converted into images. Several training sets have been designed where the percentage of samples obtained from point clouds and online images is varied.
Colored point clouds provided by the 2D-3D Stanford Dataset and images from online sources were used to classify five classes of indoor objects. The results show that online images cannot be used exclusively to train a CNN whose objective is to classify point clouds (even if these have color). The accuracy of the classifier increases gradually as the number of images obtained from point clouds in the training set increases. With 10% of point cloud images in the training set, an accuracy of 0.88 was achieved. Although the proposed method minimizes the number of point cloud samples, the choice of how many samples to use in the training is at the disposal of the creator of the dataset, the number of available samples and the final accuracy desired. Future work will focus on studying how occlusions and other anomalies in point clouds of objects affect classification results.