COMPLETION OF SPARSE AND PARTIAL POINT CLOUDS OF VEHICLES USING A NOVEL END-TO-END NETWORK

: Completing the 3D shape of vehicles from real scan data, which aims to estimate the complete geometry of vehicles from partial inputs, acts as a role in the ﬁeld of remote sensing and autonomous driving. With the recent popularity of deep learning, plenty of data-driven methods have been proposed. However, most of them usually require additional information as prior knowledge for the input, for example, semantic labels and symmetry assumptions. In this paper, we design a novel and end-to-end network, termed as S2U-Net, to achieve the completion of 3D shapes of vehicles from the partial and sparse point clouds. Our network includes two modules of the encoder and the generator. The encoder is designed to extract the global feature of the incomplete and sparse point cloud while the generator is designed to produce ﬁne-grained and dense completion. Specially, we adopt an upsampling strategy to output a more uniform point cloud. Experimental results in the KITTI dataset illustrate our method achieves better performance than the state-of-arts in terms of distribution uniformity and completion quality. Speciﬁcally, we improve the translation accuracy by 50.8 % and rotation accuracy by 40.6 % evaluating completed results with a point cloud registration task.


INTRODUCTION
Recently, 3D shape completion for vehicles from real scanned data is regarded as a fundamental and intriguing area in the fields of remote sensing, computer vision, and robotics. Especially, with the development of the autonomous driving field, more and more companies and research institutes adopt different sensors (i.e., mobile laser scanning systems or depth cameras) to collect 3D point clouds in urban scenarios. However, acquired point clouds of the real world from these sensors are usually incomplete and noisy. For example, as shown in Figure 1 (Top), illustrated point clouds of cars from the KITTI dataset (Geiger et al., 2013) are sparse and incomplete, even some of them have a very few numbers of points, which is counterproductive to the further recognition task. To tackle this problem, the completion of low-quality point clouds via recovering geometric surfaces from sparse and partial points could be a practical but challenging solution. In case the point clouds of vehicles can be completed, it would be beneficial for the recognition algorithm in the autonomous driving field and provides a solid foundation for the perception tasks (Wen et al., 2019).
Giving a partial and sparse point cloud of vehicles, the shape completion aims to recover the missing points and output a uniform and dense point cloud with complete geometry. With the recent popularity of deep learning-based methods and the emergence of the synthetic datasets of 3D CAD models, such as Pix3D (Sun et al., 2018), ObjectNet3D (Xiang et al., 2016), and ShapeNet (Chang et al., 2015), using a data-driven strategy with neural networks to accomplish this mission has proved to be feasible and realistic. For example, network structures, like 3D-EPN (Dai et al., 2017) and SSCNet (Song et al., 2017), have revealed the excellent performance of inferring 3D shape * Corresponding author information through leaning on the synthetic datasets. However, these methods usually adopt a voxelization representation to organize dataset, which limits the output resolution since 3D voxel grids are actually a down-sampling. To some extent, these voxel-based methods are difficult to achieve fine-grained completions as well, because they are an artifact of artificial discretization.
To this end, more and more researchers choose to directly operate on raw point clouds. Compared with 3D voxelization representation, point clouds based representation can boost the performance of completion in terms of fine-grained details of objects. Recently, the Point Completion Network (PCN) (Yuan et al., 2018) was proposed to produce a dense and complete geometric shape from a partial 3D point cloud. However, points in its output are particularly non-uniform, which often overconcentrated in areas like the wheels of vehicles. Inspired by this work, in this paper, we propose an innovative 3D shape completion network, termed S2U-Net, and the main contributions of this paper are as follows: • We design a novel network S2U-Net, which operates on a partial point cloud directly for 3D shape completion. By applying an upsampling strategy, S2U-Net can recover a uniform, dense, and complete point cloud for vehicles from real scan data.
• Experimental results show the improved performance of the S2U-Net over the state-of-arts, and the completed results can significantly help for the point cloud registration task.
We train the proposed S2U-Net on the class of vehicles from the synthetic dataset ShapeNet (Chang et al., 2015) and test it on two real-scanned datasets KITTI (Geiger et al., 2013) and TUM dataset (Gehrung et al., 2017). Furthermore, we also investigate the completed performance based on the results of point cloud registration. Experimental results demonstrate that our method has a better performance than the state-of-arts in respect of distribution uniformity and complete quality.

Geometry-based Shape Completion
According to geometric cues (e.g., continuity of local surfaces or volumetric smoothness) of incomplete inputs, geometrybased approaches (Kazhdan et al., 2006, Tagliasacchi et al., 2011, Wu et al., 2015a can successfully retouch small holes on surfaces of point clouds. When recovering significant missing regions, hand-designed heuristics are applied to complete the 3D shape of objects. For example, (Schnabel et al., 2009) employed a series combination of planes and cylinders to guide the 3D shape completion for the partial point cloud. Furthermore, (Li et al., 2011) proposed an innovative method to learn global relations between these primitives. Considering that man-made objects usually have structural regularity, some studies (Pauly et al., 2008, Zheng et al., 2010 proposed approaches to find regular or periodic structures in geometric models and then use them to complete missing surfaces. However, these methods heavily rely on the assumption that the partial point cloud as input is of moderate degrees of completion already.

Template-based Shape Completion
Apart from the geometry-based completion, another common shape completion strategy is to retrieve the most semble shape or templates from a large-scale database as a reference, then to deform or reconstruct the input shape according to the retrieved reference. (Pauly et al., 2005) produced a complete 3D shape using geometric priors for missing regions from a given 3D shape database, but it requires manual interaction to limit categories of objects. Similarly, (Rock et al., 2015) explored a method to complete a 3D model of any class automatically from one depth image. However, these methods strongly depend on the capacity of the 3D shape database. To avoid the high dependency of large databases, (Shen et al., 2012) conducted an assembly approach of geometric primitives to recover 3D structures with a small-scale shape dataset. (Sung et al., 2015) applied a method to predict the geometric information of an input model and then used a global optimization to reconstruct the entire underlying surface. However, these methods still suffer from several limitations. Firstly, the optimization schemes are usually too expensive in computational cost for the online application. Secondly, each shape in the pre-prepared 3D shape database requires to be labeled and segmented manually. Last but not least, they are always sensitive to noise.

Learning-based Shape Completion
Recently, deep learning-based methods for 3D shape completion has become a popular topic. Most of them provided outputs completing shapes directly from a partial input using an end-to-end artificial neural network. (Wu et al., 2015b) constructed a large-scale synthetic object dataset named Mod-elNet and proposed a Convolutional Deep Belief Networks (CDBNs) to learn shape distributions for completing point clouds. (Thanh Nguyen et al., 2016) integrated CDBNs and Markov Random Fields (MRFs) to recover incomplete shapes. (Sharma et al., 2016, Varley et al., 2017 implied an Encoder-Decoder Network for shape completion. However, these methods all selected voxelization as 3D data representation since it can be applied in the 3D convolution. (Dai et al., 2017) explored a 3D Encoder-Predictor Network (EPN) for estimating a sparse but complete shape, then refined this shape through the nearest-neighbor-based volumetric post-processing. One recent work, (Yuan et al., 2018) proposed to directly operate on point clouds for 3D shape recovery, which is most related to our demand. Nevertheless, this manner exits two limitations. One is that the shape completed using this approach is not uniform, with most of the regions are over-concentrated. While another is that output point clouds always lose some detailed information.

Overview
We convert the completion of point clouds into a set problem: given a partial sparse point cloud set X obtained from the observed surfaces, we aim to predict a dense and uniform point cloud set Y, which uniformly sampled from the complete surfaces of the object. Notably, Y is not necessarily a superset of X. They are sampled from the object surfaces independently.
Our proposed S2U-Net consists of two modules of the encoder and the generator. The former is designed for extracting the global feature map of the partial input, while the latter one contributes to producing a dense, complete, and uniform point cloud from the latent space. Besides, we utilize a coarse-tofine training approach in the generator: firstly predicts a sparse point cloud with complete geometry, then refines it with local regions using an upsampling procedure. For a better understanding, the network architecture of the S2U-Net is shown in Figure 2, which includes two components: (1) To robustly extract the global features of the partial inputs, two stacked Point-Net (Qi et al., 2017a) layers are employed in the encoder. (2) To generate the dense, uniform, and complete point cloud, the S2U-Net network uses the coarse-to-fine strategy in the generator. The coarse network estimates the complete but sparse point cloud. Then, the fine network upsamples the sparse point cloud by combining local information and global structure. The detailed information of network architecture is introduced as the following sub-sections.

Encoder
Our encoder is based on the recently advanced feature extraction network PointNet (Qi et al., 2017a), which directly operates on the point clouds. Besides, its extension PointNet++ (Qi et al., 2017b)

Coarse-to-fine Generator
After obtaining the final global feature F2, the generator, as shown on the bottom of Figure 2, aims to predict 3D coordinates of the output point cloud. We explore a coarse-to-fine training approach for generating the complete point cloud. Firstly, inspired by 3D object reconstruction network RealPoint3D , fully-connected layers work well for estimating a sparse set of points which have the complete geometric information of the object. Therefore, the feature F2 is fed to the following three fully connected layers in the shape of 3072, and then we reshape the vector to a coarse point cloud with a size of 1024 × 3.
However, since the regression of points using fully connected layers is not restrictive on the local density, a large number of points will be over-concentrated when we only employing fully connected layers to output the dense point cloud. Thus, in the second step, we apply a feature expansion strategy  to enhance the distribution uniformity and completeness of output point cloud, where an up-down-up expansion unit is used. The up-down-up expansion unit consists of the up-feature operator and down-feature operator, which have been proven to enhance the feature variations effectively. Last, we add the farthest sampling step to retain 16384 points in the output point cloud.

Loss Function
For the generated result and the ground truth, we expect their topological distance to be the smallest. The distance function must be highly efficient and invariant to permutations of the points. Furthermore, it needs to be robust to outliers in the sets because each point cloud is all unordered point sets with underlying noise. Inspired by (Fan et al., 2017), we explore Chamfer distance (CD) and Earth Mover's Distance (EMD) to optimize the S2U-Net network.
CD is defined as follows: (1) where Q1, Q2 ⊆ R 3 , Q1 and Q2 are the generated point cloud and the ground truth, respectively. Notably, the size of Q1 and Q2 can be different.
EMD is defined as follows: where Q1, Q2 ⊆ R 3 , φ : Q1 → Q2 is a bijection. Unlike CD, the size of Q1 and Q2 must be same.
In this work, the loss function involved with two terms since we imply a coarse-to-fine training method, which introduced by as follows: L(Pcoarse, P f ine , Pgt) = EM D(Pcoarse,Pgt) where Pcoarse is the sparse point clouds generated by the first step in the generator,Pgt is the sub-sampled ground truth point cloud with the equal size as Pcoarse. The second term is the Chamfer distance between the final predicted point cloud P f ine and the full ground truth point cloud Pgt, in which γ is a hyper parameter to balance the relationship of them.

EXPERIMENTS
For evaluating the effectiveness of our S2U-Net, experiments related to 3D shape completion are conducted on synthetic shapes from the ShapeNet (Chang et al., 2015) dataset and real raw point clouds from KITTI (Geiger et al., 2013) dataset and TUM dataset (Gehrung et al., 2017). We firstly introduce how to prepare the training data and some implementation details about our network. Next, we compare the proposed method with the state-of-arts. Last, we apply the completed results to point cloud registration task in order to demonstrate the performance of our method.

Dataset
We train our network based on the ShapeNet (Chang et al., 2015) dataset, which is a richlyannotated and large-scale source of synthetic 3D CAD models. ShapeNet has covered 220,000 models and 3,135 categories of objects. In this work, we only use the category of cars as training samples here, having a total of 5677 objects. For the preparation, on the one hand, 16384 points are sampled uniformly from the mesh surfaces of each object as the ground truth point cloud. On the other hand, we create partial point clouds by transforming the depth images to point clouds. For the data augmentation, multiple depth images are used from different viewpoints (e.g., eight directions) for each sample. Therefore, each training set includes eight partial point clouds and one corresponding ground truth point cloud. Notably, the size of these partial point clouds can be different.
In addition, we choose the KITTI (Geiger et al., 2013) dataset for evaluating the performance of our method on real scan data. KITTI dataset provides raw point clouds collected by Velodyne HDL-64E rotating 3D laser scanner and annotations for the seven classes of the dynamic objects in the form of 3D bounding box tracklets. In this work, we select the category of cars for shape completion.

Training strategy
The proposed network S2U-Net is implemented in the Tensorflow framework and trained on a single Nvidia Titan Xp GPU. We use Adam as an optimizer in the whole network for 100 epochs. Due to the limitation of GPU resources, the batch size is set as 8. The coarse output point cloud in the generator includes 1024 points. We set the initial learning rate as 0.0001, and it is gradually reduced by a decay rate of 0.7 per 50K steps until 10 −6 .

Results and Comparisons
4.2.1 Shape completion on Real-scanned data In this experiment, we compare our proposed method S2U-Net with !"#$%& '() *+,-).%  PCN (Yuan et al., 2018) on partial real-scanned data from KITTI (Geiger et al., 2013) dataset. Following the experimental setting in PCN, we extract 2483 partial point clouds of cars according to their labeled bounding boxes for each frame. Note that we need to convert the origin scan data firstly into the box's coordinates since the training model based on cars in the synthetic 3D model dataset, in which all point clouds are normalized. Besides, because there are no ground truth point clouds of vehicles in the KITTI dataset, we only qualitatively compare these two methods. Some visual results are shown in Figure 3.
To conclusively prove the better performance of our proposed method, we randomly select ten different cars from the KITTI dataset for comparison. The first column is the sparse and incomplete input; the second and fourth columns are the results of PCN and S2U-Net, respectively. Note that the points of output point clouds are all 16384. It may lead to an visual illusion that PCN have more points since the generated point clouds by PCN are clustered and disorganized. From Figure 3, we can see that our method S2U-Net produces more uniform point sets compared with PCN, which tend to generate noisy and nonuniform point clouds. For example, in the first row of Figure 3, the points of the car created by PCN are messy, and most of the points are over-concentrated in the wheels of the vehicle. However, the points of completed results by S2U-Net are uniformly scattered on the geometric surface.
Especially, from the below-up views (the third column and fifth column), it can be seen that S2U-Net recovered the fine-detailed structure of cars. For instance, in the first row of Figure 3, S2U-Net heals the complete wheels of the vehicle, while PCN almost misses them. And in the eighth row, some points escaped the surface of the car using the PCN method, whereas, these points lie flat and uniform on the geometric plane using our method. It is attributed to the generator's feature expansion capability and farthest point sampling strategy.
In addition, our approach reveals robustness to varying input densities. As we can see in Figure 3, the points of ten inputs range from tens to hundreds. For example, in the second, third, sixth and tenth rows, although the inputs are less than 50 points, our proposed method S2U-Net still generates complete, smooth and uniform surface for the missing structure. It also demonstrates that our method has an ability in filling large gaps in point clouds. We believe that it is owing to the powerful global feature extraction capabilities of the encoder.

Applications with completion results
3D point cloud completion of vehicles is helpful for many applications in the autonomous driving field, for example, the registration between point clouds. Therefore, we conduct a set of registration experiments, which is between vehicle point clouds of adjacent frames from the KITTI dataset. As we know, it exists some errors when estimating the spatial transformation that aligns two different point clouds. To demonstrate our completion results can significantly reduce errors for point cloud registration tasks, we take the origin partial point cloud and completion results by our proposed S2U-Net as inputs, respectively.
Besides, we implement it by a point-to-point ICP (Besl, McKay, 1992) method. ICP is a classic point cloud matching algorithm, which registers the point clouds by iteratively minimizing distances between points from two scans. It is simple and easy to implement using the PCL library (Rusu, Cousins, 2011 (Yuan et al., 2018), we select the average rotation and translation error as an evaluation metric, which is defined as following respectively: where R1 and T1 are the rotation, translation of groud truth in the KITTI dataset, respectively. R2 and T2 are the rotation, translation measured by the ICP method, respectively.
The results are shown in Table 1. We refer to the sparse and partial point clouds from the original scans as "Origin". From Table 1, we can clearly see that using the completed point cloud by S2U-Net significantly improves the effectiveness of the registration task. In terms of translation accuracy and rotation accuracy, we promote that by 50.8% and 40.6%, respectively. We think the completed point clouds can achieve satisfying performance is due to that they have more overlapping regions from the points generated by S2U-Net than that of origin scan data. Therefore, it also demonstrates our method has a great significance for real applications.

Test with TUM dataset
To demonstrate the effectiveness of our proposed method S2U-Net on real scan data, we select the Mobile Laser Scanning (MLS) dataset (Gehrung et al., 2017) in the TUM City Campus for further evaluation. This dataset covers around 80000 m 2 with annotations, and Fraunhofer Institute of Optronics, System Technologies and Image Exploitation (IOSB) originally acquires this dataset with two Velodyne HDL-64E laser scanner. Here, we use the category of vehicles as the testing data. Some results include the failure cases that are visualized in Figure 4. Notably, there are also no corresponding real point clouds of vehicles in this dataset.
As we can see in the first three rows in Figure 4, S2U-Net produces the dense, uniform, and complete point clouds for these partial inputs with different sizes. However, the remaining rows show examples of failure cases of our proposed method. We can clearly see the generated points are not lying on the surface, with many points escaped. Because there are not bounding boxes of vehicles from the TUM City Campus dataset, we need to extract the point clouds of vehicles manually. In addition, it exists many noises in this dataset, for example, in the second row, there are many unrelated points on the wheels of the cars. Therefore, these reasons result in the accuracy of data prepossessing is not enough.

CONCLUSION
In this paper, we proposed a novel end-to-end network S2U-Net, which can recover a more uniform and fine-detailed structure for 3D shape completion of vehicles from a partial point cloud in a real scan data. Compared with previous methods, we explored an upsampling strategy to enhance the distribution uniformity and completion quality in the process of generating complete point clouds. Experimental results demonstrated the performance of our method can outperform other state-of-art methods.
However, we found the variation of generated point clouds is less noticeable. One of the reasons is that there is a little interclass variation for vehicles in the ShapeNet dataset when we prepare training data. Thus, we believe that it is also a limitation for any method using this synthetic 3D CAD model database as the training data, including 3D-EPN, PCN, and so on. Besides, S2U-Net can only complete the class of vehicles from real scan data. We will explore the more categories in the urban scene in the future, such as traffic signs, buildings, and so on.