RIDF: A ROBUST ROTATION-INVARIANT DESCRIPTOR FOR 3D POINT CLOUD REGISTRATION IN THE FREQUENCY DOMAIN

Registration of point clouds is a fundamental problem in the community of photogrammetry and 3D computer vision. Generally, point cloud registration consists of two steps: the search of correspondences and the estimation of transformation parameters. However, to find correspondences from point clouds, generating robust and discriminative features is of necessity. In this paper, we address the problem of extracting robust rotation-invariant features for fast coarse registration of point clouds under the assumption that the pairwise point clouds are transformed with rigid transformation. With a Fourier-based descriptor, point clouds represented by volumetric images can be mapped from the image to feature space. It is achieved by considering a gradient histogram as a continuous angular signal which can be well represented by the spherical harmonics. The rotation-invariance is established based on the Fourier-based analysis, in which high-frequency signals can be filtered out. This makes the extracted features robust to noises and outliers. Then, with the extracted features, pairwise correspondence can be found by the fast search. Finally, the transformation parameters can be estimated by fitting the rigid transformation model using the corresponding points and RANSAC algorithm. Experiments are conducted to prove the effectiveness of our proposed method in the task of point cloud registration. Regarding the experimental results of the point cloud registration using two TLS benchmark point cloud datasets, featuring with limited overlaps and uneven point densities and covering different urban scenes, our proposed method can achieve a fast coarse registration with rotation errors of less than 1 degree and translation errors of less than 1m.


INTRODUCTION
Registration of point clouds is a fundamental problem in the community of photogrammetry and 3D computer vision. To some extent, the result of point cloud registration serve as prerequisite for many applications, such as, construction monitoring (Bosché et al., 2015, Tuttas et al., 2017, forestry monitoring , urban planning (Vosselman, Maas, 2010), and 3D modeling (Lafarge, Mallet, 2012, Yang et al., 2013. Briefly, the objective of point cloud registration is to unite the data (i.e., 3D points) acquired in various views, platforms, or times into a common coordinate system. Generally, the registration methods mainly consist of two essential categories: coarse registration and fine registration. For fine registration, iterative closest point (ICP) (Besl, McKay, 1992, Chen, Medioni, 1991, Zhang, 1994 and its variants (Habib et al., 2010, Yang et al., 2015 are the representative algorithms and have been proved to be effective and efficient. Additionally, normal distribution transform (NDT) (Biber, Straßer, 2003) is also a commonly used standard algorithm for fine registration. However, in the step of fine registration, proper initial transformation values are of great importance for fine registration to avoid local optimum. In this paper, we address the problem of how to provide good initial transformation parameters to find registration, namely the problem of coarse registration. To achieve coarse registration of point clouds, at least two steps are required: finding the correspondences and estimating the * Corresponding author transformation parameters. Among the two steps, finding the correspondences between point clouds is important and is the basis for estimating transformation parameters.
For most general registration tasks, one commonly used strategy of finding correspondences is to use a feature-based strategy, which can be divided into three steps: • Detecting of key geometric points or primitives; • Constructing features for the detected key points or primitives; • Determining the correspondences using the extracted features.
However, the estimation of correspondences of unordered and unstructured point clouds is still challenging due to the following reasons: • Uneven point densities caused the varying observation distances of the scanners; • Noise and outliers resulting from temporary or moving objects; • Low-overlapping rate between point clouds; • High-similarity and symmetry, especially for point clouds acquired in urban areas; • Large data amount of point clouds.
To tackle the aforementioned problems, we present a robust and fast registration workflow. The workflow consists of four steps: representing unordered 3D point clouds by 3D volumetric images, constructing rotation-invariant features for each voxel in volumetric images, searching for corresponding voxels in feature domain, and estimating the transformation parameters with a robust estimator. Differing from the commonly used featurebased strategy, as mentioned before, in the proposed workflow, we skip the step of detecting key geometric points or primitives to eliminate the errors or inaccuracies caused in the step of detecting edges, planes or other primitives. For most of the pointwise or voxel-wise rotation-invariant descriptors, the rotation invariance is achieved by aligning the local reference frame (LRF) by estimating the dominant gradient direction. However, the descriptor used in this paper for constructing rotationinvariance can map arbitrary voxels in volumetric images to a feature field, which is also the reason why detecting key points or primitives can be skipped in the proposed workflow. Besides, the rotation-invariant features are robust to noise without the step of estimating the dominant gradient direction. With these factors taken into consideration, the main concept is to represent point clouds with robust and rotation-invariant features in the frequency domain, so that a robust alignment between point clouds can be achieved.

RELATED WORK
In this paper, we mainly address the problem of coarse registration. Generally, many kinds of research have reported different solutions for coarse point cloud registration. The registration methods can be primarily categorized into two classes: Feature description-based registration and geometric constraint-based registration.

Feature description-based registration
For the feature description-based registration methods, as mentioned before, feature descriptor plays an important role in feature matching. Corresponding points are identified via the similarity measurement between features. In the literature, many different descriptors have been reported to be useful in the feature matching, such as scale-invariant feature transform (SIFT) (Flitton et al., 2010), fast point feature histogram (FPFH) (Rusu et al., 2009), rotational projection statistics (RoPS) (Guo et al., 2013) and signature of histogram of orientations (SHOT) (Tombari et al., 2010) and so on. Generally, two characteristics of a sufficient descriptor have to be addressed, which are high descriptiveness and rotation-invariance. However, descriptors, such as SIFT, highly depend on intersecting point detectors. Thus, it is impossible for dense feature extraction using this kind of descriptor. Additionally, the strategy used for achieving rotation in-variance is based on pose normalization. For example, SIFT produces rotation-invariant features by aligning LRF to the dominant gradient orientation. However, most of the LRF-based feature descriptors are not robust to surface noise and outliers. The other way is to obtain the statistics of the local geometry. It is of easy conduction and fast computation. But the only problem for this strategy is that the features may suffer from low descriptiveness.
Apart from the point-based feature matching methods, many other researches utilized geometric primitives as features instead of points, such as lines (Habib et al., 2005, curves (Yang, Zang, 2014), planes (Xiao et al., 2013) and plane pairs (Chen et al., 2019). On the other hand, the combination of different kinds of primitives is also an effective direction for the task of registration and simultaneously improve the robustness of these geometric features (Xu et al., 2017). No matter the process of detecting keypoints or forming geometric primitives, the accuracy of the registration highly relies on the accuracy of the estimation of these geometric representations.
Sometimes, outliers will be induced in the process of finding edges, lines, or planes.

Geometric constraint-based registration
In addition to the feature description-based method, some methods follow a different registration scheme. In this type of method, the registration is not achieved by generating features and find feature correspondences; instead, they will identify corresponding points by predefined geometric constraints. Here, 4-points congruent set (4PCS) (Aiger et al., 2008), as well as its variants, such as Super4PCS (Mellado et al., n.d.), keypoint-based 4PCS (K4PCS) (Theiler et al., 2014), and semantic keypoint-based 4PCS (SK4PCS) (Ge, 2017) are representative registration methods using this strategy, in which pairwise congruent point sets are determined by exploiting the rule of intersection ratios. For point clouds, intersection ratios of congruent points are invariant under the affine transformation, and they are used to determine whether two point sets are matched or not. Compared with feature description-based registration, these algorithms are more robust to noise, occlusions, and uneven densities. Additionally, in this method, the candidates for the transformation parameters can be largely reduced, which releases the pressure on computation. Meanwhile, some other strategies are based on the 4PCS strategy but conducted using different primitives. Representative extensions of this strategy are the ones using planes replacing points. Voxel-based 4-plane congruent sets (V4PCS)  and 4-planes congruent sets (4-PlCS) (Bueno et al., 2017) are symbolic of the implementation of structured planes as congruent sets with angle ratio constraints.
The assembly or extracting of planes from points will powerfully overcome the uneven dot density and outlines that ensure consistent alignment of planes of specific angles. The distance measured of points used by conventional 4PCS approaches, as a comparison, is more sensitive to outliers and noises. Another interesting example is the volumetric 4PCS (V-4PCS) (Huang et al., 2017), which extends the four coplanar points to volumetric non-coplanar data and theoretically shows a significant reduction in computational complexity.

Our contributions
Although the literature mentioned above is not comprehensive, a few pieces of research have addressed the characteristics of 3D point clouds in the frequency domain. In 2D image matching, methods based on Fourier transforms are commonly utilized, such as phase correlation (Nagashima et al., 2006). In (Huang et al., 2019), the strategy of Fourier-based 2D image matching was successfully applied to 3D point cloud registration. In (Liu et al., 2014), a 3D rotation-invariant descriptor was proposed and tested in the task of shape retrieval. In our work, we embed the concept of achieving rotation-invariance in the frequency domain in the framework of point cloud registration. The major contributions of this paper are abstracted as follows: • We proposed an efficient framework for fast coarse registration of point clouds, which is sufficient in the low-overlapping dataset. In the proposed workflow, point clouds are represented using volumetric images and mapped to feature space using a rotation-invariant descriptor, spherical harmonic HoG (SHHOG), in the frequency domain.
• The descriptor has two main benefits. The descriptor is not restricted by geometric shape extraction. On the other hand, the descriptor is robust to noise and outliers since high-frequency signals are filtered out when representing the corresponding signal with spherical harmonics.

METHODOLOGY
The proposed methodology for point cloud registration comprises four essential steps: voxelization of point clouds, local rotation-invariant feature extraction from two volumetric images, finding pairwise correspondences, and estimation of transformation parameters, as illustrated in Fig. 1. In the first step, point clouds are individually transformed into 3D volumetric images, in which each voxel in the 3D grids represents the number of points located within the corresponding spatial boundaries. Second, features for each voxel are constructed using Fourier-based HoG features in a continuous way. Subsequently, the corresponding voxels are matched with a fast search method using the features extracted in the previous step. Finally, the transformation parameters are calculated using the corresponding voxels and a RANSAC-based strategy for robust estimation. The detailed explanation of each step in the workflow is introduced in the following sub-sections.

Local rotation-invariant feature extraction in the frequency domain
The local rotation-invariant feature extraction consists of three steps: voxel-wise HoG feature extraction with spherical harmonic representations, computation of regional features, and the final feature extraction. In Fig. 2, the key parts in feature extraction are illustrated.
3.1.1 Voxel-wise feature extraction The first step is to create a distribution for each voxel. Let a 3D location of any voxel in a volumetric image C be (x, y, z) in Cartesian coordinates, and the corresponding 3D gradient to be d(x, y, z) denoted as the gradient in each direction. Thus, the SH coefficients will be: where the Schmidt semi-normalized SHs are defined as: where P l m are Legendre polynomials, which can be calculated by given functions in Matlab. With the issue of noise and outliers considered, only low-frequency degrees in the SH representation are computed, which has the same effect as low-pass filtering.
Then, with spatial aggregation and normalization, the SHHOG field can be expressed as: Assuming that the 3D signal is rotated with l-th degree, the feature behavior under the rotation can be derived as: Thus, the spherical tensor operations are needed to create rotation-invariant regional descriptors from the SH HOG field.
3.1.2 Regional feature extraction To obtain rotationinvariant features and improve the descriptiveness of the features, multiple concentric voxels around a selected voxel are used to describe the surroundings of the corresponding voxel.
Before the identification of rotation-invariance of features, it should be noted that the inner-product of two spherical tensors are rotation-invariant, which can be proved by Thus, we can produce rotation-invariant features by coupling the filtering output of the same rank as the inner product. Finally, the descriptor can be calculated by where Gn denotes the spherical Guassian derivatives.

Procedures of local rotation-invariant feature extraction
In summary, the procedures for extracting rotationinvariant features are given below: • By applying Fourier transformation in spherical coordinates on each voxel in the volumetric image, the magnitude features can be obtained given the corresponding Fourier order numbers, which can be denoted as f 1 m = Dm , m = 0, 1, 2, ..., m. This part is called the magnitude part.
• Using Eq. 5, rotation behavior can be compensated. Thus, the second part of features is absolute rotation-invariant features, which can be presented as FmFm.
• In the former two parts, the phase information is completely removed. Aiming at reducing the loss of phase information, we derive the relative rotation-invariant features by the convolution of two spherical tensor fields using the tensor product using Eq. 6.
Combining the aforementioned parts, we can finally derive the rotation-invariant features. In general, there are three main advantages of this Fourier-based rotation-invariant features. First, continuous dense feature computation can be achieved using this descriptor. Second, a low-frequency filtering is conducted with the spherical harmonic representation, which makes the descriptor robust to noise and outliers. Third, the final features contain not only the information of the corresponding voxels but also the surroundings, which makes this descriptor of high descriptiveness.

Generating pairwise correspondences
The identification of corresponding voxels can be conducted in a fast way using the estimated features. The correspondences  between features are estimated by the criteria set with Euclidean distances in the feature domain. The selection of the threshold for feature matching varies for different datasets and is set empirically. It is set by checking the distribution of the distances between each voxel and its corresponding nearest voxel and also the number of voxel-pairs selected to the matched voxel list. The specific workflow for finding correspondences are provided in Algorithm 1.

Estimation of transformation parameters
After finding the corresponding voxel pairs, the transformation parameters of the rigid transformation, including 3D rotations and translation, between point clouds can be estimated. For achieving a robust estimation for the parameters, a RANSACbased strategy is applied. In each round of RANSAC, a set of transformation parameters can be obtained. Then the estimated transformation parameters are utilized to transform all Algorithm 1: Finding feature correspondences: Given feature representations of volumetric images of two scans, find the matched voxels from the two scans using the features. Require: Volumetric images I1 and I2 and corresponding feature fields F1 and F2 Ensure: Matched voxel pairs denoted by their corresponding positions 1: For each voxel in I1, search for the nearest points in feature domain using the Tree(F2) created with the corresponding features 2: Judge whether the distance between F1 and the matched F2 are smaller than the threshold σ 3: If not, jump to the next voxel 4: If yes, add the voxel pairs to the matched voxel list the matched voxels in the target scan. Here, the thresholds for the RANSAC trials are set to be two-fold. One should be the maximum number of trials, and the other one is the minimum threshold of the registration score. For any trials, if the registration score is lower than the threshold, the transformation parameters will be the final estimated transformation parameters, and the RANSAC process will be stopped. The other case is that the RANSAC process will be stopped when the number of trial reaches the maximum amount. In this case, the transformation parameters in the last trial will be saved.

Experimental data
The proposed method for performing the registration is tested using two TLS point cloud datasets, which are acquired with different scanners and from different urban scenes, as illustrated in Fig. 3. The first one is a point cloud pair from the ThermalMapper project acquired by the Jacobs University Bremen (Borrmann et al., 2013). The second dataset is one pair of point clouds chosen from the Real-world Scans with Small Overlap (RESSO) dataset (Chen et al., 2019). Compared with the first dataset, the overlapping rate is relatively lower. For the first dataset, the source and target point clouds are manually aligned as ground truth. For the second one, it is provided as a benchmark dataset, in which the accurate transformation parameters between the source and target point clouds are provided.
All the experiments were conducted on a 64 Bit Windows 10 PC with 8 GB RAM and an Intel (R) Core (TM) i7-6700 @ 3.4 GHz CPU.

Evaluation metric
For evaluation, we use the following evaluation metrics: the rotation errors and the translation errors, which are computed using the references provided by manually registration or ground truth. Providing the reference transformation matrix Tgt and the estimated transformation matrix Te, the rotation error e r and translation error e t can be computed by: wherein tr(·) denotes the trace.

Experimental results
The coarse registration results using two datasets are illustrated in Fig. 4. In the experiments, the voxel size used for the Bremen dataset is set to 2m, and the voxel size for the Resso dataset is set as 1m. The thresholds for the identification of feature correspondence are both set to be 0.5, and the maximum iteration number of RANSAC is set as 1 million times. In Table 2, the number of matched voxels are provided. From the visualization aspect, it is shown that the point clouds are well aligned. Meanwhile, the quantitative results for the registration are given in Table 3 and 4. For the two scenes of Bremen and Resso, the rotation errors are less than 1 degree. As for translation errors, they are both less than 1m. The results reveal that the proposed method can achieve successful registration among different point cloud datasets, even for the datasets which have a comparatively low overlapping rate. On the other hand, based on the coarse registration results, the registration can be further improved by applying fine registration.
To further validate the effectiveness, the experimental results are compared with three benchmark methods using the Bremen dataset. The baseline methods we used for comparison are FPF-HSAC (Holz et al., 2015), K4PCS (Theiler et al., 2014), and V4PCS . The FPFHSAC method registered point clouds using FPFH features and a RANSAC process for the robust estimation of transformation parameters. The K4PCS method is under the framework of 4PCS. The difference is that this method using keypoints instead of random points, which largely reduces the number of candidates and improves the efficiency and robustness of the algorithm. V4PCS provided a different solution for utilizing the 4PCS framework. The 4PCS are replaced by voxel-based 4-planes congruent sets. In this method, more geometric constraints are added, which improved the robustness and has been proved to be a fast and effective way for point cloud registration. The registration results for these methods are listed in Table 3. It is shown that all com- pared methods provide satisfying performance. Comparatively, our proposed method offers a weaker performance considering the rotation error and translation error. The main reason lies in the two factors which highly influence the registration accuracy. The first one is the geometric resolution of each voxel.
In our experiments, considering the limitation of computation memory and the covered areas of the point clouds, the voxel size can not be set to a small value, which largely limits the representation of details of the scene. In future work, the representation of voxels and the processing way for extracting the features can be changed to reduce the required memory. On the other hand, the criteria for estimating correspondences is also an important factor. The smaller the threshold is set, the fewer correspondences are extracted. The time spent on the RANSAC process will be largely reduced. However, a strict setting for the threshold will also decrease the flexibility of the registration framework in considering noise, outliers, or even artifacts caused by our former voxelization processing. As for the running time, although the time performance of our proposed method is slightly weaker than V4PCS, it is obviously better than the other two feature-based methods. However, it is not fair to draw a conclusion for the time performance since our proposed method was conducted on Matlab while others were conducted on C++. Additionally, as stated before, the voxel size and the criteria for estimating correspondence are two important factors influencing time performance. The time spent on the RANSAC process can be largely reduced with larger voxel size and smaller threshold.   Additionally, we also compare our registration results with the baseline coarse registration method, PLADE, provided by the data provider of Resso. The PLADE method offers a solution for the small-overlapping dataset, which utilizes strong constraints between different planes as descriptors for point clouds.
As shown in Table 4 and Fig. 4, our proposed method can also achieve acceptable results for the dataset with a small overlap. Although our result is not as satisfying as the result provided by PLADE, our result is at an acceptable level of coarse registration. It proves that our method is also adaptive to lowoverlapping cases.

SUMMARY
In this work, we present a marker-free registration method which combines a robust rotation-invariant descriptor with a RANSAC estimator. Specifically, the Fourier-based descriptor enables us to describe each point cloud with rotation-invariant features in a continuous dense way. The main benefit of this descriptor can be concluded as high-descriptiveness, rotationinvariance, robustness, and independent on shape feature extraction. The RANSAC-based strategy provides the possibilities for excluding some outliers in the process of finding correspondences. The experimental results using TLS benchmark datasets have proved the effectiveness of our proposed method.
Meanwhile, the proposed method also provides satisfying results for datasets with small overlaps. However, there are also some limitations of the proposed methods. The extraction of features using the rotation-invariant descriptor relies on the representation of point clouds in a volumetric way, which has a high requirement for the computation memory. In the future, we will try to embed a more efficient and less computationexpensive data structure to this framework. On the other hand, since this descriptor is less dependent on geometric feature extraction, such as edges, lines, or planes, it may be sufficient to utilize this descriptor for the registration in irregular-shaped areas (i.e., forested areas, non-urban areas).