Generation and Weighting of 3d Point Correspondences for Improved Registration of Rgb-d Data

Registration of RGB-D data using visual features is often influenced by errors in the transformation of visual features to 3D space as well as the random error of individual 3D points. In a long sequence, these errors accumulate and lead to inaccurate and deformed point clouds, particularly in situations where loop closing is not feasible. We present an epipolar search method for accurate transformation of the keypoints from 2D to 3D space, and define weights for the 3D points based on the theoretical random error of depth measurements. Our results show that the epipolar search method results in more accurate 3D correspondences. We also demonstrate that weighting the 3D points improves the accuracy of sensor pose estimates along the trajectory.


INTRODUCTION
Since their recent introduction to the market, RGB-D cameras, such as the Kinect (Microsoft, 2010), have gained a lot of popularity for indoor mapping, modelling and navigation.The Kinect sensor captures depth and colour images at a rate of 20~30 frames per second, which can be combined into a coloured point cloud, also referred to as RGB-D data.Compared to laser scanning, Kinect RGB-D data have lower accuracy and resolution (Khoshelham, 2011).However, the high data acquisition rate and the great flexibility of the Kinect make it an attractive sensor for mapping and modelling indoor environments.
A primary step in mapping by RGB-D data is the registration of successive frames.The common approach is based on visual features, i.e. point correspondences extracted from the colour images by keypoint extraction and matching methods such as SIFT (Lowe, 2004) and SURF (Bay et al., 2008).These point correspondences are transformed to 3D space by using the depth data, and are then used to estimate the rotation and translation between every pair of frames.
The pairwise registration is prone to error due to the random error of individual points but also the transformation from the colour space to the depth space.In a long sequence, the pairwise registration errors accumulate and lead to deformation in the resulting point cloud.To cope with registration errors, loop closing has been used (May et al., 2009;Du et al., 2011;Endres et al., 2012;Henry et al., 2012).A loop in the trajectory of the sensor can be detected when the sensor returns to a scene that is previously observed.Loop closing is essentially a global adjustment of the sensor pose (position and rotation) simultaneously for all frames in a sequence.
Loop closing is not always feasible, for example when mapping a long narrow corridor, or when the two frames at the closing do not have sufficient overlap or reliable keypoint matches.In such situations, improvement of the pairwise registrations is important as it can reduce the error and deformations in the final point cloud.
In this paper, we look into two sources of error in pairwise registration based on visual features: the error in the transformation from the RGB space to the depth space, and the random error of individual points in the 3D space.We present a method for accurate transformation of point features from the RGB space to the depth space, and propose a weighting scheme to adjust the contribution of the 3D point correspondences in the estimation of the registration parameters.Our experiments show the role of relative orientation in the accuracy of the 3D point correspondences.We also demonstrate that weighting point correspondences based on their theoretical random error improves the registration accuracy.
The paper proceeds with a review of related literature in Section 2. In Section 3, the methods for the generation and weighting of 3D point correspondences are described.Section 4 describes the experiments and results of registration using weighted point correspondences.The paper concludes with final remarks in Section 5.

RELATED WORK
The popular approach to registering point clouds is the iterative closest point (ICP) algorithm (Besl and McKay, 1992;Chen and Medioni, 1992).Izadi et al. (2011) showed real-time registration of Kinect depth images using a GPU implementation of the ICP algorithm.The method of Fioraio and Konolige (2011) was also based on ICP, but could integrate features from the colour image.
Since ICP is a fine registration method requiring a close approximation of the registration parameters, it has been often used to refine an initial coarse registration.In the work of Henry et al. (2010), the initial registration parameters were estimated ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume II-5/W2, 2013ISPRS Workshop Laser Scanning 2013, 11 -13 November 2013, Antalya, Turkey from SIFT key points (Lowe, 2004) extracted from and matched across the colour images, where outliers were removed using RANSAC (Fischler and Bolles, 1981).Du et al. (2011) followed a similar approach but allowed user interaction.The RGB-D SLAM method (Engelhard et al., 2011;Endres et al., 2012) and the method of Bachrach et al., (2012) were both based on the idea of initial registration using visual feature points, although they used different feature extraction operators.Dryanovski et al. (2012) performed the initial registration based on edge features extracted from the colour images.Steinbrucker et al. (2011) adopted an energy minimization approach to registering RGB-D data.
For loop closing several methods have been used.Graph-based optimization methods (Olson et al., 2006;Grisetti et al., 2007;Kummerle et al., 2011) represent the poses and their constraints as nodes and edges of a graph, and apply an optimization method such as gradient decent to minimize the error.Sparse bundle adjustment (Lourakis and Argyros, 2009) involves leastsquares (re-)estimation of pose parameters by minimizing the re-projection error in the image space.

GENERATION AND WEIGHTING OF 3D POINT CORRESPONDENCES
In this paper, we follow the concept of initial pairwise registration using point features extracted from the colour images.We focus on two aspects in this approach: transformation of the colour image features to the depth image for the generation of 3D point correspondences, and weighting of the 3D point pairs based on the theoretical random error of individual points.

3D point correspondences from 2D keypoints
We use SURF (Bay et al., 2008) to extract and match keypoints in successive colour images as it is considerably faster than similar algorithms.The keypoints are defined in the 2D coordinate system of the colour image.For the estimation of the pairwise registration parameters the 2D points should be transformed to 3D space by using the depth data.We define the 3D coordinate system of the point cloud with its origin at the centre of the infrared camera, the Z axis perpendicular to the image plane, the X axis perpendicular to the Z axis in the direction of the baseline between the infrared camera centre and the laser projector, and the Y axis orthogonal to X and Z making a right handed coordinate system.
To generate 3D correspondences from the 2D keypoints, in some previous works it has been assumed that a shift of the depth image pixels (applied within the driver) is sufficient to align the depth image with the colour image (Engelhard et al., 2011;Endres et al., 2012;Henry et al., 2012).As we will show, there are cases where the shift between the coordinates of conjugate points in the colour image and the depth image has a large variance, even when the image coordinates are corrected for lens distortions.
A more proper way to transform the coordinates from the colour image to the depth image is by using the relative orientation parameters (three rotations and three translationsdifferent from photogrammetric relative orientation which involves five parameters) between the two cameras.This of course requires that the relative orientation parameters are estimated in a previous calibration procedure.For the estimation of relative orientation parameters stereo calibration with a calibration grid has been used (Khoshelham and Elberink, 2012).This method provides relative orientation parameters but with relatively low accuracy due to the short length of the baseline between the two cameras in proportion to the distance to the calibration grid.
Another approach is by using a 3D calibration field with markers that can be measured in the depth image as well as in the colour image.By measuring the markers in the depth image the 3D coordinates of the points are obtained in the infrared camera coordinate system.Using the 3D coordinates in the infrared frame and the corresponding 2D coordinates in the RGB frame the transformation between the two frames can be obtained by a least-squares space resection procedure.
The estimated orientation parameters allow the transformation of 3D points to the colour image (back projection), whereas we need to transform the 2D keypoints to the 3D space.This is an ill-posed problem.To overcome that, we make use of the epipolar geometry in the following procedure: Given a keypoint in the RGB frame: 1. calculate the epipolar line in the depth frame using the relative orientation parameters; 2. define a search band along the epipolar line using the minimum and maximum of the range of depth values (0.5 m and 5 m respectively); For all pixels within the search band: 1. calculate 3D coordinates and re-project the resulting 3D point back to the RGB frame; 2. calculate and store the distance between the reprojected point and the original keypoint; Return the 3D point whose re-projection has the smallest distance to the keypoint.
Note that interior orientation parameters (including lens distortion) are used in both frames to transform back and forth between pixel coordinates and image coordinates.When the distance between the keypoint and the nearest re-projected point is larger than a threshold (e.g. 2 pixels) the keypoint is flagged as not having a valid 3D correspondence.Figure 1 illustrates the procedure in a test scene.

Definition of weights
Pairwise registration involves the estimation of a rotation matrix R and a translation vector t between two sets of corresponding points, which minimize the error: where X i,j-1 and X i,j are the 3D coordinates of point i in frames j-1 and j respectively, and w i is the weight associated to the point pair i.Since points in the Kinect point clouds do not have a uniform precision (Khoshelham, 2011), it makes perfect sense to weight the points according to their random errors.
As Kinect depth images are captured typically at a frame rate of 20 to 30 fps, resulting in small rotation and translation parameters between successive frames, we can approximate our observation equations with v i = X i,j-1 -X i,j , for which the weight can be defined inversely proportional to the variance of the observation: where σ 2 X is the variance of point X and k is an arbitrary constant.
We define the weights for every pair of corresponding points based on the theoretical random error of their depth values (Z) only.This is because weighting based on the error of X, Y coordinates would reduce the contribution of the points with increasing distance from the centre of the point cloud, which is counter-intuitive as off-centre points are expected to play a more important role in the correct alignment of two surfaces.It has been shown that the variance of the depth σ 2 Z has the following relation with the variance of the measured disparity σ 2 d (Khoshelham and Elberink, 2012): where c 1 is a depth calibration parameter.This gives us the following equation for the weight of a point pair:

Pairwise registration
Once the corresponding 3D points and their associated weights are obtained the point clouds of two successive frames can be registered.The common approach, which is also used here, is to combine the least-squares estimation method with RANSAC to eliminate the outliers (Hartley and Zisserman, 2003).To speed up the registration we use Horn's closed-form solution (Horn, 1987) to estimate the registration parameters for each random sample within RANSAC.Once the inliers are identified, a final iterative least-squares estimation using weighted inlier points is performed to obtain the registration parameters.

EXPERIMENTS
To show the effect of relative orientation on the transformation of keypoints from the RGB space to the depth space we made a test scene with markers that could be measured manually in both the depth image and the colour image.The markers were captured and measured in seven pairs of images.Figure 2 shows one of the seven pairs.The coordinates of the markers were then transformed from the colour image to the depth image using the epipolar search method as described in Section 3.1.
The transformation was done using two sets of relative orientation parameters.The first set was obtained by a standard stereo calibration procedure using a calibration grid.The second set was obtained by the space resection method using a 3D calibration field similar to the scene shown in Figure 2. The discrepancies between the manually measured coordinates of the markers in the depth image, and the transformed coordinates obtained by each of the two sets of relative orientation parameters provide an indication of the error in transforming the keypoints from the 2D space to the 3D space.
Figure 3 (a) shows first the difference between the colour image coordinates and the depth image coordinates (both corrected for lens distortion using the model of Brown (1971)) of the markers to test whether the transformation is only a shift.Clearly, there is a large variance in the shift between the two sets of coordinates.Figure 3(b) shows the discrepancies between the measured and the transformed coordinates of the keypoints, where the relative orientation parameters from the stereo calibration are used for the transformation.Figure 3(c) shows the discrepancies between the measured and the transformed coordinates of the keypoints, where the relative orientation parameters from the space resection method are used.It can be seen in Figure 3(c) that the transformed points have a variance of about 1 pixel.This shows that transforming the points by the epipolar search method and using the relative orientation parameters from the space resection is more accurate and reliable than the other methods.ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume II-5/W2, 2013ISPRS Workshop Laser Scanning 2013, 11 -13 November 2013, Antalya, Turkey To study the effect of weighting 3D point correspondences in pairwise registration a set of six RGB-D sequences from an office environment was acquired.Since obtaining ground truth trajectories was difficult, the sequences were acquired such that the first and the last frame of each sequence had sufficient overlap and could be registered to form a closed loop.This allowed the calculation of the closing error for each trajectory based on the following equation: where H denotes the transformation from frame i to frame j, and Δ is a residual transformation matrix containing a closing translation vector v and a closing rotation matrix δ R .From these we calculated two error metrics to evaluate the accuracy of each trajectory: a closing distance from v and a closing angle as the sum of (absolute) rotation angles in δ R .Figure 4 shows the closing distances and closing angles for the six sequences after the pairwise registration with and without weights.The sequences were sorted in order of increasing length, and the horizontal axes show sequence length.It can be seen that both the closing distances and closing angles are improved as a result of using weights in pairwise registrations.Table 1 shows the average closing distance and closing angle over all sequences registered with and without weights.

CONCLUSIONS
When registering long RGB-D sequences, pairwise registration errors accumulate and lead to inaccurate and deformed point clouds, particularly in situations where loop closing is not feasible.We showed that accurate transformation of keypoints from the RGB space to the depth space using an epipolar search method results in more accurate 3D point correspondences.We also showed that assigning weights based on the theoretical random error of the depth measurements improves the accuracy of pairwise registration and sensor pose estimates along the trajectory.
Using weighted observations in pairwise registration allows the estimation of covariance matrices for the estimated pose vectors.These can be used to weight pose vectors in the global adjustment, and further improve the sensor pose estimates in a closed loop.
A drawback of registration by using visual features is the influence of synchronization errors between the RGB camera shutter and the IR camera shutter on the transformation of keypoints to the 3D space.This emphasises the importance of a fine registration step using point-and plane correspondences extracted from the depth images to generate accurate point clouds from RGB-D data.
Taguchi et al. (2012) combined points and planes for the registration of RGB-D data.Dou et al. (2013) combined planes with visual features in both pairwise registration and global adjustment.A comparison of RANSAC and Hough transform for plane extraction and mapping using RGB-D data is presented byNasir et al. (2012).

Figure 1 .
Figure 1.Finding 3D points in the depth image (right) corresponding to 2D keypoints in the colour image (left) by searching along epipolar lines (red bands).

Figure 2 .
Figure 2. Manually measured markers in the disparity (left) and colour image (right).
Figure3.Discrepancies between the manually measured and transformed coordinates of the markers using only a shift (a), using parameters from stereo calibration (b) and using parameters from space resection (c).

Figure 5 Figure 4 .
Figure 5 compares for one of the sequences the trajectory obtained by weighted registration (blue curve) with that obtained by registration without weights (red curve).The black curve is the closed loop obtained by a global adjustment of the sensor poses.It can be seen that the trajectory from the weighted registration follows more closely the globally adjusted trajectory.Example point clouds of an office environment obtained by the weighted registration of RGB-D sequences are shown in Figure 6.

Figure 5 .
Figure 5. Trajectory obtained by weighted registration of an RGB-D sequence (in blue) compared with the trajectory obtained by registration without weights (in red) and one obtained by global adjustment (in black).

Figure 6 .
Figure 6.Example point clouds of an office environment obtained by weighted registration of RGB-D sequences.

Table 1 .
Average closing errors for registrations with and without weight.