DEEPLIO: DEEP LIDAR INERTIAL SENSOR FUSION FOR ODOMETRY ESTIMATION

: Having a good estimate of the position and orientation of a mobile agent is essential for many application domains such as robotics, autonomous driving, and virtual and augmented reality. In particular, when using LiDAR and IMU sensors as the inputs, most existing methods still use classical ﬁlter-based fusion methods to achieve this task. In this work, we propose DeepLIO, a modular, end-to-end learning-based fusion framework for odometry estimation using LiDAR and IMU sensors. For this task, our network learns an appropriate fusion function by considering different modalities of its input latent feature vectors. We also formulate a loss function, where we combine both global and local pose information over an input sequence to improve the accuracy of the network predictions. Furthermore, we design three sub-networks with different modules and architectures derived from DeepLIO to analyze the effect of each sensory input on the task of odometry estimation. Experiments on the benchmark dataset demonstrate that DeepLIO outperforms existing learning-based and model-based methods regarding orientation estimation and shows a marginal position accuracy difference.


INTRODUCTION
Odometry estimation -i.e. estimating the 3D position and orientation of an agent through time and space -using sensory information such as cameras, light detection and ranging sensors (LiDARs), and inertial measurements units (IMU) is a highly active field of research in computer vision with a wide range of application domains such as autonomous driving (Geiger et al., 2012), robotics (Cadena et al., 2016), building information modeling (BIM) (Roca et al., 2014) and virtual and augmented reality (Klein and Murray, 2007). In one of the first papers on visual odometry (VO) (Nistér et al., 2004) the authors fulfill the VO task by extracting image features (i.e. harris corners) and tracking them frame by frame over time. But the main issue of VO lies both in the scale drift due to scale ambiguity, especially when monocular cameras are used, and the unsoundness of the predictions in the lack of enough features in images.
To tackle these problems many researchers start to fuse the VO predictions with other sensory information, like IMU (Nützi et al., 2011, Chu et al., 2012 by applying e.g. Extended Kalman Filter (EKF).
In the case of rangefinder sensors, like LiDARs mainly Iterative Closest Point (ICP) based algorithms (Segal et al., 2010, Besl andMcKay, 1992) are used to match consecutive frames by minimizing the distance between the corresponding points in two point clouds. Also, EKF in a tightly (Ye et al., 2019) or loosely (Tang et al., 2015) fashion is here used to fuse the ICP results with the IMU measurements. But unfortunately, most ICP algorithms suffer from 1) Wrong correspondences 2) Inaccurate initialization 3) Lacking of correct covariance matrix 4) High computational power.
Recently some efforts have been made to apply learning-based methods -i.e. neural networks, on LiDAR measurements, to * Corresponding author estimate the pose of a mobile agent over time , Velas et al., 2018. Analogously to these works we propose a supervised learning-based end-to-end trainable fusion method, which utilizes neural networks to extract relevant features from the multi-modal LiDAR and IMU measurements and fuses these latent features to regress the motion encoded in them. To the best of our knowledge, this work is the first comprehensive study of the supervised learning-based fusion method for odometry estimation, which uses both LiDAR and IMU measurements. Particularly our contributions can be summarized as follows: • Introducing a novel learning-based fusion architecture for 6D pose estimation -i.e. 3D translation and 3D rotation, using LiDAR point cloud and IMU measurements.
• Introducing a local and global windowed loss function.

RELATED WORKS
Many research efforts have focused on multi-sensor odometry over last decades. In this section, we give an overview of model-and deep-learning based methods applied to this problem.

Model-based Lidar-(Inertial) Odometry Estimation
Most of the proposed algorithms for LiDAR-based pose estimation on range data are based on or are related to ICP (Besl and McKay, 1992). Since then, many variants have been suggested to improve the robustness and convergence of the algorithm. In (Segal et al., 2010) the authors formulates the ICP as a general probabilistic framework (GICP), which depending on parameterization permutes to a point-to-point, point-toplane, or plane-to-plane algorithm. The proposed method in (Serafin and Grisetti, 2015) uses the normal and curvature information of a local surface estimated around a point to find correspondences in an integral image-based fashion. Furthermore, the normal information combined with Cartesian coordinates of the corresponding points is used to estimate the transformation by minimizing the least-squares loss. A comparison to other methods showed that NICP registration offers better results, also it is more robust against poor initial guesses. To improve the ICP-based pose estimation in (Xue et al., 2019) the authors introduce a loosely coupled Extended Kalman-Filterbased IMU-ICP-fusion, where the IMU measurement are used at different processing stages. According to their experiments, the best odometry estimation result is achieved using IMU and a Lidar Odometry and Mapping (LOAM) method (Zhang and Singh, 2014), which is a feature-based point cloud registration approach based on a novel strategy for edge and surface features extraction.

Learning-based Lidar-(Inertial) Odometry Estimation
The sparsity and unstructured nature of 3D point clouds make the odometry estimation problem more difficult to solve. Hence reliable and accurate 6DoF learning-based pose estimation on point clouds is still an open problem. In (Nicolai et al., 2016) the authors introduce one of the first deep-learning-based odometry methods on point clouds by projecting the point clouds to an image before feeding in their CNN-network for pose estimation. But the presented results were not able to compete with model-based ICP matching methods. Another CNN based architecture proposed in (Velas et al., 2018), also uses a image representation of the input point cloud consisting of three channels (depth, height, intensity). Even though the translation estimation results, formulated as a regression task, were promising, the results on orientation estimation were rather poor, notwithstanding formulating it as a regression or classification task. In contrary to these approaches and inspired by (Yang et al., 2018), (Li et al., 2019) introduces LO-Net , a supervised end-to-end trainable 6DoF pose estimation method based on two consecutive LiDAR frame. Therein odometry is formulated as a 7D regression problem, with a 3D translation vector and a 4D quaternion. Also here the sparse and irregular point clouds are first projected in 2D matrices by cylindrical projection. Both supervised and unsupervised method for the odometry task between two consecutive LiDAR frames is introduced in (Cho et al., 2019). Here the network consists of two separate ResNet (He et al., 2016) like branches, which are trained by 2D matrix projections of the 3D Cartesian coordinate of points (vertex) and their normal vectors. Depending on the selected loss function the network can be trained in supervised or unsupervised mode.

Spherical Projection
A point cloud captured by a LiDAR is an unordered set of 3D Points p i = (x, y, z). To be able to apply 3D CNN-technique on the point clouds we encode them into a 2D-image. To this end, we use spherical projection (Milioto et al., 2019) by defining the mapping function Π : R 3 → R 2 as,  In most scenarios the mobile agent is moving, while the sensor captures a frame, this result in de-skewed scans where there is no direct one-to-one mapping from a pixel coordinates (u, v)i to a point p i , i.g. multiple points fall in the same image pixel. For this reason, we order the point cloud in descending range order, and the closest points to the LiDAR sensor is assigned according to (Li et al., 2019, Milioto et al., 2019. The resulting image I ∈ R M ×N ×C can be augmented with multiple channels C. In this work we set C = (x, y, z), see fig. 1.

Normal Estimation
As shown in (Serafin and Grisetti, 2015) normal vectors are strong features for point cloud registration. A Commonly used strategy to calculate the normal vector ni at the point p i is to sample some points in its neighborhood Ni = [pi,0, pi,1, pi,j], j = [1 . . . n], build the covariance matrix and calculate the normal vector based on the Eigendecomposition. Corresponding to the formulation in (Li et al., 2019), given a point p i = [x, y, z]i and it's N neighbors p i,j , j = 1, 2, . . . , N . We can estimate the normal N (p i ) := ni ∈ R 3 by, Instead of directly solving the objective defined in eq. (4), we use a simplified normal estimation method same as (Moosmann, 2013), where the normal at each point of an encoded point cloud is calculated by computing the weighted cross products over p i 's four neighbors as defined by, To make sure that faraway neighboring pointse.g. around the borders of an object, do not distort the normal estimation, each distance vector ∆p i j is weighted by ω i j . The weighting function φ(x) : R 3 → R is defined by, where the weighting is controlled by the parameter α. Figure 1 shows the entire encoding pipeline, where a point cloud is first encoded as an image I(u, v) with three channels C = [x, y, z], and subsequently, the normal vectors are estimated using the x, y, z information at each pixel of this image.

DEEPLIO ARCHITECTURE
We introduce a cascade modular network architecture, that learns a robust fusion of LiDAR frames and IMU measurements for the task of relative pose estimation between a pair of consecutive LiDAR frames Ft−1 and Ft. The input to the network is a sequence of consecutive LiDAR frames and the corresponding IMU measurements in the same time interval. In essence, our network learns to extract appropriate features from both sensory information and subsequently learns a sophisticated fusion strategy suitable for mapping from both LiDAR and IMU feature space to the space of se(3) defined by, where (H, W, C) are the height, width, and the number of channels of the projected LiDAR frames and Simu is the sequence length of IMU data measured during two consecutive LiDAR frames.
As you can see in fig. 2 the network consists of three main modules.
1. Feature-Nets: This module itself consists of two following sub-modules for feature extraction.

LiDAR Feature Network
Similar to (Cho et al., 2019) we also construct the LiDAR-Feature-Nets as a siamese network with each branch referred to as VertexNet (top branch) and NormalNet (bottom branch), but in contrast to (Cho et al., 2019) we use a more lightweight network based on fire and squeeze-and-excitation modules introduced in , Iandola et al., 2017, Hu et al., 2018.

IMU Feature Network
Usually, IMUs operate at a much higher frequency than LiDAR, which results in a sequence of IMU measurements (Simu) during a full LiDAR spin. The IMU sub-network takes this sequence of raw IMU measurements m (t−1,t) = [α, ω] ∈ R S imu ×6 sampled between two consecutive frames as input and learns the dynamic encoded in this sequence for the task of pose estimation. For this reason, we experiment with three network configurations, one made of fully connetected-layers and the other two constructed by LSTM and GRU modules (Hochreiter andSchmidhuber, 1997, Cho et al., 2014).

Fusion Network
Concatenating different features together is a standard fusion strategy  with the drawback that different modalities and noise characteristics are not considered. For instance, concatenating feature vectors of multiple redundant sensory information without considering their reliability under certain environments may result in a sub-optimal and not robust solution. To remedy this issue we introduce an attention-based (Vaswani et al., 2017) where [., .] denotes the concatenation operation, σ is the sigmoid-function and W [L,I] are the corresponding weights. sL and sI form the weights corresponding to each element of combined LiDAR and IMU feature vector. Finally, we multiply these weights with their corresponding feature vector for an automatic recalibration and feature selection, where f f characterizes the fusion function with all parameters accommodated in θ f .

Odometry Network
The input to this module is a sequence of fused LiDAR and IMU features. Hence the network should be constructed so that it can learn reasonable temporal correlations encoded in its input sequence. For this reason, we construct this sub-network using LSTM-modules (Hochreiter and Schmidhuber, 1997).

LOSS FUNCTIONS
So far, we have defined the inputs, outputs, and network architecture of DeepLIO, in this section we introduce the structure of the loss function used to train the network.

Global and Local Loss
Learning to predict a robust estimation of two different quantities, namely rotation and translation is a challenging task. Even if we assume that the network can predict the relative motion between two frames reasonably good with a small deviation, by the effect of propagation of uncertainty, accumulating these local predictions still result in an inaccurate global trajectory estimation. Therefore the loss function should be designed so that it enforces the network to learn reasonable features for a robust local and global pose estimation.
Given a sequence S of pairwise stacked and projected LiDAR frames li ∈ R H×W ×2C and their corresponding IMU measurments mi ∈ R S imu ×6 from a dataset, the input X and the ground truth poses Y are defined by, Where Y l is the set of local relative motions between each consecutive frames and Y g the set of global motions occurred between the first frame and other frames in the sequence.
Furthermore let us denote the network local motion prediction at the timestamp i as (ti,ωi) = f θ (xi) ∈ se(3), where f θ is the neural network mapping function with θ being the state of all parameters learned by the network during the training. The set of all local motion predictions of the network for a given input sequence S is defined by, Based on these definitions and w.l.o.g. we can convert these local motion estimations to global estimations regarding the first frame in the sequence. Building upon that, we define a set of global motion predictions by, are the corresponding global position and orientation estimations at the timestamp i, calculated using function g, which transforms a set of local motions to a global motion with respect to the first element of the input set. Endowed with this information, we can define the local L l (ŷ l i , y l i ) and global L g (ŷ g i , y g i ) loss as two functions that measure the discrepancy between the predicted motion and the ground truth. Figure 3 demonstrates the relation between the network predictions, the global and local loss functions, and a sequence of input data. An important aspect of designing the loss function is the representation of the output quantities the network has to predict. Even though the amount of a position displacement between two frames can be expressed in a Euclidean space without any effort, representing rotational movement is not as straight-forward, since a rotation can be presented among other forms as Euler-angles, rotation-matrices, and quaternions, but unfortunately, each of them has its advantages and disadvantages, for instance, only three parameters are needed to represent Euler-Angles, but they do not provide a unique parameterization and suffer from gimbal lock, in contrast, rotation-matrices are unique but they are overparameterized and the orthogonality constraint hampers the optimization. Regarding the number of parameters, quaternions are a compromise, but they need to be normalized into a unit length.
For these reasons and under the assumption that relative motion between two frames is supposed to be small, in this work, we represent the local frame-to-frame motion in the space of se(3), since rotations are not constrained in this representation. At the other end due to the above reasons the global trajectory is represented as a quaternion q ∈ H. Hence local and global loss functions are defined by, Where . γ determines the distance norm. In this work we set γ = 2 for all loss functions.

Homoscedastic Weighted Sum Loss
Learning both translation and rotation simultaneously is due to the unit and scale differences between these quantities indeed challenging. Therefore for the loss function we need to find a sophisticated balancing strategy between both tasks. In (Kendall et al., 2015 the authors suggest a linearly weighted sum loss, with the weighting as a hyperparameter. Since an optimal balancing value is not known in advance, we need to find a sophisticated hyperparameter tuning strategy.
In contrast to the linearly weighted loss function and inspired by  we define the loss function as a Homoscedastic Weighted Sum Loss. Homoscedastic uncertainty is one of the two subcategories of Aleatoric uncertainty , which stays constant for different inputs but varies between different tasks . One major advantage of this loss is that all hyperparameters used for balancing between each task are now learnable, hence they can be learned during the training.
Assuming Gaussian likelihood we can model the network outputs for a single local pose prediction as, is the appropriate network pose estimation, and N is a Gaussian distribution with the meanŷ i and the variance σ. From now on we ignore the i-subscript to enable a better overview.
Given the network outputs, position p i and orientation q i are conditionally independent, hence reformulating the likelihood and substituting the results in a Gaussian distribution yields, In Maximum-Likelihood-Methode (MLE) rather than maximizing the likelihood function itself, we maximize the log of it, since log is a monotonic function, the same parameters which maximize a specific function will also maximize its log.
The log of a Gaussian distribution is defined by, We can now formulate the loss function L(θ, σp, σq) as a function of the network parameters θ and σ (p,q) as, Where the constant terms are neglected since they do not change the optimization result. Please note that this loss function now also depends on the weighting hyperparameters (σp, σq), hence we do not need to set them manually. Furthermore, these hyperparameters now represent the model's observation noise, i.e. they capture how much noise we have in the outputs . Furthermore due to numerical stability and to avoid the division by zero in  the authors suggest log variance s := log σ.
By extending this method and combining both the local and global loss functions, the final Homoscedastic Weighted Sum Loss (HWS-Loss) is defined by, Where (s1, s2) are the observation uncertainties corresponding to position and orientation predictions.

KITTI Odometry Dataset
In this work we use the KITTI Odometry dataset (Geiger et al., 2013) to train and test DeepLIO 1 . The dataset consists of 22 sequences from which 11 sequences are provided with ground truth for training and testing. Each sequence is a trajectory captured by driving an augmented car in Karlsruhe, Germany. LiDAR frames are captured by a Velodyne HDL-64e with 64beams and a sampling frequency of 10Hz. Also, an IMU with a sampling frequency of 100Hz and a GPS module is used to calculate the ground truth pose.

Implementation details
Using the aforementioned spherical projection we encode each input point cloud to an image matrix by setting W = 720 and H = 64 pixels, furthermore we set alpha = 0.8 for normal image calculation. During the training, we build a sequence of five paired consecutive frames and their corresponding IMU measurements. Furthermore, to assess the capability of the proposed DeepLIO network to improve the odometry results by fusing both partly redundant sensor information, we evaluate and compare the accuracy of three different sub-network architectures, which we denote as, • DeepIO: This network uses only IMU measurements to predict the resulting odometry.
• DeepLO: This network uses only LiDAR images and normals to estimate the odometry.
• DeepLIO: this network uses both LiDAR as well as IMU measurements and fuses them to estimate the odometry.
In order to facilitate comparison with other learning-based odometry methods, we use sequences 00 − 08 for training and 09−10 for test. We also evaluate the results by utilizing the metric defined in the KITTI dataset, which is the average translation t rel (%) and rotation r rel (°/100m) RMSE on the distance of 100m-800m. We employed the Adam solver (Kingma and Ba, 2014) with β1 = 0.9, β2 = 0.99 and w decay = 10 −5 for all aforementioned network configurations. We set the batch-size and the sequence length to B = 8 and S = 5. We implemented the entire framework using PyTorch (Paszke et al., 2019).

Evaluating DeepIO
By deploying the deep inertial odometry network (DeepIO) we aim to investigate the ability of different only IMU-based networks to extract the hidden information about the dynamic motion of the mobile agent from the IMU measurements to predict the current pose of the agent w.r.t. its previous pose. To this end, we designed three different network configurations. The first one is only built by fully-connected layers, where experimented with different hidden-layer settings. The next two networks are built by LSTM and GRU units. As the evaluation results in table 1 shows, both networks based on recurrent neural networks (LSTM, GRU) outperformed the fully-connected network, which confirms our assumption about the capabilities of these modules to learn the hidden dynamic in their inputs better.

Evaluating DeepLO
Consistent with the last section, we evaluated the deep lidar odometry network (DeepLO) as the next step. For this, we configured the DeepLIO-framework, in a way that only LiDAR-Feature Networks are activated and used. Furthermore, we deployed two architectures, one based on the FlowNet (Fischer et al., 2015) and the other based on PointSeg  for both siamese Vertex-and Normal-Net feature encoders. As the results in table 2 shows, the network based on PointSeg is able to learn and extract relevant features from the LiDAR images for odometry estimation, where the network based on FlowNet does not converge to a good solution.

Evaluating DeepLIO
In the last sections, we presented and analyzed different network combinations to investigate the effect of each sensory information and to measure their prediction strength. Now it is time to fuse the strengths of each of these networks into a oneunit network by utilizing the proposed fusion framework. Also in this section, we evaluate and compare DeepLIO with other ICP, Deep learning, and optimization based methods. We reduced the training time by the means of transfer learning, based on the already pre-trained sub-networks introduced in the last two sections. Table 3 shows, that compared to both ICP based methods pointto-point and point-to-plane DeepLIO achieves better results across all sequences. On the other hand, regarding position estimation, LOAM and LO-Net achieve slightly better results throughout all sequences. But regarding orientation estimation, DeepLIO outperforms all other methods. Furthermore comparing the results in table 3 with table 1 and table 2 we can see, that the DeepLIO network was able to learn a sophisticated mapping to fuse both feature streams to improve its estimation capabilities compared to the other sub-networks. The estimated global trajectories in fig. 4 validate these results.

CONCLUSION AND OUTLOOK
In this paper, we proposed a novel modular deep-learning-based fusion framework DeepLIO for robust odometry estimation using LiDAR and IMU sensors. Furthermore, also a loss function based on homeostatic uncertainty, which also considers both global and local motion was introduced. In contrast to most other deep-learning fusion methods, where two state vectors are concatenated together, we proposed a self-adaptive fusion strategy based on , so that the fusion-layer can learn to weight each input based on its modality, which is learned during the training. We also showed that our method evaluated on the KITTI odometry dataset outperforms both learning-based and model-based methods regarding orientation estimation and shows minor differences regarding position estimation. A reason for this may be the considerable amount of information loss during the quantization process of the point clouds. Therefore in a further extension of this work, we will investigate the effect of using raw point clouds and scene flow estimation on the accuracy of odometry estimation. and rotational r(°/100) RMSE on training set. * Average translational t(%) and rotational r(°/100) RMSE on test set. # These methods are not trainable, so that we take their overall odometry results as a test reuslt.   (Grupp, 2017) .