BUNDLE ADJUSTMENT FOR MULTI-CAMERA SYSTEMS WITH POINTS AT INFINITY

We present a novel approach for a rigorous bundle adjustment for omnidirectional and multi-view cameras, which enables an efficient maximum-likelihood estimation with image and scene points at infinity. Multi-camera systems are used to increase the resolution, to combine cameras with different spectral sensitivities (Z/I DMC, Vexcel Ultracam) or – like omnidirectional cameras – to augment the effective aperture angle (Blom Pictometry, Rollei Panoscan Mark III). Additionally multi-camera systems gain in importance for the acquisition of complex 3D structures. For stabilizing camera orientations – especially rotations – one should generally use points at the horizon over long periods of time within the bundle adjustment that classical bundle adjustment programs are not capable of. We use a minimal representation of homogeneous coordinates for image and scene points. Instead of eliminating the scale factor of the homogeneous vectors by Euclidean normalization, we normalize the homogeneous coordinates spherically. This way we can use images of omnidirectional cameras with single-view point like fisheye cameras and scene points, which are far away or at infinity. We demonstrate the feasibility and the potential of our approach on real data taken with a single camera, the stereo camera FinePix Real 3D W3 from Fujifilm and the multi-camera system Ladybug 3 from Point Grey.


INTRODUCTION
The presented work is part of the visual odometry package within the DFG-project Mapping on Demand (MoD) at the University of Bonn and the Technical University of Munich, in which we use a lightweight autonomously navigating unmanned aerial vehicle (UAV).The goal of the project is the development and testing of procedures and algorithms for the fast three-dimensional semantic mapping of inaccessible objects.
The on-board sensing of lightweight UAVs has to be designed with regards to their limitations in size and weight, and limited on-board processing power requires highly efficient algorithms.In this project we use a quadrocopter, equipped with a GPS unit, an IMU, ultra sonic sensors, a 3D laser scanner, a high resolution camera and four fisheye cameras, which are mounted as two stereo pairs, one looking ahead and one looking backwards, providing a large field of view, see Fig. 1.The two stereo cameras are used (a) besides the ultra sonic sensors and the laser scanner for obstacle perception in the environment for autonomous navigation and (b) besides the GPS-unit and IMU for ego-motion estimation.The goal is to use the on-board processed ego-motion as an initial estimate for the orientation of the images of the high resolution camera, taking images with about 1 Hz, for near realtime semantic surface reconstruction on a ground station.
The four cameras with Lensagon BF2M15520 fisheye lenses with a field angle up to 185 • capture four image sequences with a frame rate of 14 Hz in a synchronized way.The basis between the cameras amounts to 20 cm providing highly overlapping views at each time of exposure, see Fig. 2. The monochromatic images have a resolution of 752×480 pixels.In this paper we treat the issue of visual odometry for real-time ego-motion estimation using the synchronized images of the multi-camera system in a keyframe-based fast incremental bundle adjustment.
Bundle adjustment is the work horse for orienting cameras and determining 3D points.It has a number of favorable properties, Figure 1: Illustration of the UAV, one stereo pair is looking forward and one backwards providing a wide field of view.e.g. it is statistically optimal in case all statistical tools are exploited and it is highly efficient in case sparse matrix operations are used and variable reordering is done to prevent unnecessary fill-in.Our previous work yielded a rigorous batch bundle adjustment for omnidirectional and multi-view cameras for an efficient maximum-likelihood estimation with scene points at infinity, called BACS (bundle adjustment for camera systems), see (Schneider et al., 2012).Multi-camera systems can be used like omnidirectional cameras to augment the effective aperture angle.Far or even ideal points, i.e. points at infinity, e.g. points at the horizon, have been proven to be effective in stabilizing the orientation of cameras, especially their rotations.
We want to use the power of bundle adjustment but with a high performance enabling a real-time keyframe-based visual SLAM application.Keyframe-based methods computationally select only a small number of past frames to process a global bundle adjustment.Incremental bundle adjustment avoids periodical batch steps with recurring calculations by performing only calculations for entries of the information matrix, i.e. the normal equation matrix or inverse covariance matrix, that are actually effected by new measurements.Kaess et al. (2012) provide a sparse nonlinear incremental optimization algorithm called iSAM2, which is highly efficient, as only variables are relinearized that have not converged yet and as fill-in is avoided through incrementally changing the variable ordering.
In this paper we focus on our concept for the visual odometry with a multi-view camera system consisting of omnidirectional cameras using the iSAM2 algorithm for a keyframe-based incremental real-time bundle adjustment.
The paper is organized as follows.In the next section we present our prototype realizing the image processing for data acquisition and reliable data association and the robust orientation of a set of frames taken in a synchronized way.Further, we show how the sparse non-linear incremental optimization algorithm iSAM2 can eliminate the need for periodic batch bundle adjustment steps for mapping and motion estimation on sets of keyframes.We follow up in section 3 with first experiments regarding the achieved timings and observed properties of the algorithm used.Finally we conclude and give an outlook on our future work in section 4. Related Work.Real-time bundle adjustment has been tackled intensively in the photogrammetric community, see e.g. the review by Grün (1987).It has recently attracted researchers in videometry for real-time visual odometry.Basic techniques for achieving real-time capabilities are Kalman-filter (Davison, 2003;Choi and Lee, 2012), sliding bundle adjustment (Engels et al., 2006) and incremental bundle adjustment (Mouragnon et al., 2009), the last two techniques aiming at higher stability in determining the pose parameters.It is useful to exploit the sparsity during a sliding window bundle adjustment, cf.(Sünderhauf et al., 2005;Klein and Murray, 2007).Recently, filtering techniques based on bundle adjustment, have intensively been investigated.They use current image information to improve the past pose and map information.Strasdat et al. (2012) show that filtering all frames is inferior to using keyframes and that a high number of features is superior to a high number of frames.Incrementally updating the normal equations can be replaced by updating the QRfactorization, described in detail in (Golub and Van Loan, 1996) and e.g. proposed for aerial on-line triangulation (Grün, 1984).It has been realized for generic incremental maximum a posteriori estimation by Kaess et al. (2008) in a first version without considering reordering or relinearization.A fluid relinearization and reordering has recently been published by Kaess et al. (2012) using a bayes-tree, published in (Kaess et al., 2010).

Overview
Visual odometry consists in determing the pose of the cameras in real-time.Our system uses feature points.The process consists of several steps: 1.The data acquisition and association detects feature points, performs the matching and provides camera rays associated with the previous and the other cameras.

The orientation of individual frames aims at providing ap-
proximate values for the subsequent bundle adjustment and allows to select keyframes.3. The incremental bundle adjustment uses the new information at a keyframe and merges it optimally with the previous information.
Figure 2: Example frame set taken with the four fisheye cameras in a synchronized way, that contains 50 feature points in each frame, which are tracked using the OpenCV implementation of the KLT tracker.
As the last step uses all available data and therefore is the most costly one, it needs to be an efficient one, to ensure real-time capability, and requires the previous steps to guarantee outlier free information.Therefore we chose the software package iSAM2 for "incremental smoothing and mapping" for the third step and aim at efficient methods for reliable data association.The steps are now described in more detail.

Data Acquisition and Association
Our system for visual odometry is based on interest points, which are tracked in the individual cameras using the OpenCV implementation of the KLT tracker.Interest points are corners in the gradient image with a large smallest eigenvalue of the structure tensor, cf.(Shi and Tomasi, 1994), and detected by using the function cvGoodFeaturesToTrack and tracked using the function cvCalcOpticalFlowPyrLK, which implements the iterative Lucas-Kanade method with pyramids according to Bouguet (2000).Fig. 2 shows an example of extracted feature points in the images of the four fisheye cameras.
Tracked feature points are converted into ray directions pointing to the observed scene points in the individual camera systems.For this we model the fisheye objective with the equidistantmodel described in (Abraham and Förstner, 2005) allowing for ray directions with an intersection angle equal or larger than 90 • to the viewing direction.The interior orientation of each camera is determined in advance by camera calibration according to Abraham and Hau (1997) using Chebyshev polynomials.Using the equidistant-projection and applying all corrections to the feature points, we obtain image points e x lying closer to the principal point H than the gnomonic projections g x of the observed scene points, see Fig. 3.The spherically normalized ray direction k x s can be derived from e x by using the normalized radial distance r = | e x| growing with the angle φ between the viewing direction and the camera ray.
The uncertainty of the image coordinates can be transformed to the uncertainty of the direction k x s of the camera ray via variance propagation yielding Σk x sk x s .In all cases the covariance matrix of the camera rays is singular, as the normalized 3-vector only depends on two observed image coordinates.
To match feature points in the overlapping images of a stereo camera pair, we determine the correlation coefficients between the local 7×7 image patches at the feature points in the left and right images.Using the known rotation R 2 1 and translation t 2 1 from the left into the right camera of the respective stereo  pair, which is determined in advance according to Schneider and Förstner (2013), we can reduce the amount of possible candidates to feature points lying with regard to their uncertainty close to the corresponding epipolar lines, see Fig. 4, by checking the significance of the contradiction to the coplanarity of We assume feature points with the highest correlation coefficient ρ1 to match, if ρ1 is above an absolute threshold, e.g.0.8, and -if there is more than one candidate close to the epipolar line -the closest-to-second-closest-ratio r = ρ2/ρ1 with the second highest correlation coefficient ρ2 is lower than an absolute threshold, e.g.0.7.Finally we check if this criterion holds also for all feature points in the left image if there are more than one feature points on the corresponding epipolar lines.This procedure leads in some rare cases to wrong matches, which can be detected with a third observing ray from another pose.Matched feature pairs in the stereo cameras are used beneath visual odometry for fast obstacle detection by directly intersecting the corresponding camera rays using the known mutual orientations between the cameras within a stereo pair.

Orientation of a Set of Frames
A map in our context is a set of scene points X = {Xi, i = 1, ..., I}.It is initialized by forward intersecting the matched ray directions in the stereo pairs in the initiating frame set.
The initiating frame set is chosen as the first keyframe set with a fixed pose M k , defining the coordinate system of the map up to scale.The index k denotes a motion of a set of keyframes K k of all keyframe sets K = {K k , k = 1, ..., K} ⊂ T = {Tt, t = 1, ..., T }, taken out of the set T of all frame sets Tt, t being the index referring to the time of exposure of a set of frames taken in a synchronized way.
After initialization of the map, robust estimates for the motion Mt of the camera system in the map are computed at each time of exposure t via simultaneous resection of all cameras.We We determine the solution for the six pose parameters of Mt by a robust iterative maximum likelihood-type estimation down weighting observations with large residuals by minimizing the robust Huber cost function, see (Huber, 1981).The rigid motions Mc are determined in advance using a rigorous bundle adjustment estimating the system calibration as described in (Schneider and Förstner, 2013).Using the pose Mt−1 as the initial approximate value, the estimation for Mt converges in most cases after 2-3 fast iterations.This allows the orientation of set of frames taken with a high frame rate.A track of observations getting a low weight is put on the blacklist.Tracks on the blacklist are not considered in the following frames anymore.

Keyframe-Based Incremental Bundle Adjustment
Bundle adjustment refers to the sets of keyframes, which reduce the processing to some geometrically useful, tracked observations.A new keyframe set with motion M k is initiated in case a minimal geometric distance to the last keyframe set with motion M k−1 is exceeded.In case a new keyframe set is initiated, the observations x ikc are used to update and refine the scene points in X and poses in K in the incremental bundle adjustment.We classify the tracked observations into two sets, x 1 and x 2 , where x 1 are the observations of scene points that are already in the map and x 2 denotes those observing new scene points.The map is continually expanded as new keyframe sets are added.Initial values for new tracked scene points are obtained by forward intersection with observations x 2 , where we claim that each track consists of at least three keyframe sets.Care has to be taken with the sign: We assume the negative Z-coordinate of each camera system to be the viewing direction.The homogeneous representation of the scene points then need to have non-negative homogeneous coordinates Xi,4.In case of ideal points, we therefore need to distinguish the scene point [Xi; 0] and the scene point [−Xi; 0], which are points at infinity in opposite directions.Intersected scene points that show large residuals in the observations are put on the blacklist and deleted in the data association.Observations x 2 are assumed to be revised from corrupted tracks via the former robust resection.
The map X and the set of poses in K is simultaneously refined using bundle adjustment.For our real-time application the processing of a new keyframe set K k needs to be finished by the time the next keyframe set is added.For the first ten keyframe sets we use batch bundle adjustments as the map contains only a small number of scene points yet.After that the new information are incrementally merged with the previous information, yielding a fast optimal solution for the bundle adjustment using iSAM2, see (Kaess et al., 2012).
Incremental Bundle Adjustment.For bundle adjustment we use the measurement equation that is not linear in the scene points and pose parameters of Xi and M k .Linearization of the non-linear model at the actual linearization points, which is shown in detail in (Schneider et al., 2012), leads to the least squares optimization problem with the Jacobian A that contains for each observation x ikc the block entries A ikc , right hand side (RHS) vector b, unknown updates ∆x for the scene point and pose parameters and for proper weighting the covariance matrix Σ that contains the covariance matrices of all x ikc that we assume to be uncorrelated with each other.Using the decorrelated Jacobian A = Σ −1/2 A, which retains the sparsity pattern of A, yields in eq. ( 3) a metric with Σ as identity matrix.

QR factorization of
T where R is upper triangular and Q is orthogonal, see (Golub and Van Loan, 1996).Applying this factorization to the least squares problem in eq. ( 3) with As R is upper triangular, we can solve for ∆x with simple backsubstitution.The expensive part is done by QR decomposition.The matrix R is called square root information matrix, as the information matrix, i.e. the normal equation matrix, is given by When a keyframe set is initiated with new measurements, generating the Jacobian W and RHS g, the previously calculated components of R can be reused yielding the new system updates ∆x1 and ∆x2 for the old and new parameters, respectively.Instead of applying an expensive full QR factorization, the entries below the diagonal can be efficiently eliminated by using Givens rotations, see (Golub and Van Loan, 1996), to obtain the updated square root information matrix.
The iSAM2 Algorithm.Applying this procedure iteratively would keep the linearization point unchanged, but relinearization would involve the recalculation of R. As new measurements often have only a local effect and fill-in may become time expensive, Kaess et al. (2012) encode the conditional density of cliques in a so-called Bayes tree, see Fig. 6, which allows for an efficient incremental reordering, just-in-time relinearization and partial solving, when parameters change only locally.Using this data structure replaces also the need to form neither the complete matrices A nor R explicitly.
When a new measurement is added, i.e. a factor f (M k , Xi), only the paths in the Bayes tree between the cliques containing M k and Xi as well as the root are affected by the update.The affected part is turned into a factor graph, the new factors are added to it and a new elimination ordering is applied, to avoid fill-in.A new Bayes tree is formed and all unaffected sub-trees can be reattached.
Relinearization is performed only on variables whose estimated update is larger than a threshold β, which can be defined for different types of variables.If a variable is chosen to be relinearized, all cliques up to the root have to be removed from the Bayes tree and replaced by the relinearized factors involved.The approach is called fluid relinearization (Kaess et al., 2012) and is done in combination with the update step.
Solving via back-substitution starts at the root and continues to all leaves.A nearly exact but computational cost-efficient solution does not require solving for all variables, as only the top of the Bayes tree is affected by new factors and relinearization.Processing a sub-tree stops in case a clique is reached referring to variables whose updates are smaller than a small threshold α.

EXPERIMENTS
We have integrated our measurement model in eq. ( 2) into the software package GTSAM 2.3.0, which provides a fast implementation of the iSAM2 algorithm and is described by Dellaert (2012).GTSAM provides a comprehensive MATLAB interface that allows us to perform first experiments, as our prototype for visual odometry is mainly running in MATLAB so far.
To test the real-time capabilities and the optimality of the incremental bundle adjustment we use an image sequence taken with the four fisheye cameras from our UAV performing two circular motions.The image sequence consists of 1,800 frame sets taken with 14 Hz.We apply a high-weighted prior on the 6-parameter pose of the first keyframe to define the coordinate system of the map.The scale is defined by the known mutual orientations in the multi-camera system.Our system initiates a new keyframe set after each 1 m resulting in 107 keyframe sets.Tracking 50 feature points in each camera and setting β for the rotations to 0.5 • and for the translations to 3 cm yields a very fast processing of the bundle adjustment that is always faster than 1 second on a 3.6 GHz machine, see Fig. 7.The required time is independent of the number of new factors added to the Bayes tree but rather highly depends on the number of cliques related to variables that need to be relinearized.The choice of β has a significant influence on the required time and the obtained accuracy of the estimated parameters.Setting β too low leads to relinearization of all variables on every keyframe and setting β too large decreases the accuracy of the estimates.As we expect to have better initial parameters with DGPS-observations, which will be included into the estimation process soon, the need for frequent relinearizations should decrease.
The root mean square error (RMSE), which is determined after each incremental bundle adjustment, is in the order of 2-3, see Fig. 8, which is quite large as we assumed a standard deviation of σ l = 1 pixel for the extracted feature points.We apply the batch bundle adjustment BACS on the observations used by iSAM2 and we use the estimated values of iSAM2 as approximates.We retain the pose of the first keyframe set to use the same gauge definition as we did using iSAM2.We obtain an empirical variance factor of σ 2 0 = 2.0 2 which is quite large, being in the order of the RMSE.Applying the robust Huber minimizer shows no significant outliers and yields an equal robust empirical variance factor of σ 2 0 = 1.96 2 .The reason for the poor accuracy of the observations still has to be identified.It could lie e.g. in the tracking, in the assumption that the multi-camera system is rigid, not taking into account the possibility of vibrations, or a poor image quality.Differences in the estimated pose parameters between those from iSAM2 and BACS are shown in Fig. 9 for each set of keyframes.These differences are within their estimated uncertainty, shown in Fig. 10.The results show that iSAM2 provides estimates which are in a statistical sense optimal like the rigorous batch bundle adjustment BACS.

CONCLUSIONS AND FUTURE WORK
We presented our system for visual odometry performing a keyframe-based bundle adjustment for real-time structure and motion estimation in an unknown scene.Incremental bundle adjustment is performed by using the iSAM2 algorithm for sparse nonlinear incremental optimization in combination with our measurement equations allowing for multi-view cameras, omnidirectional cameras and scene points at infinity.First experiments show the high potential of the incremental bundle adjustment w.r.t.time requirements and optimality.Future work will focus on the issue of vibrations due to the rotors of the UAV on the mutual orientations of the multi-camera system, the integration of a down-weighting function within the fluid relinearization process and the integration of GPS observations into the visual odometry system.Further we have to solve  The axes are oriented as described in Fig. 9. the issue of estimating scene points which are at first near to the camera system and as the camera system moves away lying numerically at infinity.Furthermore we are porting the prototype, which is mainly running in MATLAB, to a fast C++ implementation.

Figure 3 :
Figure 3: Relation between sensor point, viewing direction and viewing ray.

Figure 4 :
Figure 4: Example images taken in a synchronized way in the left and right camera of a stereo pair.The extracted feature point in the left image on the rightmost car has the illustrated epipolar line in the right image.The matching point in the right image lies on the indicated yellow line and the corresponding local image patches show a high correlation.

Figure 5 :
Figure 5: A two-camera system with fisheye cameras c = 1, 2 with projection centers Ztc, rigid motion Mc and time-varying motion Mt, having a field of view larger than 180 • shown at two exposure times t = 1, 2 observing two points Xi, i = 1, 2, X1 being close by and X2 at infinity.use a generalized camera model with multiple projection centres c = 1, ..., 4 and known system calibration, described by the motion Mc of each single camera from the camera system.The measurement equation, which considers mutually fixed single view cameras, allows the single cameras to be omnidirectional and allows for far or ideal scene points, reads asvitc = J T (x itc )N [I3|03]M −1 c M −1 t Xi(1) with the homogeneous scene point Xi, motion matrices Mt and Mc, and the observed ray direction k x s itc and residual vitc of scene point i in camera system k at time t, see Fig. 5, whereby J(x) = null(x T ) and N(x) = x/|x|.

Figure 6 :
Figure 6: (a) A small exemplary factor graph and the associated Jacobian matrix A where scene points X1 and X2 are observed from the poses M1, M2 and M3 and odometry measurements between the poses are given.(b) The so-called chordal Bayes net and the associated square root information matrix R result from factorization of the factor graph with the elimination ordering X1, X2, M1, M2, M3.The clique containing the last eliminated variable is called the root.(c) The Bayes tree and the associated square root information matrix R describe the clique structure in the Bayes net.The association of cliques and their conditional densities with rows in the R factor is indicated by color.Adapted from Kaess et al. (2012).

Figure 7 :
Figure 7: (a) Required time for processing incremental bundle adjustment using iSAM2.(b) Number of cliques related to relinearized variables (solid) and the total number of cliques in the Bayes tree (dashed), note the effect on (a).(c) Number of new factors added, note that the number has no effect on (a).

Figure 9 :
Figure9: Deviations between the estimated rotation angles and translations of BACS and iSAM2 on all set of keyframes.The z-axis points in flight direction, the x-axis points upwards and the y-axis is orthogonal to both.

Figure 10 :
Figure10: Estimated standard deviations of the estimated rotation angles and translations of BACS on all set of keyframes.All are up to the fourth digit identical to these provided by iSAM2.The axes are oriented as described in Fig.9.