TOWARDS STRUCTURELESS BUNDLE ADJUSTMENT WITH TWO- AND THREE-VIEW STRUCTURE APPROXIMATION

The global approaches solve SfM problems by independently inferring relative motions, followed be a sequential estimation of global rotations and translations. It is a fast approach but not optimal because it relies only on pairs and triplets of images and it is not a joint optimisation. In this publication we present a methodology that increases the quality of global solutions without the usual computational burden tied to the bundle adjustment. We propose an efficient structure approximation approach that relies on relative motions known upfront. Using the approximated structure, we are capable of refining the initial poses at very low computational cost. Compared to different benchmark datasets and software solutions, our approach improves the processing times while maintaining good accuracy.


INTRODUCTION
Photogrammetry and computer vision Structure from Motion (SfM) algorithms underwent a remarkable evolution during the past two decades. Much of its evolution was driven by the advances in sensor technology, growing computer capabilities and democratisation of the fields through different software solutions (Snavely et al., 2006, Pierrot Deseilligny and Cléry, 2011, Moulon et al., 2016, Schonberger and Frahm, 2016. Existing SfM approaches for pose estimation become computationally inefficient as the number of images increases. Two solutions exist: sequential and global methods. Sequential methods (Schonberger andFrahm, 2016, Snavely et al., 2006) generate one or multiple (e.g. hierarchical methods , Toldo et al., 2015) seed images and sequentially concatenate overlapping images. Non-linear least squares bundle adjustment is then used to systematically eliminate accumulated errors (Triggs et al., 1999, Förstner andWrobel, 2016). On the other hand, global methods (Govindu, 2001, Moulon et al., 2013, Wilson and Snavely, 2014 simultaneously and independently find relationships between pairs or triplets of images. These relationships are encoded in an epipolar graph which forms the basis of inferring the global rotations and translations. Global methods require the knowledge of image correspondences only in the first phase when computing the two-view or three-view geometries. This makes them faster and more computationally efficient. However, identifying outliers becomes challenging because the computation is bound to a pair or a triplet. Global solutions to SfM problems provide only initialisation of camera poses. Ordinarily, a final bundle adjustment (BA) including poses and point correspondences is run to calculate the optimal solution (Triggs et al., 1999), which is a costly processing. The objective of this work is to go towards truly structureless bundle adjustment by refining the initial poses with very few points. We propose a structure (i.e. 3D points) approximation algorithm that relies on pairs and triplets of images. We firstly introduce a robust method for the relative motion calculation. Then, we redefine the per-relative-motion structure * Corresponding author through the intermediary of an ellipsoid. The goal here is to generate a minimal set of 3D points that are sufficient to infer the local structure. This set is later used to refine the global solution in a bundle adjustment routine. Please note that we do not present a global motion approach in this work. Instead, to produce an initialization to the BA, we use the global approach of SfmInit (Wilson and Snavely, 2014) or MicMac (Pierrot Deseilligny andCléry, 2011, Rupnik et al., 2017). We evaluate our approach on the Strecha benchmark (Strecha et al., 2008), as well as sequential acquisitions by a mobile mapping platform and a drone. All methods are implemented and available from: github.com/micmacIGN . The sequential datasets and their respective ground truth are also available upon request.

RELATED WORK
On global motion methods Determination of global motions is performed separately for the rotational and translation parts (Govindu, 2001). To calculate the rotations, the existing solutions: ignore the orthogonality constraint and employ linear methods (Martinec and Pajdla, 2007, Arie-Nachimson et al., 2012, Moulon et al., 2013; or optimise with robust 1 as Hartley et al. (Hartley et al., 2011) or as Chatterjee et al. (Chatterjee and Govindu, 2013) who base their algorithm on the Lie Group. A review of rotation averaging algorithms was published by Hartley et al.(Hartley et al., 2013). The translation component is always computed in the second place. Unlike for the rotational component, there is no direct way to compute scaleconsistent translations from a set of available pairwise constraints. Some methods exploit only pairwise constraints (Govindu, 2001, Martinec and Pajdla, 2007, Arie-Nachimson et al., 2012, Wilson and Snavely, 2014, but are susceptible to a degenerate solution when all cameras are aligned. Other methods formulate the problem using trifocal dependencies (Courchay et al., 2010, Zach et al., 2010, Jiang et al., 2013, Moulon et al., 2013. Moreover, approaches based on linear solvers (Arie-Nachimson et al., 2012, Govindu, 2001, and ∞ norm optimisation (Sim and Hartley, 2006, Kahl and Hartley, 2008, Moulon et al., 2013 can be distinguished. Methods formulating the BA estimation problem in terms of epipolar constraints (Steffen et al., 2010, Rodriguez et al., 2011, Indelman et al., 2012, Cefalu et al., 2016, are not discussed herewith. On outliers in relative motions Relative motions are the main ingredient in global pose estimation methods. This optimization scheme is particularly sensitive to outliers due to a limited number of motion observations. Since the motions are derived from automated algorithms for sparse correspondence search (e.g. SIFT (Lowe, 2004)), they inherently contain outliers, especially across ambiguous scenes with repetitive patterns, low texture, moving objects, etc. Existing approaches cope with the outliers by either eliminating them prior to optimising for global rotations and translations (i.e. filtering), or by adopting robust optimisation techniques. Examples of the latter are robust 1 optimization used for rotation averaging (Hartley et al., 2011, Chatterjee andGovindu, 2013) and translations solving (Dalalyan and Keriven, 2009, Olsson et al., 2010, Zhu et al., 2017. Some recent works have also addressed the evident complementarity of sequential and global methods. In Zhou et al. (Zhu et al., 2017, Zhu et al., 2018 many local incremental SfMs help to discard mismatches and invalid motions. Among the solutions based on filters, the pioneering works (Govindu, 2006, Sim and Hartley, 2006, Martinec and Pajdla, 2007 focus on pairwise constraints and remove the outliers by: RANSAC epipolar sampling (Govindu, 2006); using iterative removal techniques (Sim and Hartley, 2006); or identifying false matches via partial reconstructions (Martinec and Pajdla, 2007). Later works exploit the redundancy in the epipolar graph by looking at triplets of images. Courchay et al. (Courchay et al., 2010) introduce a trifocal graph parametrization and estimate robust global poses using linear programming with imposed loop constraints and RANSAC. Similarly to this work, Zach et al. (Zach et al., 2010) uses loopy belief propagation to identify erroneous geometries in loops. Jiang et al. (Jiang et al., 2013) establishes a trifocal coplanarity constraint that minimizes a geometric error. Work in Moulon et al. (Moulon et al., 2013) fuses the Bayesian inference of Zach et al. (Zach et al., 2010) and weighted graph filtering of Enqvist et al. (Enqvist et al., 2011) to remove outliers. Sweeney et al. (Sweeney et al., 2015) updates the fundamental matrices by optimizing a cost function defined over the trifocal point transfer. Finally, Wilson et al. (Wilson and Snavely, 2014) finds a purely pairwise method adapted to very large image collection where pairwise translations are projected to 1D subspace and a global ordering fitting the pairwise constraints is sought.

CONTRIBUTION AND OVERVIEW
To increase the robustness of the relative motions, our method combines several hypotheses, and tests each with a suitable direct algorithm (Section 3.1). To infer the triplets we take a similar perspective as (Jiang et al., 2013) and (Sweeney et al., 2015). We present a linear algorithm that rapidly calculates the triplet motions (Quick triplet algorithm). The robust triplet algorithm (Section 3.2) tests two hypotheses to see whether the camera in question has a short, normal, or a long focal length. As the long focal lengths approach the orthogonal projection, we replace the perspective camera model with the one proposed by Tomasi et al. (Tomasi and Kanade, 1992). Otherwise, the pose of the "third" image of the triplet is found with the spatial resection (Grunert, 1841). We intentionally avoided the trifocal tensor parametrization knowing its unstable behavior on flat surfaces (Julià and Monasse, 2017). Our structure reduction approach (Section 4) approximates the original points with an ellipsoid, and generates a small set of virtual points within its volume. This reduction allows for a lighter BA, without having to resort to PCG-based optimizations (Agarwal et al., 2010, Jeong et al., 2011, Kushal and Agarwal, 2012. Therefore, we keep the access to the reduced camera system's covariance matrix. This is advantageous for many metrological applications that necessitate a degree of confidence associated with their measurements.

BUILDING THE RELATIVE MOTIONS
The computation of relative motion consistent within triplets of images is divided into three stages. Firstly, given a set of image correspondences, we create the epipolar hypergraph H(V, E), where each vertex Vi ∈ V represents an image, and the edge Eij connects a pair of images (Vi, Vj) (Section 3.1). The connection implies that relative motion {Rji, trji} is known. We refer to H(V, E) as a hypergraph because we have access to a list of connected vertices V k for a given Eij. The connected vertices are then explored to construct triplets complying with the respective edges (Section 3.2). We assume that cameras are calibrated, i.e. at least their focal lengths and the principal points are known. The local coordinate frame within an edge Eij or a triplet (Vi,Vj,V k ) is always associated with the first camera (i.e. Vi). In the text we often refer to the "third" image of a triplet by which we mean the image associated with V k .

Pairwise relative motions
Several direct algorithms are tested to compute robust relative motions between pairs of images. We begin with the hypothesis that the corresponding points contain a many outliers, and slowly relax the constraint by allowing more points in the computation. The full set of correspondences entering the method amounts to 500 and was chosen randomly. In total we test 6 calculation variants, see Table 1. The difference between variant T1 and T2 is in the way the points a drawn out the set of 500 points. While T2 is purely random, in T1 we bias the selection by forcing that the points are well distributed across the image plane. This way we avoid sampling points that are clustered in one zone. Variants T1-T4 are solved with 8-point algorithm (Hartley and Zisserman, 2003) throughout several RANSAC samples. After each draw the current solution is refined with a 2 solver. In T5 we employ the full set of points, and perform 1 and 2 estimations. The T6 variant tests the planarity of the scene by computing a homography and decomposing it to a relative motion (Faugeras and Lustman, 1988). Finally, the variant with the smallest re-projection error is refined in a coplanarity-based bundle adjustment. MEss is the 8-point essential matrix algorithm, H is the homography decomposed to a relative motion (Faugeras and Lustman, 1988). (*) indicates that the points were drawn with a bias forcing a homogenous distribution within the image.

Triplet motions
The triplet motions are determined ideally between "relevant" triplets of images. In order to select a subset of suitable triplets, we first determine the triplet motions between all possible candidates (see Quick triplet algorithm and Algorithm 1), and use them to define a per-triplet quality index (see Triplet selection).
The final triplet motions, calculated on a pre-selected set, result from the Robust triplet algorithm.
On straight lines intersection For the sake of clarity, we lay out the line intersection algorithm that is the building block of the Quick triplet algorithm. Let us define two straight lines as (cf. Figure 1 (a)): The intersecting 3D point belongs to the line that minimizes the distance between the lines P and Q, hence, it will be found on a line perpendicular to them. We define it as (P − Q), and impose the following: Using the above, parameters p and q are inferred. The final 3D point is equal to 1 Quick triplet algorithm The rotation of the "third" image in the triplet (Vi,Vj,V k ) is calculated twice: directly as R 3 = R31, and indirectly as R 3 = R21R32. The definite rotation being R3 = 1 2 (R 3 + R 3 ). The orthogonality of R3 is enforced by mapping it to its nearest orthogonal rotation with Singular Value Decomposition. The computation of the perspective center proceeds in two steps as presented in Figure 1 and Algorithm 1. Our objective is to force the perspective center to lie close to both the translation direction tr31 calculated in the preceding step, and the directions associated with the position of 3D points. To achieve this, our first step is to determine the 3D position of two corresponding points using image measurements in (Vi, Vj) (cf. Figure 1  (b)). Then, the new perspective center of V k results from intersecting the translation direction tr31 and the direction tr3P Int , defined by the image measurement in V k and attached at the position of the 3D point. For N corresponding points, one obtains N estimates of the perspective center (or N estimates of p and q, see Figure 1 (a)). The terminal value is then C3 = tr31 ·p, where p = {pi},p is the median of p and i ∈ [1, N ]. By definition, the proposed algorithm calculates consistent triplets exclusively from image correspondences of manifold 3 (i.e., points visible in 3 images). Statistically speaking, higher manifold points are less susceptible to outliers, therefore they are expected to produce more stable results. It is also required that a minimum of 8 3-manifold points exists. In practice, many more points are available and we decimate them by a factor of 50 when the number exceeds 500. Note that in this formulation, the computation of the perspective center is not degenerate if all three perspective centers are located on a straight line.
Triplet selection Out of a number of initialized triplets, we want to select the K best ones. The notion of a best triplet is translated to a quality index Q and calculated for each triplet in H. The Q favours wide baselines between images (i.e. base-toheight ratios or bh) as presented here: {p, q} = Intersect pq(t31, PInt + U3)

10:
Vp.push(p) 11: end for 12: C3 = t31 · median(Vp) 13: R3 = N earestRot (0.5 · (R31 + R21 · R32)) Figure 1. Computation of the perspective center of the "third" image (C3) in the Quick triplet algorithm. C1 and C2 are perspective centers of images 1 and 2, respectively. Relative orientations between images 1 and 2 as well as 1 and 3 are known and related by a 7-parameter transformation. where T L is a rough bh limit; RV i V k is computed between the first and the second image in a triplet, and RV j V k between the second and the third, correspondingly. To rapidly calculate the indices we exploit the triplet motions obtained with Quick triplet algorithm. We performed all the experiments with K = 1 as with the growing K we didn't get much improvement in accuracy but worsened the running times. The T L wes set to 0.15.
Robust triplet algorithm At this stage every edge Eij in H contains a list of at most K "third" images that were selected in the preceding step. We begin by re-estimating the poses of the "third" images for each Eij in H. This estimation is embedded in a RANSAC framework, and uses two direct estimation models: the spatial resection algorithm (Grunert, 1841), and the Tomasi et al.approach to orientating images with long focal lengths (Tomasi and Kanade, 1992). We always test both algorithms and choose the better result by comparing their reprojection errors. We set the number of RANSAC draws to 100, and take a random subset of 500 and 30 corresponding points for spatial resection and long focal length algorithms, respectively. Finally, a per-triplet bundle adjustment is run on the randomly reduced points.  Figure 2 illustrates a toy example of the structure approximation algorithm. Given a set of initial 3D points resulting from per-pair or per-triplet intersection (see On straight lines intersection), we calculate an ellipsoid inscribed in the points. The ellipsoid is represented by its eigenvectors, eigenvalues as well as its center of gravity. These parameters serve to generate a set of new, fictitious 3D points and their corresponding image measurements. In the same vein, Mayer (Mayer, 2014) suggested a point reduction approach to speed-up the merging process in hierarchical SfM.

STRUCTURE APPROXIMATION
The ellipsoids are established for both triplets and pairs of images. To begin with, the center of gravity µ P and the covariance matrix K P are determined: where P = {Pi} are the point correspondences, P ds is the weight function that penalizes large reprojection errors (er) and small base-to-height ratios (bh), P ds = 1/ 1 + er α·bh 2 . In our experiments α was fixed to 10. Then, the eigenvalues and eigenvectors of the K matrix are retrieved with the Jacobi method (Press et al., 1992). We choose to approximate the structure with 5 symmetrically distributed points (see Figure 2 (d)) because it is the minimal number of points sufficient for direct pose estimation algorithms such as the 5-point algorithm (essential matrix), spatial resection as well as for a similarity transform that brings two stereo reconstructions to a common coordinate frame (e.g. in hierarchical SfM). Given the eigenvalues e and eigenvectors Ei, we generate a new point Hi from: where µ is the ellipsoid's center of gravity; E1, E2, E3 are first, second and third unit eigenvectors and e1, e2, e3 are their corresponding eigenvalues; [hi 1 , hi 2 , hi 3 ] is the position of a fictitious point in the coordinate frame of the eigenvectors. In our 5-point configuration from Figure 2 (  points. The reported results are always a mean value over 100 or 1000 repeated random calculations. In the bundle adjustment, the reduced points are weighted according to the number of points that contributed to calculating the ellipsoid using: where N b are the initial points, and N bMax was set to 100.

RESULTS
We run experiments on three groups of datasets: the Strecha benchmark (Strecha et al., 2008), a sequential dataset (CAR) acquired by a mobile mapping system, and a drone dataset (UAV). The number of images, number of corresponding points and number of triplet pairs per dataset is shown in Tables 2 and 3. In Tables 4, 6 and Figure 3 the results are evaluated with respect to ground truth data, and with respect to other software solutions. To bring the results to the same coordinate frame, we perform a 7-parameter transformation on the cameras' perspective centers. The Average positional error refers to respective perspective centers' differences once the transformation has been applied. Strecha images are provided with their ground truth poses. The CAR dataset contains 647 images that amount to an approximately 2km trajectory without loops (see Figure 4). The camera ground truth poses were calculated using highly accurate ground control points measured in the field with classical surveying techniques (point positional accuracy in the range of few mm). To avoid having to measure many ground control points across the entire trajectory length, the vehicle moved in circles around a block of buildings, and during processing the image correspondences were extracted only between immediate cameras. The SfmInit (Wilson and Snavely, 2014) solution did not suceed in orientating the CAR image set, therefore, we needed to resort to an alternative global SfM available in MicMac (Pierrot Deseilligny and Cléry, 2011). A possible explanation is that the CAR dataset is a linear acquisition, with potentially many cameras located on the same 3D line. Such acquisitions form a degenerate case for algorithms based only on pairwise constraints. For the same reason the BA refinement on structure approximated with pairs of images (Ep) did not converge. The UAV dataset contains 73 images over an area of 150x160m as illustrated in Figure 5. Analogously to the CAR dataset, the ground truth poses were calculated using ground control points measured in the field. Here as well, SfmInit did not manage to provide an initialisation to the final BA, hence, we resorted to MicMac (Pierrot Deseilligny and Cléry, 2011). The followed general pipeline includes: 1. Extraction of correspondences with SIFT (Lowe, 2004); 2. Building of relative motions (Section 3); 3. Approximating the structure (Section 4), variants based on: (a) pairs, Ep; (b) triplets, Etri; (c) pairs and triplets, Ep+tri; (d) pairs and triplets with random point distribution, E rnd p+tri ; 4. Computation of global motions with SfmInit (Wilson and Snavely, 2014)  Running times Tables 5, 6 and Figure 3 report on the running times. All our experiments were carried out on a machine with Intel Core i7-8550U, 1.8GHz x 8 processor. The Ours approaches concern time spent on the final BA. To deduce the integral processing time, one should add it to the time spent on calculating the initial solution. For a just evaluation, in Table 5, we look at the ratio Sf mIBA Ours , which compares the BA processing times with SIFT to our approximated structure approach. We are faster in all instances by a few factors.
Structure approximation variants Approximating the structure with triplets only (Etri), or triplets and pairs (Ep+tri) of images turns out to be the best solution throughout all processed acquisitions. We can assume that for well connected images, which is the case for the three acquisitions tested, using triplets only is accurate enough and fast. For sparser connections, adding pairs may be indispensable. We have also observed a mediocre performance when the structure was approximated only with the pairs (Ep). In general, the refinement step succeeds in improving accuracy with respect to initial poses, and the variants with structure approximated by triplets of images perform at least as good as the sequential SfMs solution.
Deterministic or random fictitious points The results of the three experiments using random points do not stand out from the 5-point configuration. On the Strecha benchmark as well as the UAV dataset, the random distribution is only slightly worse than the best solutions. On the other hand, in the CAR dataset, the random selection sometimes outperforms the other three variants. We perceive that what discriminates the acquisitions is the scene geometry. In the CAR dataset, the scene is characterised by very large depth variations, which allows it to fully model the local structure even of randomly sampled.
Limitations The structure approximation approach is not meant for image sets with unstructured connectivity graphs (e.g. Internet photo collections). The overwhelming redundancy of images does not reduce the information content across the scene, hence, does not accelerate the processing.

CONCLUSION
There are two leading contributions in this publication. First, we introduce a combinatorial approach to estimating relative motions which enhances the robustness to identified degenerate cases. Second, for a given relative motion, we propose a way of abstracting its 3D structure, such that the motion's estimation properties are preserved. The abstracted structure injected into a bundle adjustment effectively refines the initial solution at very low computational cost. The concept is validated with respect to ground truth data as well as other software solutions. Future work will concentrate on the purely structureless bundle adjustment, with exclusively view-dependent constraints and a thorough propagation of the minimal structure via relative and absolute motion covariance matrices.  (Schonberger and Frahm, 2016) SfMs, as well as our approach (Ours). SfmI is the initial solution found with (Wilson and Snavely, 2014) that we relied upon in our BA. The E rnd p+tri corresponds the solution with structure approximated by pairs an triplets with randomly distributed points; a mean value over 1000 repeated calculations is reported; Ep, Etri, Ep+tri are the solutions where the structure is approximated with 5 points (see Figure 2(d)), and structure is approximated with pairs, triplets and the fusion of both, respectively.  Table 6. UAV dataset. Average position error (err) and running times (τ ). MMInit and MM are the global and sequential SfMs in MicMac. MMInit BA corresponds to the BA refinement on the initial solution. Note that the times for Ours concern only the BA step. In E rnd p+tri a mean value over 1000 repeated calculations is reported.  (Schonberger and Frahm, 2016). MicMac and Colmap results were computed with with SIFT (Lowe, 2004). In variant E rnd p+tri a mean value over 100 repeated calculations is reported. The high frequency drift is due to the cyclic nature of the acquistion geometry.