ROTATION ONLY BUNDLE-ADJUSTMENT FOR THE GENERALIZED CAMERA MODEL AND ITS APPLICATION FOR LARGE-SCALE UNDERWATER IMAGE-SETS

: The generalized camera model allows handling a broad range of imaging systems for which the common central perspective model no longer applies. While offering greater modeling flexibility, designated orientation procedures need to support its application. In that respect, models that facilitate the computation of the bundle-adjustment solution are imperative as a means to solve the motion and structure of a large set of images. Being non-linear, the bundle adjustment solution requires good initial estimates and the filtering of outlying matches. To facilitate such a solution, we propose in this paper a method for estimating the global rotations of a set of views independent of their translation and the scene structure. This method is then used to improve the robustness, efficiency, and convergence of the solution when used as a refinement to the global rotations. The proposed pipeline is evaluated by experiments on synthetic and real-world data of axial flat-refractive cameras. It shows that the proposed method produces more accurate and optimal initial estimates of the global rotations than the state-of-the-art rotation averaging method. complete bundle adjustment while improving the estimated parameters that are introduced into it and the ability to filter outlying matches. Our experiments demonstrate that the proposed and when no on With the introduction


INTRODUCTION
The use of the generalized camera model (GCM) has seen a growing interest in recent years with the advancement of imaging systems and multicamera installations (Miraldo and Cardoso, 2020). Notable examples include fish-eye (Castanheiro et al., 2021), central or noncentral catadioptric , rolling shutter (Fan and Dai, 2021), panoramic (Ji et al., 2020), and underwater cameras (Telem and Filin, 2010), as well as multi-view rigs (Miraldo and Cardoso, 2020). The generalized form facilitates the modeling of a broad range of imaging systems but it also requires suitable pose estimation procedures to support it. Because of the deviation from the central perspective form, integration into structure-from-motion (SfM) and bundle adjustment (BA) solutions requires also robust procedures that generate good initial estimates for the pose parameters and the filtering of outlying matches. The predominant strategy of recent years initializes such solutions by estimating the global motion of each view via an averaging procedure whose aim is to recover global motions from a set of relative ones. Such a scheme, commonly termed motion averaging, provides a fast and accurate method for the cameras' motion estimation with a wide range of applications, including view registration, robotic path estimation, super-resolution, etc. (Chatterjee and Govindu, 2013;Eriksson et al., 2018).
Motion averaging can be partitioned into rotation averaging (RA) followed by translation averaging (TA) procedures (Hartley et al., 2013;Chatterjee and Govindu, 2017;Schonberger and Frahm, 2016;Eriksson et al., 2019;Chen et al., 2021). RA consists of estimating global camera orientations (in reference to a predefined datum) that best agree with the complete set of plausible pairwise relative orientations within the image set. These global orientations are estimated such that the disagreement is minimized and the errors are distributed over * Corresponding author the entire set of pairwise constraints. Differing from the BA solution, in RA only the rotations are estimated as a function of the relative pairwise rotations but not so that the imagerelated reprojection errors are minimized (Chen et al., 2021). As the set of unknown parameters is small, compared to the full BA framework (involving structure + motion), the computation is faster and the procedure is simpler to perform. Being an indirect minimization form, not in reference to the image measurements, and as all relative rotations are treated equally, even when a different number of points is utilized for the individual estimations, this procedure is not optimal. Recently, a method that considers both pair-wise rotation estimates and image measurements was proposed for central cameras (Lee and Civera, 2021). There, given the RA estimates, a rotationonly bundle adjustment (ROBA) solution, which optimizes the rotations over all image measurements, was performed. For that, the authors extended a pair-wise relative orientation model that allows the rotation to be recovered independently of the translation into a multiple-view framework. While offering better initial estimates, their solution only fits central cameras and cannot handle non-conventional imaging systems.
Considering the growing utilization of the generalized camera models, we study in this paper the computation of global rotations as an optimization form and use the image rays direction as the direct input. This solution can be integrated into the SfM pipeline to refine the initial absolute rotations by RA methods before the BA solution commences. While our solution form is general, we demonstrate its application on underwater flat refractive imaging configurations. It has been identified that the flat refractive configuration is axial by nature (cf., Telem and Filin, 2010), also demonstrating that the nature of this system translates to depth dependence in terms of the image related corrections (Telem and Filin, 2010;Nocerino et al., 2021). The literature shows that the predominant method to handle the axial nature of this system is by approximating it as a Figure 1: Geometry of the generalized relative pose problem for multi-camera systems. The unknowns are the transformation parameters between the two viewpoints b and b ′ , given by t and R. The observation vectors vi and v ′ i and the position of the camera centers c (1) and c (2) with respect to b and b ′ , given by t c (1) and t c (2) , respectively. perspective one with distortions (Chadebecq et al., 2020). This produces aberrations and inaccuracies in the reconstruction because of the depth-dependent 3-D error in the system. To alleviate such errors, reweighted BA procedures have been recently proposed, where the weights assigned to the image measurements are changed according to the error introduced by their depth (Nocerino et al., 2021). Aspiring for a physically exact solution to the flat-refractive camera model, we utilize the GCM and solve the global orientations as initial estimates for the BA solution. The advantages of such a representation are the direct form by which the measurements are introduced into the model and the physically exact modeling of the system. We extend a pairwise rotation-only relative orientation solution to handle multiple views by aggregating a set of two-view costs and minimizing them through nonlinear optimization. We test the proposed generalized rotation only bundle BA through a set of simulated tests and on real-world data. As the results show, under typical configurations, the obtained parameters are sufficiently close to the actual ones and may facilitate a variety of applications such as coarse reconstruction or coarse localization. Finally, we compare our results to the state-of-the-art RA method (Chatterjee and Govindu, 2017) and demonstrate consistent and significant gains in accuracy. The organization of this paper is as follows: Sec. (2) develops the generalized twoview rotation-only solution to the flat-refractive setup. Sec. (3) describes the generalized rotation-only solution. In Sec. (4) we present the experimental results, and Sec. (5) presents the discussion and conclusions.

GENERALIZED CAMERA MODEL
Image-related measurements for the GCM are usually expressed by Plücker coordinates, a 6-D vector of which the first three elements correspond to the ray direction, and the latter three are given by the cross-product between a point on the line and its direction. For a multi-camera system whose center does not coincide with a specific camera center, we denote its positions at two different epochs by b and b ′ , and relate them by a 3-D rigid body transformation, where t and R are the respective translation and rotation (Fig. 1). We also consider two different cameras centers, c (1) and c (2) , each viewing at a different epoch the same object-space point by the two respective rays, vi and v ′ i , in reference to b and b ′ (Fig. 1). In that form, the Plücker line coordinates of the two observations are given Figure 2: Geometry of the axial camera relative pose problem as a multi-camera system. The unknowns are the transformation parameters between the two viewpoints b and b ′ , given by t and R. The observation vectors vi and v ′ i and the position of the camera centers c (1) and c (2) with respect to b and b ′ , given by t c (1) = kn = [0, 0, −k] T and t c (2) = k ′ n = [0, 0, −k ′ ] T , respectively. The axial nature dictates that all camera centers to be on the same system axis, n (dashed red line) for each one of the viewpoints. Note that, b and b ′ must also lie on the axes of the systems. by: Integration of the Plücker line transformation and the intersectionconstraint (cf. Förstner and Wrobel, 2016, for more details) leads to the generalized epipolar constraint (GEC): where [t]× represents the skew-symmetric form of t. Substituting Eq. (1) into Eq.
(2), we obtain the generalized epipolar constraint (GEC): Similar to the central case, this formulation allows solving linearly for the relative pose. However, the linear solution has a large redundant parametrization and requires 17 correspondences for solving only 6 DoF (Kim et al., 2009).

Application to flat-refractive geometry of underwater cameras
When applied to underwater cameras that image through a flat interface, one needs to account for refraction at the interface that bends the incident ray direction and introduces a non-linear trajectory (Fig. 2). We can maintain the collinearity of the incident ray by offsetting the camera center along an axis whose direction is the normal to the interface by a magnitudekf , such that the modified ray direction under such a formulation becomes, where xi, yi, are the image plane coordinates given in the calibrated camera frame, andk is a correction factor to the principal distance, f , whose magnitude depends on the image point co-ordinates (cf. Appendix A). Next, defining ki as the correction to the principal distance, where t c (i) = kin = [0, 0, −ki] T and n = [0, 0, 1] T , the image ray can be expressed in the Plücker line representation by, such that kin is the modified position of the camera center and vRi is the ray direction. Note that Eq. (5) defines a linear form of the ray trajectory through refraction which is expressed by image-related quantities only. Substituting the Plücker linecoordinates from Eq. (5) into Eq.
(2) and with further derivations we obtain: (3), thereby establishing the analogy of the flat-refractive camera model to the GCM and GEC forms. Kneip and Li (2014) expressed Eq. (6) as follows:

GEC rotation-only solution
where g i is termed the generalized epipolar plane normal vector, andt the homogeneous translation vector, which has an arbitrary scale. Having n generalized normal vectors, the following constraint can be generated: This expression constrainst by a n × 4 matrix that depends only on the rotation parameters. As the trivial solution is not allowed, the rank of G has to be 3. Hence, given an arbitrary number of correspondences, n, a rank minimization of G over R can be reached by minimizing the smallest eigenvalue of Solving R through H can be done by the minimization of where λ H (R) is the smallest eigenvalue of H, which is a function of R. Let aλ 4 + bλ 3 + cλ 2 + dλ + e = 0 be the fourth degree polynomial whose roots are the eigenvalues of H. The coefficients {a, b, c, d, e} can be derived from det (H − λI4×4). The smallest root can be obtained in closed form by applying Ferrari's solution: and, Hence, R * that minimizes the GEC (Eq. 6) can be obtained directly by nonlinearly solving Eq. (15). The GEC rotationonly solution allows computing pair-wise orientations between all image pairs with a sufficient number of correspondences.

GENERALIZED ROTATION ONLY BUNDLE ADJUSTMENT (G-ROBA)
Given the set of plausible relative orientations, it is possible to reconstruct a view-graph G that encodes all connections between the pair of views. We define G = {V, E} such that ∥V∥ = N and ∥E∥ = M , where V is the set of N cameras and E is the set of M edges representing the relative orientation between individual cameras. In this representation the edge ij ∈ E represents the computed relative rotation Rij between the cameras i and j, such that [i, j] ∈ V. We denote the collection of all relative rotations by RE . The set of all 3-D rotations RV = {R1, R1, . . . , RN } completely specifies the global rotation of all the cameras with respect to a given reference frame.
If E spans the entire view graph, the global rotations of all the cameras can be solved by using the pairwise relative rotations.
As the cameras i and j have global rotations of Ri and Rj respectively in a given reference frame, the relative rotation between them should obey the relationship, The problem of relative rotation averaging can be stated as follows: given a sufficient number of relative rotations Rij ∈ RE , we seek an estimate of the global camera rotations, RV . In practice, we always have a larger number of edges than what is required to span the view-graph, i.e., M > N − 1, implying that we have a redundant set of observations. We also note that due to the presence of noise or outliers, the set of relative rotations is inconsistent, i.e., we cannot find a solution RV = {R1, R1, . . . , RN } that exactly satisfies all constraints {Rij = RjR T i |∀ij ∈ E}.
We seek to find an estimate of RV that is most consistent with the observed relative rotations. This can be obtained by minimizing a cost function that penalizes the discrepancy between the observed relative rotations Rij and the one suggested by the estimate RjR T i , i.e., where d(.) is a distance measure between two rotations in SO(3) and ρ(.) is a loss function defined over this distance measure. We follow Chatterjee and Govindu (2017) that apply a two-step approach in which first an L1-iterative reweighted least-squares (L1-IRLS) is used for initialization and is then switched to an L1 /2 -IRLS for additional refinement.
Next, we extend the idea of utilizing the two-view rotation-only solution for a rotation-only BA (ROBA, Lee and Civera, 2021) to the case of generalized cameras. Given the set of all edges E, the global rotations of the N cameras are computed using the RA. This provides the set of global rotations denoted by {R1, . . . , RN }. For each edge in E a constraint on two rotations from the set of N global rotations can be generated using Eq. (7). This form allows minimizing the repreojection as it is directly related to the image measurements. However, in its form, it can only relate two views. To extend the relative orientation problem to handle multiple views, we define a cost given by λ H,min which can be considered as a measure of how good is the estimate and amounts to zero when no noise exists in the system. For each edge, the value of λ H,min is computed while using the global rotation estimates as inputs and the sum of all costs is then minimized in a single optimization. Lee and Civera (2021) showed that the cost λ H,min performed better and improved the convergence rate. Hence, the optimization problem can be formulated as with, To solve Eq. (18) iteratively, the first-order gradient-based optimization algorithm for stochastic objective functions, ADAM (Kingma and Ba, 2014) was used. The hyper-parameters, β1 and β2, were set to the default values (i.e., β1 = 0.9, β2 = 0.999). The step size was set to α = 0.01 at the beginning from which was then switched to α = 0.001 permanently once the cost increased in five successive iterations.

EXPERIMENTS
To test the performance of the proposed global rotation estimation solution, experiments were carried out via simulations and validated using real-world data. For the simulated experiments, we use a 640 × 480 pixels frame camera with a 525 pixel focal length. The distance to the interface, d was set to 30 mm, and the indices of refraction to µ0 = 1 and µ1 = 1.33. We generated 3-D points at random distances D ∼ U (1, 5) m from the xy-plane while ensuring that each view observed at least 10 corresponding points with neighboring views. The image points were formed by forward-projecting the generated 3-D points using ray-tracing (Kunz and Singh, 2008). They were then perturbed by Gaussian noise characterized by N 0, σ 2 . Image pairs with more than nin corresponding points formed an edge in E. Our base setup consisted of nviews = 100, and we set nin = 50 and σ = 1 pixel. We then evaluated also the influence of the following alterations to the base setup: i) an increase in the number of minimal corresponding points to nin = 100, thereby limiting the number of permissible pairwise connections; ii) fewer views covering the same scene to nviews = 30, thereby reducing the overlap; iii) more views, to nviews = 300, thereby increasing the overlap; iv) decreased noise level, to σ = 0.5 pixel; and v) increased noise level, to σ = 2 pixels. Each setup was simulated by 200 different configurations of randomly sampled camera rotations and a 3-D set of points. We tested these setups on two imaging scenarios: i) of a closed-loop where n cameras were uniformly distributed one unit apart from one another, while their optical axis direction was uniformly perturbed by an θ ∼ U (0, 20 • ) with respect to the z-axis; and ii) of an image block made of nstrips strips, with nviews images per strip. The rotations were uniformly perturbed by θ ∼ U (0, 5 • ) off the nadir direction.
To simulate as realistic as a possible scenario, the relative rotation estimates were computed using the method from Sec. (2.2), The rotation, Rij between images i and j initialized the RA solution. Finally, the output of the RA was introduced to the G-ROBA solution. To compute the RA solution, the state-ofthe-art model by Chatterjee and Govindu (2017) was used. Its implementation was based on the code publicly shared by the authors. For performance evaluation, we note that our method was implemented in Python and tested on an Intel i7-3770 CPU, 3.40GHz PC 16GB RAM.
As the simulated parameters are given in absolute terms and our solutions is in reference to an arbitrary datum, the global rotation estimates (R1, . . . ,RN ) do not share the same reference frame as their ground-truth counterparts (R gt 1 , . . . , R gt N ). Therefore, they must first be aligned with the ground-truth to evaluate the accuracy. To do so, it is customary to estimate a rotation that transforms the estimated global rotations to the ground-truth system. Being a nonlinear single rotation-averaging problem, it is solved iteratively (Hartley et al., 2013). We compute this by minimizing the L1 and L2 norms, yielding two such rotations, RL 1 and RL 2 , respectively that are estimated by: and, where d(·, ·) denotes the geodesic distance between the two rotations, i.e., d(R1, R2) = arccos (tr(R1R T 2 )−1) /2 . Next, all the estimated global rotations are rotated by RL 1 and RL 2 to produce two different alignments corresponding to the L1 and L2 minimization norms. The mean and median angular errors using these two alignment methods are presented and analyzed.
We compare the performance of the application of the RA and G-ROBA in terms of the mean angular error following the L1 and L2 alignment. For evaluation, we define the metrics mn1 and mn2 as the mean angular error (in degrees) following the L1/L2 alignment, respectively. Similarly, we also denote the median angular error for the L1/L2 alignments as md1 and md2, respectively. Figures 4, 5, 6, and 7 illustrate the results in which on each box, the central mark indicates the median, and the bottom and top edges of the box indicate the 25th and 75th percentiles (Q1 and Q3), respectively. The whiskers extend to the most extreme data points not considered outliers, and the outliers are plotted as blobs, individually.
Closed-loop configuration -Figs. 4 & 5 plot the results of the closed-loop simulation. In all settings considered, the G-ROBA solution demonstrated an improvement over the RA. For example, for the baseline case and using the L1 alignment (Fig. 4), the mean angular error dropped from 2.41 • ± 0.63 • when using the RA to 1.38 • ± 0.57 • using the G-ROBA. For the L2 alignment, the mean angular error dropped from 2.39 • ± 0.67 • to 1.41 • ±0.59 • (Fig. 5). The addition of more points improved the results, reaching a mean as low as 1.05 • when using our method. Furthermore, using fewer views improved the results when compared to the baseline and more views cases with 0.29 • compared to 1.38 • and 3.89 • , respectively. Finally, the noise test showed that our method scales well with the increase of noise, reaching a mean of 1.22 • , 1.38 • , and 1.62 for 0.5, 1, and 2 pixels, respectively. This demonstrates the model's capa- Figure 3: Simulated imaging configurations. (left) We simulate a block structure with nstrips as the number of strips, each containing nviews number of views. The rotations are perturbed by random angles θ ∼ U (0, 5 • ); (right) we uniformly distribute n cameras on a circle on the xy-plane such that the neighbors are evenly spaced along the circle perimeter. After aligning their optical axes with the z-axis, we perturb the rotations by random angles θ ∼ U (0, 20 • ). Figure 4: Results of the synthetic data test settings. Comparison between RA and G-ROBA (initialized by RA) in terms of the mean angular error after the L1 alignment. bility to handle large quantities of noise while also maintaining a linear trend.
Block configuration -For the base case, and using the L1 alignment, the mean angular error was 1.18 • ± 0.29 • when using the RA and 0.51 • ± 0.18 • for the G-ROBA solution. Using the L2 alignment, the respective mean angular errors were 1.42 • ± 0.36 • and 0.62 • ± 0.24 • (Fig. 5). Additional points improved the results with a mean error of 0.39 • and fewer views improved the results even further, reaching a 0.17 • mean error. These results show, similar to the closed-loop simulation, that both methods performed better for the cases where more points were added and lesser views of the same scene were observed. Comparing these settings to the baseline and more views cases show an improvement by a factor with 0.5 • and 1.15 • for the base and more views cases, respectively. Hence, for all settings considered, the G-ROBA solution demonstrated improved results compared to the RA. Here also the angular error for all six settings was lower than that of the closed-loop configuration. We attribute this to the number of edges gained by the side overlap between adjacent strips.

Real-world experiments
To validate our method in real-world conditions we compared the computation of our global rotations to a publicly shared dataset by Bender et al. (2013)      al., 2012). The vehicle was designed for high-resolution, georeferenced imaging (Bryson et al., 2013). It includes 11,278 stereo image pairs and an onboard Imagenex DeltaT 260kHz Multibeam sensor. Integration of both data streams was used in a SLAM solution that is utilized here as a baseline for validation. 1 The set of images covers a seabed strip, more than 4 km long. The trajectory consists of a 4 km long straight transect, and a zig-zag path that traverses this transect, crossing it five times (Fig. 8). In our analysis we studied the RA and G-ROBA performance on i) the 4 km long open-end strip and the zigzag path, ii) on both setups using only the left and then right cameras and iii) on the whole dataset, whose overall length was 8 km, and facilitates a loop closure.
Data processing: We used SIFT (Lowe, 2004) to extract feature points and the FLANN (Muja and Lowe, 2014) for the matching. To improve the efficiency of the matching stage and as the image order was known, we limited the possible matches to the five neighboring images in both directions. Considering the fact that the images were acquired in a strip-like campaign, this simplification had hardly any effect on the edges in the view graph. The underwater system parameters were calibrated using the recently proposed method of Elnashef and Filin (2022), and then the relative rotation of each edge we estimated according to Sec. (2.2). As a means to remove bad matches, we computed the relative orientation over all stereo-pairs and removed outlying matches using the random sample consensus (RanSaC) algorithm. Next, we computed the RA solutions and refined them using the proposed G-ROBA solution.
Validations: Results of the three scenarios and test sets are listed in Table (4.1). For all scenarios and subsets, our solution outperformed RA error-wise. In scenario #1, we computed the angular error for the two subsets, namely, a zig-zag path with 6278 stereo images, and a straight path with 5000 stereo images with respect to their baseline. We observe that in the straight path set, the RA error was large, reaching a mean error of 6.22 • , these results were improved by our solution reducing the error to 4.89 • . Also, we listed the run-time for each set and show that the RA method is more efficient by at least an order of magnitude compared to our solution (  (Top) The data are partitioned into two patches, a zig-zag path, and a straight path; (Middle) The data are partitioned into left and right images of the stereo-pair. Note that, the offset between the two lines is enlarged for a better illustration; (Bottom) Full dataset with a side view magnification at the intersection between the zig-zag and straight paths.
is a consequence of adding more edges at the loop closure positions, namely, the intersection between the zig-zag and straight paths (Fig. 8). To measure the consistency of the solution, we computed the angular error of the alignments between the left (11278 images) and right (11278 images) sets (Table 4.1). The mean errors were as low as 0.16 • when using G-ROBA. Applying both methods over the entire dataset (scenario #3), yielded errors as low as 3.52 • in comparison to mn1=2.27 • for the RA and G-ROBA, respectively. Demonstrating once again that our solution provides an improvement over the RA and reduces the overall error.

CONCLUSIONS
This paper proposed a generalized version of the rotation-only bundle adjustment for the generalized camera model. It presented a complete pipeline adapted for that purpose that begins with a relative orientation of this axial camera form, through  Table 3: Comparison of the real-world results against the baseline dataset (Bryson et al., 2013) divided into three scenarios. (1) The data are partitioned into two sets, a zig-zag path (6278 stereo images), and a straight path (5000 stereo images), and solved independently from one another. Both sets are compared to their baseline counterpart, respectively.
(2) The data are partitioned into left (11278 images) and right (11278 images) images. The two sets are then compared to their baseline counterpart and to one another (relative rotation between the left and right trajectories).
the establishment of a view-graph for the image set, the initialization by applying a rotation averaging procedure, and the generalized refinement of the global rotations. The sequential process poses little computational demand on the complete bundle adjustment solution while improving the estimated parameters that are introduced into it and the ability to filter outlying matches. Our experiments demonstrate that the proposed pipeline is general, and performs well even when no loop closure is enforced on the image block. With the introduction of the loop closure, the estimated rotations further improve, in some typical constellations providing accurate results sufficient for some coarse mapping tasks. Our evaluations also demonstrated that in all setups the G-ROBA solution outperformed the standard RA. Future research would study the integration of these solutions, into a global bundle adjustment and SfM procedures, as a means to obtain a suited underwater global orientation solution.

ACKNOWLEDGMENT
The authors would like to acknowledge the Australian Center for Field Robotics' marine robotics group for providing the data. Funding was provided in part by the Neubauer family foundation.