MLPnP - A Real-Time Maximum Likelihood Solution to the Perspective-n-Point Problem

In this paper, a statistically optimal solution to the Perspective-n-Point (PnP) problem is presented. Many solutions to the PnP problem are geometrically optimal, but do not consider the uncertainties of the observations. In addition, it would be desirable to have an internal estimation of the accuracy of the estimated rotation and translation parameters of the camera pose. Thus, we propose a novel maximum likelihood solution to the PnP problem, that incorporates image observation uncertainties and remains real-time capable at the same time. Further, the presented method is general, as is works with 3D direction vectors instead of 2D image points and is thus able to cope with arbitrary central camera models. This is achieved by projecting (and thus reducing) the covariance matrices of the observations to the corresponding vector tangent space.


INTRODUCTION
The goal of the PnP problem is to determine the absolute pose (rotation and translation) of a calibrated camera in a world reference frame, given known 3D points and corresponding 2D image observations. The research on PnP has a long history in both the computer vision and the photogrammetry community (here called camera or space resectioning). Hence, we would first like to emphasize the differences between the definitions used in both communities.
PnP Classically in literature, basically two definitions of the problem exist (Hu and Wu, 2002). In the first, the distance based definition, the problem is formulated in terms of the distances from the projective center to each 3D points, e.g. leading to minimal P3P solutions (Haralick et al., 1991). The second definition is transformation based. Here, the task is to determine the 3D rigid body transformation between object-centered and camera centered coordinate systems, e.g. (Fiore, 2001, Horaud et al., 1989. In both PnP definitions, however, the camera is always assumed to be calibrated and known, allowing to transform image measurements into unit vectors, that point from the camera to the 3D scene points. More recent approaches extend this classical definition by including unknown camera parameters into the formulation such as the focal length (Wu, 2015) or radial distortion (Kukelova et al., 2013).
Camera resectioning The task in the photogrammetric definition of the problem is to find the projection matrix λu = P3×4X that transforms homogeneous 3D scene points X to homogeneous 2D image points u (Hartley andZisserman, 2003, Luhmann et al., 2006). The projection matrix is given by P3×4 = K[R|t] and hence contains the camera matrix K as well as the rotation R and translation t.
Thus, fundamentally, the main difference between the definitions in both communities is, that PnP solves only for the absolute pose of the same, calibrated camera. In camera resectioning the camera is assumed to be unknown and thus part of the formulation of the problem. In this paper, we assume the camera to be calibrated and known, thus we present a solution to the Perspective-n-Point problem.
Still, the need for efficient and accurate solutions is driven by a large number of applications. They range from localization of robots and object manipulation (Choi andChristensen, 2012, Collet et al., 2009), to augmented reality (Müller et al., 2013) implementations running on mobile devices and having only limited resources, thus focusing on fast solutions. Especially in industry, surveying or medical environments, involving machine vision, (close-range) photogrammetry (Luhmann et al., 2006), point-cloud registration (Weinmann et al., 2011) and surgical navigation (Yang et al., 2015), methods are demanded, that are not only robust, but also return a measure of reliability.
Even though the research on the PnP problem has a long history, few work has been published on efficient real-time solutions, that take the observation uncertainty into account. Most algorithms focus on geometric but not statistic optimality. To the best of our knowledge, the only work, that includes observation uncertainty into their framework is the Covariant EPPnP of Ferraz et al. (Ferraz et al., 2014a).
In this paper, we propose a novel formulation of a Maximum Likelihood (ML) solution to the PnP problem. Further, a general method for variance propagation from image observations to bearing vectors is exploited to avoid singular covariance matrices. In addition, we benchmark our real-time method against the state-of-the-art and show on a ground truth tracking data set, how our statistical framework can be used, to get a measure of the accuracy of the unknowns given the knowledge about the uncertain observations.

RELATED WORK
The minimal number of points to solve the PnP problem is three. Closed-form solutions to that minimal configuration return up to four solutions, and a fourth point can be use for disambiguation. Prominent solutions to the P3P problem requiring exactly three points are (Kneip et al., 2011), Xu, 2011) and(DeMenthon andDavis, 1992). The stability of such algorithms under noisy measurements is limited, hence they are predominately employed in RANSAC (Fischler and Bolles, 1981) based outlier rejection schemes. Apart from the P3P methods, the P4P (Fischler andBolles, 1981, Triggs, 1999) and P5P (Triggs, 1999) algorithms exist, still dependent on a fixed number of points.
Most solvers, however, can cope with an arbitrary number of feature correspondences. Basically, they can be categorized into iterative, non-iterative or polynomial, non-polynomial solvers. Table  1 lists all methods that are in addition evaluated in the experimental section.
Iterative solutions use different objective functions, that are minimized. In case of LHM (Lu et al., 2000) the pose is initialized using a weak perspective assumption. Then they proceed by iteratively minimizing the object space error, i.e. orthogonal deviations between observed ray directions and the corresponding object points in the camera frame. In the Procrustes PnP (Garro et al., 2012), the error between the object and the back-projected image points is minimized. The back-projection is based on the iteratively estimated transformation parameters. As iterative methods usually are only guaranteed to find local minima, (Schweighofer and Pinz, 2008) reformulated the PnP problem into a semidefinite programme (SDP). Despite its O(n) complexity, the runtime of the global optimization method remains tremendous.
Likewise, early non-iterative solvers were computational demanding, especially for large point sets. Among them (Ansar and Daniilidis, 2003) with O(n 8 ), (Quan and Lan, 1999) with O(n 5 ) and (Fiore, 2001) with O(n 2 ). The first efficient non-iterative O(n) solution is the EPnP by (Moreno-Noguer et al., 2007), that was subsequently extended by (Lepetit et al., 2009), employing a fast iterative method to improve the accuracy. The efficiency comes from the reduction of the PnP problem to finding the position of four control points that are a weighted sum of all 3D points. After obtaining a linear solution, the weights of the four control points is refined using Gauss-Newton optimization.
The most recent, non-iterative state-of-the-art solutions are all polynomial solvers: The Robust PnP (RPnP) (Li et al., 2012) first splits the PnP problem into multiple P3P problems, that result in a set of fourth order polynomials. Then the squared sum of those polynomials as well as its derivative is calculated and the four stationary points are obtained. The final solution is selected as the stationary point with the smallest reprojection error. In the Direct-Least-Squares (DLS) (Hesch and Roumeliotis, 2011) method, a nonlinear object space cost function is formulated and polynomial resultant techniques are used to recover the (up to 27) stationary points of a polynomial equation system of fourth order polynomials. A drawback of this method is the parametrization of the rotation in the cost function by means of the Cayley parameters. To overcome this, the Accurate and Scalable PnP (ASPnP) (Zheng et al., 2013b) and the Optimal PnP (OPnP) (Zheng et al., 2013a) use quaternion based representation of the rotation matrix. Subsequently, the (up to 40) solutions are found using the Gröbner basis technique on the optimality conditions of the algebraic cost function.
The linear, non-iterative Unified PnP (UPnP)  goes one step further and integrates the solution to the NPnP (Non-Perspective-N-Point) problem. The DLS formulation is extended to include non-central camera rays and the stationary points of the first order optimality conditions on the sum of object space errors are found using the Gröbner basis methodology.
Thus far, all methods assume, that the observations are equally accurate and free of erroneous correspondences. The first PnP method, that includes an algebraic outlier rejection scheme within the pose estimation, is an extension of the EPnP algorithm called Robust Efficient Procrustes PnP (REPPnP) (Ferraz et al., 2014b). Outliers are removed from the data by sequentially eliminating correspondences that exceed a threshold on an algebraic error. The procedure remains efficient, as the algorithm operates on the linear system, spanned by the virtual control points of EPnP and thus avoids recalculating the full projection equation in each iteration. After removing the outliers, the final solution is attained by iteratively solving the closed-form Orthogonal Procrustes problem.
Yet another extension to the aforementioned EPPnP is termed Covariant EPPnP (CEPPnP) (Ferraz et al., 2014a). It is the first algorithm to inherently incorporate observation uncertainty into the framework. Again the linear control point system of EPnP is formulated. Then the covariance information of the feature points is transformed to that space using its Jacobian. Finally, the Maximum Likelihood (ML) minimization is approximated by an unconstrained Sampson error.
In this paper, we propose a new real-time capable, statistically optimal solution to the PnP problem leveraging observation uncertainties. We propagate 2D image uncertainties to general bearing vectors and show, how a linear ML solution with non-singular covariance matrices can be obtained by using the reduced observation space presented by Förstner (Förstner, 2010). In contrast to the CEEPnP, where the ML estimator is used, to obtain an estimation of the control point subspace, we optimize directly over the unknown quantities, i.e. rotation and translation and thus, directly obtain pose uncertainties. Finally, we compare the results of our algorithm to a ground truth trajectory and show, that the estimated pose uncertainties are very close to the ground-truth.

MLPNP
The task of Perspective-n-Point (PnP) is to find the orientation R ∈ SO(3) and translation t ∈ R 3 that maps the I world points pi, i = 1, .., I to their corresponding observations vi in the camera frame. This relation is given by: where λi are the depths of each point and the observations vi are the measured and thus uncertain quantity having unit length, i.e. vi = 1. In the following the methodology of our algorithm is explained. The parameters and observations are depicted in

Observations and uncertainty propagation
Let a world point pi ∈ R 3 be observed by a calibrated camera and let the uncertain observation in the image plane be: where the uncertainty about the observed point is described by the 2D covariance matrix Σ x x . Assuming an arbitrary interior orientation parametrization, the image point x is projected to its corresponding three dimensional direction in the camera frame, using the forward projection function π (e.g. K −1 in the perspective case): with Jπ being the Jacobian of the forward projection. Thus, the uncertainty of the image point x is propagated using: where the rank of the covariance matrix Σxx is two, i.e. it is singular and not invertible. Subsequent spherical normalization yields the final and general observation. We will refer to them as bearing vectors: following (Förstner, 2010) the covariance is propagated using: Observe, that the covariance matrix Σvv remains singular, i.e. a Maximum Likelihood (ML) estimation based on the three residual components of bearing vectors is invalid. Thus, a minimal representation of the covariance information for the redundant representation of the homogeneous vector v is desirable. In the following section, we will introduce the nullspace of vectors and show how this can be subsequently used to get an initial estimate of the absolute orientation of the camera.

Nullspace of bearing vectors
The following was developed by (Förstner, 2010). The nullspace of v, spans a two dimensional coordinate system whose axis, denoted as r and s, are perpendicular to v and lie in its tangent space: The function null(·) calculates the Singular Value Decomposition (SVD) of v and takes the two eigenvectors corresponding to the two zero eigenvalues. Further we assume Jv r to be an orthonormal matrix, i.e. J T vr (v)Jv r (v) = I2. Note, that Jv r in addition represents the Jacobian of the transformation from the tangent space to the original vector. Thus the transpose J T vr yields the transformation from the original homogeneous vector v to its reduced equivalent vr.
with nonsingular covariance Another way to think about vr is as a residual in the tangent space. In the following, we will exploit Eq. 8 to get a linear estimate of the rotation and translation of the camera in the world frame by minimizing this residual in the tangent space.

Linear estimation of the camera pose
Using Eq.1 and 7 we can reformulate Eq. 8: with λi = 0. Thus, if we knew the absolute orientation of our camera, the projection of a world point pi to the tangent space of v, should result in the same reduced coordinates, i.e. zero residual. Expanding Eq. 10 yields: +r2(r21px +r22py +r23pz +t2) +r3(r31px +r32py +r33pz +t3) 0 = s1(r11px +r12py +r13pz +t1) +s2(r21px +r22py +r23pz +t2) +s3(r31px +r32py +r33pz +t3) with p = [px, py, pz] T . Now, both equations are linear in the unknowns, i.e. we can stack them in a design matrix A to obtain a homogeneous system of linear equations: with u = [r11,r12,r13,r21,r22,r23,r31,r32,r33,t1,t2,t3] T . As each observation yields two residuals, at least I > 5 points are necessary to solve Eq. 12. Assuming uncorrelated observa-tions, the stochastic model is given by: and the final normal equations are: We find the u that minimizes Eq. 14 subject to u = 1, using SVD: The solution is the particular column of V that corresponds to the smallest singular value in D: It is determined up to a scale factor, thus the translational part t only points in the right direction. The scale can be recovered from the fact, that the norm of each columnr1,r2 andr3 of the rotation matrixR must equal one. Hence, the final translation is: The exploited constrain shows, that the 9 rotation parameters do not define a correct rotation matrix. This, can be solved by calculating the SVD ofR:R and the best rotation matrix minimizing the Frobenius norm is found as: Up to this point, a linear ML estimation of the absolute camera pose is obtained. To increase the accuracy, a non-linear refinement procedure is used. Doing a subsequent refinement of the initial estimate is a common procedure, e.g. performed in (Lepetit et al., 2009), (Ferraz et al., 2014b) or (Lu et al., 2000).

Non-linear refinement
We apply a Gauss-Newton optimization to iteratively refine the camera pose. Specifically, we minimize the tangent space residuals defined in Eq. 10. This is reasonably fast for two reasons. On the one hand, the nullspace vectors are already calculated and we simply have to calculate the dot products between the tangent space vectors and each transformed world point. On the other hand, the results of the linear estimates are already close to a local minimum, i.e. the Gauss-Newton optimization converges quickly. In practice we found, that a maximum number of five iterations is sufficient. To arrive at a minimal representation of the rotation matrix, we chose to express R in terms of the Rodriguez parametrization.

Planar case
In the planar case, the SVD (Eq.15) yields up to four solution vectors, as the corresponding singular values become small (close to zero). In this case, the solution is a linear combination of those vectors. To solve for the coefficients, an equation system with non-linear constraints had to be solved. We avoid this using the following trick. Let M = [p1, p2, .., pi] be a 3 × I matrix of all world points. The eigenvalues of S = MM T give us information about the distribution of the world points. In the ordinary 3D case, the rank of matrix S is three and the smallest eigenvalue is not close to zero. In the planar case, the smallest eigenvalue becomes small and the rank of matrix S is two. If the world points lie on a plane that is spanned by two coordinate axis, respectively, i.e. one of the elements of all world points is a constant, we could simply omit the corresponding column from matrix A and get a distinct solution.
In general, the points can lie on an arbitrary plane in the world frame. Thus, we use the eigenvectors of S as a rotation matrix RS and rotate the world points to a new frame using: Here, we can identify the constant element of the coordinates and omit the corresponding column from the design matrix A. Note, that this transformation does not change the structure of the problem. The rotation matrix obtained after SVD (Eq. 19), simply has to be rotated back to the original coordinate frame: To keep our method as general as possible, the matrix S is always calculated. We then simply switch between the planar and the ordinary case, depending on the rank of the matrix S. To determine the rank we use rank-revealing QR decomposition with full pivoting and set a threshold (1e-10) on the smallest eigenvalue.

RESULTS
In this section, we compare our algorithm to all state-of-the-art algorithms using synthetic as well as real data. To reasonably assess the runtime performance of each algorithm, we categorize the state-of-the-art solvers according to their implementation:
Our method is implemented in both Matlab and C++. We integrated the C++ version in OpenGV (Kneip and Furgale, 2014). The Matlab version will be made publicly available from the website of the corresponding author, i.e. all results are reproducible. All experiments have been conducted on a Laptop with Intel Core i7-3630QM@2.4Ghz.

Synthetic experiments
We use the Matlab toolbox provided by the authors of (Ferraz et al., 2014a, Li et al., 2012, Zheng et al., 2013a and compare our algorithm in terms of accuracy and speed. The simulation configurations as well as the evaluation metrics are also part of the toolbox and are briefly described in the following. All experiments are repeated T = 250 times and the mean pose errors are reported.
Assume a virtual calibrated camera with a focal length of f = 800 pixels. First, 3D points are randomly sampled in the camera frame on the interval [-2, 2]×[-2, 2]× [4,8] and projected to the image plane. Then, the image plane coordinates x are perturbed by Gaussian noise with different standard deviations σ yielding a covariance matrix Σ x x for each feature. At this point, we arrived at Eq. 2. Now, the image plane coordinates x are usually transformed to normalized image plane coordinates applying the forward projection π. In the synthetic experiment, this simplifies to perspective division: x = [x /f, y /f, 1] T . Most algorithms work with the first two elements of x, i.e. normalized image plane coordinates. Instead, we apply Eq. 5 and spherically normalize the vectors, to simulate a general camera model. Further, we propagate the covariance information using Eq. 6.
Finally, the ground truth translation vector tgt is chosen as the centroid of the 3D points and the ground truth rotation Rgt is randomly sampled. Subsequently, all points are transformed to the world frame yielding pi.
The rotation accuracy in degree between Rgt and R is measured as max 3 k=1 (arccos(r T k,gt · r k ) × 180/π), where r k,gt and r are the k-th column of the respective rotation matrix. The translation accuracy is computed in % as tgt − t / t × 100.
The first column of 3 depicts the ordinary and the second column the planar case, where the Z coordinate of the world points is set to zero. The first two rows of Fig. 3 depict the mean accuracy for rotation and translation for an increasing number of points (I=10,..,200). Following the experiments of (Ferraz et al., 2014a), the image plane coordinates are perturbed by Gaussians with σ = [1,..,10] and 10% of the features are perturbed by each noise level, respectively.
For the third and fourth row of Fig. 3 the number of features is kept constant (I=50) and only the noise level is increased. This time, a random σ is chosen for each feature from 0 to the maximum number, ranging from 0 to 10.
The experiments for the ordinary case show, that MLPnP outperforms all other state-of-the-art algorithms in terms of accuracy. Moreover, it stays among the fastest algorithms as depicted in Fig. 2. For less than 20 points, the algorithm is even faster than EPnP, that is still the fastest PnP solution.

Real Data
To evaluate the performance of our proposed MLPnP method on real data, we recorded a trajectory of T = 200 poses of a moving, head mounted fisheye camera in an indoor environment. The camera is part of a helmet multi-camera system depicted in Fig.  4a. We use the camera model of (Scaramuzza et al., 2006) and the toolbox of (Urban et al., 2015) to calibrate the camera. All sensors have a resolution of 754×480 pixels and are equipped with equal fisheye lenses. The field of view of each lens covers 185 • . The ground truth poses M t gt are obtained by a Leica T-Probe, mounted onto the helmet and tracked by a Leica lasertracker. This tracker was in addition used to measure the 3D coordinates of the world points. The position accuracy of this system is σpos ≈ 100µm = 0.1mm and the angle accuracy is indicated with σ angle ≈ 0.05mrad. For each frame t in the trajectory, the epicenters of I passpoints (planar, black filled circles) are tracked over time, as depicted in Fig. 4b. The number of passpoints I, that are visible in each frame ranges from 6 to 13.
To assess the quality of the estimated camera pose, we calculate the mean standard deviation (SD)σ R andσt of the relative orientation between camera and T-Probe. The relative orientation for each frame is calculated as follows: where Mt = [R, t] is the current camera pose and is estimated by each of the 13 algorithms respectively.

Including Covariance Information
For the first frame t = 1, the covariance information about the measurements Σ x x is set to the identity I2 for each feature. To this point, the only thing we know about the features is, that they are measured with identical accuracy of 1 pixel. For the following frame t = 2, however, estimation theory can tell us, how well a point was measured in the previous frame, given the estimated parameter vector.
Let A be the Jacobian of our residual function Eq. 10. Then, the covariance matrix of the unknown rotation and translation parameters is: and cofactor matrix of the observations: Observe, that this gives us the cofactor matrix in the reduces observation space. Thus, we project them back to the true observation space of the respective point i: Now, we can set the covariance matrices Σv i v i for frame t = 2 to the estimated values from the first frame t = 1 and so on. The results from this method are depicted as MLPnP+Σ in Fig.  5. Clearly, the consistent use of the full covariance information gives another performance increase. Fig. 5a shows the mean standard deviation of the three orientation components.

External vs. internal accuracies
Usually, it is desirable to have a measure of reliability about the result of an algorithm. A common approach in geometric vision, is to quantify the quality of the pose in terms of the reprojection error. Most of the time, this might be sufficient. Many computer vision systems are based on geometric algorithms and need a measure of how accurate and robust objects are mapped in the camera images. But sometimes it is useful to get more information about the quality of the estimated camera pose, e.g. for probabilistic SLAM systems. As MLPnP is a full ML estimator, we can employ estimation theory to obtain the covariance information about the estimated pose parameters and thus their standard deviations as a measure of reliability. Let r be the vector of stacked residuals [dri, dsi] T . First, we calculate the variance factor: with P being the stochastic model and b = 2I − 6 is the redundancy of our problem. Then, the 6×6 covariance matrix Σrt of the unknowns is extracted with Eq. 23. Finally, the 6×1 vector of standard deviations of the camera pose is: We extract the covariance matrix for every frame in the trajectory and calculate the mean standard deviation over all frames. Tab. 2 depicts the results compared to the standard deviation that was calculated to the ground truth. The marginal differences between the internal uncertainty estimations and the external ground truth uncertainties is an empirical proof that our proposed ML estimator is statistically optimal.

CONCLUSION
In this paper, we proposed a Maximum Likelihood solution to the PnP problem 1 . First, the variance propagation from 2D image observations to general 3D bearing vectors was introduced. The singularity of the resulting 3×3 covariance matrices of such bearing vectors motivated the subsequent reduction of the covariance to the vector tangent space. In addition, this reduced formulation allows to obtain a solution to the PnP problem in terms of a linear Maximum Likelihood estimation followed by a fast iterative Gauss-Newton refinement. Finally the method was tested and evaluated against all state-of-the-art PnP methods. It shows similar execution times compared to the fastest methods and outperforms the state-of-the-art in terms of accuracy.

ACKNOWLEDGMENT
This project was partially funded by the DFG research group FG 1546 "Computer-Aided Collaborative Subway Track Planning in Multi-Scale 3D City and Building Models". In addition the authors would like to thank Patrick Erik Bradley for helpful discussions and insights. (a) M L P n P M L P n P + Σ L H M E P n P + G N R P n P D L S P P n P A S P n P S D P O P n P E P P n P C E P P n P U P n P 0 0.5 1 1.5 2 mean translation SDσt [cm] (b) M L P n P M L P n P + Σ L H M E P n P + G N R P n P D L S P P n P A S P n P S D P O P n P E P P n P C E P P n P U P n P