Fast global stereo matching via energy pyramid minimization

: We deﬁne a global matching framework based on energy pyramid, the Global Matching via Energy Pyramid (GM-EP) algorithm, which estimates the disparity map from a single stereo-pair by solving an energy minimization problem. We efﬁciently address this minimization by globally optimizing a coarse to ﬁne sequence of sparse Conditional Random Fields (CRF) directly deﬁned on the energy. This global discrete optimization approach guarantees that at each scale we obtain a near optimal solution, and we demonstrate its superiority over state of the art image pyramid approaches through application to real stereo-pairs. We conclude that multiscale approaches should be build on energy pyramids rather than on image pyramids.


INTRODUCTION
Precise Digital Surface Models (DSM) are widely employed in urban monitoring, geological surveys, architecture, or archeology.DSM are now mostly generated using remote sensing surveys based on optical stereo-imaging, interferometric Synthetic Aperture Radar (SAR), or Light Detection And Ranging (LiDAR) acquisitions (Li et al., 2005).In this paper, we focus on optical stereo-imaging.
The volume of available optical aerial and satellite images has sky-rocketed during this last decade.Moreover, the resolution and the size of these images have also vastly increased, and it is now common for satellites with push-broom sensors to produce images of more than 35,000 × 30,000 pixels with resolution up to 50 cm Ground Sampling Distance (GSD) (Worldview, 2014).Aerial imaging with frame camera achieve resolution better than 10 cm GSD with 20,000 × 13,000 pixels per images (UtraCam, 2014).
DSM with impressive accuracy and resolution can now been computed using multi-stereo pairs (Acute3D, 2014, 123DCatch, 2014, UtraMap, 2014, Pix4d, 2014, Micmac, 2014), such techniques are based on the redundancy of view points as one ground point is viewed by up to hundred of images.We work in a slightly different context where only one stereo-pair is available, which is typical of satellite imagery.
DSM computation from stereo pairs involves a number of steps described for instance in (Hartley and Zisserman, 2004): calibration, aerotriangulation, eventually rectification, and then the estimation of the disparity map; finally, the DSM is computed from the geometry of acquisition and the estimated disparity map.In this paper, we only focus on the estimation of the disparity map in epipolar geometry (Zhang, 1998).
Numerous methods have been designed to estimate disparity maps e.g.(Scharstein et al., 2001, Brown et al., 2003, Ansar et al., 2004, Tombari et al., 2008).State of the art approaches estimate disparity maps by defining an energy to minimize, which commonly enforces a notion of matching and a notion of regularity on the disparity maps.To speed up the computation, the photogrametry community has developed semi-global matching approaches (Hirschmuller, 2005, Pierrot-deseilligny andPaparoditis, 2006) that compute local optimum solutions along different directions and aggregate them together.However, no mathematical guarantee has been given on the aggregation of local solutions to form a global optimum.Under different constraints, the computer vision community has developed more advanced techniques to globally optimize the matching problem (Szeliski et al., 2008, Kappes et al., 2013).These techniques called global matching are mathematically more sound than semi-global matching as they guarantee a near global optimum (Boykov et al., 2001).However, due to their computational complexity, they have only been applied to small images, i.e., less than 1000 × 1000 pixels, as proof of concept (Middleburry, 2014, Klaus et al., 2006, Kolmogorov and Zabih, 2001) and are not yet scalable to accommodate the large sizes of remote sensing images.Very recently, the discrete optimization community has started to show some interest for variable grouping as a technique to solve global optimization more efficiently, (Bagon and Galun, 2012, Komodakis, 2010, Kim et al., 2011).However, only results from (Middleburry, 2014) have been presented for stereo imagery.
Our contribution is to bring the global matching techniques up to the dimensionality of remote sensing data and to offer the photogrammetry community a sound mathematical framework both in terms of modeling and optimization.In this paper, we describe how to estimate disparity maps based on energy minimization.Then, we propose an algorithm to efficiently solve such optimization problems with a multiscale approach based on an energy pyramid rather than the traditional image pyramid.Finally, we demonstrate the improvements of our approach through an application to real stereo acquisitions.

Probability formulation
Let Ir be a reference image and G = (V, E) its associated graph.The set of nodes V consists of the pixels of Ir and the set of edges E is defined by the 4 connectivity as illustrated in Fig. 1   ) .We measure how a given d fits the data Ir and It by defining: The definition of P is context dependent, but most approaches enforce: (1) a notion of the similarity between Ir and It •(id+d) (where id refers to the identity operator) by expressing PM (d|Ir, It) and; (2) a notion of regularity for d by expressing PR(d|Ir).If we suppose that these two probabilities are independent, we can write: Instead of directly working with probabilities, we prefer using the energy domain as it is easier to define a measure on images.One can simply relate probability density function to energy via the Gibbs measure: With Z being a normalization factor so that the integral of the probability function equal to 1.

Energy formulation
Through Eq. 3, we relate E, EM , and ER to P , PM , and PR respectively, which gives the following energy: We define a pixel-wise similarity measure based on the similarity function ρ.Commonly used similarity functions are L1 or L2 norms (Birchfield and Tomasi, 1998), Normalized Cross Correlation (NCC) or Zero Normalized Cross Correlation (ZNCC) (Brown, 1992), and the different versions of the Mutual Information (Viola and Wells, 1995, Kim et al., 2003, Hirschmuller, 2005).In any case, the matching energy of a pixel p is defined as: If the similarity measure is defined on a patch, we apply a rigid translation to the patch.Here, we use the ZNCC coefficient as it is robust to changes of illumination and contrast between Ir and It that appear due to specular objects or different acquisition times.
To enforce the regularity of d we choose to penalize the L1-norm of its discretized gradient, modulated by a weight function w.For each edge pq ∈ E: With : λ1, λ2, and σ are scalar parameters.The L1-norm of the gradient naturally enforces piece-wise constant disparities.The weight function w(p, q) relaxes the regularization on radiometric discontinuities of the reference image as in (Gamble andPoggio, 1987, Boykov et al., 2001).This is an effective heuristic as most of the edges of the disparity maps are also edges of the image Ir.
Alternatively, we could use the L1-norm of the laplacian rather than the gradient but this would lead to optimizing second order Conditional Random Fields that despite recent progress in solvers remain intractable in the context of this study (Komodakis and Paragios, 2009, Ishikawa, 2011, Fix et al., 2011).

Discrete Conditional Random Fields
We have built the measure EM of how d fits our data and the measure ER of how d respects the a priori knowledge on the disparity map.However, we still need to find the most probable disparity map d * , i.e., the maximum a posteriori, by minimizing Eq. 8 over d ∈ D: Finding d * means optimizing a continuous Conditional Random Field (CRF), continuous because d belongs to the continuous space D and conditional because PR depends on the data, i.e., Ir.This task is extremely difficult mainly because d is continuous and non-convex as Fig. 2 illustrates.Instead, we discretize D for each nodes p ∈ V to the discrete set Dp, so that we now deal with a discrete CRF that is much easier to solve, with d living in (Dp)p.
We use the vocabulary of the discrete optimization community and we introduce the notion of graph, unary term, edge cost, distance function, and label set for a CRF.
Our graph is directly the one of the image Ir, i.e., G = (V, E).
We introduce D = p∈V Dp, the union of all tested disparities.
To each disparity of D is associated a label, i.e. an index, in the label space L. Hence, each node p has a different label space Lp ⊂ L that relates to the disparity set Dp.
To each node p ∈ V and each label l ∈ L, we associate a unary term corresponding to Eq. 5: If l / ∈ Lp the configuration is impossible, and we associate an infinite unary cost.
To each edge pq ∈ E, we associate an edge cost : To each pair of labels (l1, l2) ∈ L we associate a distance function: For the sake of completeness, each edge pq ∈ E and each pair of labels (lp, lq) ∈ L defines a pairwise term.
We finally optimize the energy of the following discrete CRF: After optimizing Eq. 13 we obtain a labelling l * that relates to the disparity map d * .For each nodes p ∈ V, the following holds: We note CRF 1 = [G, c, w, δ] the CRF computed from Ir and It and defined by Eq. 13.CRF 1 belongs to the class of first order CRF because for each edges pq ∈ E, the distance function δ only depends on the two labels lp and lq.The size of CRF 1 depends on: (1) its spatial component, i.e., the number of nodes V and the number of edges E; and (2) its label component i.e., the number of label that relates in our case to the number of disparities per node p to evaluate.
The complexity to solve a first order CRF depends on the relative contribution to the energy of the unary terms and the pairwise terms.The higher the contribution of the pairwise terms, the more complex is the CRF optimization.The nature of the distance functions δ is also important.If δ is issued from a metric function, as in our case, then we have mathematical guarantee to retrieve a solution close to the global optimum while optimizing the CRF (Boykov et al., 2001, Komodakis andTziritas, 2007).
During the last decade numerous advances have been achieved to optimize this problem (Geman and Geman, 1984, Felzenszwalb and Huttenlocher, 2004, Kolmogorov, 2006, Boykov et al., 2001, Komodakis et al., 2007a, Komodakis, 2010).Our attention was drawn to Fast-PD, (Komodakis andTziritas, 2007, Komodakis et al., 2007b) for two different reasons: (1) it is the fastest algorithm currently available (Kappes et al., 2013); ( 2) it has the ability to use an input as initialization.This last property is extremely interesting as our multi-scale approach benefits from a hot-start of the CRF optimization, using the solution found at a previous scale.

MULTI-SCALE SCHEME AND SPARSE CRF
3.1 The multi-scale scheme In our remote sensing context images and disparity to recover are large.Directly minimizing Eq. 13 would be inefficient both in terms of memory consumption and computational speed because the needed discretized disparity space D will be very large.Instead, we propose a multi-scale approach to efficiently explore the solution space D with a succession of discretized disparity spaces D.
A multi-scale approach is valid because of the particular structure of the disparity map to retrieve.Indeed, while the full disparity range is large at the image scale, one can notice that locally the range is only a mere fraction of the full range.Natural topography is the main contributor to widening the range of disparities, which has a relatively low spatial frequency at the resolution we work at.Moreover, the local range of disparities is mainly due to manmade objects such as buildings, which tend to have higher spatial frequencies than natural features.Hence, the largest low spatial frequency disparities are resolved at the coarsest scales, while the smallest high spatial frequency disparities are resolved at the finest scales.
Our multiscale approach builds coarse to fine sequences of CRF, CRF n , ..., CRF f , ..., CRF 1 , with CRF 1 being defined at the full resolution of the images Ir and It, and CRF f being defined at a downsampling factor f .Our multiscale approach reduces the size of the CRF on both the spatial and label components.To build CRF f from CRF 1 we can use two different approaches: (1) image pyramid or (2) energy pyramid.We compare the two approaches in 4.1.
For both pyramid approaches, a crucial step is to define the discretized disparity space  We introduce the notion of energy manifold as the hyper-surface formed by the set of unary terms.The energy manifold lives in a 4 dimensional space.One dimension is linked to each spatial component of the unary terms.One dimension is linked to each disparity component of the unary terms.

Image pyramid: the GM-IP algorithm
The Global Matching with Image Pyramid (GM-IP) algorithm is an image pyramid approach based on downsampling the images to reduce both components of CRF 1 .First, we downsample Ir and It to I f r and I f t .This directly reduces the spatial component of the CRF and it smooths the energy manifold, i.e., the label component of CRF.The smoothing allows to reduce the sampling density for (D f p )p, trimming the label component of the CRF.The image pyramid is defined in Algorithm 1.
As the images have been downsampled by a factor f , it is enough to use a 0.5 × f discretization factor for (D f p )p.
At first glance, the downsampling seems to be beneficial as it reduces both components of the CRF.Unfortunately, the smoothing on the label component destroys the energy manifold as illustrated by the Fig.
at coarse scales might be discarded.This generates, especially at coarse scales, artifacts in the retrieved disparity maps d f .Some of these artifacts can be corrected at a finest scale only if they are not too important.This phenomenon is well known of the optical flow community (Horn and Schunck, 1981).While different heuristics, such as post-processing filtering between scales have been proposed (Sun et al., 2010), none can guarantee to correct these artifacts.This is the main motivation to build a proper representation of the energy manifold through an energy pyramid.
Figure 3: The unary terms (solid line) computed with the image pyramid (k = 8) poorly represent the energy manifold (dashed line).

Enegy pyramid: the GM-EP algorithm
Contrary to the GM-IP algorithm, the Global Matching via Energy Pyramid (GM-EP) algorithm is based on an energy pyramid approach that builds a more faithful representation of the energy as it directly downsamples CRF 1 .Our approach is inspired from the work of (Bagon andGalun, 2012, Kim et al., 2011).We explain the downsampling procedure in three steps: (1) we define how to downsample the graph G of CRF 1 ; (2) we define the coarse edge costs; and (3) given a search space, we define the coarse unary terms.

Downsampling the graph
From a downsampling factor f we define G N odes a function that groups nodes V of G in packets of f × f nodes.These packets define a set of nodes V f .This is illustrated by Fig. 4 as black circled nodes of V are grouped in a 2 × 2 fashion, creating the squared nodes of V f (f=2).
The grouping of nodes separates the edges of E in two different kinds: (1) edges that belongs to the same node of V f , i.e., the black thin solid lines in Fig. 4; and ( 2) edges that belongs to two different nodes of V f , i.e., the dashed lines in Fig. 4. We discard the first kind of edges as they only contribute to a constant energy term.We define G Edges as the function that groups the second kind of edges together.G Edges defines the new set of edges E f , i.e., the thick solid lines in Fig. 4.
So far, the grouping procedure defines from

Downsampling the edge costs
The grouping procedure also naturally defines the edges costs w f of CRF f as for each edges pq f ∈ E f we have: 3.4 Using sparsity to reduce the size of CRF Thus far, we have defined two pyramid approaches to reduce both the spatial and label components of Eq. 13 that build a sequence of CRF, [CRF n , ..., CRF f , ..., CRF 1 ].While we move through the scales, we always center the search space around the revised solution d.Consequently, due to Eq. 9, many unary terms may have an infinite value as illustrated in Fig. 6.Hence, we know beforehand that some configurations of d f are impossible.
To speed up the optimization process we indicate the solver, Fast-PD, not to evaluate the impossible configurations.
A potential approach is to remove all infinite unary terms, reorganize the CRF with respect to the label space and modify the distance function δ to take into account dinit.This leads to a new distance function δ The main drawback is that the new distance function δ ′ is not derived from a metric as it does not respect the triangular inequality.Therefore, we are only mathematically guaranteed to find a local optimum during the optimization (Komodakis et al., 2007a).
Instead, we remove all infinite unary terms but we do not reorganize the CRF and we keep δ as a distance function.This leads Figure 6: The multiscale approach generates sparsity in the CRF sequence, and only the coarsest CRF is assured to be full.The sparsity is due to defining search space around the previous up sampled scale solution dinit to a sparse CRF that keeps the metric properties of the distance function δ.Hence, we are mathematically guaranteed to find a near global optimum to the solution.We design our own implementation of Fast-PD to accept sparse CRF.Modifying Fast-PD for sparse CRF input is beyond this paper's scope, but this will be the subject of a future publication.

Factory
Urban Figure 7: Image Ir from subsets of a stereo pairs.Each subset is 2500 × 2500 pixels with a GSD of 10 cm.
We use a stereo pair from an aerial survey above an urban environment acquired with (UtraCam, 2014) of 11000 × 20000 pixels.We use our own calibration and aerotriangulation to transform the images into an epipolar geometry limiting the optimization of d to the horizontal direction.We process overlapping tiles of 5000 × 5000 pixels and extract the two subsets presented in Fig. 7 to compare the disparity maps.

Comparing GM-EP and GM-IP algorithms
We compare the GM-EP and GM-IP algorithms by reporting the disparity maps among each scale with the associated final energy and global computation time.For both algorithms we use 4 scales, f ∈ [8, 4, 2, 1], with a search range of [−5 : 0.5 : 5] × f around dinit, and we only perform one iteration per scale.We also define a baseline by solving CRF 0 over a large range of disparities, [−40 : 0.5 : 40].Note that the baseline can here be computed because we only focus on a small subset of the entire image.The results are presented in Fig. 8.
The black dots that appear in the upper left part of the disparity maps of the image pyramid and the baseline are due to moving vehicles as the stereo pair is issued from quasi-simultaneous acquisitions.Moreover, the direction of motion is very different from the epipolar lines.Our model does not cope with such motion and we do not account for these artifacts in our comparison.
The coarsest scales, f = 8 and f = 4, illustrate how the energy pyramid of the GM-EP algorithm performs a better approximation of the energy manifold than the image pyramid of the GM-IP algorithm.Indeed, the GM-IP disparity maps show smoothed factory edges.However, the same edges appear sharp in the GM-EP results, and even the chimney in top right part of the factory is well defined at the scale f = 4.At the finest scales f = 2 and f = 1 the GM-IP is only able to sharpen some of the blurred edges.For instance, the two tubular elements in the center of the images remain blurred while they appear sharp in GM-EP disparity maps.The visual inspection is confirmed by the energy measurements in Table 1 as the final energy of the GM-EP approach is lower than that of GM-IP.The computation time given in Table 1, is however slightly in favor of GM-IP.
In Fig. 9, we verify how well the GM-EP compares to the baseline.Both results are very close.Nevertheless, little details such as lampposts in the lower left part of image are better represented in the baseline.This is also confirmed in Table 1 as the GM-EP final energy is slightly superior to the baseline energy.However, the computation of the baseline is more than 3 times superior to GM-EP, and this non-pyramidal approach is non-feasible for large images due to exponential memory consumption.

MicMac comparison
We also compare the GM-EP to (Micmac, 2014), and we only report the final disparity map as the computation time is implementation dependent and the energy is linked to the model, which differs for both algorithms.There are three main differences between the GM-EP and MicMac: (1) the GM-EP model is more complex than the one used in MicMac, i.e., both models seem equivalent if λ2 = 0 in Eq. 7; (2), MicMac uses an image pyramid approach while the GM-EP works on an energy pyramid approach; and, (3) MicMac relies on semi-global optimization while GM-EP uses global optimization that produces near optimum solutions.
On both images, MicMac produces disparity maps with noisy areas that correspond to shadows the in stereo-pair, while the GM-EP is unaffected by shadows.This is due to the difference of models where the weight term of Eq. 7 increases the regularization on constant radiometric areas of the reference image.
Like the GM-IP, MicMac uses an image pyramid that tends to produce blurred edges when dealing with large disparity range.This is illustrated in the top left image of Fig. 10     The GM-EP algorithm obtains a lower energy than GM-IP for a slightly longer processing time.The Baseline obtains the lowest energy with a significant gap compared to GM-IP, but is more than 3 times slower on an Intel Xeon 8 core CPU computer (the code is not optimized).Therefore, the GM-EP algorithm defines good compromise between effective optimization and computation time.
MicMac GM-EP Figure 10: Disparity maps on the Factory and Urban subsets.

CONCLUSIONS AND EXTENSIONS
We proposed a rigorous yet computationally efficient framework for disparity maps estimation using a single stereo pair.This approach naturally lead to a global optimization problem of a CRF.Instead of relying on a semi-global optimization with no mathematical guarantee on the optimality of the solution, we use global discrete optimization which is mathematically guaranteed to yield a near optimum solution.The global optimization is speed-up by building a truthful representation of the energy manifold with an energy pyramid multiscale approach.We also exploited the sparsity of the resulting sequence of CRF to derive the GM-EP algorithm.From a practical standpoint, we demonstrate through experiments on real stereo-pairs the superiority of the GM-EP algorithm over image pyramid approaches whether they rely on global optimization, such as the GM-IP, or semi-global optimization, such as MicMac.Future work will extend the experimental evaluation to standard photogrammetric tools to further assess the performance of our approach.Nevertheless, the comparison with MicMac outlines the importance of the energy modeling for disparity map estimations.Therefore, future work will extend our model to a symmetric definition of the energy with respect to the input images.Indeed, the choice of the reference and the target images from a stereo-pair is completely arbitrary and swapping them produces slightly different disparity maps.We think that a symmetric energy might enhance the quality of estimated disparity maps as the problem is by nature symmetric.We also plan to integrate the notion of occlusions and moving objects in our future model.
Finally, instead of using a multiscale approach, future work will focus on a multigrid scheme to further speed-up the optimization.Indeed, an inherent drawback of multiscale approaches is to force all nodes to work at the same scale, which might not be required as some parts of the disparity map might be well-resolved even at coarse scale.
. Let It be the target image.Given Ir and It, we need to find the most probable 2D deformation d which associates each pixel of Ir to a pixel of It with d being a function of p ∈ V → d(p) ∈ (R × R).Thus, d lives in

Figure 1 :
Figure 1: Graph G = (V, E) of a 4 by 4 image with a 4 connectivity.
(Dp)p by adequate sampling of D. The key property lays in noticing that due to the interpolation kernel used for computing It • (id + d(p)), the values of the unary terms are continuous with respect to d(p) and have a limited frequency support.Figure 2 represents typical values of unary terms issued from ZNCC coefficients with a very fine sampling along one component of D. Ideally, we would like to sample D such that the unknown minimum of Eq. 8, d * , is included in the samples.However, we have no prior knowledge of d * .Therefore, we suppose the image well-sampled and we choose to respect the Shanon sampling theorem.As the ZNCC coefficient is defined on a patch, this doubles the frequency support of the unary terms compare to the frequency support of the images Ir and It.Thus, we sample D to half a pixel as in Fig. 2.

Figure 2 :
Figure 2: Illustration of the non convexity of unary terms with a 0.5 pixel discretization step to respect the Shanon sampling theorem.

Figure 4 :
Figure 4: Grouping a set of 4 × 4 nodes into a set of 2 × 2 nodes and implications on the edge set.

3
Figure 5: The erosion produces eroded unary terms, the crosses, that are the lowest of the k = 8 neighbors intermediate unary terms (the circles).The decimation defines the final unary terms (the squares).

Algorithm 2 :
Energy pyramid, GM-EP algorithm Data: Ir,It Result: d Set d = 0 for f = coarsest to finest downsampling factor do Set dinit = d; Downsample dinit by f → d f init ; Define the set of disparity to evaluate from d f init → D f Define the grouping from f → G N odes and G Edges Apply the grouping on G by the large chimney, the two tubular objects and the numerous bridges.This also appears on the bell tower in the lower right part of the bottom GM-EP GM-IP Figure 8: Top four rows f = [8, 4, 2, 1], black values represents disparities of -15 pixels or below, and white values represents disparities of 35 pixels or above.Last row absolute difference with respect to the baseline, brighter greys are higher difference.left image of Fig. 10.The edges produces by the GM-EP are sharper and better defined.MicMac suffers from the inherent issues limitations of image pyramid approaches, which are overcome by the GM-EP algorithm.Moreover, as the model of MicMac is less complete than ours, it introduces noisy areas.GM-EP Baseline no pyramid Figure 9: Final disparity maps.Black values represents disparities of -15 pixels or below.White values represents disparities of 35 pixels or above.
3. Disparities that should have been retrieved