MEMORY EFFICIENT SEMI-GLOBAL MATCHING

: Semi-Global Matching (SGM) is a robust stereo method that has proven its usefulness in various applications ranging from aerial image matching to driver assistance systems. It supports pixelwise matching for maintaining sharp object boundaries and ﬁne structures and can be implemented efﬁciently on different computation hardware. Furthermore, the method is not sensitive to the choice of parameters. The structure of the matching algorithm is well suited to be processed by highly paralleling hardware e.g. FPGAs and GPUs. The drawback of SGM is the temporary memory requirement that depends on the number of pixels and the disparity range. On the one hand this results in long idle times due to the bandwidth limitations of the external memory and on the other hand the capacity bounds are quickly reached. A full HD image with a size of 1920 × 1080 pixels and a disparity range of 512 pixels requires already 1 billion elements, which is at least several GB of RAM, depending on the element size, wich are not available at standard FPGA-and GPU-boards. The novel memory efﬁcient (eSGM) method is an advancement in which the amount of temporary memory only depends on the number of pixels and not on the disparity range. This permits matching of huge images in one piece and reduces the requirements of the memory bandwidth for real-time mobile robotics. The feature comes at the cost of 50% more compute operations as compared to SGM. This overhead is compensated by the previously idle compute logic within the FPGA and the GPU and therefore results in an overall performance increase. We show that eSGM produces the same high quality disparity images as SGM and demonstrate its performance both on an aerial image pair with 142 MPixel and within a real-time mobile robotic application. We have implemented the new method on the CPU, GPU and FPGA. We conclude that eSGM is advantageous for a GPU implementation and essential for an implementation on our FPGA.


MOTIVATION
The justification for SGM arises from the limitations of other global approaches for the real-world applications.
Graph Cuts (Kolmogorov andZabih, 2001, Szeliski et al., 2008) is a global stereo method that translates the matching problem into a graph and seeks a partition, which effectively approximates the minimization of a global energy.The method is not only slow, but also quite memory intensive.Belief Propagation (Felzenszwalb andHuttenlocher, 2004, Szeliski et al., 2008) is another approach for minimizing a global cost function.It passes messages iteratively in a three dimensional grid structure for approximating the global minimum.The size of the temporary memory is linear to the number of pixels and the disparity range, as explained above.A Dynamic Programming stereo matching method (Birchfield and Tomasi, 1998) can be implemented with much less memory requirements, since the method optimizes the global function along each scanline separately.Unfortunately, this leads to streaking artifacts.
The Semi-Global Matching (SGM) method (Hirschmüller, 2008) minimizes a global cost function only in one dimension like Dynamic Programming, but the direction is not oriented along scanlines.Instead, it is performed symmetrically from eight directions towards all pixels in the image.SGM does not suffer from streaking artifacts like Dynamic Programming and does not require iterations like Belief Propagation.The rather simple and regular integer operations of the SGM algorithm make it suitable for implementations on the GPU (Rosenberg et al., 2006, Gibson and Marques, 2008, Ernst and Hirschmüller, 2008) and FPGA (Gehrig et al., 2009, Banz et al., 2010), which permits real-time performance.
In practice, SGM appears rather robust and insensitive to the choice of parameters in contrast to other methods like Graph Cuts.This makes SGM suitable for real world applications like (tilewise) aerial image matching (Hirschmüller, 2008, Gehrke et al., 2010) and automotive applications (Steingrube et al., 2009).
The drawback of SGM is its memory consumption that depends, as for all global methods, on the number of pixels and the disparity range.This limitation may be overcome in principle with tilling the input data.While this approach is usually sufficient for CPUs it does not take advantage of the high degree of parallelism commonly found in FPGAs and GPUs.

MEMORY EFFICIENT SEMI-GLOBAL MATCHING
The SGM method (Hirschmüller, 2008) aims to minimize the global cost function (1) The function consists of one term that sums the matching costs C over all pixels p, according to given disparities Dp.For rectified images, the absolute difference would be calculated by (2) This just serves as an example.In practice, more robust matching costs should be used (Hirschmüller and Scharstein, 2009).The second term of (1) sums small penalties P1 for all pixels, where the disparity difference to the neighbor is at most one pixel.The last term sums a higher penalty P2 for all pixels with a higher ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume I-3, 2012 XXII ISPRS Congress, 25 August -01 September 2012, Melbourne, Australia disparity difference to neighboring pixels.The goal is to find the disparity image D that minimizes this function.

Review of the SGM Method
In SGM (Hirschmüller, 2008), the minimization of ( 1) is done by going along one dimensional paths L in eight directions r through the image.Thus, at each pixel p, paths from eight directions meet as shown on the left in Figure 1.(3) Thus, for each pixel p and disparity d, the cost is computed by the sum of the matching cost and the minimum path cost of the previous pixel p − r, considering the penalties P1 and P2.The latter is done by computing the minimum over four values.The first value is the path cost at the previous pixel at the same disparity.This value is taken without any penalty.The second and third value is the path cost at the previous pixel with the next lower and higher disparity.Here, the small penalty P1 is added.The last value is the minimum cost at the previous pixel over all disparities with the additional higher penalty P2.
The last term of equation (3) subtracts the minimum cost at the previous pixel from all costs of the current pixel.This is done for keeping the values L low for using a small data type.In fact, an arbitrary value could be chosen, as long as it is constant for all disparities.The minimum of the previous pixel is used, because it is already available and subtraction will never make the whole term negative.
Since disparity changes are usually indicated by intensity changes, the penalty P2 is adapted to the intensity gradient between the current and the previous pixel, according to The information from all paths is fused for all pixels and disparities by Lr(p, d). (5) The disparity for each pixel corresponds to the minimum cost, i.e.

DL(p) = argmin
Subpixel interpolation can be performed by either fitting a parabola through the neighboring cost values or by using an equiangular line fit (Shimizu and Okutomi, 2005).The choice depends on the matching cost.
Unlike Dynamic Programming solutions, the pathwise cost computation does not contain any handling for occlusions.This is not possible because the paths are not oriented in the direction of epipolar lines.Therefore, the disparity of the right image DR is computed by either a diagonal search in S (Hirschmüller, 2008), or by switching the roles of the left and right image and computing from scratch.A consistency check is performed by comparing DL with DR and differing disparities are set to invalid or interpolated if needed (Hirschmüller, 2008).
Aggregation can be performed in two passes.The first pass goes from the top left pixel line wise through the image and computes the path from left, diagonally from top-left, from top and diagonally from top-right.For each pixel, the four paths are continued from the previous pixels to the current pixel according to (3) as shown on the left in Figure 2.This requires storing the path costs Lr for the previous pixels for all disparities for each direction separately.Consequently, the required memory has a size of 3 × w × dmax + dmax elements, with w as width of the image and dmax as disparity range.After computing the four pathwise costs for a pixel, the result is added according to (5) to the array S, which has the size w × h × dmax elements.Thereafter, a second pass is required that starts from the bottom, right pixel and goes upwards for computing the remaining four paths.This completes the computation of S. The disparity image is derived from S immediately according to equation ( 6).Thus, the total temporary memory requirement of SGM is The main problem is the memory size of S as it depends on the width, height and disparity range.In contrast, the memory for pathwise costs only depends on the width and disparity range.Since a disparity range that is larger than the total image size does not make sense, the required memory will be at most in the order of the size of the input image.

The Novel eSGM Method
In the SGM algorithm, the disparity is determined as the index that corresponds to minimum cost of all disparities of a pixel (6).However, each of the eight paths that contribute to the cost for a pixel, carries its own preference for the location of the minimum.Figure 3 shows the costs of the eight paths from different directions at pixel p for all disparities.Ideally, the location of the minimum of the eight paths would be the same disparity.However, we have to anticipate that paths from some directions may be disturbed, e.g.near depth discontinuities.Nevertheless, at least one path should predict the disparity correctly.The idea is to focus only at the locations of S, where the eight paths have their minima.These are at most eight distinct places.For all other disparities, the costs need not to be stored.Thus, S becomes a sparse S ′ , whose size does not depend on the number of disparities any more.
Figure 3: Costs for all disparities of a certain pixel from paths of eight different directions that are summed to S. The minimum of S may differ from the minima of the pathwise costs.
The immediate question is if we do not reduce the number of possible disparities too early?What if the minimum of S is at a different location than the minimum of any of the eight paths?The answer is that it is very unlikely that true disparity is not detected by any of the paths and at the same time appears as minimum in S. We claim that if the minimum of S is at a different place than any of the eight pathwise minima, then it can only be the correct disparity by pure chance.Thus, the new eSGM method should have qualitatively the same output as the SGM method.
The new method can be implemented with three passes.The first pass works like in the original method, except for storing the result to S. Instead of storing the costs for all disparities, for each pixel the four minima of the four paths that meet in this pixel are determined and the costs are only stored for these four disparities.For subpixel interpolation, the costs of the next lower and higher disparity is stored as well.Figure 4 shows the values that need to be stored for one pixel.The second pass computes the remaining four paths from the bottom up.The costs are added at the places that were identified by the first pass.These four places are complete, which allows the computation of an intermediate result by choosing the lowest cost among the four and calculating the sub-pixel disparity position.This frees the memory of the four minima of the first pass.The memory is used for storing the costs at the four minima of the paths from the second pass, which are in general different to the minima of the first pass.The aggregation is completed by a third pass that computes the same paths as in the first pass, but adds the costs at the minima that where identified in the second pass.The final minima of each pixel can then be selected among these four minima and the intermediate result of the second pass.Thus, for this algorithm, the amount of temporary memory is just

S(d
Obviously, the computational effort of eSGM is increased by 50% in comparison to SGM due to the necessary third pass. The new eSGM method does not allow the direct derivation of the right disparity image from the cost array S by a diagonal search (Hirschmüller, 2008).Therefore, we suggest another kind of fast consistency check that projects DL directly into DR by considering the costs that are available at the disparities of DL.For rectified images it can be written as DR(x − DL(x, y), y) := DL(x, y).Double mappings are resolved by storing the disparity with the lower associated cost into DR.
The reasoning behind the new method permits the derivation of matching confidence as the number of paths that have the minimum at the same disparity as the final disparity.Obviously, if paths from all directions have the same minimum, then the trust in the found disparity is very high.In contrast, if other path directions suggest a different disparity, then the confidence in the correctness of the match is lower.Computing this confidence number is computationally very cheap, because the eSGM algorithm already computes the individual minima of all paths.

IMPLEMENTATIONS
We have implemented the new method on different hardware platforms for different applications.As matching cost, we have focused on Census, since this matching cost appears to have the highest radiometric robustness (Hirschmüller and Scharstein, 2009).

CPU
We have implemented the SGM and eSGM method on the CPU as explained in Section 2 using unsigned integer values with 16 bit as basic elements.The pathwise cost calculation loop is computed using SSE2 vector commands.In case of SGM, the Census matching cost was computed only once and the pairwise matching costs stored for each pixel and disparity, which doubles the memory requirement, but offers fastest processing.Thus, the total temporary memory requirement in bytes of our SGM implementation is four times equation (7).
Our eSGM implementation uses unsigned 16 bit integer values as well and computes the Hamming distance of Census transformed images for each of the three passes, using xor and popcnt commands, that are part of the SSE 4.2 or SSE 4a specification.Thus, the total amount temporary memory in bytes is two times equation (8).

GPU
The GPU implementation of eSGM uses conventional OpenGL/Cg (Segal andAkeley, 2009, NVI, 2009) programming techniques (Rosenberg et al., 2006, Ernst andHirschmüller, 2008).Several render buffers of an OpenGL frame buffer object keep the data in a 16 bit float data format while the arithmetic operations are usually done in 32 bit float precision.All the work is carried out through the execution of OpenGL rendering commands with specialized fragment and vertex shaders.With respect to the memory bandwidth, necessary for direct matching cost calculation, we have chosen a Census cost function with a 5 × 5 window and alternatively Mutual Information as matching cost (Hirschmüller and Scharstein, 2009).Mutual Information requires a hierarchical eSGM algorithm.
Due to the two-step approach, the computing time for the path cost values is theoretically doubled as compared to SGM.It can be reduced if the minimal path cost values are saved during the first step for the second one.The memory size for S ′ with eight values per pixel and the data structures for storing the minimal path cost indices and values is sufficient for computing images up to 2048 × 2048 pixels on a GPU with OpenGL/Cg.A disparity range of up to 2048 pixels is only limited by the 16 bit float data format of the buffers.For consistency checking on the GPU, the right disparity image is calculated from scratch with swapped input images.Alternatively the fast consistency check method of Section 2.2 can be used.

FPGA
Programmable hardware like FPGAs have been used for stereo methods like SGM (Banz et al., 2010, Gehrig et al., 2009).FP-GAs have a low power consumption and high computational power that make them suitable for energy aware embedded systems that demand real-time image processing.A FPGA implementation of the original SGM algorithm on a high-end FPGA board matches two pairs of 320x200 images at 27 frames/s using ZSAD as matching cost (Gehrig et al., 2009).During our FPGA implementation efforts it became clear that the required memory bandwidth is the limiting factor for increased image resolutions or deploying low cost FPGAs in mass production.Unlike other FPGA implementations (Gehrig et al., 2009) we are using Census as matching cost, as it is well suited for hardware realization and because of superior performance under realistic conditions (Hirschmüller andScharstein, 2009, Hirschmüller andGehrig, 2009).The window size of the Census transform can be set to any arbitrary size.For our real-time setup, the census size was chosen to be 5 × 5, since we found that it gives the best trade-off between hardware impact and image quality.
All hardware components are designed with the hardware description language VHDL, following a hardware operating system concept (HW-OS).The HW-OS allows for a fast interchange of already deployed and proven hardware modules, at the expense of 5-10 % additional resource utilization.The eSGM core is customizable to any disparity and image resolution, which must be a power of two.Our hardware platform consists of a mid-sized Xilinx Virtex 5 FPGA (XC5VSX95T) with a PCI express interface to the image sensors and a GPU.The GPU is used for pre-and post-processing as shown in Figure 5. Based on this hardware environment, a stereo image pair with a resolution of 1024 × 1024 pixels and a maximum of 64 disparity steps can be processed at nearly 10 Hz.VGA sized image pairs are processed at 33 Hz with the same disparity range.The consistency check doubles the execution time.A second independent matching core may be instantiated in a FPGA with more logic resources.The novel eSGM method has also proven its feasibility on low cost Xilinx Spartan Devices.

RESULTS
We have tested our implementations of SGM and eSGM against each other on a standard test set, on huge images and on different platforms for finding out the advantageous and disadvantageous of our new method.

Evaluation on Middlebury Datasets
Figure 6 shows the standard data sets from the Middlebury web page (Scharstein and Szeliski, 2010, Scharstein and Szeliski, 2002, Scharstein and Szeliski, 2003).We used the same parameter settings for SGM and eSGM and for all images.The consistency check was used for identifying occlusions, which were interpolated as explained in Section 2.1.
The disparity images of SGM and eSGM are almost identical.In contrast, eSGM with the fast consistency check produces worse results.The last row of Figure 6 shows the confidence images that were computed by eSGM as explained in the end of Section 2.2.Obviously, lower confidence is estimated near object borders, which is in fact the place where most errors occur.
All pixels that differ by more than one pixel from the ground truth are counted as error.Table 1 shows the results over all nonoccluded pixels.The performance of SGM and eSGM is very similar, which confirms that the eSGM method produces virtually the same output as the SGM method.The original SGM implementation is listed as SemiGlob in the table.The difference to our implementation is the matching cost.We have used Census.
The runtime is given in the last column of Table 1.We have measured it on a Xeon X5570 CPU with 3 GHz.The implementation is single threaded, thus only one CPU core was used.In this example, eSGM is 56% slower than SGM, which confirms the theoretical runtime overhead of 50%.

Matching of Huge Images
The main advantage of SGM is the possibility to match huge images in one piece without tiling.Figure 7 shows a rectified aerial image pair.Both images have a size of 9782 × 14580 = 142 MPixel.The disparity range in this example is 1788 to 2300, thus 512 pixels, mainly due to the tower, which has a height of ≈ 370 m in reality.
eSGM computed the disparity and confidence image, as shown in Figure 7, in one piece using 4.8 GB of temporary memory.It required 72 Minutes on a Xeon X5570 CPU with 3 GHz in a single threaded implementation.The SGM implementation would require 272 GB of memory for computing the whole image, which is absolutely unrealistic.Instead the disparity image is computed in 7×11 tiles of the size 1478×1406 pixels and required 4 GB of temporary memory.Thus, the tiles have an overlap of 40 pixels.The computation time was 51 Minutes.The resulting disparity image looks the same as the eSGM result and is therefore omitted.The problem of tiling is the definition of the overlap.For well textured, rather flat scenes, as the current one, a low overlap of 20-40 pixels is sufficient.However, low-textured scenes, water or high depth differences require a higher overlap.With eSGM, this problem is avoided.
In this example, eSGM is 41% slower than SGM.In practive, even higher disparity ranges are typical, because the disparity range is not known and therefore is often overestimated for being on the safe side.Furthermore, the ground resolution in the example was 15 cm/pixel, but 5 cm/pixel can be obtained by aerial cameras, which increases the disparity range by factor 3 on scene with the same relative depths.Finally, scenes with mountains and canyons will have larger depth ranges.Thus, disparity ranges of thousands of pixels have to be considered in aerial image matching.For tiled processing, it means using much smaller tiles.In contrast, eSGM does not have this problem.

Computation Time on Different Hardware
As discussed before, we have implemented the new method on three different hardware platforms.All implementations use Census as matching cost and a full consistency check.The CPU version runs single threaded on one CPU core.Parallelization into several threads appears feasible, but is not used.Table 2 shows the runtime using two different image configurations.These results show that eSGM has special advantageous on the GPU and the FPGA, because these platforms typically have not as much memory available as the CPU implementation and the ratio between computation power and memory bandwidth is worse.On the GPU, especially the memory size is a problem.An image of size 2048 × 2048 pixels and 1024 pixel disparity range cannot be computed by SGM, but by eSGM.On the FPGA, the memory size as well as the memory bandwidth is a problem, which is relaxed by the eSGM method.

CONCLUSION
We have shown that the novel eSGM method improves the SGM algorithm by making its temporary memory requirement independent from the disparity range without sacrificing matching quality.The novel feature comes at the cost of around 50% more computation time as compared to SGM.
We have argued that memory efficiency is very important for applying global stereo methods to real world problems.As the image resolution increases in future, the disparity range increases as well.This means that the problem will become worse in future, if the main memory does not increase over-proportional to the image resolution.
The novel eSGM method permits matching of huge images up to several hundred MPixel in one piece on the CPU, which is not possibly by any other similarly accurate stereo method.The benefit of eSGM is even higher on the GPU and the FPGA as the main memory size is more limited on these devices and the ratio between computational power to memory bandwidth is worse.

Figure 1 :
Figure 1: Aggregation of costs in disparity space.Along each path Lr, the minimum cost to reach all disparities of a pixel p on the path is computed according to the global cost function (1).The minimum cost along a path is visualized on the right in Figure 1.Mathematically, the cost computation is done by recursively computing Lr(p, d) =C(p, d) + min(Lr(p − r, d), Lr(p − r, d − 1) + P1, Lr(p − r, d + 1) + P1, min i Lr(p − r, i) + P2) − min k Lr(p − r, k).

Figure 2 :
Figure 2: Calculation of the eight path directions in a top-down pass (left) and a bottom-up pass (right).

Figure 4 :
Figure4: Definition of 18 data elements that need to be stored for each pixel in the eSGM method.

Figure 5 :
Figure 5: Overview of the functional blocks and the underlying heterogeneous hardware platform.

Figure 6 :
Figure 6: Comparison of our SGM and eSGM implementation on the Middlebury datasets.

Figure 7 :
Figure 7: Left and right rectified aerial image with 142 MPixel, disparity image by eSGM and corresponding confidence image.ing disparity estimation: Architecture and fpga-implementation.In: IEEE Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

Table 2 :
Runtime in seconds of SGM and eSGM on one CPU

Table 1 :
(Klaus et al., 2006)ed areas of the Middlebury datasets using the standard threshold of one pixel.Algorithm Tsukuba Venus Teddy Cones Average error Runtime for Teddy AdaptingBP(Klaus et al., 2006)