LOCK-FREE MULTITHREADED SEMI-GLOBAL MATCHING WITH AN ARBITRARY NUMBER OF PATH DIRECTIONS

This paper describes an efficient implementation of the semi-global matching (SGM) algorithm on multi-core processors that allows a nearly arbitrary number of path directions for the cost aggregation stage. The scanlines for each orientation are discretized iteratively once, and the regular substructures of the obtained template are reused and shifted to concurrently sum up the path cost in at most two sweeps per direction over the disparity space image. Since path overlaps do not occur at any time, no expensive thread synchronization will be needed. To further reduce the runtime on high counts of path directions, pixel-wise disparity gating is applied, and both the cost function and disparity loop of SGM are optimized using current single instruction multiple data (SIMD) intrinsics for two major CPU architectures. Performance evaluation of the proposed implementation on synthetic ground truth reveals a reduced height error if the number of aggregation directions is significantly increased or when the paths start with an angular offset. Overall runtime shows a speedup that is nearly linear to the number of available processors.


INTRODUCTION
Solving the correspondence problem is fundamental to photogrammetry as it will allow the reconstruction of 3D points from 2D bitmaps if the underlying camera geometry is known. Numerous algorithms have been proposed to find identical pixels in two calibrated images among which semi-global matching (SGM) (Hirschmüller, 2008) has been outstandingly successful in terms of speed and quality of the resulting disparity maps (Scharstein et al., 2014). Since the proposal of the classic algorithm and the availability of its original single-threaded implementation, many contributions have been made to improve and accelerate SGM, and some of them discharged into scientific and commercial-grade software packages.
Notably, the tSGM algorithm of (Rothermel, 2016) reduces both the number of calculations and the amount of memory required for the cost aggregation stage of SGM. This is achieved through dynamic bounds on the disparity ranges for each image pixel. The bounds are obtained from matching subsampled copies of the input stereo pair and propagating the disparity estimates to higher levels of the image pyramid. In its multithreaded commercial incarnation named SURE, the cost aggregation scheme of tSGM is identical to classic SGM and relies on a fixed number of typically 8 or 16 path directions (nFrames GmbH, 2014). Thus, in practice, homogeneous areas sometimes lack point coverage, and low-amplitude artifacts from the approximate "semiglobal" minimization of the energy function involved in the matching process can be observed in the produced disparity maps. Characteristic quality issues related to SGM are specifically addressed by (Facciolo et al., 2015) in their more global matching (MGM) approach. In MGM, information is transferred between adjacent cost aggregation paths of a particular direction resembling belief propagation in graphs. It is demonstrated that the technique promotes smoothness and consistency in the generated disparity maps. However, the introduced data dependency comes at the price of a speed penalty, and it is likely that parallelization will be adversely impacted. Other modifica-tions to SGM target its cost function and smoothness term to enhance the accuracy and coverage of the resulting depth estimates leaving the actual cost aggregation strategy untouched. This for instance includes the use of weighted similarity metrics (Miclea, Nedevschi, 2017), extended discontinuity penalties (Michael et al., 2013), their automatic adaptation to the image content (Karkalou et al., 2017), and combinations. Lately, machine learning techniques to predict SGM parameters have been discussed (Seki, Pollefeys, 2017). This paper describes an implementation of SGM that supports a nearly arbitrary positive number of path directions for the cost aggregation stage. Hence, as a brute-force alternative to MGM, the user is given full control over the degree of approximation of the global energy minimization problem to be solved during stereo matching. In the implementation, for each direction, cost aggregation is performed along linear paths. The shapes of the paths are precomputed using an iterative line rasterization algorithm. Path-wise cost summing repeatedly reuses and shifts the resulting regular substructures, or runs, from the rasterizer, which happens in at most two partial sweeps per direction over the entire disparity space image. Since adjacent paths are properly aligned and sweeps are coordinated, it is guaranteed that the discrete scanlines do not overlap. Because there is no data dependency, path-wise multithreading can be applied without the need for time-consuming synchronization primitives. Computational efficiency is further improved by dynamically gating the disparities to a small range estimated from subsampled versions of the input images like in tSGM. On the machine level, the expensive cost calculation and aggregation stages of the proposed algorithm are parallelized using single instruction multiple data (SIMD) commands for the omnipresent Intel x86-64 and ARM CPU architectures.
The performance regarding both the runtime and disparity map quality of the new SGM implementation is evaluated on a synthetic 3D scene which gets rendered into error-free oriented pinhole camera images and depth ground truth. Following stereo matching, the reconstructed points are compared to the known scene geometry. The influence of high path direction counts and asymmetry on the height error and point coverage is examined for detailed scene elements and critical low-texture objects. Extensive optimization of the SGM penalty functions or disparity map refinement techniques remain unconsidered, however, these may complement the proposed aggregation scheme.

REVIEW OF THE BASELINE SGM ALGORITHM
The classic SGM algorithm comprises four basic processing stages. Assuming two oriented rectified images with parallel epipolar lines on the same vertical coordinate (Fusiello et al., 2000), the matching cost is initially computed for each pixel p at position (x, y) and disparity value d within the allowed range for the stereo shift. The cost calculation stage yields a disparity space image (DSI) which actually is a three-dimensional boxlike volume C(p, d) storing the similarity between potentially corresponding pixel pairs p(x, y) and p (x + d, y). A popular choice for the cost function serving as a similarity measure is the census transform because it is tolerant to radiometric differences between the images and can be efficiently calculated (Zabih, Woodfill, 1994).
Global stereo matching algorithms now attempt to optimally fit a 2D surface through the disparity space image minimizing the total cost, or energy. However, due to ambiguities in the DSI caused by homogeneous and repetitive areas in the input bitmaps, the neighborhood of a pixel must be incorporated using dynamic smoothness terms. This makes finding the desired surface and thus the corresponding pixels NP-complete and prohibitively time-consuming on current computers, at least when energy minimization is performed in two or more dimensions (Boykov et al., 2001). For one-dimensional structures, the optimal solution can be derived efficiently using dynamic programming (Cox et al., 1996) which leads to "semi-global" matching.
In SGM, the two-dimensional surface is approximated by minimizing equation 1 along a set of paths of direction r. Here the cost L r of pixel p at disparity d comprises the initial cost C from the DSI plus the minimum of the costs at the previous pixel along the path at the same disparity, at the disparity differing by one and at the disparity differing by more than one. A disparity change during the transition from p − r to p however gets penalized by P 1 and P 2 ≥ P 1 . This introduces a smoothness constraint which permits small depth changes and discourages depth jumps. The last term subtracted in the equation ensures L r ≤ C max + P 2 to avoid numerical overflows with C max denoting the maximum DSI value possible.
The final cost per pixel and disparity is the sum of the L r over all path directions where the number of directions often is fixed to 8 or 16. In the first case, only paths running strictly horizontally, vertically and diagonally must be considered. The next pixel on the line is computed by simply incrementing either the x or y coordinate, or both simultaneously. For sixteen orientations, two pixel steps in the major line direction are followed by one pixel step in the minor line direction. For n dir directions, the maximum aggregated cost to be kept in memory will be n dir (C max +P 2 ) which becomes relevant when the data type for the S(p, d) is to be chosen. If equation 1 is implemented reusing the previously computed L r along the respective path, the time and space complexity of the described cost aggregation step for a DSI of w by h pixels and d disparities is O(whd) for any constant value n dir .
In a third stage, to eventually derive the pixel matches, the stereo shift d min with the minimum cost is taken from the set of S(p, d). Subpixel precision can be obtained by for instance fitting a curve through the costs at the disparities next to d min and recalculating the final d min from the minimum function value. The last stage comprises the consistency check where erroneous stereo shifts due to mismatches or occlusions are eliminated. This often is achieved by reversing the role of the input images, re-matching the bitmaps and testing the new disparities against their counterparts from forward matching. Optionally, before being written to a bitmap, the surviving consistent disparity values can be filtered for example with a local median operator to eliminate outliers.

SGM WITH AN ARBITRARY DIRECTION COUNT
To support any positive number of directions in SGM, the simple discretization scheme for 8 and 16 path orientations must be abandoned. Instead, Bresenham's four-connected component line rasterizer is used to construct the paths for cost aggregation (Bresenham, 1965). This algorithm, which originates from computer graphics, supports arbitrary slopes and uses efficient integer additions and subtractions only. Cost aggregation now comprises at most two coordinated sweeps over the disparity space image as shown in figure 1.

Primary sweep
For the primary sweep performed first, the line shape of direction r is rasterized by running the Bresenham algorithm exactly once over the full horizontal or vertical extent of the xy plane of the DSI depending on the slope. Thus, path sampling must begin in one of the DSI corners at the minimum disparity, i.e. at (p 0 , d 0 ). As a result, the longest sequence of linear segments, or runs, in which the horizontal or vertical coordinate remains constant is obtained for r (figure 2). The runs in general have a variable length and characterize the fast-changing major direction of the line to be discretized.
To sum up the cost along a particular SGM path, the run sequence archetype is followed starting at the DSI boundary at position (p, d) = (p 0 , d 0 ) according to the sweep scheme. For each run pixel, the cost L r as of equation 1 is computed. The obtained value gets added to cell (p, d) of the aggregation volume S which has the same dimensions as the disparity space image. Position p is incremented or decremented by one in the major path direction. When the current run of the precalculated sequence is exhausted, the pixel position p will be incremented or decremented with respect to the minor path direction, and calculation will continue with the next run. If the DSI is eventually left, a new path of direction r next to the current one will be processed. The position p is reset to the DSI boundary and Figure 1. Primary (red) and secondary (black) sweeps over the disparity space image for paths with an arbitrary slope α Figure 2. Runs r 1 , r 2 , r 3 (red) of the rasterized line r with a length of five, three and four pixels altered by one pixel in the minor line direction, and the run sequence gets rewound to its start.
The primary sweep will be complete as soon as a new path of direction r begins outside the DSI. When this happens, at least 50% (diagonal path, square image) and at most 100% (horizontal or vertical path) of the matching cost for direction r have been aggregated. Because adjacent rasterized paths are guaranteed to be piece-wisely parallel and tightly aligned when the same run sequence gets repeatedly replayed as described, each DSI pixel and its counterpart in the aggregation volume are visited exactly once during the primary sweep. Therefore, redundant calculations are avoided. More importantly, no cost update will be omitted or performed multiple times for a particular (p, d). This ensures that the aggregation volume will remain unbiased when the minimum cost stereo disparities are chosen.

Secondary sweep
Except for perfectly horizontal and vertical path directions, a secondary sweep will be needed to aggregate the matching cost for the DSI pixels not visited during the primary sweep. Like in the first pass, the run sequence is followed to obtain the DSI pixel p for the calculation of L r . However, the start position of the k-th secondary path along the respective DSI boundary is computed as the accumulated length of the runs r 1 , ..., r k of the rasterized line prototype. Also, the path gets stripped off its first k runs to be piecewisely adjacent to the paths of the primary sweep (see figure 3). Overlaps and gaps are avoided during cost aggregation between scanlines of the secondary sweep. The sweep scheme prevents collisions with the positions visited in the first pass over the disparity space image.

IMPLEMENTATION AND OPTIMIZATION
As a proof of concept, the SGM algorithm has been reimplemented in platform-independent C++ based on run sequences for the cost aggregation stage. The software prototype takes two rectified RGB or graylevel TIFF input images with 8 or 16 bits per pixel and performs forward (left to right) and reverse (right to left) stereo matching. On success, this will generate a pair of floating-point TIFF bitmaps directly encoding the set of consistent disparity values. Optionally, the median filter, small segment removal and interpolation to fill up image areas with invalid disparities from neighbor pixels can be applied to the output in a postprocessing step.
Cost calculation utilizes the census function with a kernel of 7x7 pixels. The disparity space image and aggregation volume are stored as partially consecutive 16 bit unsigned integers. Assuming a penalty P 2 = 100 for depth discontinuities, the maximum SGM path direction count before an aggregation overflow occurs is calculated as n max = (2 16 − 1)/(7 2 + P 2 ) = 439 as of equation 2. Since in theory the runtime of the algorithm will increase linearly with the number of SGM path directions, the code has been applied optimizations to speed up all stages of the matching process. Aside from pointer arithmetic, this includes disparity gating, SIMD intrinsics and extensive multithreading.

Disparity gating
Like in tSGM, instead of running the matcher for a fixed predetermined range of stereo shifts, disparity gating dynamically restricts the range for the disparities d per DSI pixel (p, d) depending on the input image content. Therefore, the proposed SGM algorithm is run on recursively subsampled bitmaps built from the stereo pair. At the lowest resolution, the full image width is taken as the range for the disparities d producing a first estimate for the stereo shifts that occur for the pictured scene.
To accommodate mismatches, as a variation to tSGM, a local histogram filter with a fixed support of 51x51 pixels is applied to the disparity maps. It sets the disparity bounds to the lower and upper 16th percentile. The histogram can be quickly updated as described in (Faugeras et al., 1993). When the window is shifted horizontally or vertically over the disparity images, O(max(f w , f h )) time complexity is achieved for each move where f w and f h denote the filter dimensions.
The minimum and maximum disparities are subsequently passed up the matching hierarchy range-gating the stereo shifts for the image pair of the next higher resolution. Since the estimates tend to become more precise at upper levels, less disparities must be considered per pixel during cost aggregation along the SGM paths compared to a predefined constant interval. Also, the computed disparity range gets concentrated more accurately around the true value. This reduces matching ambiguities and improves the time efficiency of the algorithm. Runtime is further optimized with a limit on the number of SGM path orientations for the coarse tiers of the image pyramid while the maximum direction count will only be used for the full-resolution level of detail.
Congruently to the speedup, the amount of storage space for the DSI and aggregation volume will be greatly reduced when only the necessary values with respect to the disparity interval and reduced image dimensions are kept. Therefore, instead of a three-dimensional memory block accommodating the full resolution, the proposed implementation utilizes a single array of 16 bit unsigned integers sized to the sum over all per-pixel disparity ranges of the current SGM hierarchy level. Access to individual matching cost values is facilitated indirectly through an index structure. For each pixel p, the index holds the minimum disparity considered d 0 , the disparity count and the address of the first cost value assigned to p, i.e. the offset to (p, d 0 ). Since the array is consecutive in memory, pointer arithmetic can be used to efficiently navigate adjacently stored costs, and a suitable alignment prepares the storage for vector instructions.

Multithreading
On a more technical level, the SGM implementation utilizes multiple execution threads for its submodules based on OpenMP (OpenMP Architecture Review Board, 2015). This will ideally lead to a speedup that is directly proportional to the number of available processor cores. For the cost computation, histogram filter, minimum disparity calculation and postprocessing stages, the involved local operators are run concurrently for each horizontal image scanline with static scheduling.
The cost aggregation loop over the set of paths of direction r along which the values L r (p, d) are summed up is multithreaded using dynamic scheduling. In contrast to the default scheduling policy of OpenMP using a fixed number of paths per thread, dynamic scheduling effectively assigns each path its own thread and starts processing a new scanline as soon as another path has been entirely covered. Thus, with the lines of varying length in the secondary SGM sweep, all CPU cores remain fully loaded for a maximum of efficiency. Because there are no overlaps between the paths, there will be no race condition when the costs L r are stored concurrently to the cells (p, d) of the aggregation volume. Therefore, time-consuming thread synchronization to ensure exclusive data writes becomes unnecessary, and the absence of any memory locks will positively contribute to the scalability of the SGM implementation to a high CPU core count. As a drawback, the use of multithreading introduces duplicate data structures and hence a small memory overhead. For example, the current and previously aggregated matching costs along the path L r (p, d) and L r (p − r, d) now must be kept and updated separately for each thread.

SIMD instructions
On the hardware level, census calculation and cost aggregation of the SGM implementation are further parallelized using vector or SIMD (single instruction multiple data) commands supported by current processors. In these stages, instead of looping over single data elements, blocks of input image pixels and consecutive cells of the DSI and aggregation volume are uniformly processed by a single instruction in few clock cycles. The block size which affects the achievable speedup depends on the actual CPU architecture. For the dominating Intel and ARM processors, 8 (Intel SSE2/3/4, ARM NEON), 16 (Intel AVX/AVX2) or 32 (Intel AVX512BW) unsigned integers of 16 bits as stored by both three-dimensional data structures can be processed at once (Intel Corporation, 2019)(ARM Limited, 2019).
For the census matching cost, the crucial part is to derive a bitmask that indicates which elements of the neighborhood have an intensity greater or equal to the input image pixel under the filter mask center. Using SSE2 instructions and retaining a symmetric kernel, the comparison can be performed simultaneously for a row of seven pixels as shown in figure 4, and consequently, the filter size is set to 7x7. In the C++ code, the SIMD instructions are prefixed with mm and take either 128 bit SSE registers or memory pointers aligned to 16 bytes as arguments. For SGM cost aggregation, the loop over the disparities contained in equation 1 is rewritten. Here blocks of consecutive stereo shifts d n , ..., d n+b−1 with b being the SIMD block size are simultaneously processed. In the sample code for the SSE2 instruction set, the calculation of the L r (p, d) comprises multiple vector additions, the subtraction of the min k L r (p − r, k) kept in a local variable and the 4-minimum operator ( figure 5). This operator, which is not directly available, must be emulated with three calls to the binary minimum intrinsic.

PERFORMANCE ANALYSIS
Performance analysis of the proposed algorithm focuses on the influence of the number of path directions on the quality of the resulting disparity maps and the scalability of lock-free pathwise multithreading on modern multi-core processors. For this Figure 5. C++ snippet with SSE2 intrinsics for cost aggregation purpose, the optimized SGM implementation is passed a set of RGB stereo images of a synthetic 3D scene (Frommholz, 2019). The input bitmaps are 8192 by 6144 pixels (50.3 megapixels) each with 8 bits per color channel. They cover an area of roughly 500 by 375 length units at a ground sampling distance (GSD) of about 0.07 length units. The images have been rendered by a raytracing software where the virtual pinhole cameras satisfy the stereo normal case. Except for inherent aliasing artifacts suppressed by excessive oversampling with 81 rays per pixel, the bitmaps are free of distortions and noise. Congruently to the stereo pair, the raytracer also provides depth ground truth that comprises the z coordinates of the intersections when the emitted rays hit the scene objects. If the disparity maps of SGM are triangulated keeping only the z coordinates, both the actual and ground truth z images will become directly comparable, and as long as the z axis and viewing vector approximately coincide, matching errors will get reflected adequately. Also, by working on the z coordinates instead of raw stereo shifts, there will be an immediate connection to the height values of the pictured synthetic 3D geometry which simplifies manual point sampling. Figure 6 depicts the reference data set.

Matching quality
For the matching quality, the SGM implementation is run on the synthetic input images with a steadily increasing number of path directions n dir . Aggregation for most passes is carried out symmetrically along the k = 0, ..., n dir − 1 lines with inclinations of k · 360 • /n dir . However, the initial path angle of runs 8/11 and 16/7 is 11 • and 7 • respectively. The prime offsets have been chosen to reduce the chance of a correlation between the direction of the scanlines and the predominant orientation of the imaged scene objects. Penalties remain fixed at P 1 = 90 Output stereo shifts for the left-to-right and right-to-left SGM passes are checked for consistency and filtered allowing a maximum deviation of one pixel. Any other postprocessing or interpolation remains disabled. The produced disparity maps are transformed into z images of which the left-to-right bitmap is compared to the ground truth pixel by pixel. Statistics are compiled for the entire scene and two subsets comprising detailed objects (i.e. Sierpinsky tetrahedra of 4, 5 and 6 iterations) and a homogeneously textured cube. These special cases tend to be prone to matching errors in practice and hence are of particular interest.
Matching quality will be quantified in terms of coverage and accuracy. For this purpose, the number of successfully reconstructed points n valid , the median height deviation from the ground truth∆h and the aggregated count of pixels with an absolute height error less or equal than 0.2 length units n ∆h≤0.2 are determined. The threshold has been empirically chosen as roughly three times the ground sampling distance. Table 1 lists the statistics for the whole scene where the row labeled "gt" has the reference values. It shows that the number of pixels with a valid correspondence basically stagnates between 85.5% and 85.95% when the path count increases. The median distance error to the ground truth first falls quickly, decreases by around 11% from n dir = 8 to n dir = 96 and stabilizes for the upper zone of directions tested. Similarly, over the medium range of orientations, the number of accurate pixels with a height deviation below three times the GSD grows by about 2%.
However, for n dir = 4, 8, 12 and 16, spikes occur in the number of valid pixels. For these cases, the median height error remains in line, and the values n ∆h≤0.2 mostly increase disproportionately compared to the neighboring direction counts. Hence, the new consistent disparities likely have contributed to the accurate pixel class, and there must have been a move of already successfully matched stereo shifts towards smaller distance deviations. A review of the produced z images for the respective configurations indicates that the cause for this effect is local minima in the disparity space image which specifically occur perpendicularly to the direction of perspective distortions, e.g. on wall surfaces of high-rise buildings (figure 7). Due to their low cost, these regions will be followed by aggregation paths that have a similar orientation, which is the correct solution according to equation 1. For dense and accurate surface reconstruction, it may therefore be beneficial to analyze the DSI three-dimensionally for linear structures and run SGM with a set of paths coinciding with their orientation. In congruence to this observation, when the direction count remains constant but the start angle of the paths is altered breaking the alignment to the DSI minima, the number of valid pixels significantly drops. Remarkably, for 8 directions/11 • and 16 directions/7 • respectively, using oddly oriented aggregation scanlines reduces the median distance error to the ground truth to values obtained for 24 and 32 directions. The number of accurate pixels also drops by 0.73% (n dir = 8/0 • to n dir = 8/11 • ) and 1.02% (n dir = 16/0 • to n dir = 16/7 • ) which, for eight directions, is less than the decrease in consistent disparities. Thus, if the distance error is to be optimized with SGM in general, it may be advantageous to start cost aggregation with an angular offset and choose a low direction count for speed.
For the high-detail subset of the synthetic stereo pair, table 2 contains the coverage and accuracies. The number of successfully matched pixels is about 3-4% below the values for the full scene. Except for the settings with an angular offset, it grows slowly with a rising path direction count by 1% from n dir = 8 to n dir = 256. The median distance deviation to the ground truth is greater by roughly 0.02 length units. Introducing angular offsets for the paths decreases the number of valid pixels and lowers the height error to a value that even outperforms the configuration with 256 directions. Hence, for detailed objects to be reconstructed, either a substantial number of orientations or oddly rotated paths is the strategy of choice for SGM. However, on the examined Sierpinsky tetrahedra, if the number of directions is increased and the final aggregated cost S(p, d) stabilizes, the distribution of valid pixels will also change. Isolated disparity pixels and small regions vanish in favor of larger consolidated segments. This may be disadvantageous for a subsequent interpolation step because supporting points inside the nodata areas will get lost. Figure 8 illustrates selected results.
For the homogeneously textured cube, the statistics are given in the SGM path direction count gets ramped up with slight drops on the angular offsets. Starting with an initial prime slope does not reduce the median height error which is approximately 1% higher than for the detailed objects and basically remains static beyond n dir = 32. The increasing number of new valid samples is accompanied by a growth in accurate pixels whose number rises by roughly 6% over the entire direction range shown and by nearly 3% from n dir = 8. Hence, although the absolute distance deviation can hardly be lowered by a vast number of path directions in semi-global matching, the point density will benefit when little information is present in the input data. In addition, when a high orientation count is selected, the characteristic streaking artifacts of SGM that were addressed by the more global matching approach will diffuse into an undirected point pattern as depicted in figure 9.

Runtime and scalability
The impact of a high number of path directions on the runtime of the optimized SGM implementation is evaluated for both the Since the server hardware offers plenty of resources, measurements are on the full-scale synthetic images that were used for matching quality analysis. The SGM implementation is compiled to optimized binaries without platform-specific tuning. For x86-64, the executables have been built from the platformindependent C++ code and the C++ code manually enhanced with SSE2/SSE4.2 and AVX2 intrinsics for the cost calculation and aggregation steps as outlined previously. The software utilizes 24 threads of execution, i.e. one per physical processor core. Due to the lack of memory, the ARM SBC is given a re-  In practice, runtime grows nonlinearly with the direction count. The increase from 8 to 16 orientations is 12% on x86-64 with SSE intrinsics and 26% on ARM using plain C++ code. Quadrupling the number of directions from 8 to 32 extends the runtime by 36% and 77% respectively. Moreover, processing speed does not always benefit from SIMD optimization. On both platforms, for low to medium numbers of path directions, the binaries built from plain C++ code outperform their manually vectorized counterparts. This indicates that current C++ compilers already generate very efficient machine code. Also, when a sparse set of aggregation paths predominantly runs through DSI pixels that have been assigned narrow disparity ranges, the data transfer overhead from and to the dedicated SIMD registers may outweigh the efficiency gain from simultaneously processing blocks of stereo shifts. When n dir exceeds 16 (Intel) and 32 (ARM), SIMD becomes clearly faster than unvectorized code.
On the EPYC CPU, SSE commands beat the wider and more recent AVX2 instructions. This may be due to limitations of its floating-point/vector execution pipelines (Fog, 2019).
To assess the scalability of path-wise parallelization for the cost aggregation stage, the SGM implementation is executed with a constant number of 16 scanline directions but an increasing number of processing threads. On both platforms, doubling the number of processing threads lowers the runtime by a factor of 1.69 to 1.98 until the physical CPU cores are exhausted. Hence, the SGM implementation with path-wise multithreading almost scales linearly on the test machines. There is a marginal speedup when the logical cores are included and competition for memory bandwidth and arithmetic units of the processor starts.

CONCLUSION
This paper has outlined an implementation of semi-global matching that allows an arbitrary number of directions for cost aggregation. For each direction, a template sequence of line runs is constructed utilizing Bresenham's rasterizer. The runs are followed in at most two coordinated sweeps over the image to update the costs for each pixel and disparity. Since overlaps cannot occur, each path can be traced in its own thread without write locks. Runtime is further minimized through disparity gating and the use of SIMD instructions. Tests on synthetic stereo images with a high number of aggregation directions have revealed an improved point coverage and reduced height error. Future work will explore how the orientation count and path slopes can be dynamically adjusted to the image content and initial SGM cost volume in order to minimize the deviation and optimize the data throughput on real-world imagery. For mobile devices with limited CPU resources, a GPU-based implementation of the proposed algorithm is to be evaluated.