TEXTURE-AWARE DENSE IMAGE MATCHING USING TERNARY CENSUS TRANSFORM

: Textureless and geometric discontinuities are major problems in state-of-the-art dense image matching methods, as they can cause visually significant noise and the loss of sharp features. Binary census transform is one of the best matching cost methods but in textureless areas, where the intensity values are similar, it suffers from small random noises. Global optimization for disparity computation is inherently sensitive to parameter tuning in complex urban scenes, and must compromise between smoothness and discontinuities. The aim of this study is to provide a method to overcome these issues in dense image matching, by extending the industry proven Semi-Global Matching through 1) developing a ternary census transform, which takes three outputs in a single order comparison and encodes the results in two bits rather than one, and also 2) by using texture-information to self-tune the parameters, which both preserves sharp edges and enforces smoothness when necessary. Experimental results using various datasets from different platforms have shown that the visual qualities of the triangulated point clouds in urban areas can be largely improved by these proposed methods.


INTRODUCTION
Recent innovations in aerial camera systems (Fritsch and Rothermel, 2013) and dense image matching (DIM) (Haala, 2013) approaches have led to the development of more imagebased large-scale 3D city reconstruction. Modern DIM approaches are generally formulated as a discrete optimization problems on the disparity space image (Scharstein and Szeliski, 2002). As 2D optimization on the disparity image is inherently NP-complete, the computational complexities are prohibitively high for large-scale aerial applications, even with relatively efficient approximate optimization algorithms, such as loopy belief propagation (Sun et al., 2003) and graph cut (Kolmogorov and Zabih, 2001). Fortunately, it is possible to separately carry out one-dimensional optimization for each scan line through dynamic programming (DP) in polynomial time (Birchfield and Tomasi, 1999). The industry proven method of Semi-global Matching (SGM) (Hirschmuller, 2008) extends DP in multiple directions, and is effective in both airborne and satellite-borne applications (Alobeid et al., 2010;Remondino et al., 2014).
The quality of photogrammetric point clouds is widely acknowledged to be inferior to that of laser scanning (Gerke, 2009;Nex and Gerke, 2014), particularly regarding noise and the preservation of sharp features. Insufficient texture and geometric discontinuity are crucial problems in DIM algorithms, particularly in urban areas with artificial structures and sharp edges. First, the lack of texture causes ambiguities in the similarity measurement (referred to as matching cost) of the local appearances of image intensities (Yoon and Kweon, 2006). Second, unlike sparse feature matching, most DIM methods explicitly integrate smoothness constraints to obtain dense point clouds in textureless areas. However, the smooth assumption is contradictory to the preservation of sharp features, which is crucial for its successful application. The balance between matching cost and smoothness is usually tuned through manually assigned parameters (Kolmogorov and Zabih, 2001;Sun et al., 2003;Hirschmuller, 2008), but in complex urban scenes, it is much more difficult to reach a compromise between noise and sharp feature preservation through a single set of parameters.
To address these problems, SGM is extended in this study, in terms of both the matching cost computation and disparity optimization. First, a ternary census transform is proposed to handle the ambiguities in textureless areas. Compared to the binary census transform (Birchfield and Tomasi, 1998), the proposed method also considers the equality of pixel intensities, as textureless areas typically have similar grayscale values and the relative order is more easily affected by random noise. Second, a texture-aware SGM is proposed to adaptively tune the parameters, to both enforce smoothness and preserve sharp edges. Because textureless areas not only represent similar intensities but also disparity continuities and vice versa (Stentoumis et al., 2014;Yang, 2015), this study quantitatively measures the texture information then self-tunes the parameters of SGM accordingly.

RELATED WORKS
With recent developments in aerial camera systems (Fritsch and Rothermel, 2013) and the automation of 3D reconstruction, DIM have now been a hot topic in the photogrammetry community (Cavegn et al., 2014;Kuschka et al., 2014;Krombach et al., 2015). Most DIM methods typically have four stages (Scharstein and Szeliski, 2002;Yang, 2015): 1) matching cost computation, 2) matching cost aggregation, 3) disparity computation/optimization and 4) disparity refinement. Taking traditional area-based matching as an example, the matching cost is the square difference of pixel intensities, matching cost aggregation is generally accomplished using the Normalized Correlation Coefficient (NCC) in a rectangular window, the disparity computation is carried out through a local winnertake-all (WTA) operation in a small search window and screening with a simple threshold, and the final step typically involves sub-pixel interpolation and other post processing. DIM methods are classified as either local or global, distinguished by the use of smoothness constraints and the inference of disparities (Scharstein and Szeliski, 2002;Lazaros et al., 2008). The former depends only on local intensity values, while global reasoning is generally used in the latter for disparity optimization.
DIM methods in both approaches generally face common problems, such as insufficient texture and the preservation of sharp features, which often appear in complex urban areas (Remondino et al., 2014). Textureless areas will cause ambiguities in determining matching cost and therefore magnify errors from the random noise of pixel intensities. Because of the difficulties to obtain local optimum in matching of textureless areas, DIM methods will generally fail in this area without smoothness constraint (Lazaros et al., 2008). The simplest strategy to exploit smoothness is to aggregate matching cost in a larger window (Xin et al., 2012), as the aggregation window implicitly uses the assumption of the same disparity value (Yang, 2015). It is also possible to enforce smoothness through segmentation: the images are first clustered into homogenous segments using color information, and pixels in each segment are implicitly assumed to share similar or smoothed disparities (Bleyer and Gelautz, 2004;Hong and Chen, 2004;Zhang et al., 2009). Similar effects can also be achieved through plane fitting, where the plane proposals can originate from color consistencies (Klaus et al., 2006) or initial feature matches (Wu et al., 2011;Wu et al., 2012;Sinha et al., 2014). In the global scheme, the smoothness constraints can be explicitly enforced by an energy term that penalizes neighbor pixels with different disparities (Kolmogorov and Zabih, 2001;Sun et al., 2003;Hirschmuller, 2008).
Insufficient texture requires the DIM algorithms to enforce smoothness and use global reasoning to alleviate noise and the ambiguities of matching cost. However, a smoothness constraint also flattens sharp features or causes the loss of boundary points due to failures in cross-checking (Remondino et al., 2014). Small rather than large aggregation windows are preferred for sharp features, or in extreme cases, pixelwise cost without aggregation can be used (Scharstein and Szeliski, 2002), such as absolute difference, sampling-insensitive absolute difference (Birchfield and Tomasi, 1998), mutual information (Hirschmuller, 2008), and census transform (Zabih and Woodfill, 1994). Weighing the matching window to avoid edges is another widely used strategy. For example, Yoon and Kweon (2006) assign appropriate weight by color similarity and spatial proximity, which transpires to be analogous to the bilateral filter (Hirschmüller and Scharstein, 2009). Bilateral filtering can also be applied to the minimum spanning tree, to achieve non-local aggregation and linear time complexity (Yang, 2015). Weighting can also be applied to non-parametric census transform through duplicating code (Spangenberg et al., 2013) or adaptively modifying the aggregation window (Stentoumis et al., 2014). It is possible to preserve geometric discontinuities through global methods by loosening the smoothness constraints through parameter tuning, e.g., P1 and P2 for SGM (Hirschmuller, 2008), and  for graph cut (Kolmogorov and Zabih, 2001).
A textureless area will inevitably decrease the performance of the matching cost per se. The most intuitive solution is to increase the resiliency of matching cost to noise and other conditions. In a comprehensive comparison of many matching costs, the census transform has been claimed to demonstrate the best and most robust overall performance (Hirschmüller and Scharstein, 2009). However, as relative intensity order does not take equality into account in computing the binary census transform, in a textureless area with many similar pixel intensities, a high level of incorrect ordering through random noises, camera qualities, and vignetting are expected. Using ternary order, which accounts for equality, So in this paper, we extend the binary order results with ternary order that takes equality into account (Tan and Triggs, 2010;Murala et al., 2012), and modify the pattern accordingly to allow for efficient computation.
Using adaptive filtering to preserve sharp features is in general only suitable for local methods, which are usually less accurate than global methods (Stentoumis et al., 2014). However, in global methods it is contradictory to both enforce smoothness and preserve discontinuities in complex scenes by a single parameter set. Following the suggestion to adaptively tune parameters in complex scenes (Remondino et al., 2014), we incorporate texture information to achieve both smoothness and discontinuity in different scenarios.

Overview
Because SGM is an industry proven DIM algorithm that demonstrates superior precision and production-level efficiency (Alobeid et al., 2010;Remondino et al., 2014), the method proposed in this paper is based on SGM. As the basic concept is compatible with any global method using pixelwise matching cost, similar strategies can be easily applied to other DIM algorithms. The overall workflow of the proposed method is summarized in Figure 1. The method differs from others as it uses ternary census transform for matching cost computation and incorporates texture information into SGM to self-tune the parameters for the disparity computation.
DIM methods typically use an epipolar relationship to constrain the correspondence search in a one-dimensional direction. The epipolar relationship can be calculated dynamically in disparity optimization, but for efficient implementation we pre-rectify the stereo image pairs to obtain horizontal epipolar images. For frame images, a homography transform is used for the rectification (Fusiello et al., 2000). For pushbroom satellite images, though the epipolar geometry is a quadratic curve instead of a line, in a relative small region, a simple affine transform can also be used (Wang et al., 2010), due to the stability of the satellite platform during image acquisition. After rectification we convert the epipolar image into corresponding codes using the proposed ternary census transform, where each pixel encodes a 64-bit census code (detailed in Section 3.2). The matching cost is measured using the hamming distance and dynamically computed in a single instruction 1 , rather than reallocated in the cost volume. Forward and backward texture-aware SGM are then conducted to achieve the final disparity image in a coarse-to-fine scheme (detailed in Section 3.3). SGM can be formulated as DP in multiple directions, and in a single direction it solves the following problem: where P1 and P2 control the penalties for adjacent pixels with different disparities and other parameters are found in (Hirschmuller, 2008). P1 is imposed on small and smooth changes of disparity and P2 on larger and acute changes. In this study we adaptively tune these parameters according to the texture information, to achieve less noise in textureless areas and to preserve geometric discontinuities in shared edges.
Potential outliers in the pairwise disparity map are first removed using speckle filters (Hirschmuller, 2008). The disparity map is then refined by parabola interpolation, using the matching cost of three adjacent pixels (Wang, 1990). The invalid pixels that fail the cross-check are left unchanged without interpolation, as suggested by the original SGM used for aerial images (Hirschmuller, 2008), as this may cause further outliers that are smooth and difficult to remove.
For multiple overlapping views of the same area, multiple disparity maps are fused into the depth map defined in the original image space. The fusion is carried out using a weighted average, and weights for every pair are determined by the length of the baseline. When converting the depth maps to corresponding point clouds, gridded median filters are used. This process also resamples the point clouds and removes potential duplicate points.

Ternary Census Transform
The census transform is a non-parametric matching cost, which relies only on the ordering of pixel intensities and is therefore invariant to any radiometric changes that preserve the intensity order (Hirschmüller and Scharstein, 2009). For any given rectangular window, a pixel will be encoded into a bit string, which concatenates the relative order of neighboring pixels, as described in Figure 2a. The census transform is defined with where  is the concatenation operator, CT() is the code image after transformation, I() is the epipolar image, and b() is the binary operator as in the following.
As previously described, small variations in textureless areas with similar pixel intensities will cause differences in the binary code. The sources of radiometric variations may be illumination settings, image noise, vignetting, or even jpeg compression. To enhance the robustness of the matching cost in textureless areas, we take the equality of the intensity comparison into account. As Figure 2b shows, instead of using the binary operator in Equations (2) and (3), the proposed census transform takes the ternary operator, following the local ternary pattern concept (Tan and Triggs, 2010;Murala et al., 2012),  A single comparison results in two bits rather than one, so it appears that the storage and computation may be two-fold in the binary census transform. However, unlike the traditional pixelby-pixel comparison carried out with a 9  7 grid (Hirschmüller and Scharstein, 2009), we use a predefined pattern on a 9  9 grid, where only the shaded squares are evaluated, as shown in Figure 3. This sparse and symmetrical pattern has two possible advantages: 1) a dense grid may have a close correlation between adjacent pixels, and 2) the pattern is designed to follow the Gaussian kernel, thus giving more weight to pixels close to the center. As only 32 grids are used for the evaluation, the total length for a single pixel is 64-bit and the matching cost can be evaluated in a single instruction. Figure 3 A symmetrical pattern on a 9  9 grid. For a single pixel, the proposed ternary census transform results in 64-bit binary codes.
It should be noted that the matching cost of the ternary census transform is also measured by the Hamming distance, which is the number of different bits in the coded string. The matching cost between 00 and 11 is 2, and between equality and inequality is 1. The ternary census transform will therefore result in an intermediate cost for smaller differences.

Texture-Aware Semiglobal Matching
Texture is the frequency of color variation and represents the level of image detail (Malik et al., 2001). The premise of texture-aware SGM is that color variation is in direct proportion to the variation of the disparity and thus the smoothness. The richer the texture, the larger the parameter P in Equation (1) should be. The texture information can be quantitatively measured by information entropy (Zhu et al., 2007), a number of filters with different orientations and scales (Malik et al., 2001), and by simple gradient analysis (Xue et al., 2014). Textureless areas are known to result in a low gradient and variation in color appearance, so this study uses the above two cues to quantitatively gauge the texture information, as; where g(x,y) is the average of gradient in both directions and (x,y) is the standard deviation of intensities. W(x,y) is a local window and in this study the same 9  9 grid is used. Finally, T(x,y) is the weighted texture information using both of these factors, where w1=w2=0.5 is used in this paper.
The original SGM exploits this idea by simple adaption with derivatives (Hirschmuller, 2008) as P2=P2'/|Ip-Iq| and another excellent SGM implementation, SURE, uses canny edges to adaptively change between two values as P2 or P2-P3 (Rothermel et al., 2012). However, in real applications we have found that the derivatives are quite sensitive to noise, and the canny edges require two parameters to explicitly binarize the image. A comparison of the above two approaches and the proposed method is shown in Figure 4. It should be noted that the derivatives are essentially sensitive to noise, and using the same thresholds, the canny method sometimes omits edges or results in dashed edges, illustrated by the red rectangles. However, it may also cause noise edges, as shown by the yellow rectangles.
After obtaining the texture information of the epipolar images, an intuitive solution for adaptive parameter selection is linear interpolation between the user-defined minimum and maximum. However, this requires a further temporary buffer to save the parameters, and an extra load operation for Single Instruction Multiple Data (SIMD) optimization. Similar to SURE (Rothermel et al., 2012), we therefore also binarize the texture and thus the parameter. Manual parameter-tuning is a timeconsuming task, so an adaptive thresholding strategy using local Gaussian weighted mean is used. An inconsistency of even one pixel in disparity has been found to sometimes cause more than 0.2 meters of perturbations in the point clouds, so we also adapt P1 in Equation (1) rather than only P2, as in previous studies (Hirschmuller, 2008;Rothermel et al., 2012). We can then obtain a parameter mapping function for each pixel P1(p) and P2(p), which are linear mapping between user defined ranges as detailed in experiments.

Test data description
To evaluate performance and efficiencies, we tested the proposed methods on three different datasets: Jinyang, Zurich (Cavegn et al., 2014), and Vaihingen (Cramer, 2010), all of which cover built-up urban areas. The images in Vaihingen are collected by large-frame aerial cameras with a relatively large baseline-height-ratio with nadir views. The radiometric qualities of the images can also be superior in this dataset. The pansharpening colour infrared images are used for matching.
The other two datasets both consist of aerial oblique images, but the configuration and tilt angles are quite different. In the Jinyang dataset, the tilt angle is 45 and the focal length is about 50 mm and 80 mm for nadir and oblique views, respectively. For the Zurich dataset, the tilt angle is 35 and the focal length is 50 mm for all the cameras. Both datasets have RGB color images, but the radiometric resolutions are different: 8-bit for the Jinyang dataset and 16-bit for the Zurich dataset. A brief overview of the test data is summarized in Table 1.   (Hirschmuller, 2008), canny edges implemented in SURE (Rothermel et al., 2012), and the proposed texture representation. Top row: comparison in a cement square; Bottom row: residential building roof with small structures.

Results
To analyze the effectiveness of the proposed methods on textureless areas and geometric discontinuities, we evaluate the matching results on the above four datasets using three parameter configurations; smaller parameter, larger parameters, and adaptive parameters. We then compare the matching results on typical scenarios, to visually evaluate the amount of noise and the quality of point clouds on the sharp edges. In all the experiments, no matter what radiometric resolutions and camera types, the matching parameters are all set the same at [28,56] and [70,140] for P1 and P2, respectively. Figure 5 shows the overview of the point clouds in the evaluated area and highlights the textureless and geometric discontinuities with rectangles. Figure 6 through Figure 8 show the matching results for the four different datasets. In each figure, the first column shows the subset of the images. The three right-hand columns (b), (c), and (d) show detailed results for smaller, larger, and adaptive parameters, respectively. In each column, the top row shows the results of a relatively smooth area from the blue rectangle, and the bottom row shows the results of a building area. Figure 6 shows the matching results from the two large-frame cameras, which capture the nadir images. In the top row of the figure detailed matching results for smooth road areas with Delaunay triangulation are shown. As expected, smaller parameters will generate significantly more noise in the planar roads than larger parameters, found when results in columns (b) and (c) are compared. The noise may cause problems in two areas: 1) random noise in image intensities, which leads to an incorrect matching cost; 2) a non-optimal matching cost, which causes small variations in sub-pixel interpolation. The horizontal plane in aerial nadir images will generate almost no disparity change, so it is reasonable to penalize both P1 and P2.
The adaptive parameter cases shown in column (d) share the same qualities as discussed above, as roads are generally labeled as textureless, even with road markers. Another interesting effect in textureless areas is the stripped effect shown in Figure  6, which also appears in the SGM implementation of SURE. This effect is rooted in the nature of DP itself, except that the strip may have different orientations, rather than only a horizontal direction. The bottom row of Figure 6 shows detailed point clouds over a building area. Larger parameters are found to suffer more from losses of matched points than smaller parameters, particularly in the corner area, as the disparity here is not consistent in all SGM scan directions. By over-penalizing for inconsistent disparities, the final matching results will deviate from the matching cost and create a smoothed disparity map; these disparities will probably fail the cross-check and be removed from the final results. By using smaller parameters, the discontinuities can be better preserved, and as the texture variation between roof and ground are much more significant, they can also be used to derive the adaptive results in column (d).
The oblique cases provide similar results, as shown in Figure 7 and Figure 8. Here, the horizontal plane is actually slanted in the camera view, so the disparity will continuously change. To be consistent, we also adaptively tune parameter P1 in this implementation, and the results prove that this strategy will generate better and smoother results in the horizontal planar square and for roads, as shown in the top row.
Airborne LiDAR data is also available in the Vaihingen dataset, so profile analyses are also presented, as shown in Figure 9. The densities of the two datasets are not the same, with an average interval between adjacent points for the DIM point clouds of about 0.1m, and of about 0.3m for LiDAR. When sampling the profiles, we use the larger space as the LiDAR dataset. The profile of the LiDAR data is smoother in the planar area than the photogrammetric point clouds, as shown in Figure 9a, and also preserves better and sharper edges, shown in Figure 9b. For smaller parameters, the variation in the planar area is larger than that of the proposed texture-aware SGM, as the right-hand side of Figure 9a shows. For larger parameters, part of the edge of the building is lost and some sharp features are severely smoothed, as matches on the edges of buildings will not pass the cross-check. Furthermore, because the point clouds from SGM are coarsely sampled, the noises effect are smoothed for the profile analyses.

CONCLUSION
The aim of this study is to improve matching quality in textureless areas and at geometric discontinuities, using a texture-aware DIM method, with a ternary census transform for pixelwise cost computation. This method is based on SGM, but the concept can be generalized and easily adapted to other matching channels. The ternary census transform is designed to handle insufficient texture, where pixel intensities are similar and prone to small noises, while considering equality in the intensity comparison and encoding the results in two bits. Texture information is quantitatively measured and inserted into the cost aggregation step of SGM, and adaptively tunes the SGM parameters P1 and P2. Finally, various camera systems and configurations are used in our extensive experimental evaluation of three different datasets, which visually demonstrates that the proposed method improves the matching quality, as to some extent the noise in textureless areas is reduced and more sharp features are preserved than in traditional SGM.