DENSE MATCHING COMPARISON BETWEEN CENSUS AND A CONVOLUTIONAL NEURAL NETWORK ALGORITHM FOR PLANT RECONSTRUCTION

3D reconstruction of plants is hard to implement, as the complex leaf distribution highly increases the difficulty level in dense matching. Semi-Global Matching has been successfully applied to recover the depth information of a scene, but may perform variably when different matching cost algorithms are used. In this paper two matching cost computation algorithms, Census transform and an algorithm using a convolutional neural network, are tested for plant reconstruction based on Semi-Global Matching. High resolution close-range photogrammetric images from a handheld camera are used for the experiment. The disparity maps generated based on the two selected matching cost methods are comparable with acceptable quality, which shows the good performance of Census and the potential of neural networks to improve the dense matching.


INTRODUCTION
A study on individual tree growth pattern can help to understand the corresponding ecosystem's health situation, to predict environmental disturbances, and to determine how the plants can respond to climate change, e.g.drought (Levin, 1999;Gatziolis et al., 2015).Hence the appearance of individual tree crown is regarded as the major indicator of the ecosystem's static and dynamic properties (Strigul et al., 2008;Strigul, 2012).In order to describe tree crowns precisely, 3D models can be reconstructed to provide intuitive geometric characteristics with the filigree objects such as tree leaves being considered.
Spaceborne and airborne stereo imaging sensors are capable of deriving high resolution digital surface models.Frequent forest monitoring can be performed due to the sensors' broad coverage, however only some large scale parameters, e.g.forest canopy height, can be obtained (Tian et al., 2017).Light Detection and Ranging (LiDAR) technique based on airborne or terrestrial platforms can provide dense point cloud to construct detailed 3D tree architectures and derive structural parameters to assist forest management, but the data acquisition is time-consuming and costly (Morsdorf et al., 2004;Maas et al., 2008;Metz et al., 2013).
In the past decade, Semi-Global Matching (SGM) (Hirschmueller, 2008) was designed and outperformed most of the existing stereo matching approaches in accuracy and efficiency.However, SGM may perform variably when different matching cost computation approaches are adopted.The matching cost computation evaluates the cost of matching two points according to their similarity, which is the first step of dense matching with effect to the final output (Scharstein and Szeliski, 2002).Many methods have been designed for that.Census transform calculates the matching cost based on a small window around an object pixel, which has good performance in the presence of discontinuities (Zabih and Woodfill, 1994;Hirschmueller and Scharstein, 2009).Recently, an algorithm computing Matching Cost based on Convolutional Neural Networks (MC-CNN) is proposed (Zbontar and LeCun, 2016).The algorithm outperforms many previous methods on KITTI 2012 (Geiger et al., 2013), KITTI 2015 (Menze and Geiger, 2015) and Middlebury (Scharstein andSzeliski, 2002, 2003;Scharstein and Pal, 2007;Hirschmueller and Scharstein, 2009;Scharstein et al., 2014) stereo data sets.There is not enough focus on the plant reconstruction which provides the possibility to explore them from the geometric properties.Therefore in this paper, Census and MC-CNN are tested accordingly for 3D reconstruction of an indoor plant to compare their advantages and disadvantages.Two experiments are designed by integrating the two selected algorithms to SGM, respectively.Images with very high resolution are collected conveniently and flexibly from a handheld camera to study single plant in detail.

Preprocessing
Before SGM is performed for dense matching, some preprocessing should be executed.Firstly, tie points should be selected for camera calibration and relative orientation.Then image rectification can be executed to generate epipolar images in which the search for correspondence can be reduced from 2D to 1D.At last, SGM based on two selected matching cost algorithms (Census and MC-CNN) is used for plant reconstruction.

Dense Matching
Dense matching attempts to look for dense correspondences between image pairs to recover the object depth information.Usually four steps should be followed (Scharstein and Szeliski, 2002).Firstly, a similarity value between two potentially matching pixels is computed to evaluate the matching cost.Then a neighboring window around the pixel being compared is defined to aggregate the matching cost of each pixel inside the window, to avoid matching ambiguities when comparing the central pixel alone.Afterwards, the pixel correspondence is determined and the coordinate difference between the matched pixel pair is calculated as disparity.Finally the disparity values are refined to obtain a disparity map for further depth information exploration and 3D reconstruction.Three categories of dense matching algorithms are available: local methods, global methods and SGM.

Local Methods
Local methods typically go through all the four steps mentioned above for dense matching.According to (Farinella et al., 2013), the disparity for the pixel at location p can be computed as in which dmin and dmax indicate the predefined disparity range, Ep is the local energy in the neighborhood of p.
The local method is intuitive and easy to implement, however, the result can be blurred and it is difficult to resolve local matching ambiguities especially in untextured regions.

Global Methods
In global methods, besides the matching cost, a smoothness term is also considered to ensure that the generated disparity map is spatially smooth.Therefore, an energy function is defined (Farinella et al., 2013).
In which, E is the energy calculated based on the selected disparity map D. E data measures the consistency between left and right image, which is essentially the sum of matching cost for all the pixels within the image.E smooth is the smoothness term, which prefers same or close disparity values between neighboring pixels.λ is used to balance the influence of E data and E smooth .
The goal is to minimize the above energy function and regard the corresponding disparity map as the matching results.However, it can be proved that finding the minimum is an NP-complete problem (Boykov et al., 2001).Many optimization strategies have been utilized to approximate the energy minimum, such as Dynamic Programming (Birchfield and Tomasi, 1998), Graph Cut (Kolmogorov and Zabih, 2001) etc., while the methods usually behave well at the cost of long runtime.P1 represents a penalty when the previous pixel has a disparity difference of 1. P2 means a larger penalty for larger disparity differences.

SGM
In this project, SGM is selected due to its good performance and efficiency (d'Angelo and Reinartz, 2011;d'Angelo, 2016).C(p, d) is calculated using either of the two selected algorithms (Census and MC-CNN) and then aggregated based on Cross-Based Cost Aggregation (CBCA) (Mei et al., 2011).The sum of Lr(p, d) calculated in each direction undergoes CBCA once more before disparity computation.

Matching Cost Computation
In all three matching schemes mentioned above, the matching cost computation is always an indispensable step to estimate similarity between pixels.Many algorithms are available to calculate the matching cost (Hannah, 1974;Anandan, 1989;Kanade, 1994;Zabih and Woodfill, 1994).Among them, Rank and Census transform (Zabih and Woodfill, 1994;Hirschmueller and Scharstein, 2009) performs well.Furthermore, Convolutional Neural Networks is currently a popular topic in computer vision.
It has been used to solve several vision problems such as classification (Krizhevsky et al., 2012), recognition (Lawrence et al., 1997), etc. Hence in this paper, Census and an algorithm related to Convolutional Neural Networks are tested to compute matching cost, for the sake of verifying the classic Census algorithm's performance and exploring the benefits of using neural networks for plant reconstruction.

Census
Census transform is a non-parametric measure, insensitive to image radiometric difference and behaving well when discontinuities exist (Hirschmueller and Scharstein, 2009;d'Angelo and Reinartz, 2011).Hence, it becomes an appropriate alternative for matching cost computation because discontinuities are ubiquitous among plant leaves.Census constructs a bit string for each pixel, in which each bit is determined from the pixels in a predefined neighborhood according to the intensity.The corresponding bit for each neighbor will be set as 1 if it has a lower intensity than the central pixel as shown in Figure 1.In this way, the constructed bit string roughly represents the local image structure and can be compared with other bit strings of pixels via computing the hamming distance between them to measure the cost for matching the pixels.Convolutional Neural Networks provide a new possibility in dense matching (Luo et al., 2016;Zbontar and LeCun, 2016).Zbontar and LeCun (2016) proposed to train a net on pairs of small image patches with the true disparity known, to learn a similarity measure for matching cost computation.KITTI and Middlebury stereo data sets have been used with the ground truth disparity maps available to construct a binary classification data set.At each image location, a positive and a negative training example are extracted.The positive example is a pair of patches from the left and right image respectively with the central pixels projected from the same object point, while the negative example is from a pair of patches where this geometric condition is not satisfied.Then two network architectures are designed and trained on the extracted training examples for matching cost computation.With a pair of patches as input, both architectures will extract feature vectors to represent each patch and output the similarity measure between them.The first architecture, called fast architecture, concentrates on the efficiency using a fixed similarity measure, while the second, called accurate architecture, learns to measure similarity between feature vectors during training to acquire higher accuracy.In this paper, the accurate architecture is used due to the high quality demand of plant reconstruction.
The accurate architecture is a siamese network with its two subnetworks sharing the same weight (Bromley et al., 1993).As shown in Figure 2, each subnet consists of several convolutional layers, each of which is followed by a rectified linear unit.Then the outputs of the two subnets are concatenated and passed through a number of fully-connected layers followed by a rectified linear unit for each.Finally, there is one more fullyconnected layer at the end producing a number which will be transformed using the sigmoid nonlinearity as the final output.
The output is the similarity score between input patches, which is the opposite number of the matching cost.

Disparity Computation and Refinement
The disparity for each pixel is calculated using the winner-takesall strategy to generate a disparity map.Referring (Zbontar and LeCun, 2016) and (Mei et al., 2011), some post-processing steps are implemented to refine the quality of the disparity map, including left-right consistency check, subpixel enhancement, a median filter, and a bilateral filter.

Dataset
In the experiment, the data are collected using a digital highresolution handheld camera (Canon EOS-1D X).Details about the image acquisition are available in Table 1.
Five stereo pairs are constructed using four collected images as shown in Table 2.The four images are taken in order with the camera moving along a line with a distance of approximately 3 meters from the plant.A reference disparity map is generated from the matching results of the first three pairs, while the last two are used to observe the sensitivity of Census and MC-CNN to the baseline length, which will be described detailedly in section 3.3.Table 1.The camera setting for data collection

3D Reconstruction
MicMac (Rosu et al., 2015) is utilized for camera calibration, relative orientation and image rectification to generate epipolar image pairs from which the corresponding points have y-disparity less than 1 pixel.The epipolar images generated based on stereo pair 1 are shown in Figure 3.
In order to compare Census and MC-CNN for matching cost computation, two experiments have been designed as shown in Figure 4.The input for both experiments is the same image pair, preprocessed by MicMac to generate epipolar images.Then the epipolar images will go through two processing lines for which SGM is used for both but with different matching cost algorithms (Census and MC-CNN), respectively.
For both experiments, the penalty P1 and P2 for SGM are set adaptively according to the intensity difference between the target pixel and its previous pixel.In experiment 1, the Census algorithm uses a 9 × 9 window size.In experiment 2, the net pretrained on the Middlebury data by (Zbontar and LeCun, 2016) is directly used for matching cost computation since our data have

Result Evaluation
Comparing the results in Figure 5, it is found that the plant is reconstructed well by both Census and MC-CNN.They are both able to keep details of the plant and even the complex outlines of leaves are recovered to a certain extent.It should be noticed that the net used for MC-CNN is not trained specifically for plant reconstruction but can still achieve comparable results as Census, which proves the potential of neural networks.
As for the crown of the plant, more detailed comparison is performed as shown in Figure 6 in which the color range is adjusted according to the local disparity change.Firstly, a line is selected (marked red in the master epipolar image) along which the disparity change is displayed as shown by the curves at the end.The horizontal axis represents the pixel number along the red line, while the vertical axis is the corresponding disparity value.Some points' disparity values, which are far from the trend of the curve, are ignored because the values will influence the detailed display of the disparity change for the majority.It is found that the results from Census and MC-CNN are almost consistent with each other and the predicted disparity change can basically describe the depth trend correctly.However, there is still some unnatural disparity change detected on a plane which is almost flat, e.g. the leaf surface in (a).Furthermore, four leaves are selected to compare the reconstruction performance of Census and MC-CNN.
The selected leaves are circumscribed with a blue rectangle, two of which are close to the camera (in (a) and (b)) while the others are relatively far (in (c) and (d)).In (a) and (b), the selected two leaves exhibit better reconstructed outlines by MC-CNN than by Census.However for leaves shown in (c) and (d), Census performs better.
In order to make a further comparison between Census and MC-CNN, a reference disparity map is generated as follows: Firstly, on each disparity map generated from stereo pair 1,2 and 3, a left-right consistency check is implemented additionally, meaning that six new disparity maps are acquired (two new disparity maps are obtained for each pair from Census and MC-CNN result, respectively).The pixels with inconsistent matching will be Afterwards, all the six new disparity maps are used to produce a dense point cloud.Finally, regarding the plane of the master epipolar image of stereo pair 1 as reference plane, the point cloud is reprojected back to generate a disparity map as our reference disparity map.For each point on the reference plane where more than one point is reprojected, the point with maximum disparity (i.e.minimum depth) is kept.Figure 7 shows the procedure to generate the reference disparity map with merged information of the three stereo pairs.
The two new disparity maps from stereo pair 1 (termed A) can be directly compared with the reference map as they share the same master epipolar image.However, in order to observe the influence of the baseline length, stereo pair 4 and 5 should be Therefore, for each single disparity map from stereo pair 4 and 5, the same procedure as described in the previous paragraph is used, resulting in the newly reprojected disparity maps B (for stereo pair 4) and C (for stereo pair 5).Thus the reference disparity map is compared with A, B and C respectively to calculate a valid matching point percentage.The percentage is calculated as Figure 7.The reference disparity map generation where w is the valid matching percentage.ni and n ref are the number of valid matching points in A, B, C and the reference disparity map, respectively.One random region within the plant crown part is selected for the computation as shown in Figure 8.The percentage calculation results are recorded in Table 3.In Table 3, it is found that MC-CNN performs slightly better than Census in the reconstruction density of the disparity map.The decrease of the valid matching point percentage as the stereo baseline length increases indicates the level of the matching difficulty.

CONCLUSION
Plant reconstruction from stereo imagery is difficult due to the complexity of leaves which exhibit similar shape and intensity information in images.Hence the matching cost computation should be accurate enough to adequately represent the similarity between patches as the basis for the final disparity computation.In this paper, SGM using Census to calculate the matching cost is tested and an acceptable reconstruction result is generated.Based on the same procedure with Census replaced by MC-CNN, a comparable result is obtained.The result from MC-CNN is based on a pretrained net which proves its power for dense matching.The evaluation shown in this paper is not complete due to lack of ground truth data.
Dense matching for stereo data with large baselines or stereo angles is always hard to implement.The problems include perspective distortion, occlusion etc., which are even more serious when the target to be reconstructed is a plant with dense leaves in our case.Some leaves can be occluded for which the matching is almost impossible.In the future, the matching algorithms should be improved to obtain more detailed geometric information of plants.The refined reconstruction model can assist to monitor the plant health situation.
SGM is a combination of local and global methods.The disparity is independently computed for each pixel, yet the neighboring pixel's disparity is also considered to pursue smoothness.In order to approximate the global energy minimization in 2D, SGM defines several paths (e.g.16) going through each pixel in different directions and implements the minimization on the aggregated cost of each direction to calculate the disparity.In one direction, the cost at location p is computed as: Lr(p, d) =C(p, d) + min(Lr(p − r, d), Lr(p − r, d − 1) + P1, Lr(p − r, d + 1) + P1, min i Lr(p − r, i) + P2) (3) in which, Lr(p, d) is the cost along the path traversed in direction r for the pixel p at disparity d.C(p, d) is the matching cost.

Figure 1 .
Figure 1.The bit string construction of Census

Figure
Figure 2. The accurate architecture to calculate matching cost Figure 3.The epipolar image pair for dense matching

Figure 6 .
Figure 6.The detailed plant crown comparison.From the left to the right in each subset: the master epipolar image, Census results, MC-CNN results, and the disparity change for one selected line assigned one unified high value on the new maps as a mark of invalid matching.The rest are naturally regarded as valid matching.Afterwards, all the six new disparity maps are used to produce a dense point cloud.Finally, regarding the plane of the master epipolar image of stereo pair 1 as reference plane, the point cloud is reprojected back to generate a disparity map as our reference disparity map.For each point on the reference plane where more than one point is reprojected, the point with maximum disparity (i.e.minimum depth) is kept.Figure7shows the procedure to generate the reference disparity map with merged information of the three stereo pairs.

Table 3 .
Valid matching point percentage comparison between Census and MC-CNN results