Enhancement of Depth Map by Fusion using Adaptive and Semantic-Guided Spatiotemporal Filtering

Extracting detailed geometric information about a scene relies on the quality of the depth maps (e.g. Digital Elevation Surfaces, DSM) to enhance the performance of 3D model reconstruction. Elevation information from LiDAR is often expensive and hard to obtain. The most common approach to generate depth maps is through multi-view stereo (MVS) methods (e.g. dense stereo image matching). The quality of single depth maps, however, is often prone to noise, outliers, and missing data points due to the quality of the acquired image pairs. A reference multi-view image pair must be noise-free and clear to ensure high-quality depth maps. To avoid such a problem, current researches are headed toward fusing multiple depth maps to recover the shortcomings of single-depth maps resulted from a single pair of multi-view images. Several approaches tackled this problem by merging and fusing depth maps, using probabilistic and deterministic methods, but few discussed how these fused depth maps can be refined through adaptive spatiotemporal analysis algorithms (e.g. spatiotemporal filters). The motivation is to push towards preserving the high precision and detail level of depth maps while optimizing the performance, robustness, and efficiency of the algorithm.


Background
Over the last few decades, a large number of Very High Resolution (VHR) Satellites are established to provide sub-meter resolution imageries, with frequent re-visiting times during the year to allow extracting comprehensive 3D geometrical information about the scene. Algorithms such as Multi-view stereo (MVS) highly depend on the spatial and temporal resolutions of sensors to facilitate generating reliable 3D reconstructed models. However, dense stereo image matching algorithms often rely on the quality of the image pair and nature of the captured surface. Images captured by satellite sensors are prone to spectral inconsistencies and distortions, which may affect the dense stereo matching algorithm and produce incorrect or missing height information, and therefore, degrade the quality of the depth map. In particular, MVS algorithms are very sensitive to temporal inconsistencies between the images, thus, they cannot be used directly to obtain 3D models and generate height information like the Digital surface model (DSM). The acquisition conditions and measurement errors such as distance to sensor, the lighting conditions, occlusions in the scene (e.g. Tree obstructing a building), and not enough overlap between the images can also complicate unique feature matching (Qin, 2019). Object properties and their pattern in the scene can also increase the uncertainty of the generated height, such as thin structures, texture-less surfaces, featureless areas, and repeated patterns or structures directly affect stereo matching. All these errors can cause the height data to be temporally inconsistent and lead to holes, noise, missing data, blurry artifacts, and fuzzy edges and boundaries, hence, incomplete and unreliable representation of the 3D information.
To enhance low-quality depth maps (i.e. height map) generated from MVS algorithms, researchers suggest fusing several depth maps to utilize the redundant information in the temporal data. A common approach to fusing depth maps is the simple median filtering (Kuschk, 2013;Matyunin et al., 2011;Ozcanli et al., * Corresponding author 2036Neil Avenue, Columbus, Ohio, USA. qin.324@osu.edu 2015Qin, 2017). The median filter is preferred in many works related to depth refinement and processing due to its simplicity, and ability to eliminate outliers while preserving the details. Other methods to process fusion include global approaches such as Markov Random Field (MRF) and total variation (TV) which optimizes the solutions (Zhu et al., 2010;Liu et al., 2015 ;Kuschk & d'Angelo, 2013;Lasang et al., 2016;Kuschk, et al., 2017). However, they are mostly used to fused depth maps resulted from RGB-D images from Kinect or video scenes, and despite the effectiveness of these algorithms, there are still some limitations that strict their usages. One limitation is the necessity to acquire noise-free and clear set of images to improve stereo matching and the corresponding depth map, and unlike Kinect and video scenes where many images are captured indoor within a few milliseconds with consistent acquisition and lighting conditions, satellite images are more exposed to noise and outliers due to atmospheric conditions, seasonal variations, sun and satellite angles, image time, occlusions, etc. (Qin, 2019). Another issue is that current fusion algorithms are not adaptive to the scene objects. Urban features made of concrete and asphalt like buildings and roads are time-invariant, which means they experience a low rate of change in the temporal depth map. Vegetation, on the other hand, is time-variant, where they tend to change frequently during the year due to atmospheric conditions and seasonal changes. The time variance is not the only factor that should be considered while designing the fusion algorithm; the characteristics of the object should also be considered. For example, narrow objects like road and ground tend to be blocked by shadows, or trees, which increase their height uncertainty. The discrepancies in the height of objects in the depth maps produces a nonlinear type of change, which cannot be resolved directly using simple filters or fusion techniques that process all objects in the scene similarly. Therefore, in our work, we emphasize on the importance of analysing the type of class to develop an adaptive spatiotemporal fusion that processes each pixel based on its class stability.

Related Works and Rationale
Generating high-quality depth maps using very high-resolution multi-view satellite images to improve 3D reconstruction model has been an ongoing research topic in the past few years. Multiview Stereo (MVS) methods are widely used to extract elevations from multiple satellite views due to its efficiency and lower cost in comparison to direct methods like LiDAR. However, the depth map generated from MVS may be of bad quality and come with noises, outliers, and missing data due to the temporal and spectral inconsistencies between the images pair used to generate the depth map. In dense image matching the number of matched points between the image pair can play a major role in depth quality, if no or few points can be matched between the image pair due to other factors such object surface properties (e.g. smooth or texture-less surface, repeated patterns, etc.), it can produce inaccurate depth map.
A solution to recover the depth map generated from MVS algorithms is multi-view depth fusion, which has been explored by researches in two contexts either global or local approaches. The global approaches mainly involve optimization and energy function to minimize the losses and sometimes include smoothing and regularization terms as additional constraints. Markov Random Field (MRF) is one example that has been widely used in depth fusion. For instance, Zhu et al., (2010) and Liu et al.,(2015) both used spatiotemporal MRF to fuse depth maps by achieving temporal coherence. Weighted total variation (TV) and total generalized variation (TGV) methods are also popular approaches in global fusion of depths (Kuschk & d'Angelo, 2013;Lasang et al., 2016); Kuschk, et al., 2017). Neural networks are also effective fusion techniques that have been mostly used to fuse depth extracted from video scenes. Although global methods are useful and can provide accurate depth results, they are mostly applied to Kinect RGB-D sensors or video scenes, which in comparison to satellite images have an optimal indoor environment to capture numerous numbers of frames in a few milliseconds, therefore, provides lower distortions and better dense image matching. Additionally, background and foreground objects in the frames can be separated easily because of the fixed acquisition and lighting conditions, unlike satellite images where all objects are interrelated and difficult to extract directly. For depths generated from satellite images, local approaches are more popular, where fusion is performed mostly using local filtering. The most common fusion technique is the median filter because of its stability and robustness to outliers (Kuschk, 2013;Matyunin et al., 2011;Ozcanli et al., 2015;Qin, 2017). Other algorithms such as (Reinartz, Müller, Hoja, Lehner, & Schroeder, 2005) used average filtering to fuse DSMs obtained from stereo techniques using SPOT-5 and radar data obtained from SRTM, but because average filtering techniques smooth high-frequency data it tends to discard high level of details and generate outliers. Recent methods include median clustering which has been proposed in (Facciolo et al., 2017;Rumpler et al., 2013), where clustering is considered as an effective method to assess the temporal homogeneity of height data by measuring the inter-and intraclass similarity and dissimilarity. Local strategies are efficient in terms of time and robust to small outliers but are not able to solve the problems of extensive nonlinear noise and object boundaries. Moreover, the current filtering methods used in fusion often ignores objects class and process the image using fixed parameters. Since objects in the scene and nature have different responses and reflection properties in the captured satellite image, it is important to assess the elevation uncertainty for each class and incorporate it into the fusion algorithm.

Proposed Method and Rationale
The spatiotemporal analysis is an effective way to solve problems related to data consistency, noise, missing data, outliers, etc. Using redundant data, we can fuse the depth maps to result in a reliable and accurate single depth map. Fusing depth maps using an adaptive spatiotemporal algorithm is an ongoing research topic where very few researches have investigated this area of research (Qin, 2017). Therefore, in our work, we aim to investigate and analyze the role of class to improve the multidepth fusion algorithm that is adaptive to scene elements, efficient, robust, and able to recover the gaps mentioned in the literature.
This paper is organized in the following order, in Section 2, we will mention the methodology and analysis, where we will discuss the dataset used, the pre-processing methods, and the analysis that supports our proposed work. Section 3 includes the experimental results, with reasonable explanations and validations. Finally, the conclusion and future works will be addressed in Section 4.

Data Description and Pre-Processing
In our experiment, we chose 3 different datasets with varying complexities to examine and fuse; dataset I, is a commercial buildings area, dataset II is a more open area with natural objects such as vegetation and a water surface, and dataset III is a condensed housing area (for more details see figure 1). We follow the same pre-processing method for all datasets, wherein each dataset we use VHR image pairs from Worldview 3 satellite to generate the corresponding multispectral orthophoto and the temporal DSMs. We use RSP (RPC Stereo Processor) software developed by (Qin, 2016) that performs hierarchical semi-global matching (SGM) algorithm (Hirschmuller, 2008) to generate and register the Orthophoto and the DSMs. We then generate the mask for each class using the 8-band multispectral orthophoto. We categories the classes into tree, grass, buildings, and a combined category for ground/road for all datasets. We use indices such as the Normalized difference vegetation index (NDVI) along with normalized DSM (nDSM) generated using top har reconstruction (Qin and Fang, 2014) to extract the masks. For instance, the NDVI helps to determine the locations of trees and grass, and with the appropriate nDSM we can find the position of trees based on their heights, and determine the location of buildings, ground, and road accordingly. For more details on the pre-processing steps, see the diagram in figure 2.

Data Analysis
In our data analysis, we aim to investigate the stability and uncertainty of the height of objects and the way it varies according to the class. We categories the objects in the scene to either: 1) Time-invariant (e.g. buildings, ground/roads.) or timevariant (e.g. vegetation) objects. 2) Based on its surface properties (e.g. smooth texture-less, flat, etc.).
To assess the height of classes, we perform per-pixel processing by taking the standard deviation in the temporal DSMs of the DSMs for each class. We then summarize the analysis results by viewing the histogram of standard deviations to be able to see the distribution of every class and its average standard deviation (see Figure 3). From figure 3, we can see that each class has different standard deviation distribution. Most classes follow normal distribution, which can give us a clue about which type of probability to use in the weight measurement of our adaptive spatiotemporal fusion. The average standard deviation in Table 1 tells us the average uncertainty of height for each class. We can notice from Table 1 that for all three datasets vegetation (including trees and grass) has higher uncertainty than other classes, which means that height in the temporal DSM varies more than other objects. The buildings class comes in second to have higher height uncertainty. Among all classes, ground/ road appear to have better and higher elevation stability, where in all datasets it had the lowest values of uncertainty.  Figure 3. The distribution of the standard deviations of temporal heights in dataset I. The x-axis shows the standard deviation.

The Proposed Spatiotemporal Fusion Algorithm
Our fusion method is based on spatiotemporal filter; the generic formula for the fusion process is as follows (1) Where = final fused pixel i, j = location of pixels along the width and height ℎ = median height in the temporal DSMs = spectral weight = spatial weight ℎ = temporal height weight = total weight Inspired from the basic median filter, where its purpose is to suppress noise and maintain the sharpness of edges, we extract the median height (ℎ ) from the temporal for each pixel. We choose the median as the base to get the final fused height value and to calculate the height weight as in the following equation (2) Where ℎ = the height bandwidth The height bandwidth determines the extent of filtering in the temporal direction of the DSMs. ℎ is assigned for every pixel based on its class, and its value is decided empirically. With the prior knowledge of the extracted labeled image (mentioned in the pre-processing section 2.1.), we can determine which ℎ to use for each pixel.
We also analyze each DSM spatially, where spatial information is known to best reflect similarities in the neighborhood. To further improve the results, we use the spectral information in the orthophoto (to get the RGB image) and smooth the final DSM. Thus, we calculate the weights ( , ) in the formula as follows Where I = RGB image i, j, l, k = the location of pixels in the window = range bandwidth = spatial bandwidth

EXPERIMENTAL RESULTS
The results of our proposed spatiotemporal fusion are shown and discussed in this section. We will specify the parameters used and the outcomes for the three datasets. We validate our results per pixel, and against other existing methods.

Parameters
The two main parameters that we require in this work are the window size and bandwidths (spectral, spatial, and height), the choice of these parameters is made empirically. The window size is set and fixed to a moderate value of 7. Similarly, the spatial and spectral bandwidths ( , ) are set to (11, 50) (Tomasi and Manduchi, 1998). Validating the adaptive spatiotemporal fusion concept requires examining the bandwidth under different scenarios. For example, for objects with high uncertainty such as the time-variant objects like vegetation, we might need high ℎ . For flat, narrow, and featureless objects like ground and road, we might also need high ℎ . We also want to compare our adaptive approach to the fixed value of ℎ . We also chose the height bandwidth ( ℎ ) empirically, but in a manner that allows validating the assumption and the analysis made in section 2.2. Therefore, we chose ℎ based on several criteria indicated in Table 2.

Results and Validations
The fused DSM for all datasets is shown in figure 4, where we can see that most of the distortions such as noise, missing datasets, and holes are filled and taken into consideration. We can notice that the adaptive spatiotemporal fusion algorithm produces good results visually, in comparison to other methods. For instance, if we compared it to the simple median filter we can see that median filter fusion results can produce overly smoothed outcomes, in addition to blurring some details. The adaptive median, on the other hand, is better in terms of capturing the details of the buildings since they use an adaptive window for buildings, but smaller details in ground are also overly smoothed. The C-median clustering can generate fuzzy and partially noisy results as in dataset II and III.

Dataset I Dataset II Dataset III
Adaptive spatiotemporal fusion Adaptive median filter Simple median K-median clustering Figure 4. The fusion results of the three datasets using adaptive spatiotemporal fusion, the adaptive median filter, simple median filter, and C-median clustering.
We evaluate the results of the adaptive spatiotemporal fusion statistically by showing a comparison between different values of ℎ for all classes, and a comparison between other existing approaches. We compared the results of the proposed adaptive spatiotemporal fusion to the ground truth height data (from LiDAR) and measured the accuracy of each dataset to a 6-meter level of difference. The results are shown in tables 3, 4, and 5. Table 3. Accuracy assessment of dataset II. Accuracy assessment of dataset I. Note: T=Tree, GRS=Grass, B=Building, GRD= Ground/Road, and OA = Overall Accuracy.
From the table 3, we note that that the highest overall accuracies (95.6254%, 99.0925%, and 99.9962%) for all datasets are located in the second category (explained in Table 2.), where greater values of weight Wh are given to objects with high elevation uncertainty. We also note that the adaptive approach provides slightly better results (0.02% higher) than the fixed bandwidth parameter (see last rows in table 3). The second part of the table explains how different classes correspond to different sigma values. We also show this in Figures 5, 6, and 7. For instance, we can notice from Figure 7 that trees and grass achieve the highest accuracies at large values of ℎ (30 and 35), while buildings require lower ℎ to achieve high accuracy at ℎ = 20 to 25. In all three datasets, grass always require larger ℎ , whereas urban objects such as buildings and ground can achieve high accuracies at moderate values of ℎ ≈ = 20 to 25. This can lead us to conclude that the height of time-variant objects are less certain and require larger compensation using higher ℎ . We can also conclude that fixed bandwidth parameters do not achieve optimal results; this can be seen from Figures 5, 6, and 7, and confirmed from the overall accuracy in table 3 at a fixed value of ℎ =15. We can use the information in these figures, to determine the optimal sigma values for each class and use it to get the optimal fused depth map. However, we can see that the optimal height bandwidth patterns between the classes differ with the dataset depending on the complexity and objects in the scene. For instance, for areas with few trees and many large commercial buildings as in dataset I, trees and grass required least ℎ of values 13 and 18 respectively, while, dominant objects like buildings and ground required larger ℎ of values 23 and 27 respectively. On the contrary, dataset II and III had trees and grass as dominant objects, thus, they require larger ℎ than the other classes.   We also use existing methods such as simple median filter, adaptive median filter by (Qin et al., 2017), and C-median clustering by (Facciolo et al., 2017) to evaluate our method (see table 4). We find that our adaptive method provides marginally higher accuracy with an increased range between almost 0.01-2%. Similarly, the accuracy of the majority of classes has higher accuracy in the adaptive case than the other methods with fixed bandwidths. Table 4. Accuracy assessment for all datasets. Note: ADPT_MED= adaptive median filter, SIMP_MED= simple median filter, C-MEDCLUST= C-median clustering, and ADPT_STF= adaptive spatiotemporal fusion.

CONCLUSION AND FUTURE WORKS
In our work, we show that the adaptive spatiotemporal fusion technique can provide a better solution for objects with a high level of elevation uncertainty. The overall accuracy in all three datasets showed that optimal results could be achieved using a class-adaptive approach rather than the fixed parameter. Our analysis also shows that for classes with a high level of uncertainty like vegetation, more emphasis should be given by adjusting their height bandwidths to larger values. We also compare our results to existing work and found that it achieved slightly better overall accuracy ranging from 0.01 to 2%. In the next step, we will extend this work to determine the value of the height bandwidth automatically based on the scene informationusing machine learning (ML) methods. We also would like to obtain the label image more efficiently using better nonparametric classification methods with indices such as NDVI, Morphology index, etc. to extract varying objects, their class and the corresponding classification map.