STATE OF THE ART IN DIGITAL SURFACE MODELLING FROM MULTI-VIEW HIGH-RESOLUTION SATELLITE IMAGES

Data from the optical satellite imaging sensors running 24/7, is collecting in embarrassing abundance nowadays. Besides more suitable for large-scale mapping, multi-view high-resolution satellite images (HRSI) are cheaper when comparing to Light Detection And Ranging (LiDAR) data and aerial remotely sensed images, which are more accessible sources for digital surface modelling and updating. Digital Surface Model (DSM) generation is one of the most critical steps for mapping, 3D modelling, and semantic interpretation. Computing DSM from this dataset is relatively new, and several solutions exist in the market, both commercial and open-source solutions, the performances of these solutions have not yet been comprehensively analyzed. Although some works and challenges have focused on the DSM generation pipeline and the geometric accuracy of the generated DSM, the evaluations, however, do not consider the latest solutions as the fast development in this domain. In this work, we discussed the pipeline of the considered both commercial and opensource solutions, assessed the accuracy of the multi-view satellite image-based DSMs generation methods with LiDAR-derived DSM as the ground truth. Three solutions, including Satellite Stereo Pipeline (S2P), PCI Geomatica, and Agisoft Metashape, are evaluated on a WorldView-3 multi-view satellite dataset both quantitatively and qualitatively with the LiDAR ground truth. Our comparison and findings are presented in the experimental section. * Corresponding author


INTRODUCTION
Benefit from the revolution of satellite sensors and orbit revisit technology, more and more high-resolution optical Earth Observation Satellites have been launched, such as WorldView-1~4, GeoEye-1, Plé iades-1A&1B, and SuperView-1. The growing numbers of satellites and the increasing accessibility of the high-resolution satellite imagery have attracted considerable interest in large-scale 3D reconstruction. In the foreseeable future, any area of interest on the earth's surface can be observed by tens to hundreds of high-resolution satellites. Compared with the data processing capability, data obtained from the optical Earth Observation Satellites is available in embarrassing abundance nowadays. Leveraging the highresolution multi-view satellite images in 3D reconstruction fuels remote sensing applications in a variety of domains, such as ecological monitoring, 3D city-scale modelling, urban planning, and navigation (Gruen, 2012;Haala & Kada, 2010).
Given both more complicated imaging geometry models and the vast number of pixels for the satellite images created a considerable gap between the photogrammetry and computer vision community. The orientation parameters are provided as standard Rational Polynomial Coefficients (RPC) files with satellite images and other metadata by different satellite image venders, which helps both in hiding the physical sensor model and allowing high accurate mapping and surveying (Gong, 2003). This paper reviewed the latest solutions for digital surface modelling from high-resolution multi-view satellite images. Related works are analyzed in Section 2. Section 3 gives the details on the datasets and the methodology. In Section 4, the DSM generation steps for the selected software packages are introduced. We did the experiments and gave the analysis Section 5 demonstrates the. Last but not least, Section 6 concludes this paper.
Almost all the present photogrammetric software packages implement these components in their digital surface modelling workflow, and DIM might be the most challenging part. In the past decades, DIM methods have been facing unprecedented development. Many algorithms were developed and adopted both in the photogrammetry and computer vision community, examples include, multi-image matching (Zhang, 2005), dynamic programming (Van Meerbergen et al., 2002), semi-global matching (Hirschmüller, 2005(Hirschmüller, , 2008, patch-based matching (Furukawa & Ponce, 2010, 2012, and deep learningbased stereo matching (Zbontar & Lecun, 2015. For high-resolution multi-view satellite images, the digital surface modelling methods can be grouped into two categories. On the one hand, a true multi-view matching approach that leverages of pixel or feature information from multiple images simultaneously when computing correspondences. On the other hand, a multi-stereo depth fusion method that performs binocular stereo matching based on selected stereo pairs, and fuses the resulting point clouds or depth maps. A distinct advantage for multi-view matching is that the leverage of the multi-view information can increase the robustness of matching reliability. Multi-stereo methods offer speed performances by performing matching in the epipolar space compared to multi-view methods. Also, the multi-stereo strategy gained more favours in practice due to its relatively lighter implementation complexity and flexibility (Han et al., 2020).
Previous work (Ozcanli et al., 2015) already discovered that the multi-stereo methods delivered better accuracy than the multiview methods. The possible reason could be considering multiview images at the same time may introduce larger areas of occlusions that may affect the solution of the global energy minimization.

DATASET AND METHODOLOGIES
In our experiments, a 1×1 km 2 area of interest with complicated ground object classes is focused on, charactered with the elevated road/bridge, low-rise residential buildings, high-rise buildings arranged in octagonal patterns, and a high water tower formed by three cylinders are selected. The satellite images and corresponding ground truth LiDAR data are covering this region created for the IARPA Multi-View Stereo 3D Mapping Challenge (Marc Bosch et al., 2016;M Bosch et al., 2017). WorldView-3 satellite imagery is provided courtesy of DigitalGlobe, and IARPA provides ground truth LiDAR. As shown in Figure 1, there exist 50 WorldView-3 satellite images covering 100 km 2 area near San Fernando, Argentina, collected between 2014 and 2016. The ground sampling distance (GSD) of the panchromatic images is approximate to 0.3 m, and 0.2 m airborne LiDAR ground truth data is provided by IARPA, which is collected in June 2016. The reference DSM at 0.3 m GSD is generated from the LiDAR data as the ground truth. The ground truth DSM of the AOI in this paper is shown in Figure  1(middle). For DSM generation from high-resolution multi-view satellite images, we firstly select appropriate stereo pairs by analyzing their metadata, then using the strategy considering the maximum off-nadir angle, intersection, and temporal proximity of images at the same time (Facciolo et al., 2017). Secondly, DSMs are separately generated for each selected pairs using solutions including S2P, PCI Geomatica, and Agisoft Metashape that includes individual sequentially steps in a multistereo manner.
A modified version of the semi-global matching (SGM) method is used as the core dense image matching (DIM) algorithm for stereo matching in PCI and Metashape, while the S2P (Carlo de Franchis, Enric Meinhardt-Llopis, et al., 2014a, 2014b utilizes Census-based correlator more global matching (MGM) (Facciolo et al., 2015). Considering the RPC parameters might be inaccurate, DSMs of each selected pair are registered and then fused pairwisely using their own procedure in PCI Geomatica and S2P. It should be noted that Agisoft Metashape utilized all the selected images and combined them for possible stereo matching first, then a fused DSM is computed in its own strategy, which is unknown for a commercial reason. Finally, the fused DSMs are used to compare with the LiDAR-DSM after a 3D translation coregistration.

Images pair selection
Existing researches(d' Angelo et al., 2014;Facciolo et al., 2017) consider metadata when capturing the images, such as the capture date, the intersection angle, the off-nadir angle, and sun angles are of critical importance factors that can affect the matching quality. Image pair selection is one of the most critical issues for multi-stereo matching since a dramatic difference in terms of accuracy could be computed even with along-track stereo images. As shown in Figure 2, the capture time for all the stereo images (a), (b), (e), (f) is on the same day.
However, considerable failures in the dense image matching process can be observed in the computed point cloud (d) from (e) and (f), while acceptable performance can be seen in point cloud c computed from a and b, It can be seen that there exist apparent radiometric distortions between image (e) and (f).
We adopted the method proposed by Facciolo et al. (2017), which is used to select "good pairs" and won the IARPA Multi-View Stereo 3D Mapping Challenge 2016. We firstly choose pairs by the intersection angle within [5º ,45º ] from all the 1225 pairs combined by the 50 images, then pairs with the maximum off-nadir angle of image equal or greater than 40º are eliminated from the pair list, then we rank all the remind pairs according to the temporal proximity. Top five pairs are chosen for the follow-up computing in this paper.

To bundle adjustment or not?
RPCs can achieve the same geolocation precise level as the rigorous model. While due to the measurement on the satellite platform, the inaccuracies could be tens of pixels. Hence bundle adjustment is critical for correcting inaccuracies in the RPC models. However, it depends on detecting and matching a sufficient quantity of accurate correspondences across the multiview images, which is a problem when the study area is restricted to a small test area of interest. Moreover, previous work (Marí et al., 2019) highlighted that post-co-registration of the computed multi-stereo DSMs showed high robustness and accuracy compared to the bundle adjustment for small area scenario. Hence in this paper, no bundle adjustment is conducted before multi-stereo matching; individual DSM was computed from binocular stereo pairs without prior bundle adjustment. It is accessible to co-registration and fuses the computed point clouds and DSMs for the considered software packages. a

Solutions selection and configuration
There are several software packages (both commercial and open-source) available for photogrammetric processing for satellite images with RPC models, such as the ERDAS IMAGINE (Geosystems, 2004), PCI Geomatica (Geomatics, 2005) S2P, which stands for Satellite Stereo Pipeline, is an automatic 3D reconstruction workflow from 2D high-resolution satellite images implemented in Python and C language. Detailed instructions reference could be found in the link here, https://github.com/glostis/s2p.
A trial version of Agisoft Metashape Professional Version 1.6.0 is adopted for the test in this work. Agisoft is a geo-information company based in Russia. Metashape is a stand-alone software product updated from the photogrammetric processing of digital images, such as aerial and close-range photography. It is capable for satellite imagery in the latest version and 3D spatial data generation for Geographic Information System (GIS) applications, previously known as Agisoft PhotoScan, which the Professional 1.6 version released in December 2019 is capable of processing the panchromatic and multispectral satellite images. We first load all the images of the selected pairs, then follow a standard workflow in Metashape to compute the final DSM by fusing the pairwise DSMs. To get the best performance on Metashape, the Ultra-high accuracy was selected adaptively for sparse matching in alignment and dense image matching stages.
The OrthoEngine Banff in a trial PCI Geomatica Banff Edition, which was released on November 25th, 2019, was adopted for this work. PCI Geomatica is developed by the PCI Geomatics, which is a geo-imaging products and solutions company based in Canada. The fast Fourier transform phase matching method is adopted in the tie point collection stage, and the SGM (Semi-Global Matching) instead of NCC (Normalized Cross-Correlation) is selected for dense image matching. The merge function is available in the latest version for fused DSM generation, and the window size of 13 is followed for cleaning up building edges.
The matched DSMs computed from the three solutions are then aligned with the LiDAR-DSM by a 3D translation (X, Y, Z) based on the least aquae method to eliminate potential systematic error introduced by orientation or the biases caused by the software packages.

EXPERIMENTS AND RESULTS
In our experiments, the testing data on San Fernando, Argentina, were provided by Johns Hopkins University Applied Physics Lab (JHUAPL), and all the multi-view high-resolution satellite images in the dataset were WorldView-3 panchromatic images. The MVS3D dataset consists of 50 images collected between 2014 and 2016. The ground sampling distance (GSD) of the dataset is approximately 0.3m for panchromatic images.
We followed the workflow described in Section 4 to compute the DSMs results of the testing data, as shown in Figure 3. The first row in Figure 3 shows the DSMs computed from the selected pairs using S2P and PCI Geomatica, while the second rows show DSMs computed from the selected pairs using Agisoft Metashape and the ground truth LiDAR. Figure 3 shows that each of the three considered solutions is able to achieve good digital surface models in the area of interest, including elevated road/bridge, low-to-high buildings. However, some mismatches occurred in untextured or weakly textured or shadow regions, e.g., roads, building roofs and building boundaries, due to matching uncertainties on these objects. For building boundaries, there exists an extension for all the matched DSMs when comparing with LiDAR-DSM.
Moreover, some areas covering with trees (especially the trees in the shadow caused by the tall building boundaries) were not well reconstructed, due to there existing a time gap between imaging and LiDAR acquisition, besides seasonal changes on the vegetation areas among temporal images, which caused the high matching uncertainty reason. For better analysis on the performance of different solutions quantitatively, firstly, the ground truth DSM is generated from the LiDAR points cloud provided by IARPA. Given a matched DSM from one of the three latest solutions and the LiDARderived gound truth DSM, the root means square error (RMSE) of all the pixels on the DSM is computed. In order to further give a better visual comparison and determine the digital surface modelling performance of an individual software package on a specific ground feature, we also computed the spatial error distribution maps, as shown in Figure 4.  Figure 4. Error distribution map of the computed DSMs from considered solutions (red and blue indicate the largest distance).
As is shown in Figure 5(a), green, red, and yellow masks indicate three areas of interest, where the green masked area is charactered with low-rise residential buildings, the red masked area is charactered with high-rise buildings arranged in octagonal patterns, and the yellow masked area is the road/bridge. We give the cross-section analysis in Figure 5 Figure 5 (a), (b), and (c) show that there exist noises in the LiDAR-derived DSM. These noises are partially caused by moving objects, such as driving vehicles on the road or flying birds in the sky. However, these noises do not lower the accuracy measure since the LiDAR-derived DSM is not strictly ground truth, but the RMSE can still relatively indicate the performances of the considered solutions in this work. From the cross-section analysis, it can be seen that the matched DSM is consistent with the LiDARderived DSM. a. Selected AOI_1/2/3 b. AOI_1 (low-rise residential buildings) c. AOI_2 (high-rise buildings) d. AOI_3 (elevated road/bridge) Figure 5. Cross-section analysis of three computed DSMs with the ground truth LiDAR. In (a), we use green, red, and yellow to indicate three areas of interest.

CONCLUSION
In this paper, we described the digital surface model (DSM) generation pipeline for mapping and 3D modelling from multiview high-resolution satellite images. Three top solutions were considered, the open-source S2P was used to win the IARPA Multi-View Stereo 3D Mapping Challenge 2016, Agisoft Metashape, and PCI Geomatica are all high-end geoinformation commercial solutions in the industry. The Agisoft Metashape supports multi-view satellite images 3D reconstruction since December 2019, and DSMs merge function is available in PCI Geomatica since November 2019, which are the two latest commercial solutions available for assessment in this work. Furthermore, the initial results show that S2P, which adopts an MGM method, achieved the highest accuracy.
We will consider more areas of interest with different ground covers for a better understanding of the performance on different ground objects in future research. Image selection method and DSM fusion method should be further investigated for digital surface modelling from high-resolution multi-view satellite images.