OPTIMIZING MESH RECONSTRUCTION AND TEXTURE MAPPING GENERATED FROM A COMBINED SIDE-VIEW AND OVER-VIEW IMAGERY

Image-based 3D modelling are rather mature nowadays with well-acquired images through standard photogrammetric processing pipeline, while fusing 3D dataset generated from images with different views for surface reconstruction remains to be a challenge. Meshing algorithms for image-based 3D dataset requires visibility information for surfaces and such information can be difficult to obtain for 3D point clouds generated from images with different views, sources, resolutions and uncertainties. In this paper, we propose a novel multi-source mesh reconstruction and texture mapping pipeline optimized to address such a challenge. Our key contributions are 1) we extended state-of-the-art image-based surface reconstruction method by incorporating geometric information produced by satellite images to create wide-area surface model. 2) We extended a texture mapping method to accommodate images acquired from different sensors, i.e. side-view perspective images and satellite images. Experiments show that our method creates conforming surface model from these two sources, as well as consistent and well-balanced textures from images with drastically different radiometry (satellite images vs. street-view level images). We compared our proposed pipeline with a typical fusion pipeline Poisson reconstruction and the results show that our pipeline shows distinctive advantages.


Introduction*
Surface models through meshing point clouds are important presentations in Geomatics and Computer Graphics community. State-of-the-art approaches are well developed for well-acquired and single-source images, which result in many successful software packages and processing pipelines, such as Pix4D (2014), ContextCapture (Antila et al., 2011), Vricon (Landry et al., 2016), RSP (Qin, 2016). However, meshing conflated 3D dataset generated from images of different view and sensors for surface reconstruction is extremely challenging. Nowadays the tasks of 3D modeling and meshing may utilize diversified data sources, such as RGB or depth images from low-cost cameras/depth ranger or expensive LIDAR (Light Detection and Ranging) data. However, the current practice in using multisource data for meshing is still rather intuitive. For example, data from different sources are often processed separately using well-developed approaches to generate point clouds, and then apply a point cloud based meshing algorithms (Kazhdan et al., 2006), followed by a post texture mapping using available oriented images. It is known that meshing from image-based point clouds often utilizes the visibility information that codes for each point, which image (if available) observes it, as such information implicitly provides the surface information with respect to the camera positions. Meshing from the level of combined point clouds (from different sources) are likely to miss the visibility information, because on one hand, most of the image-based 3D modeling software packages do not output the visibility information, and on the other hand, when multiple * Corresponding author cameras are involved, implementing individual camera models for visibility can be trivial processes. In this paper, we take the challenge by proposing a meshing method that integrates point clouds generated from the overview satellite images and sideview perspective images. Instead of re-implementing different camera sensor models into an image-based point cloud meshing pipeline, we assume the point clouds generated from the satellite images are associated with top-view parallel cameras, which is much easier for encoding the visibility information. In addition, we take this idea to further extend an existing texture mapping method to incorporate satellite images with street-view images for seamless texture mapping.

Related Works
Existing surface reconstruction from point set methods can be categorized into implicit surface methods and Delaunay-based methods (Labatut et al., 2009). Implicit methods construct a function of space from samples, a surface can be implicitly defined as a level-set of the function allowing smooth and approximating surface reconstruction, however sometimes they are sensitive to noise, outliers or non-uniform sampling or even simply by a lack of reliable and consistent normal (Labatut et al., 2009). Among implicit methods, Poisson surface reconstruction (Kazhdan et al., 2006) is regarded one of the most commonly used methods, even though relies on surface normal, and have been implemented in many software packages, including CloudCompare (Girardeau-Montaut, 2016), MeshLab (Cignoni et al., 2008), etc. Another family of Delaunay-based methods that construct tetrahedra that partition the space to compact simplices and eliminate facets of Delaunay tetrahedra according to certain criteria such as being in an inner or outer surface. One of the most widely used techniques for large-scale surface reconstruction was the work of Labatut, et al. (2009), which was further improved by Jancosek and Pajdla (2014). In this approach, a s-t (source-sink, acyclic) graph (Dantzig and Fulkerson, 2003) is derived from Delaunay tetrahedra, the minimal s-t cut of graph labels each tetrahedra as either occupied or free. As a result, surfaces located at the interface between source labeled and sink labeled tetrahedron. To construct s-t graph, Labatut, et al. (2009) introduced ray visibility information derived from structure-from-motion (SFM) (Moulon et al., 2016) processing to compute weights for each tetrahedron. Our surface reconstruction pipeline is extended from this work, and the visibility information plays an essential role carrying ray information.
As compared to the normal of points, the ray visibility preserves raw observations which helps perform robust surface reconstruction: as shown in Figure 1a, a normal will deterministically pick a specific tangent plane on one point in surface optimization, which can be affected by outliers and noises. In contrast, the rays cover a range of possible orientations that surface may have at this point (Figure 1b), which allows more flexible formulation for surface optimization. However, the visibility information is not always available, such as point clouds generated from commercial software packages, or LIDAR datasets. The image-based 3D mesh reconstruction algorithms are wellpracticed in single-sourced data, while processing data from multiple sources is non-trivial. In this paper, taking 3D point clouds generated from satellite images, and 3D point clouds produced by an SFM pipeline (with the associated visibility and poses of the images), we propose a method to reconstruction textured meshes from such data, specifically, in our method, 1. We extend state-of-the-art Delaunay tetrahedron based surface reconstruction method which modelled ray visibility by adapting satellite stereo height map to create large scale, coarse-fine fused surface model.

2.
We extend a texture mapping method to accommodate images acquired from different sensors, i.e. side-view perspective images and satellite images.
We will introduce our proposed mesh reconstruction method in Section 2 and texture mapping method in Section 3 in detail. We practice our proposed method on overview satellite point clouds and street-view point clouds and report the results in Section 4; Section 5 concludes this paper and proposes our future works.

Data Description
Our method focuses on multi-source fusion specifically for sideview and over-view dataset as shown in Figure 2 and Figure 3. Furthermore, the satellite dataset comes approximately georeferenced (based on their sensor accuracy) (Barazzetti et al., 2016;Mandanici et al., 2019) in which the Z values of the resulting point clouds refers to the evaluation. The approach we used takes building boundaries as the common feature aligning street-view and overview datasets. The co-registration achieves an RMSE of 1.44 m in error, which are reasonable considering that the satellite point clouds have a resolution of 0.5 m (Qin, et al, 2020).We also assume that side-view images have been oriented with their resulting point clouds co-registered with the satellite point clouds, an example of the dataset is shown in Figure 4.  The resolution of side-view and over-view sources are significantly different, as the GSD (ground sampling distance) of the satellite point clouds is about 0.5 m, and pixel GSD of side-view image is less than 0.01 m. The resolution differences lead to different uncertainty of the resulting point accuracy. Therefore, it is expected that the resulting meshes can be heterogeneous in terms of their level of details.

The Proposed Multi-source Mesh Reconstruction Pipeline
The base method (Labatut et al., 2009) takes the constructed Delaunay tetrahedra of the point clouds as the input to extract the surface. These tetrahedra can be viewed as a connected graph, in which the tetrahedra are the nodes and shared/common faces are edges. Our pipeline extends from this base algorithm by incorporating point clouds generated from the satellite images.
In a multi-view stereo scenario, each point in 3D space can be determined by at least two rays, which connect the object points and camera centers, and these rays indicate the visibility of this object point on the image connected to it. Based on ray visibility, tetrahedra intersected with rays are evaluated by their probability to be in a free space (outer space), and tetrahedra behind the ray endpoint are evaluated by their probability belonging to the full space (inner space). Labatut (2009)'s work minimized the s-t cut approach to solve the labeling problem. The final surfaces are the common faces of the tetrahedra labeled as free and full spaces ( Figure 5). Figure 5. Left: Green network is Delaunay triangulation, yellow region (free space) are tetrahedra which intersected with rays (dash arrows), white region is tetrahedra labelled as full space. Right: Red lines are surfaces between full and free space, which are common faces shared by those tetrahedra. Firstly, we form Delaunay tetrahedra with the combined point set. Next, rays are created from visibility clue through triangulated points from dense matching and orthographic projection of satellite orthophoto to weight the tetrahedra. Finally, solve the minimum s-t cut for labeling problem and extract surface.

Delaunay 3D Triangulation
3D Triangulation or tetrahedralization is extended from 2D triangulation, which partitions a polyhedron into nonoverlapping basic 3D elements, where the vertices of tetrahedra take the vertices of the original polyhedron. Delaunay tetrahedron reconstruction (Van Kreveld et al., 2000) divides the convex hull of points into compact simplices, where neither extremely long edge nor extremely sharp angle is included. Many well-known commercial packages and open source projects have implemented the algorithm that creates Delaunay tetrahedron from point set, here we use CGAL (Fabri and Pion, 2009) an open source computational geometric algorithm library to construct tetrahedra.

Visibility
Dense points associated with registered images are the most common source of visibility in our approach. To setup a minimum s-t cut problem, we first set the tetrahedra covering camera centers (start point of rays) to be free space and assign s label to them. We further initialize the tetrahedron behind object points as full, assign t label to them. As for tetrahedra that intersect with visibility rays while not belonging to any cases mentioned above, we assign weights to the edges of the graph (being the faces shared by tetrahedra) as smooth terms. The aforementioned algorithm was implemented by an open-sourced project OpenMVS (Cernea, 2015).
Ideally, we should implement the satellite camera sensor models (e.g. Rational Polynomial Coefficients) (Qin, 2016) to record the visibility information, however that could painstakingly trivial and complex. By considering that the point clouds can be associated with the orthophoto through a parallel projection, we then propose a two-step method to implement this strategy: 1. We first project the satellite points on to a grid, and in this process, for each planimetric location, we only mark the highest points as being visible.
2. Create vertical rays for these visible points.

Assigning Weights for the Graph
Section 2.3 introduces the constructed graph for the surface reconstruction problem. The strategy of weighting the edges considers the visibilities. The idea is intuitive: if a face has been traversed by many rays, it is unlikely to be a surface as the points are all behind this surface (which the ray is shooting to from the camera center). In this process, the confidence value of the triangulated points, if available, can also be incorporated when assigning the weights, otherwise, we keep it as a constant, denote as .
Our method follows a so-called soft visibility weighting model that was used by the base algorithm (Labatut et al., 2009): As shown in Figure 7, the left-most tetrahedron which contains camera, gets a scaled link to the source, as discussed in Section 2.2. The green faces that passed through by the ray get scaled , and the scale factor serves as the penalty that computed by its distance to the object points, being, the closer the face is to the object point, the less trusty of the ray. As shown by Figure 7, the yellow curve in labelled with oriented facet weights represent scaled along the camera-point ray. The right-most tetrahedron gets scaled and is link to sink.
Once weighting procedure for the edges is done, we use IBFS (Incremental Breadth First Search) (Goldberg et al., 2011) maximum flow algorithm to solve minimum s-t cut problem. Finally, the common faces between source and sink tetrahedra are extracted to build up optimum surface model (Figure 8b).

The Proposed Multi-source Texture Mapping Pipeline
Our texture mapping framework is based on Waechter (2014)'s work which has been well practiced and widely used in many famous open source projects, e.g. TexRecon (Guthe, 2020), OpenMVS (Cernea, 2015), Open Drone Map (Dakota et al., 2017), etc. We consider the street-view images are perspective and the satellite orthophotos are in parallel projection. Our texture mapping considers the orthophoto as one of the images with only few simple modifications: we balanced data term of ortho images to compensate resolution gap and make ortho images as the default sources for texturing. Given these many images serving as potential source of texture, the first step is to pick a best image for each triangular face. Lempitsky and Ivanov (Lempitsky et al., 2006) compute a labeling that assign a view to be used as texture for each mesh face using a pairwise Markov random field energy formulation (Waechter et al., 2014): (1) Figure 9. Our pipeline of multi-source texture mapping (a) Street-view image only texture mapping (b) Our multi-source texture mapping Figure 10. Textured models using only the street-view images (a), and textured models using both satellite and street-view images in our texture mapping method (b).

Best-View selection
The base method (Waechter et al., 2014) determines face visibility (distinct from ray visibility) for all combinations of views and faces by first performing back face and view frustum culling, then renders faces onto images, using depth buffer to determine the nearest faces. This base approach takes projective projection as general camera model (Equation 2), but it did not assume images with parallel projections. In our work, we extend face visibility module to make it support orthographic projection (Equation 3) to adapt orthophoto. Projective projection model: (2) Orthographic projection model: (3) Data Term: The best view is defined as the closest, orthogonal image with highest-resolution and without imagery quality defects (e.g. blur). Gal et al. (2010) use the gradient magnitude of the image integrated over the face's projection, as Equation 4 shown. The data term is assigned only if the given face is visible for street view image . The visibility is checked by projecting to a given view and perform depth testing (or z-buffering) like graphics rendering system (Greene et al., 1993). In our case, the challenge is handling data from different sensors and distinct resolutions. First, the normalization factor is introduced to balance the scale difference which theoretically is the ratio of average GSD of overview images and street view images. By adjusting manually, we can control the preference between street view and orthophoto. The second challenge we met is sampling issue in z-buffering phase as each pixel in orthophoto might cover multiply triangular faces in practice because the mesh contains high resolution point clouds. Equation 5 shows the data term we used for orthophoto images, the denotes i-th ortho image. If the face is visible for ortho photo , the term will be computed by gradient magnitude; if neither street view nor ortho image can see the face , every ortho photo will be assigned a constant value which is less than infinity (infinity is the trivial cost for unlabeled face). The data term (Equation 5) for orthophotos can make sure at least orthophotos are available to provide texture, which is a common practice in photogrammetry digital products generation (Zebedin et al., 2006). If there are more than one ortho images are provided, the smooth term will determine the which orthophoto to use based on face connectivity. = image of i-th street / ortho view = any constant value that less than infinity = average GSD of orthophoto divided by average GSD of the street-view images.
Smoothness Term: Waechter (2014) proposed a smoothness term based on Potts model (Wu, 1982): , which works well and very fast.

Seamless Texture Fusion
Waechter (2014) proposed a global and local color adjustment method to blur the seams, which extended Lempitsky and Ivanov's (Lempitsky et al., 2006) color adjustment approach.
The original approach only accounts for color difference on vertices to measure color difference along the seam line, called global adjustment. The extended method added a local adjustment with Poisson editing (Pérez et al., 2003), affect border strip of image patches.
In our case, since the resolution of orthophoto is way lower than the street-view images, prior to applying the fusion of image patches, we up-sampled orthophoto to the same resolution as that of the street-view images. After color balancing and Poisson editing, color differences can be well adjusted and seams are successfully been blurred, as shown in Figure 11.

EVALUATION
We collected 6 World-View 2 images fully covering the campus region of The Ohio State University and processed with RSP satellite imagery stereo reconstruction package (Qin, 2016) The resulting DSM (or point clouds) and orthophoto cover 5 km x 3 km, with 0.5m GSD. As for close-range side view dataset, we extract a sequence of ~3000 images taken with GoPro camera mounted on a car. GoPro images are registered with state-ofthe-art SFM algorithm (in-house developed approach, following standard SFM / photogrammetry pipeline (Moulon et al., 2016)), and the dense matching were performed using OpenMVS through multiple view 3d reconstruction algorithm (Shen, 2013). The datasets are then co-registered with the free SFM bundle with satellite DSM points ( Figure 12) using approach described in (Qin, et al, 2020). As shown in Figure 13, our method outperforms Poisson reconstruction with our dataset, when processing dataset coming from multiple sources. Our method benefits from visibility information comparing with Poisson surface reconstruction (Kazhdan et al., 2006). Since Poisson reconstruction requires oriented points (with normal) but our dataset does not provide such information, we preprocessed the dataset with normal estimator (Mitra and Nguyen, 2003). The Poisson surface reconstruction we used is implemented in CloudCompare (Girardeau-Montaut, 2016), and the voxel size of the algorithm is configured to match with our average point distance. Figure 14 shows results of the texture mapping method we proposed with our reconstructed mesh models. Our method can eliminate less confident vertices from satellite (e.g. tree crowns in Figure 13a), and create smooth, nice looking model with color adjustment.
Here we summarized run-time statistics and the model complexities of the generated model using different methods. As shown in Table 1, two methods have similar processing time, but it turns out our method can generate more concise model without appearance loss.

CONCLUSION
We present an approach to mesh heterogeneous 3D data by using Delaunay tetrahedralization based mesh reconstruction with ray visibility weighting. The surface reconstruction is cast as an energy minimization problem that can be globally optimized by solving minimum s-t cut. Our method recovers visibility from DSM data which acquired by matching satellite images and scaled to adapt to street-view images with orientations. It turns out that combining complementary data sources could enhance surface representative significantly if all information is well processed. The proposed multi-source meshing pipeline which extends state-of-the-art Delaunay tetrahedron based surface reconstruction method and multisource texturing mapping pipeline which accommodates images acquired from different sensors (i.e. side-view perspective images and satellite images) can create large scale, coarse-fine fused, and seamless textured surface model.
The experiments show that our method creates visually consistent surface model from these two drastically different sources, as we as textures with well-balanced color although the source data are with very different radiometry (satellite images vs. street-view level images). We compared our proposed pipeline with a typical fusion pipeline -Poisson surface reconstruction and the results show that our pipeline shows distinctive advantages.
Our approach assumes that the satellite point clouds are associated with orthophotos, which does exclude the possible points from facades, as well as other scenarios in which different types of sensors from different views are used. Our planned future work includes methods that deal with 3D point clouds generated from other sources, such as MLS (Mobile Mapping System), ALS (Airborne LIDAR System).