Combining visibility analysis and deep learning for refinement of semantic 3D building models by conflict classification

Semantic 3D building models are widely available and used in numerous applications. Such 3D building models display rich semantics but no fa\c{c}ade openings, chiefly owing to their aerial acquisition techniques. Hence, refining models' fa\c{c}ades using dense, street-level, terrestrial point clouds seems a promising strategy. In this paper, we propose a method of combining visibility analysis and neural networks for enriching 3D models with window and door features. In the method, occupancy voxels are fused with classified point clouds, which provides semantics to voxels. Voxels are also used to identify conflicts between laser observations and 3D models. The semantic voxels and conflicts are combined in a Bayesian network to classify and delineate fa\c{c}ade openings, which are reconstructed using a 3D model library. Unaffected building semantics is preserved while the updated one is added, thereby upgrading the building model to LoD3. Moreover, Bayesian network results are back-projected onto point clouds to improve points' classification accuracy. We tested our method on a municipal CityGML LoD2 repository and the open point cloud datasets: TUM-MLS-2016 and TUM-FA\c{C}ADE. Validation results revealed that the method improves the accuracy of point cloud semantic segmentation and upgrades buildings with fa\c{c}ade elements. The method can be applied to enhance the accuracy of urban simulations and facilitate the development of semantic segmentation algorithms.


INTRODUCTION
Semantic 3D building models at levels of detail (LoD)1 and 2 are widespread 1 and commonly applied in urban-related studies (Biljecki et al., 2015). Such 3D models are frequently reconstructed using a combination of 2D building footprints and multi-view stereo (MVS) or airborne laser scanning (ALS) techniques, as in the example of more than eight million reconstructed buildings in Bavaria, Germany (Roschlaub and Batscheider, 2016). This reconstruction strategy enables detailed modeling of roof surfaces but renders generalized façades neglecting openings such as windows and doors, as shown in Figure 1b.
Reconstructing façade elements becomes a key factor enabling automatic 3D building modeling at LoD3, for which an increasing demand has been expressed by numerous applications including estimating heating demand (Nouvel et al., 2013), preserving cultural heritage , calculating solar potential (Willenborg et al., 2018), and testing automated driving functions (Schwab and Kolbe, 2019).
Since point clouds are deemed as one of the best data sources for 3D modeling, dense, street-level mobile laser scanning (MLS) point clouds appear to be especially suitable for at-scale façade reconstruction (Xu and Stilla, 2021). For this purpose, however, point clouds require semantic classification, which has been recently approached using machine and deep learning methods yielding promising results Matrone et al., 2020). Yet, these methods can have limited accuracy when classifying objects that are translucent (e.g., windows) or have an inadequate amount of training data (e.g., doors). On the other hand, points' rays intersecting with 3D models can provide geometrical cues about possible façade openings, but without differentiating between classes, such as window, door, or underpass (Tuttas and Stilla, 2013). In this paper, we present a strategy that combines both ray-and region-based methods for conflict classification (Wysocki et al., 2022a). This approach leads to refinement of both 3D building models and segmented point clouds' accuracy; our contributions are as follows: • a CityGML-compliant strategy for upgrading LoD2 to LoD3 models by model-driven 3D window and door reconstruction; • a method classifying conflicts between laser observations and 3D building models using deep learning networks; • a method improving the semantic segmentation results of arXiv:2303.05998v1 [cs.CV] 10 Mar 2023 deep learning networks by analyzing ray-traced points and 3D building models.

RELATED WORK
The internationally used CityGML standard establishes the LoD of semantic 3D city objects (Gröger et al., 2012). One of the chief differences between LoD2 and LoD3 is the presence of façade openings in the latter. In our case, we search for absent façade elements in the input models using point clouds and then carry out their reconstruction. Therefore, we deem methods as related if they deal with detecting missing features (Section 2.1) and façade reconstruction using point clouds (Section 2.2).
2.1 Visibility analysis using point clouds Hebel et al. (2013) employ visibility analysis to detect changes between different point cloud epochs. The method addresses the uncertainty of ALS measurements using the Dempster-Shafer theory (DST). Ray tracing on a voxel grid is introduced to identify occupied, empty, and unknown states per epoch. Based on the epochs comparison, they distinguish: consistent, disappeared, and appeared states.
Visibility analysis is utilized to remove dynamic objects from point clouds, too (Gehrung et al., 2017). MLS observations' rays are traced on an efficient octree grid structure introduced by Hornung et al. (2013). Each traced ray provides occupancy probabilities, which are accumulated per voxel using the Bayesian approach. Moving objects are removed based on decreasing occupancy probability of ray-traversed voxels.
A multimodal approach to visibility analysis is proposed by Tuttas et al. (2015). They investigate how to monitor the progress of a construction site using photogrammetric point clouds and building information modeling (BIM) models. The Bayesian approach and the octree grid structure is employed to analyze the points' rays and vector models. The as-is (point cloud) to as-planned (3D model) comparison differentiates between potentially built, not visible, and not built model parts.
In our previous work (Wysocki et al., 2022a), we introduced visibility analysis to refine semantic 3D building models with underpasses using MLS point clouds. The method compares ray-traced points with building objects on an octree grid in a probabilistic fashion. Contours of underpasses are identified based on an analysis of conflicts between laser observations and building models, supported by vector road features.

Façade openings reconstruction using point clouds
Substantial research effort has been devoted to methods using images for façade segmentation (Szeliski, 2010). Nevertheless, 2D images require additional processing to enable semantic 3D reconstruction. 3D point clouds, however, provide an immediate 3D environment representation, which makes them one of the best datasets for urban mapping (Xu and Stilla, 2021).
When analyzing laser observations, openings are often assumed to represent holes due to their translucent characteristic or faceintruded position (Tuttas and Stilla, 2013;Fan et al., 2021). For example, windows are detected based on building interior points, which imply opening existence (Tuttas and Stilla, 2013). Borders of openings are delineated based on the ray tracing of interior points and the detected façade plane in point clouds. Zolanvari et al. (2018) propose a slicing method to identify openings using horizontal or vertical cross-sections. The method finds façade planes using the RANSAC algorithm and removes noisy points based on their deviations from the planes. Gaps occurring in horizontal or vertical cross-sections delineate possible openings.
Layout graphs are proposed by Fan et al. (2021) to identify façade structures. Spatial relations among detected objects are encoded and exploited by the Bayesian framework to deduce the whole façade layout.
Recently, however, data-driven methods based on machine and deep learning approaches have provided promising results for classifying point clouds, especially when using the selfattention mechanism (Zhao et al., 2021). These great strides have influenced façade segmentation of point clouds, too Matrone et al., 2020). Modified versions of the DGCNN deep learning architecture are proposed to classify façade elements in point clouds . The method employs features stemming from machine learning approaches to improve deep learning network accuracy.
Little research attention has been given to investigating the automatic upgrade of LoD2 to LoD3 building models using point clouds, except, to the best of our knowledge, our previous works refining overall façade geometry (Wysocki et al., 2021a,b) and reconstructing underpasses (Wysocki et al., 2022a). However, related work is proposed by Hensel et al. (2019) for detecting and reconstructing openings, not by point clouds but by exploiting the textures of semantic city models. They apply the Faster R-CNN deep neural network to identify the bounding boxes of windows and doors on textured CityGML building models. To minimize inaccuracies in the alignment of openings, they apply mixed-integer linear programming. Then, bounding boxes serve as reconstructed opening elements in LoD3 building models.

METHODOLOGY
In contrast to our previous work devoted to refining building models with underpasses (Wysocki et al., 2022a), in this paper we focus on detecting and reconstructing outstanding façade openings, such as windows and doors. Moreover, our method refines point cloud segmentation by back-projecting classified conflicts onto the input point clouds.
As presented in Figure 2, the method evaluates and assigns uncertainties to the input datasets (Section 3.1). While a neural network is trained on points representing façade elements (Section 3.2), the points ray tracing process performs probabilistic classification of a scene into occupied, empty, and unknown voxels (Section 3.3). Subsequently, labeled voxels are compared to segmented points to derive static and remove dynamic points in voxels (Section 3.5). The voxels are also compared to vector 3D models to identify confirmed, empty, and unknown voxel labels (Section 3.4). If conflicted and static features exist, probabilistic classification is carried out, where a Bayesian network identifies unmodeled openings and other objects (Section 3.6). These are back-projected to the point cloud, refining its segmentation accuracy. If the Bayesian network detects windows or doors, shape extraction is conducted (Section 3.7); otherwise, another module can be triggered, such as the underpass reconstruction (Wysocki et al., 2022a). Opening shape extraction is followed by shape generalization, which delineates fitting borders for 3D reconstruction (Section 3.7). Window and door 3D models are automatically fitted to shapes based on the respective geometry and opening class (Section 3.9). Afterward, unchanged and new semantics are assigned to geometries, following the CityGML standard for LoD3 (Gröger et al., 2012).

Data with uncertainties
Uncertainties in laser measurements and vector objects can stem from various sources, such as imprecise metadata, data transformations, and acquisition techniques. Uncertainties are application-dependent, too. Therefore, the proposed façades refinement involves uncertainties concerning the global positioning accuracy of point clouds and building models. To quantify these uncertainties, we introduce the confidence interval (CI), which is estimated using the confidence level (CL), its associated z value (z), standard deviation (σ), and mean (µ).
Let σ1 be the location uncertainty of point clouds, and σ2 the location uncertainty of 3D model walls. These are estimated based on the assumed point cloud global registration error e1 and the global location error of 3D model walls e2. Then, the façade's CI is calculated based on σ = σ 2 1 + σ 2 2 . The maximum upper and lower bounds are given by [µi −2σi, µi +2σi], when assuming operating in the L1 norm and Gaussian distribution (Suveg and Vosselman, 2000). CL1 and CL2 quantify the operator's confidence level in true-value deviations for laser measurements and 3D model walls, respectively. Depending on the CL value, corresponding zi values are assumed. The division of µi by zi estimates the standard deviation σi value (Hazra, 2017).

Semantic segmentation
The goal of semantic segmentation is to divide a point cloud into several subsets based on the semantics of the points. Following Wysocki et al. (2022b) and as shown in Figure 3, eight relevant classes for façade segmentation and reconstruction tasks are considered: arch (dark blue), column (red), molding (purple), floor (green), door (brown), window (blue), wall (beige), and other (gray).
The segmentation is performed using a modified Point Transformer self-attention network (Zhao et al., 2021) extended by the use of geometric features improving the network performance, such as height of the points, roughness, volume density, verticality, omnivariance, planarity, and surface variation. The last three mentioned features are based on the normalized eigenvalues λi (λ1 > λ2 > λ3), which are derived from the 3D point coordinates within a considered spherical neighborhood ri (Weinmann et al., 2013;Grilli and Remondino, 2020).
Finally, using a softmax output layer, we obtain an output vector of probabilities for each predicted class, which becomes fundamental for running our conflict classification approach (Section 3.5).

Ray tracing
Points ray tracing is performed to identify absent structures in existing 3D building models ( Figure 4). To enable comparison between these modalities, we employ a 3D occupancy grid. The grid adapts its size to the input data since it utilizes an octree structure. 3D voxels are the octree structure's leaves, and their size vs is selected based on the relative accuracy of laser observations.
Every laser observation is traced from the sensor position si, following the orientation vector ri, to the reflecting point pi = si + ri. Voxels containing pi are labeled as occupied (blue), those traversed by a ray as empty (pink), and the untraversed ones as unknown (gray). The labels are assigned based on a probability score that considers multiple laser observations zi, which are updated using prior probability P ( n) and previous estimate L(n|z1:i−1). Final score is controlled using log-odd values L(n) and clamping thresholds lmin and lmax (Hornung et al., 2013;Tuttas et al., 2015): The grid is vector-populated by inserting 3D model faces and their quantified uncertainties (Section 3.1). Hence, each face has an assigned façade's maximal deviation range (upper CI) and its confidence level (CL). Ultimately, the grid's 3D voxels include attributes such as location, size, as well as state probability stemming from laser observations and a building model.

Voxels to model comparison
As shown in Figure 4, each voxel is analyzed in relation to its intersection with a façade: Occupied voxels that intersect with façades are labeled as confirmed (green); empty voxels that intersect with façades are labeled as conflicted (red); unknown voxels hold their status, as they represent unmeasured space. Voxels are projected onto the intersected façade, forming the

Voxels to point cloud comparison
Ray tracing provides physical, per-voxel occupancy indicators, while semantic segmentation yields educated, per-point semantic classes. Both of these sources provide their semantic information with a probability measure. The fusion of voxels and points is conducted to transfer per-point semantic classes to occupancy voxels and suppress the impact of dynamic points ( Figure 6). The rationale behind this fusion is that static, occupied voxels (yellow) are building-related; dynamic, unoccupied voxels (gray) represent moving objects, such as pedestrians or cars, and can be suppressed by multiple laser observations, as shown by Gehrung et al. (2017) and in Figure 7.
Semantic points are inserted into the voxel grid to enable comparison between the two representations. Then, the median probability score P (B) is derived from point classes within each voxel. The occupancy probability P (A) and the median probability of each class P (B) are two dependent events, for which the existence probability score Pex is calculated Pex(A ∩ B) = P (A) · P (B|A). Voxels are deemed as static if the existence probability score Pex is greater than the static threshold probability: Pex >= Pstatic; otherwise, voxels represent the dynamic state. Points within dynamic voxels are relabeled to the other class and are back-projected to the input Figure 6. Fusion of voxels with per-point semantic information (yellow), while suppressing dynamic points (gray) using probability score and measurements accumulation. Static voxels with semantics are projected onto the façade, forming the points comparison texture map layer with labels corresponding to the classes, as shown in the example of windows (orange) in Figure 8. As in the model comparison layer (Section 3.4), the cell spacing of a texture map follows the projection of the voxel grid to the plane.

Probabilistic classification: the Bayesian approach
Model comparison and points comparison textures are utilized to identify façade openings using a Bayesian network (BayNet). The network estimations are also back-projected onto semantic point clouds to enhance their segmentation accuracy.
As shown in Figure 9, the designed BayNet comprises: one target (red), two input (yellow), one decision (blue), and two output nodes (green). Each directed link represents a causal relationship between the X and the Y nodes. The conditional probability table (CPT) prescribes weights for each state and node combination (gray). The target, opening state is calculated using the joint probability distribution P (X, Y ) and the CPT. The marginalization process is used to calculate the probability of the target node Y being in the opening state y. The process sums conditional probabilities of the states x stemming from parent nodes X (Stritih et al., 2020). Since the network consists of texture layers with state probabilities, the data evidence represents the so-called soft evidence (Stritih et al., 2020). In an inference process, soft evidence is added to update the joint probability distribution. This process provides the most likely node states by estimating the posterior probability distribution (PPD).
Pixel classes from the model comparison and points comparison textures form clusters if they have a neighbor in any of the eight directions of the pixel. The co-occurring conflicted, window, and door cluster classes, lead to a high probability of unmodeled openings. This output is used for further opening 3D modeling and is back-projected onto segmented point clouds as either the window or door class. On the other hand, cooccurring confirmed, window, and door clusters, lead to a low probability of existing openings. These clusters are also backprojected to improve the accuracy of semantically segmented point clouds: either as the molding class, if close to an opening; or otherwise as the wall class. The low probability P low and the high probability P high labels are assigned to clusters based on the probability threshold Pt: P high > Pt >= P low .

Openings shape extraction
The high probability clusters P high are extracted from a Bayesian probability texture as opening shape candidates. Adding to existing shape indices (Basaraner and Cetinkaya, 2017), we introduce the completeness index, which measures the rcp ratio of outer shape area to inner-holes area. The candidates are rejected if their area is smaller than the chosen area threshold value bs and if their completeness index score rcp is smaller than rcp t .

Openings shape generalization
Yet, the extracted candidates can still display distorted, noisy shapes. Morphological opening operation is applied to minimize the effect of spiky and weakly connected contours. Subsequently, these shapes are generalized to minimum bounding boxes, for which a modified rectangularity index (Basaraner and Cetinkaya, 2017) is calculated. The modification considers relation of the bounding box sides a to b, where outliers are rejected based on the upper P Eup and lower P E lo percentiles of the index score.

Model-driven 3D reconstruction
Identified bounding boxes are used as fitting boundaries for window and door 3D models, which are loaded from a predefined library. The opening models' coordinate origin is erased and then placed in the bottom left corner of a model. The offset to global coordinates is calculated between the opening model origin and the bottom left corner of the respective bounding box. After the shift, the rotation is performed as a difference between the façade's face orientation and opening model orientation. Aligned 3D models are scaled to fit bounding box boundaries, as presented in Figure 10 and in Figure 11.

Semantic modeling
Since 3D solid libraries of openings are employed for 3D reconstruction, we opt to model them as solid geometries, too, following the CityGML encoding recommendation (Special Interest Group 3D, 2020). Based on the identified opening class, windows and doors are assigned to the respective CityGML Window and Door classes; as such, they link to the building entity (Gröger et al., 2012). The unchanged semantics of input elements is preserved, except for the LoD which is upgraded to LoD3.

Datasets
The method was tested using MLS point clouds and governmental CityGML building models at LoD2 representing the Technical University of Munich (TUM) main campus, Munich, Germany.
The acquired LoD2 building models were created using 2D cadastre footprints and aerial measurements (Roschlaub and Batscheider, 2016 (Zhu et al., 2020) was transformed into the global coordinate reference system (CRS) and used to perform point cloud ray tracing. The TUM-FAÇ ADE dataset was deployed for training, as it comprises façade-annotated point clouds (Wysocki et al., 2022b). For computational reasons, we subsampled the original dataset removing all the redundant points within a 5 cm distance. In this way, we compressed an initial dataset of about 118 million points to a still resolute but lightweight version of about 10 million points. The subsampled point cloud was divided into 70% training and 30% validation sets ( Figure 12). Additionally, 17 available classes were consolidated into seven Figure 12. The TUM-FAÇ ADE benchmark divided into training (blue) and validation (red) sets for the segmentation experiments.
representative façade classes: molding was merged with decoration; wall included drainpipe, outer ceiling surface, and stairs; floor comprised terrain and ground surface; other was merged with interior and roof ; blinds were added to window; while door remained intact.

Parameter settings
The uncertainties of the true façades location were estimated considering the global registration error of MLS point clouds and building models: For point clouds these were set to e1 = 0.3 m, µ = 0.15 m, CL1 = 90%, and z1 = 1.64; for building models were set to e2 = 0.03 m, µ = 0.015 m, CL2 = 90%, and z1 = 1.64. This yielded the façades' upper CI score of 0.2 m and CL = 90%.
Ray casting was employed on a grid with the voxel size set to vs = 0.1 m considering: opening size, the point clouds density, and their relative accuracy. The voxels were initialized with a uniform prior probability of P = 0.5. Log-odd values were set to locc = 0.85 for occupied and lemp = −0.4 for empty states, corresponding to Pocc = 0.7 and Pemp = 0.4, respectively. Clamping parameters were set to lmin = −2 and lmax = 3.5, corresponding to Pmin = 0.12 and Pmax = 0.97, respectively, following (Tuttas et al., 2015;Hornung et al., 2013); an exemplary implementation is provided in our repository 3 . For the fusion of voxels and points, the static threshold was set to Pstatic = 0.7, while the empty voxels occupancy probability was fixed to 0.4 for processing acceleration.
As regards the semantic segmentation procedure, taking into consideration the main characteristic of the buildings, the classes to be detected, and following , we identified 0.8 m as optimal neighborhood search radius ri for the features roughness, volume density, omnivariance, planarity, and surface variation, while 0.4 m for verticality.
The proposed BayNet has two input soft evidence layers: points comparison and model comparison textures. These had associated confidence levels, which scored 70% and 90% for point and model comparison layers, respectively. The opening state probability was defined by the probability threshold: Pt = 0.7.
The opening candidates' area threshold value bs was set to 0.3 m 2 , while completeness threshold score rcom t was set to 0.1, to suppress noisy, patchy clusters. The over-elongated bounding boxes were suppressed by calculating the modified rectangularity index, where the upper P Eup and lower P E lo were set to the 95th and 5th percentile, respectively.

Validation of improved semantic segmentation
Semantic segmentation results were validated on unseen ground-truth point clouds of the TUM-FAÇ ADE dataset. For evaluation, we use the overall accuracy (OA); F1 score per class; and average: precision (µP), recall (µR), F1 score (µF1), and intersection over union (µIoU). The arch and column classes were omitted in the validation, since they were absent in the ground-truth building. As shown in Table 1, for the baseline of the validation served the Point Transformer (PT) network (Zhao et al., 2021). The presented feature-extended version of the PT network (PT+Ft.) served as an input for the proposed conflict classification (CC) method.

Validation of openings reconstruction
Reconstructed openings were validated using manually modeled ground-truth building openings (Table 3). Detection rate was calculated based on the on-site inspection of all existing façade openings (AO) and measured openings (MO) by laser scanner (Table 2). The validation was performed for façades A, B, and C, shown in Figure 10, Figure 11,

DISCUSSION
Experiments revealed promising results for refining of both building models and classified point clouds. As presented in    Table 3. Validation of reconstructed openings using median (mIoU) and average intersection over union (µIoU).  The experiments corroborate that DR was dependent on the density of measurements per façade: for the densely covered façade A it estimated 90% DR-AO and detected 100% of measured openings (DR-MO); for the highly occluded side-façade C it estimated 28% and 50%, respectively (see Table 2 and Figure 15).
The method improves significantly reconstruction performance in comparison to the one conducted only on segmented point clouds of the baseline PT architecture, as shown in Figure 13. When compared to the ground-truth openings, the proposed reconstruction reached roughly 90% accuracy (Table 3); yet, the method is limited when windows are partially measured (e.g., blinds before windows), as exemplified by several windows in the third row in Figure 13b.
The back-projected, classified conflicts increased accuracy of semantic point cloud segmentation by approximately 12% (Table 1). Note that the precision and intersection over union score for CC remained similar to the PT+Ft. score, while F1 score for floor dropped by about 6%. Remarkably, the proposed CC method improves segmentation of window, door, and other classes by approximately 11%, 12%, and 21%, respectively.

CONCLUSION
Our work has led us to the conclusion that refinement is a promising alternative to a from-scratch reconstruction. The refinement preserves input semantics, minimizes model-specific planarity issues, and enables consistent city model updates. Moreover, existing LoD3 elements can be extracted and directly employed as refinement features for buildings at lower LoDs.
The validation presents that the method reaches a high accuracy of 92% in detecting observable windows and a low false alarm rate score of approximately 1%. Refined point clouds also score low false negative rate, which is indicated by a high recall score of 79%. This trait of our method could be of particular importance for feature-dependent applications, where robustness is favored over visualization, such as in simulations of automated driving functions (Schwab and Kolbe, 2019). On the other hand, façade occlusions and a laser range could limit the method's applicability for visualization-oriented purposes, where a further prediction of unseen objects could be employed.
Experiments corroborate that combining visibility analysis with a region-based approach improves segmentation accuracy. In the future, we plan to embed occupancy information directly into the training of the deep neural network. Furthermore, including radial point cloud features (e.g., intensity) in training datasets could facilitate detecting windows covered by blinds.
Tested façades presented challenging, varying measuring conditions; for similar façade and opening styles, the method is expected to provide comparable results. Yet, testing sample size implies that caution must be exercised. It is worth noting that static objects, which do not contribute to façades elements and are adjacent (e.g., traffic signs, bus shelters), can negatively influence the semantic back-projection results. To further our research, we plan to test the method on a higher number of façades.