3D INDOOR MAPPING WITH THE MICROSOFT HOLOLENS: QUALITATIVE AND QUANTITATIVE EVALUATION BY MEANS OF GEOMETRIC FEATURES

3D indoor mapping and scene understanding have seen tremendous progress in recent years due to the rapid development of sensor systems, reconstruction techniques and semantic segmentation approaches. However, the quality of the acquired data strongly influences the accuracy of both reconstruction and segmentation. In this paper, we direct our attention to the evaluation of the mapping capabilities of the Microsoft HoloLens in comparison to high-quality TLS systems with respect to 3D indoor mapping, feature extraction and semantic segmentation. We demonstrate how a set of rather interpretable low-level geometric features and the resulting semantic segmentation achieved with a Random Forest classifier applied on these features are affected by the quality of the acquired data. The achieved results indicate that, while allowing for a fast acquisition of room geometries, the HoloLens provides data with sufficient accuracy for a wide range of applications.


INTRODUCTION
Rapid 3D mapping and scene understanding for indoor environments have seen tremendous progress in recent years, enabling a rich diversity of applications including scene modeling, navigation and perception assistance, and future use cases like telepresence. Besides 3D reconstruction based on RGB imagery Stathopoulou et al., 2019), RGB-D data (Zollhöfer et al., 2018) or data acquired via mobile indoor mapping systems (Lehtola et al., 2017;Chen et al., 2018;Nocerino et al., 2017;Masiero et al., 2018), there has also been an increasing interest in augmenting the acquired 3D data with virtual contents or semantics. In this regard, mobile Augmented Reality (AR) devices like the Microsoft HoloLens allow for the in-situ visualization of virtual contents (e.g., Building Information Modelling (BIM) data or information directly derived from the acquired data) which, in turn, facilitates numerous applications addressing facility management, cultural heritage documentation or educational services.
While the HoloLens has recently been evaluated regarding its capabilities as an AR device (Liu et al., 2018) and regarding the spatial stability of holograms (Vassallo et al., 2017), there have also been first investigations on the spatial accuracy of triangle meshes acquired by the HoloLens in comparison to ground truth data acquired with a terrestrial laser scanning system (Khoshelham et al., 2019;Hübner et al., 2019). However, to the best of our knowledge, the impact of the quality of the acquired data on the extraction of geometric features and thus on the results of semantic segmentation (Weinmann, 2016;Poux, Billen, 2019) still remains an open issue, although it has recently been proven that the robustness of such geometric features is strongly influenced by such cues (Dittrich et al., 2017).
In this paper, we address 3D indoor mapping with the Microsoft HoloLens (Version 1) with a particular focus on a quantitative * Corresponding author and qualitative evaluation by means of geometric features. We use a set of rather interpretable low-level geometric 3D and 2D features (Weinmann, 2016;Weinmann et al., 2017), which are extracted from the local neighborhood of each query point and concatenated to define the respective feature vector. The latter, in turn, serves as input to a classifier, for which we use a Random Forest classifier (Breiman, 2001) in the scope of our work. We compare the behavior and expressiveness of the involved features to their counterparts extracted from downsampled TLS data, and we analyze the impact of different feature sets on the classification results. This paper is organized as follows. We first briefly discuss related work with respect to sensor systems and recent progress in indoor mapping in Section 2. Subsequently, we explain the applied methodology in Section 3. In Section 4, we present and compare the results achieved for an indoor environment that has been acquired with the Microsoft HoloLens and a TLS system in two independent scan campaigns. These results are discussed in Section 5. Finally, a summary and concluding remarks as well as suggestions for future work are provided in Section 6.

RELATED WORK
In the following, we briefly summarize related work with respect to sensor systems (Section 2.1) and recent progress in indoor mapping (Section 2.2).

Sensor Systems
For highly accurate geometry acquisition within indoor environments, Terrestrial Laser Scanning (TLS) systems are typically used. While the quality of a range measurement generally depends on a variety of influencing factors Weinmann, 2016), remaining errors often tend to be negligible and are mainly caused by either (i) the characteristics of the observed scene in terms of object materials, raw TLS data (left), TLS data downsampled via a voxel-grid filter using a voxel size of 3 cm (center), and HoloLens data (right).
surface reflectivity, surface roughness, etc. or (ii) the scanning geometry (i.e., the relative distance and orientation of object surfaces with respect to the used scanning device). However, a TLS system is rather expensive, and a single scan may typically not be sufficient to achieve a full coverage of the considered indoor scene. Hence, several scans have to be acquired and transformed into a common coordinate system, as indicated in Figure 1. In this context, different side constraints such as range constraints and/or incidence angle constraints may be taken into account  and, if done manually using artificial markers, this process may be laborious and time-consuming.
To facilitate indoor scene acquisition, a diversity of Mobile Laser Scanning (MLS) systems or Mobile Mapping Systems (MMSs) have been presented with only little loss in measurement accuracy. In this regard, commonly used systems are represented by trolley-based systems (e.g., the NavVis mobile mapping system 1 ), UAV-based systems (Hillemann et al., 2019), backpack-based systems (Nüchter et al., 2015;Blaser et al., 2018) or hand-held systems (e.g., the Leica BLK2GO 2 ). However, such systems tend to be rather expensive due to the involved laser scanning device(s) and/or the involved multicamera system. Furthermore, trolley-based systems encounter challenges in stairways, while UAV-based systems require an expert to fly the sensor platform and backpack-based systems have a significant weight. Thus, applicability for the end-user is typically reduced.
To address the required expenses, low-cost RGB-D cameras (e.g., the Microsoft Kinect or the Intel RealSense) have been presented which can be used as a hand-held device for scene acquisition. Such RGB-D cameras allow for scene capture with high frame rates and are therefore often suitable for acquiring both static and dynamic scenes. Among a diversity of approaches, KinectFusion (Izadi et al., 2011) and respective improvements (Nießner et al., 2013;Kähler et al., 2016;Dai et al., 2017b;Stotko et al., 2019) have become popular methods for fast scene reconstruction. For a detailed survey on 3D scene acquisition with RGB-D cameras, we refer to (Zollhöfer et al., 2018). Due to the focus on the low-cost constraint, however, such systems tend to reveal limited capabilities regarding the accuracy of geometry acquisition. In particular, errors are caused by sensor noise, limited resolution and misalignments due to drift (Zollhöfer et al., 2018).
Providing a trade-off between accurate scene acquisition and low-cost solution, a popular device is given with the Microsoft HoloLens 3 representing a mobile, head-worn AR device. The HoloLens provides the capability to map its direct environment in real-time in the form of triangle meshes and to simultaneously localize itself within the acquired meshes. The latter is achieved based on four gray-scale tracking cameras, while the 3D mapping relies on a time-of-flight (ToF) range camera operating to distances of up to about 3.5 m. Besides these mapping capabilities, the HoloLens is capable of augmenting the physical environment of the user with virtual content. This AR capability of the device could be used to guide the user by providing information about where to look in order to have a scene coverage as high as possible. This makes the Microsoft HoloLens rather easy-to-use for non-expert end-users.

Indoor Mapping
Besides geometry acquisition as possible with various sensor systems described in the previous section, indoor mapping may also address further tasks, such as the acquisition of the given room topology as well as Building Information Modeling (BIM) (Tran et al., 2017;Nikoohemat et al., 2019;Ochmann et al., 2019), and semantic segmentation (Armeni et al., 2016;Engelmann et al., 2017;Poux et al., 2018;Poux, Billen, 2019). The latter can be done on point-level (i.e., each point is assigned a class label indicating one of the defined object categories) and on instance-level (i.e., each point is assigned a semantic class label indicating one of the defined object categories and an instance label indicating the respective object in the scene), and by using traditional approaches relying on the use of hand-crafted features or by using modern deep learning techniques.
To foster research on indoor scene reconstruction and understanding, a variety of datasets have been presented: • The dataset released with the ISPRS Benchmark on Indoor Modelling  contains five indoor scenes, each captured with a different sensor.
• The ScanNet dataset (Dai et al., 2017a) represents a largescale RGB-D video dataset containing more than 1.5k indoor scenes annotated with respect to camera poses, surface reconstructions, and semantic segmentation.
• The Matterport3D dataset (Chang et al., 2017) is a largescale RGB-D dataset containing 90 indoor scenes and more than 2000 rooms annotated with respect to surface reconstruction, camera poses, and both 2D and 3D semantic segmentations suitable for several scene understanding tasks.
• The House3D dataset (Wu et al., 2018) contains more than 45k human-designed, visually realistic 3D indoor scenes characterized by a diversity of 3D objects, textures and scene layouts.
• The Replica Dataset (Straub et al., 2019) contains highquality reconstructions of a variety of indoor scenes, whereby the focus was set on obtaining visually, geometrically, and semantically realistic models of the world.
While all these datasets have been created with a focus on the mapping and/or modeling of large-scale indoor scenes, none of them contains data for the same scene, but acquired with different sensor systems.

METHODOLOGY
For evaluating the influence of the quality of the acquired data on the expressiveness of geometric features and on the accuracy of semantic segmentation, we focus on a traditional workflow.
To describe each 3D point via geometric features, characteristics of the spatial arrangement of neighboring points have to be encoded appropriately. Accordingly, we first need to recover the local neighborhood for each point of the point cloud (Section 3.1) in order to encode the local 3D structure via geometric features (Section 3.2). The derived encoding then serves as input for classification (Section 3.3).

Recovery of Local Neighborhoods
To recover the local neighborhood for each point Xi of the point cloud, we focus on local neighborhoods with a locallyadaptive neighborhood size (Weinmann, 2016). Instead of relying on an identical scale parameter (represented by the number k of nearest neighbors) that needs to be determined once for the considered dataset, this definition relies on the idea that the selection of an optimal neighborhood size parameterized by ki = ki,opt might depend on the local 3D structure and thus, to some degree, the considered classification task. To achieve such a local adaptation, we focus on eigenentropy-based scale selection (Weinmann, 2016) that has proven beneficial compared to dimensionality-based scale selection (Demantké et al., 2011).
The main idea of eigenentropy-based scale selection (Weinmann, 2016) consists in the consideration of different values of the scale parameter to derive multiple neighborhoods for each 3D point and selecting the value of the scale parameter that corresponds to the minimal disorder of 3D points across the considered local neighborhoods. More specifically, for different values of the scale parameter (here: k), the 3D coordinates of a query point Xi = Xi,0 and its k nearest neighbors Xi,j with j = 1, . . . , k are used to calculate the 3D structure tensor representing a 3D covariance matrix for the local barycenter Xi,j. (2) Thus, the three eigenvalues of Si exist, are non-negative and indicate the dispersion magnitude along their corresponding eigenvectors (Dittrich et al., 2017). Normalizing these eigenvalues by their sum yields normalized eigenvectors λi,1, λi,2 and λi,3. Without loss of generality, we assume that λi,1 ≥ λi,2 ≥ λi,3 ≥ 0, and that these normalized eigenvalues can be expressed as a function of the neighborhood size: λi,j = f (k). The optimal neighborhood size ki,opt minimizes the eigenentropy Ei (i.e., a measure for the disorder of neighboring 3D points) according to (3) In accordance with related work (Demantké et al., 2011;Weinmann, 2016), we consider scale parameters within the interval K = [k min , kmax], whereby we consider relevant statistics to start with a minimum number of k min = 10 neighboring points and, in order to limit the computational burden, we select the upper boundary as kmax = 100.

Extraction of Geometric Features
To encode characteristics of the spatial arrangement of points within the local neighborhood of a query point Xi, we consider the corresponding 3D structure tensor Si and its normalized eigenvectors λi,1, λi,2 and λi,3. The latter, in turn, are used to derive the dimensionality features of linearity Li, planarity Pi and sphericity Si as well as further eigenvalue-based features represented by omnivariance Oi, anisotropy Ai, eigenentropy Ei and change of curvature Ci (West et al., 2004;Pauly et al., 2003): Ai = λi,1 − λi,3 λi,1 λi,j ln λi,j Furthermore, we follow (Weinmann, 2016) and consider geometric features represented by the absolute height Hi of the query point Xi, the radius Ri of the local neighborhood, the local point density ρi with the verticality Vi with where nz is the vertical component of the local normal vector, and the maximum difference ∆Hi as well as the standard deviation σH,i of the height values of all points within the local neighborhood.
Finally, we take into account that indoor environments reveal many vertical structures. Accordingly, we apply a 2D projection of the point Xi and its ki nearest neighbors onto a horizontal plane (Weinmann, 2016). Based on the 2D coordinates of these projections, we derive the 2D structure tensor in analogy to the 3D structure tensor. The 2D structure tensor, in turn, has two eigenvalues ξ1 and ξ2 with ξ1 ≥ ξ2 ≥ 0 from which we derive their sum Σ ξ,i as well as their ratio R ξ,i . Besides these eigenvalue-based 2D features, we also consider the radius ri and the point density ζi on the basis of the 2D projections.
For our framework, we consider different feature sets as input for a subsequent classification. These feature sets comprise the set S 3D,EV including all eigenvalue-based 3D features, the set S 3D,other including all other geometric 3D features, the set S 3D,all including all 3D features, the set S 2D,all including all 2D features, and the set S all including all 3D and 2D features:

Supervised Classification
To assign an appropriate class label to a query point, in the scope of this paper, we focus on only considering the corresponding feature vector resulting from the concatenation of all extracted features, while we consider both smooth labeling techniques (Schindler, 2012) and structured regularization techniques (Landrieu et al., 2017) as subject of future work.
Given a set of representative training data, we focus on supervised classification based on a Random Forest classifier (Breiman, 2001). This classifier is a representative of discriminative classification approaches searching for the best separation of data points, independent of underlying probability density functions. More specifically, a Random Forest classifier is generated via a strategic combination of a set of weak learners represented by decision trees. These decision trees are trained on different, randomly chosen subsets of the given training data (Breiman, 1996). When training a single decision tree, the focus is set on a successive splitting of the data into smaller subsets based on specific homogeneity criteria until the resulting subset at a leaf node is as pure as possible. Since all decision trees are trained on independent, randomly different subsets of the given training data, their hypotheses for new unseen data to be classified can be considered as de-correlated. Thus, taking the majority vote across all these hypotheses represents a reasonable class prediction with improved generalization and robustness (Criminisi, Shotton, 2013).
To select the internal settings of the Random Forest (e.g., the number of involved decision trees), we perform a grid search on a suitable raster during the training process. Given the hypotheses of the involved decision trees also allows interpreting the output of the Random Forest as a soft assignment indicating the probabilities with which a query point Xi is associated to each of the given classes. Such a soft assignment thus represents a measure of confidence with respect to the assigned class label.

EXPERIMENTAL RESULTS
In the following, we first focus on the two involved sensor systems represented by the Microsoft HoloLens and a Leica HDS6000 (Section 4.1), and we then describe the acquired datasets (Section 4.2). Subsequently, we demonstrate the impact of the quality of the acquired data on the extraction of geometric features (Section 4.3). Finally, we present the classification results achieved when using different feature sets as input for classification (Section 4.4).

Microsoft HoloLens vs. Leica HDS6000
The Microsoft HoloLens is equipped with a variety of sensors. Among these sensors, a video camera is used to allow recording screenshot videos and pictures, in which the physical environment can be augmented with virtual contents. In addition, there are four gray-scale tracking cameras for a robust selflocalization. Two of these are oriented to the front in a stereo configuration with large overlap, while the other two are oriented to the right and left with nearly no overlap to the center pair. Furthermore, the HoloLens contains a time-of-flight (ToF) depth sensing camera providing images with pixel-wise range measurements, whereby range images can be queried in two different modes for the range from 0 m to 0.8 m ("short throw" mode) and the range from 0.8 m to about 3.5 m ("long throw" mode). The respective field-of-view of these sensors is illustrated in Figure 2. More detailed specifications can be accessed via the Microsoft Windows 10 SDK for the device.
The Leica HDS6000 is a standard phase-based terrestrial laser scanner with survey-grade accuracy (within a few mm range) and a field-of-view of 360 • × 155 • . To obtain complete scene coverage, several scans have to be taken from different positions and, as the data acquired with each scan refers to the local coordinate system of the scanner, all acquired scans have to be transferred into a common reference coordinate system. This process is referred to as point cloud registration.

Datasets
The considered scene is represented by an empty apartment consisting of five rooms of different size and one central hallway as shown in Figure 1. For HoloLens-based scene acquisition, an operator wearing the device went through the apartment and achieved a rapid and comfortable mapping of the indoor scene within a few minutes. To create the triangle mesh, the commercially available SpaceCatcher HoloLens App 4 was used, since this allowed directly visualizing the triangle meshes for the operator while they were recorded. The resulting mesh is visualized in the right part of Figure 1 and contains 105,200 points.
For TLS-based scene acquisition, 11 scans were taken from the positions indicated with a circle in the left part of Figure 1 and registered by using artificial planar and spherical markers placed in the apartment to establish correspondence. Subsequently, the complete point cloud was manually cleaned, downsampled via a voxel-grid filter using a voxel size of 3 cm, and finally meshed via Poisson Surface Reconstruction (Kazhdan et al., 2006). The resulting mesh is visualized in the center part of Figure 1 and contains 178,322 points.

Feature Extraction Results
We use eigenentropy-based scale selection to derive locallyadaptive neighborhoods (i.e., local neighborhoods whose size is optimized for each query point individually; see Section 3.1). Based on these neighborhoods, geometric features are extracted (see Section 3.2). The behavior of the neighborhood size and the different features across the complete mesh is visualized in Figures 3 and 4 for the HoloLens dataset and the TLS dataset, respectively.

Classification Results
For classification, we focus on a rather simple scenario with the three classes "Ceiling", "Floor" and "Wall" in the scope of this work. The ground truth labeling obtained via manual annotation is visualized in Figures 3 and 4 for the HoloLens dataset and the downsampled TLS dataset, respectively.
For training, we take into account that an imbalanced amount of training examples across different classes may have a detrimental effect on the generalization capability of the classifier. Hence, we randomly select 1000 points per class for training and all remaining points for performance evaluation. The latter is carried out based on commonly used evaluation metrics: Overall Accuracy (OA), κ-Index and class-wise F1-scores.
The classification results achieved when using different feature sets as input for the classifier are provided in Tables 1 and 2 for the HoloLens dataset and the downsampled TLS dataset, respectively. Visualizations corresponding to these results are provided in Figure 5.  Table 1. Results (in %) achieved for the classification of the HoloLens dataset when using different feature sets as input for the classifier: Overall Accuracy (OA), κ-Index and class-wise F1-scores (C: Ceiling; F: Floor; W: Wall).  Table 2. Results (in %) achieved for the classification of the downsampled TLS dataset when using different feature sets as input for the classifier: Overall Accuracy (OA), κ-Index and class-wise F1-scores (C: Ceiling; F: Floor; W: Wall).

DISCUSSION
The provided visualizations reveal that the accuracy of the acquired HoloLens dataset is worse compared to the accuracy of the downsampled TLS dataset. However, the HoloLens allows for a fast acquisition of the room geometry, and the accuracy is still sufficient as initialization to apply voxel representations or plane fitting techniques for creating a 3D model of the indoor scene. The accuracy might also still be sufficient to have a fast guess about the area and volume of the apartment, two cues important when calculating the rent for or the costs of the apartment.
The visualizations in Figures 3 and 4 indicate the flexibility of the neighborhood size varying between 10 and 100 nearest neighbors for the query points. Furthermore, they allow for reasoning about expressive features (e.g., the height Hi, the verticality Vi, or the ratio R ξ,i of the eigenvalues of the 2D structure tensor) and less-expressive features (e.g., the radii Ri and ri of the local neighborhood in 3D and 2D, the local point densities ρi and ζi in 3D and 2D, or the sum Σ ξ,i of the eigenvalues of the 2D structure tensor) with respect to the considered classification task. Of course, some features might be less suitable if there are more classes with a higher similarity or more complex indoor scenes (e.g., scenes covering different floors and/or also containing room inventory).
Among the feature sets, the set S 3D,EV including all eigenvaluebased 3D features is not suitable to achieve appropriate classification results (Figures 3 and 4 and Tables 1 and 2). The reason for this is that the respective features describe local characteristics around the query point with respect to the principal axes of the 3D ellipsoid spanned by the neighboring points, while the absolute orientation with respect to horizontal and vertical directions is not taken into account. Also the set S 2D,all including all 2D features does not allow appropriately separating the defined classes, since the 2D features ri, ζi and Σ ξ,i are less expressive and only R ξ,i is expressive allowing to separate vertical from horizontal planes, which is not sufficient for separating the defined classes. The other feature sets S 3D,other , S 3D,all and S all lead to classification results of almost the same quality, since they have several features in common that are highly relevant for the considered classification task (e.g., the height Hi, the verticality Vi, and the maximum difference ∆Hi and standard deviation σH,i of height values).
A comparison of the classification results achieved for the HoloLens dataset and for the downsampled TLS dataset (Tables 1 and 2 and Figure 5) reveals a decrease in OA when using the HoloLens for data acquisition. This decrease in OA is about 5 %, when considering the more suitable feature sets S 3D,other , S 3D,all and S all .

CONCLUSIONS
In this paper, we have focused on rapid 3D mapping and scene understanding for indoor environments. In particular, we have addressed 3D indoor mapping with the Microsoft HoloLens with a particular focus on a quantitative and qualitative evaluation by means of geometric features. Considering an indoor scene acquired with either a HoloLens or a TLS system (Leica HDS6000), we have extracted a set of rather interpretable lowlevel geometric 3D and 2D features and provided these features as input for a Random Forest classifier. We have analyzed the impact of the quality of the acquired point cloud data on the behavior and expressiveness of the interpretable geometric features and on the classification with respect to three classes ("Ceiling", "Floor" and "Wall"). Furthermore, we have evaluated the impact of different feature sets on the classification results.
In future work, we plan to increase the number of considered classes and the complexity of the considered scene (e.g., by considering indoor scenes covering different floors and also containing room inventory). Furthermore, we aim at guiding the user during the acquisition regarding scene completion and densification of sparsely reconstructed areas.