AUTOMATIC TRAINING DATA GENERATION IN DEEP LEARNING-AIDED SEMANTIC SEGMENTATION OF HERITAGE BUILDINGS

: In the geomatics domain the use of deep learning, a subset of machine learning, is becoming more and more widespread. In this context, the 3D semantic segmentation of heritage point clouds presents an interesting and promising approach for modelling automation, in light of the heterogeneous nature of historical building styles and features. However, this heterogeneity also presents an obstacle in terms of generating the training data for use in deep learning, hitherto performed largely manually. The current generally low availability of labelled data also presents a motivation to aid the process of training data generation. In this paper, we propose the use of approaches based on geometric rules to automate to a certain degree this task. One object class will be discussed in this paper, namely the pillars class. Results show that the approach managed to extract pillars with satisfactory quality (98.5% of correctly detected pillars with the proposed algorithm). Tests were also performed to use the outputs in a deep learning segmentation setting, with a favourable outcome in terms of reducing the overall labelling time (-66.5%). Certain particularities were nevertheless observed, which also influence the result of the deep learning segmentation. established, experiments were performed on two different scenarios.


INTRODUCTION
As the documentation of heritage objects is undertaken more and more in 3D, point cloud data has become ubiquitous in the heritage community. With the advent of laser scanners and advanced photogrammetric processing, the documentation process is becoming more and more streamlined. The next issue of interest in the point cloud processing community is how to annotate the geometric point cloud with the addition of semantic attributes. This is required when the point cloud is needed for analysis, modelling, and predictions. A semantically annotated point cloud can thereafter be used to create information-rich 3D GIS and/or HBIM (Heritage Building Information Models) (Campanaro et al., 2016).
Machine learning (ML), and more precisely deep learning (DL) techniques, has witnessed a surge in overall interest in this age of big data (Bello et al., 2020). The possibility to use large quantities of data to train the computer to perform semantic segmentation automatically is indeed a very interesting concept and currently a promising field of research, as it provides a robust segmentation result with quick processing time. However, the main bottleneck problem in implementing DL techniques is mainly related to the availability of labelled datasets (Maalek et al., 2019). In the case of heritage point cloud, this problem is exacerbated by the diversity of classes and architectural features, as well as the general lack of labelled datasets. As such, the usual way to obtain training data is by manual annotation .
Considering these issues, a combination of more traditional segmentation based on geometric axioms and DL techniques will be presented in this paper. In particular, deep learning techniques will allow validating objects segmented by traditional methods. The algorithmic segmentation uses the functions in the Matlab toolbox M_HERACLES to generate training data for the DL technique. The toolbox is a set of Matlab functions (https://github.com/murtiad/M_HERACLES, accessed 4 October 2021) specifically developed to perform semantic segmentation on heritage objects (Murtiyoso and Grussenmeyer, 2020).
While the geometric approach may be used to directly generate segmented point cloud, many limitations are inherent in this method. Indeed, this type of approach mainly uses case-specific hard-coded prior knowledge and geometric rules in each function, therefore limiting its use for a holistic semantic segmentation. This rigidity is however counterbalanced by rapid processing time and a straightforward and open nature as opposed to the more closed system of DL techniques. As such, we postulate that both geometric rules-based and DL methods have their own advantages and disadvantages which may play well to support each other.
The main idea proposed in this paper is therefore to use the two semantic segmentation approaches, namely the geometric rulesbased and DL methods, to complement each other. Our proposed method therefore tries to automate as much as possible the semantic segmentation workflow in the case of heritage buildings, from the generation of DL training data up to the use of DL itself for the abovementioned task. In this case, the DL framework is developed by testing the data on a specific neural network, namely the DGCNN (Dynamic Graph Convolutional Neural Network) (Wang et al., 2019), specially modified for the semantic segmentation in the field of cultural heritage (Matrone et al., 2020a;Pierdicca et al., 2020). M_HERACLES will be used to detect objects from the 3D scene (with the specific class of pillars is chosen for this study). Comparison of its results as opposed to manually labelled data will then be performed. Both the manual and automatic (i.e., resulting from M_HERACLES) results will then be used as input training data for our DGCNN framework. The results will then be compared in terms of processing time and quality. The paper does not aim to improve on the results of the neural network; rather, the main objective is to determine whether the use of geometric rules-aided automatic training data may help achieve similar results as manual annotation with expected gain in overall labelling time.

STATE OF THE ART
Semantic segmentation as a research topic stems as a logical consequence to the use of point clouds as 3D archive. The unorganised nature possessed by point clouds by default requires classification in order to permit a better understanding of the scene (Poux et al., 2016). In the architecture and civil engineering (ACE) domain, this has evolved into the need for a "scan-to-BIM" process with Building Information Models (BIM) as the final product (Macher et al., 2017;Xiong et al., 2013). As regards to heritage point clouds, semantic segmentation enables the otherwise purely geometric data to receive tangible semantic information (Murtiyoso and Grussenmeyer, 2020) and thus may ultimately aid the creation of Heritage Building Information Models (HBIM) (Chiabrando et al., 2016).
Various approaches to point cloud processing may be taken to perform this task. Grilli et al., (2017) categorised these approaches into region growing methods (Bassier et al., 2017a), edge-based reconstruction (Boulaassal et al., 2007), modelfitting (Sanchez and Zakhor, 2012), machine learning-based methods (Bassier et al., 2017b), and hybrid approaches. In an article by Nguyen and Le (2013), a dual distinction between segmentation by machine learning and by the use of geometric axioms was made, also called "constraints" or hard-coded knowledge. Furthermore, Bassier et al. (2017b) refer to the latter as heuristic approach in contrast to machine learning; in essence an algorithmic approach to the problem of 3D point classification. With the advent of big data, DL as a part of the machine learning approach is today considered as another viable approach (Matrone et al., 2020a). Two methodologies encountered in the literature for the semantic segmentation task of point clouds concern deep learning and rules-based approaches.

Deep learning for point cloud segmentation
Three-dimensional point clouds are currently used in many applications thanks to the recent wide availability of 3D scanners and reconstruction techniques such as lidar, Structurefrom-Motion (SfM) and other sensors like Kinect and Xtion. These are 3D unstructured vectors that compute 3D coordinates and other characteristics such as reflection, colour and normal. One of the pioneering research that used deep learning to process raw point clouds is presented in Qi et al. (2017a), where the PointNet architecture is used to process point clouds for semantic segmentation and classification. One limitation of this approach is its inability to extract the local features of point clouds, learning only global features through the max-pooling layer. To overcome this problem, the PointNet++ architecture (Qi et al., 2017b) was developed to allow the encoding of local features by dividing locally the point cloud. The authors of the architecture propose essentially hierarchical feature learning for object classification and semantic segmentation of 3D point clouds.
A recent approach inspired by PointNet and PointNet++ is the Dynamic Graph Convolutional Neural Networks (DGCNN) (Wang et al., 2019). Unlike PointNet++, DGCNN extracts local features by using an EdgeConv layer. DGCNN builds a neighbourhood graph that allows exploiting the local geometric structures, defining a link between the central point chosen and the edge vector connecting its neighbours to itself. The main advantage is that DGCNN presents a robust result to the variation of the input to obtain satisfactory results in both indoor and outdoor scenes.
In the Digital Cultural Heritage (DCH) domain, the work of Pierdicca et al. (2020) attempted to exploit an improved DGCNN architecture in order to semantically segment 3D point clouds, which is both interesting and useful for the automatic interpretation of architectural objects. Furthermore, the authors proposed a novel approach using additional point cloud features. A completely new dataset involving both indoor and outdoor scenes was used, belonging to different historical periods and different styles. The dataset has been manually labelled by experts to increase its level of trustworthiness.
However, creating a huge amount of labelled point clouds through manual annotation is very time-consuming and often impractical. This issue becomes more exacerbated by the complex variations in architectural styles and details in the case of heritage sites. A possible solution to this problem is the creation of synthetic training datasets, even though this procedure is less common in the DCH domain . Pellis et al. (2021) also proposed reprojection of 3D point cloud labelling into the 2D space of photogrammetric images in order to augment 2D semantic segmentation training data in a DCH context. Other recent approaches for the generation of new data include techniques based on generative models, in particular generative adversarial networks (GAN) (Goodfellow et al., 2016).
In this work the DGCNN-Mod  network will be used to validate the scenes segmented by rules-based approach provided by M_HERACLES. This particular network was chosen primarily because it was recently implemented in the DCH domain and showed promising results . Furthermore, as an initial experiment and proof of concept this paper will focus mainly on the pillars class in some cultural heritage scenes.

Geometric rules-based approach to the problem of point classification
The rules-based approach, as has been mentioned previously, employs geometric rules and constraints to detect and classify certain architectural features from the point cloud scene. The rules are generally prior knowledge hard coded into the algorithm (Maalek et al., 2019) and may consist of simple axioms (Macher et al., 2016;Murtiyoso and Grussenmeyer, 2019a) to more complex ontological networks (Poux et al., 2018). This heuristic approach can be rapidly employed without the need for training and is thus adaptable for simpler cases (Nguyen and Le, 2013); however it may suffer from higher noise rate and rigidity, which leads to its limited use when compared to other approaches based on ML/DL. Several types of geometric axioms can be used to detect very specific object classes. For example, in Riveiro et al. (2016) the authors detected planar surfaces for wall detection in an outdoor setting. In Rodríguez-Cuenca et al. (2015), the authors attempted to detect pole-like objects from an unorganised point cloud. Poux et al. (2017) took advantage of other point cloud features to perform segmentation, while in Poux et al. (2018) and Drap et al. (2017) more complex ontological relations were designed. Murtiyoso and Grussenmeyer (2019b) used preexisting GIS layers to perform a similar task in a smaller-scale data setup.
In this paper, a part of the toolbox M_HERACLES was used to perform semantic segmentation for the class of pillars, i.e., structural supports. Structural supports such as columns present a particular interest for the heritage community, as often they present a valuable example of historical engineering and architectural design. Several study have been done in the field of structural support automatic 3D modelling (Maalek et al., 2019), but most focus on simple pillars or supports. In this regard, automation for heritage-related structural support remains difficult due to the many different types linked to the architectural style. M_HERACLES was specifically conceived to tackle this problem using rules-based approaches (Murtiyoso and Grussenmeyer, 2020).
As has been previously established, this study will be focused on the semantic segmentation of pillars. M_HERACLES will therefore be used to detect and classify pillars from the available datasets. These automatically segmented scenes will be used thereafter to train the DGCNN-Mod for the semantic segmentation task and the results will be compared against those obtained from manual labelling.

METHODOLOGY
The tests were conducted on a small, labelled dataset of point clouds of cultural heritage to quantitatively compare the labelling time and the consequent results of the neural network with the two different inputs, namely the dataset manually annotated and the one automatically segmented via M_HERACLES. The toolbox performs segmentation in the case of pillars using geometrical rules, namely the circularity of the cross-sections. Moreover, the scenes of the dataset were divided into training, validation, and test. More specifically, we want to understand the behaviour of the neural network with respect to the two test scenes with completely different architectural styles, namely European and Southeast Asian.
Among the five selected case studies (Figure 1), three are in the European architectural style: the "Sala delle Colonne" of Valentino Palace in Turin, Italy ("Valentino") and two point clouds from the Sacro Monte di Varallo site ("Ghiffa" and "Pilato"). These sites are all included in the UNESCO World Heritage List. Meanwhile, the Southeast Asian dataset comprises of two pavilions ("Kasepuhan_1" and "Kasepuhan_2") from the Kasepuhan Palace, a 15 th century royal complex in Cirebon, Indonesia. All of these point clouds are part of the public ArCH dataset (Matrone et al., 2020b) (http://archdataset.polito.it/, accessed 5 October 2021).
Two distinct experiments were performed for the purposes of this paper. In the first experiment, the automatic point cloud labelling using the M_HERACLES toolbox will be discussed. In order to quantitatively analyse the results, the results were compared against manually labelled data to assess the proposed rules-based method's performance in terms of statistical qualities and processing time. The main objective of this first experiment is to test M_HERACLES's reliability for automatic point cloud labelling.
The second experiment is the more elaborate of the two. In this experiment, a real application of M_HERACLES's results was fed into a DL network to test its reliability in terms of use for DL training data generation. These results were then compared to the ones acquired from the manual labelling process as a point of reference.   Pursuing the aim of ensuring the repeatability, the second experiment is further divided into two scenarios in which different datasets were used for the training, validation, and testing phases. The configurations of the datasets used in these two scenarios are described in Figure 2.
Scenes containing an adequate number of columns with different characteristics and dimensions were chosen for the training of the network. For the test, on the other hand, the main criterion was the representativeness of different and diverse architectural styles, to test the generalization of the method. Indeed, Scenario 1 will perform the test on a Southeast Asian data while Scenario 2 will be applied to the European one.
The DL framework used in this experiment is the DGCNN-Mod developed by Pierdicca et al. (2020) with the geometric coordinates (XYZ), radiometric component (RGB) and the normal vectors as inputs. The normal vector was previously computed using the third-party software CloudCompare. The 3D features have not been added in this case (Matrone et al., 2020a), since the purpose of the tests is to investigate, by way of a comparison, if and how the automatic rules-based annotation affects the final prediction. The DGCNN-Mod training was set with hyperparameters identical to the values tested in Pierdicca et al. (2020), namely block dimensions of 1x1m, 4096 subsampled points for each block and a stride value of 1. Tests with 2x2m block size, 8192 points subsampled and stride 1 (in order to have overlapping blocks) have been carried out as well, but as no relevant differences emerged, only the prior ones will be shown and discussed in this paper. An NVIDIA RTX 2080 TI 11 GB, 128 GB RAM with processor Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz was used for DL-related processing.
Finally, in this second experiment for each scenario the training will be performed twice: the first uses manually annotated point clouds while in the second the "column" class of the scene was replaced by the one predicted automatically by M_HERACLES as presented at the end of the first experiment.

RESULTS AND DISCUSSIONS
This section will be divided into two subsections, each describing the findings related to the first and second experiments respectively. The first subsection will describe the results of the automatic segmentation and labelling of pillars in the five available datasets using functions taken from the M_HERACLES toolbox. The second subsection will describe results from the second experiment, namely the use of outcomes from the previous subsection as training data for the developed DL algorithm and the assessment of its performance in comparison to manually labelled training data. Two scenarios will be presented in this second part of the section.

Automatic segmentation and labelling
As far as the rules-based segmentation and labelling is concerned, results show that the algorithm managed to properly detect the correct number of pillars in 4 out of the 5 datasets (FIGURE 3). In the case of the Pilato dataset, one pillar failed to be detected by M_HERACLES since it was attached to a wall segment. Similarly, the algorithm was unable to detect the engaged columns attached to the walls in the Valentino dataset, but otherwise successfully predicted the six free standing pillars at the centre of the scene. In terms of processing time, all datasets were processed in a reasonably fast processing time as also shown in FIGURE 3. This is arguably faster than manual labelling and moreover requires less human intervention and thus human-induced error. The experiment was performed by using an Intel(R) Xeon(R) E5645 2.4 GHz CPU. A more detailed comparison especially in terms of processing time visà-vis manual labelling shall be performed further in the text.   As far as the processing time is concerned, lower point count seems to influence the overall duration. However, the number of detected objects is also an important factor. Indeed, the processing time for Valentino is much higher in spite of it having the fewest pillars because M_HERACLES attempted to detect (and rejected) candidate pillars from the surrounding scene i.e., walls and engaged columns. In reality, for the Valentino scene M_HERACLES detected 20 potential pillars, of which only 6 were retained. However, this also shows one of the shortcomings of the rules-based approach. Indeed, the ground truth received from manual labelling also classes engaged columns as "pillars". M_HERACLES, however, failed to take this into account and instead generated only freestanding pillars in its result.
Quantitatively speaking, the algorithm managed to obtain an average recall of 82.05% across the five datasets, even though the average precision is lower at 73.84% (Table 1). It is also worth noting that in Table 1 the point count of for manual labelling and M_HERACLES is different, because the manual labelling presents the reference ground truth and as such may be assumed as free from erroneous classification.
Furthermore, as may be inferred from Table 1, the two Kasepuhan datasets gave higher precision values with lower recall rates, while the other three datasets received a better recall score at the expense of the precision. This is due mainly to the nature of the datasets. The two Kasepuhan pavilions are particular as they do not possess ceilings; the roofs having been suppressed automatically by M_HERACLES beforehand (Murtiyoso and Grussenmeyer, 2019a). On the contrary, the other three Italian datasets possess either ceilings or arches at the top of each column. The 2.5D approach used by M_HERACLES means that it encountered problems when dealing with different classes of objects in the vertical space ( Figure 4). Figure 4 further demonstrates the limitations of the current M_HERACLES implementation (blue) of the algorithmic approach in 2.5D. In Valentino, a part of the ceiling was also labelled as "pillar". A similar phenomenon occurred with the arches and the pedestal in Pilato.
Consequently, for the European point clouds Type I error (false positives) was more dominant while for the Kasepuhan Type II error (false negatives) was more important; hence explaining the precision and recall values for both cases. The normalised F1 score value showed less discrepancy between the different architectural styles, with a mean value of 74.73%.
Furthermore, this type of semantic segmentation is less robust to noise; indeed, the systematic higher recall value is validated by the fact that M_HERACLES employs geometric rules to perform the detection. In this regard, due to the systematic hard knowledge embedded within the algorithm, the precision, recall, and F1 score values depend strongly on the choice of geometric parameters inputted into the algorithm. The main goal of this step is not to create the final semantic segmentation, but rather to accelerate the training data generation for further DL processing. Table 2 shows a comparison between the annotation times required for the different scenes using manual means and automatically using M_HERACLES. This simple comparison shows that manual annotation generally takes more time than the M_HERACLES algorithm. Furthermore, manual annotation also encounters the problem in that it requires an expert in domain to be able to correctly perform the labelling, as opposed to rules-based approaches in which this knowledge is already hard-coded into the algorithm.

Comparison of processing time
If we consider the overall results, manual annotation generally guarantees more accurate predictions. However, the question that we would like to answer in this regard is whether this increase in accuracy at the expense of time and expertise can be replaced by rules-based segmentation, particularly within the context of using them further as input for neural networks. It can be beneficial to find a balance or a compromise between the quality of the results and the pre-processing times. The following subsection shall try to answer this question by feeding the results of the two approaches (manual and M_HERACLES) into our DL network and analysing the quality of the resulting semantic segmentation.

Application in deep learning training
This subsection will describe the use of the results from the previous subsection as input in DL training and how it can affect the semantic segmentation of point clouds. Two types of input scenes are used for training neural networks: manually labelled data and automatic labelled data obtained from M_HERACLES. For the experiments the DGCNN-Mod, a state-of-the-art neural network designed for the cultural heritage domain, will be used. As has already been explained, this work focuses only on the pillars/columns class objects. Therefore, only results related to this class will be shown and compared. Visual results and the main metrics for the segmentation task will be shown for every test. As has been previously established, experiments were performed on two different scenarios. On the left side (red) the semantically segmented column class from the training using manually labelled data is shown. The results shown on the right (blue) used results derived from M_HERACLES for the training process. Another interesting point to note is that in the manually annotated scene, parts of the base of the pillars are recognised as the column class, while in the case of automatic annotation this misclassification is seen to be of a lesser extent. Theoretically speaking, the result from both approaches should be similar since no column base was included in either class. The reduced number of points in M_HERACLES's case due to its inherent denoising functions may have inadvertently reduced this effect for the automatic training data. Furthermore, since the neural networks are by nature opaque in some of their parts, it was not possible at this point to define the nature and source of this particular error.
When considering the metrics of the single class of columns as displayed in Table 3, the precision and the F1-score are greater in the case of manual labelling. The result is inverted in the case of the recall score. High precision score in this case indicates that manual labelling allows the training of a network that manages to correctly predict the column class. On the other hand, the M_HERACLES-based method with the higher recall score, creates a more generic network able to predict other classes outside of the column class.

Scenario 2:
For the second scenario a starker difference between the two training datasets can be observed.
As can be seen in Figure 6, semantic segmentation results derived from automatic annotation is visually noisier and tended to classify engaged columns on the walls behind to the arcade of the Ghiffa scene as columns. The upper mouldings included by M_HERACLES in the training dataset it provided have many similar shape and geometric features as those of the engaged columns and halfpilasters. Furthermore, this behaviour is accentuated by the fact that only the radiometric component and normals have been used as input features. Indeed, the latter is shown to have significant influence in predicting the semantic segmentation results.
Quantitatively speaking, the general metrics as displayed in Table 4 are lower than those presented in the article by Matrone et al. (2020a), due to the lower number of scenes in the training set. Furthermore, compared to the previous scenario and considering solely those of the class of columns, a greater deviation between the manual and automatic-derived results in terms of Precision, Recall, F1-Score and IoU can also be observed. This result comes despite the relatively noise-free result from M_HERACLES, thus adding further to the argument put forward in the previous paragraph in which the misclassification of mouldings as columns generated a significant error.   Table 4. Results of the tests performed on the second scenario.

CONCLUSIONS
This article described an alternative methodology to be applied for automatic labelling of point clouds. The output of the analysed algorithm can be, in fact, used as input for deep learning techniques. Being a preliminary study, this research focused mainly on the column class to evaluate a possible extension of this methodology to other classes.
The results obtained show that the proposed method managed to cut processing time by up to an average of six times faster than traditional manual labelling. The first scenario of DL implementation showed that despite lower annotation accuracy of automatic approaches, in simpler settings the proposed approach managed to attain similar quality as manual labelling. However, in the second scenario the more complex nature of the data presented other challenges. Automatically derived training data was essentially faced with a systematic error in the form of misclassified points. In this case we can observe that annotation accuracy is also important, at least in this case and using this neural network architecture. Additionally, the importance and great relevance of normal vectors in class recognition was demonstrated particularly in this scenario.
Although the metrics derived from the DL training based on automatic annotation are lower than the manual ones in both scenarios (in Scenario 2 more so than in Scenario 1), it is also important to note that the rules-based approach is both modular and adjustable. By knowing specific problems, the M_HERACLES functions can be tuned to adapt to specific cases and can thereafter be easily and rapidly deployed in a repeated manner. Although this may seem counter-intuitive, we argue that the margin of processing time gained by the proposed method permits a further in-depth tune up of rules-based approaches to increase the results.
Most importantly, we argue that the methodology described in this paper can provide a compromise between the pursuit of the best neural network performances and the reduction of overall processing times. Indeed, in the context of DCH this is crucial due to the virtually countless types of potential objects for DL training. Such compromise is therefore an interesting solution to greatly reduce processing time by pertaining to the required geometric specifications.
Some critical issues related to the difficulty of error track back in the proposed neural network remain, as is expected from the black-box nature of deep learning architectures. In this case, this issue meant that errors related to semantic segmentation results were more difficult to ascertain and, in some cases, to prove. Furthermore, the rules-based approach used in this paper as represented by the M_HERACLES toolbox also faces issues such as the requirement for further parameter tune-up to improve the results. These two issues, respectively pertaining to ML-based and geometric rules-based approaches are known and continue to be interesting topics to explore in the future. Future developments of this work can be an expansion of the proposed method into other classes in the cultural heritage context.