Weakly Supervised Pseudo-Label assisted Learning for ALS Point Cloud Semantic Segmentation

Competitive point cloud semantic segmentation results usually rely on a large amount of labeled data. However, data annotation is a time-consuming and labor-intensive task, particularly for three-dimensional point cloud data. Thus, obtaining accurate results with limited ground truth as training data is considerably important. As a simple and effective method, pseudo labels can use information from unlabeled data for training neural networks. In this study, we propose a pseudo-label-assisted point cloud segmentation method with very few sparsely sampled labels that are normally randomly selected for each class. An adaptive thresholding strategy was proposed to generate a pseudo-label based on the prediction probability. Pseudo-label learning is an iterative process, and pseudo labels were updated solely on ground-truth weak labels as the model converged to improve the training efficiency. Experiments using the ISPRS 3D sematic labeling benchmark dataset indicated that our proposed method achieved an equally competitive result compared to that using a full supervision scheme with only up to 2$\unicode{x2030}$ of labeled points from the original training set, with an overall accuracy of 83.7% and an average F1 score of 70.2%.


INTRODUCTION
The recent development of convolutional neural networks (CNNs) has been followed by the progress of point cloud semantic segmentation tasks, and there are constantly new methods that achieve state-of-the-art results. Most of these methods focus on designing new network structures or convolution kernels based on the characteristics of point cloud data (Hu et al., 2020;Thomas et al., 2019;Huang et al., 2020;Steinsiek et al., 2017), whereas few focus on data labeling. In experiments, competitive results rely on a large amount of data annotations, which continuously require labeling operators with expert knowledge. The labeling of point cloud data is particularly difficult, and the operator usually needs to confirm the category of a point from multiple perspectives. Although labeling work is difficult and time-consuming, obtaining raw point cloud data has become easier. With the advancement of light detection and ranging sensor technology and the diversification of data acquisition platforms, obtaining massive unlabeled point cloud data is no longer a problem. Thus, extracting information that helps the task of semantic segmentation from unlabeled data is an essential issue. By using a small amount of annotated data to achieve classification results comparable to full supervision, the workload of data annotation would be considerably reduced and the efficiency in related applications would be improved.
Semi-and weakly supervised learning are commonly used to address situations in which the number of labels is scarce. There are some comprehensive reviews on these two types of supervised learning methods (Chapelle et al., 2010;Zhou, 2017;Zhu, 2005). Choosing a suitable weak label annotation strategy is key to balancing the labeling workload and experimental results. In image processing, weak labels are represented as a few labeled images (Dong and Xing, 2018), a few labeled pixels (Bearman et al., 2016), and a few bounding boxes or categories that appear on the images (Kolesnikov and Lampert, 2016;Papandreou et al., 2015). Until now, there have been very few works that used weakly supervised methods to process point cloud data. Wei et al. (2020) applied a point class activation map to classify point clouds using only cloud-level labels. However, we believe that it is not practical for point cloud data, particularly in outdoor scenes that cover a wide region, because the complete data must be divided into many small blocks and categories contained in each block must be specified. In comparison, assigning labels to a few points is a more direct approach. Polewski et al. (2015) used an active learning method to detect standing dead trees from airborne laser scanning (ALS) point cloud data combined with infrared images. Lin et al. (2020) proposed an active and incremental learning strategy for ALS semantic segmentation, and manual annotation was iteratively added for training. Nonetheless, the setting of a weak label is used to annotate all points of tiles, and manual intervention is required during training. Through theoretical analysis and experimental results, Xu and Lee (2020) found that when the number of labeled points remains constant, the spatially sparse annotations achieve better results than those samples gathering in certain object instances, and this weaklabel setting was adopted in this work. An illustration of the sparse weak-label situation is shown in Fig. 1. A weakly supervised semantic point cloud segmentation framework was proposed by Xu and Lee (2020), and an approximate result of fully supervised learning was obtained using 10% labels. However, the characteristic of this weak label can be understood as being spatially continuous at a lower resolution, and the workload of labeling is still large. Guinard and Landrieu (2017) utilized the point cloud segmentation method to improve the classification accuracy with very few labels, but the result largely depends on the segmentation accuracy and focuses on classifying the point cloud of the area where the initial weak label is located, which is transductive learning.
Pseudo-labels, assigning annotations to unlabeled data based on the predictions of the current model, can enable the use of unlabeled data in parameter updates. Through the pseudo-label method, the classification model can calculate the loss function and backpropagation from the dataset with more labels, thereby improving the accuracy. Iscen et al. (2019) and Lee (2013) applied pseudo-labels in image classification, and Zou et al. (2021) designed pseudo-labeling and data augmentation to improve the performance of image semantic segmentation. Berthelot et al. (2019) mixed several proven semi-supervision strategies and proposed a holistic framework. Only a few studies have utilized pseudo-labels in point cloud processing. Yao et al. (2020) introduced a pseudo-labeling method into point cloud semantic segmentation. However, similar to Guinard and Landrieu (2017), the framework is also transductive learning , and the performance of the model has not been verified on untrained test data.
To cooperate with pseudo-labels, it is essential to propose an effective semantic segmentation network structure. Owing to the irregular distribution of point clouds, the network structure is not as uniform as that of two-dimensional (2D) CNNs. Early processing methods project point clouds onto images and directly use mature 2D CNNs to train the model (Boulch et al., 2018). Another branch of methods voxelizes point clouds and proposes three-dimensional (3D) CNNs to process voxel data (Tchapmi et al., 2017). The disadvantage of these two methods is that as point clouds need to be converted into regularized data, geometric information may be lost. PointNet and Point-Net++ (Qi et al., 2017a;Qi et al., 2017b) are pioneers in the development of shared multilayer perceptrons (MLPs) to directly analyze point clouds. However, the spatial distribution relationship of the point cloud requires well-designed MLPs, which complicates network structures. Compared to pointwise MLP networks, graph convolution networks construct a graph through relative spatial positions between points for feature extraction and fusion (Wang et al., 2019). Point convolution networks are similar to graph-based methods, in which point kernels are designed to learn local geometric information (Thomas et al., 2019).
In this study, we explored how to obtain reliable semantic segmentation results with very few labels. A pseudo-label-assisted point cloud semantic segmentation framework with extremely few annotations is proposed. We used KPConv (Thomas et al., 2019), a point convolution network, as our encoder network. Our weak label was defined as sparsely labeled points randomly distributed in space. An initial model was trained using the selected weak labels. Pseudo-labels were then generated by the trained model. Considering that the model obtained from weak label training is underfitting to the entire data space, an adaptive threshold is proposed to balance the number and accuracy of pseudo-labels. The training procedure that combines the ground-truth labels, referred to weak labels, and pseudo-labels was iteratively performed. To accelerate the training progress and reduce the influence of pseudo labels containing errors on the model, we updated the pseudo labels when the model converged on the ground-truth labels. Experiments on the ALS dataset showed the effectiveness of the method.

METHODOLOGY
An overview of the proposed method is presented in Fig. 2. There are two stages in the framework: red lines represent incomplete training and blue lines represent pseudo-label-assisted training. In stage 1, the initial model is trained using sparse labeled points, based on which pseudo labels are generated. In stage 2, the training progress is continued with a combination of the initial ground truth and pseudo labels. Pseudo labels are iteratively updated once the mixed trained model converges on the labels. The following sections include the introduction of the semantic segmentation network, the generation and update of pseudo labels, and the entire training process.

Point Cloud Semantic Segmentation Network
This section briefly reviews the architecture of the adopted network. In this work, KPConv was used as our encoder network because of its state-of-the-art experimental results on several public datasets. KPConv uses the idea of convolution kernels in image convolutions and can extend kernel points to be deformable to adapt the local features of point clouds. We chose the rigid point convolution kernel and used the same network architecture as KPConv for the semantic segmentation task. The encoder network contained five convolutional layers, and the ResNet-like structure was embedded. Skip links were used in the decoder network, and features were passed by the nearest sampling.

Incomplete Supervision
In this study, weak labels are defined as limited and sparse ground-truth labels distributed across the space, which is re- ferred to as incomplete supervision. In the form of manual annotation , we randomly selected several points according to the category and assigned the ground-truth labels. Given that the input points P ∈ R N ×D consist of N points with D dimensional features, M(M N) points are assigned labels, denoted as Pw ∈ R M ×D and the corresponding labels Lw ∈ R M .
The training process of incomplete supervision is similar to that of full supervision, and the only difference lies in the design of the loss function. Because only a few points have label information, we calculated the loss of these points and performed backpropagation. We chose the softmax cross-entropy loss function on the labeled points, denoted as: where yic andŷic are the prediction and label of point pi, respectively.

Pseudo-Label assisted Learning
In incomplete learning, model parameters are trained by partial points, and the accuracy of prediction is significantly lower than that of fully supervised learning. This is because limited labeled points cannot describe the overall distribution of the data. Thus, using more training data can achieve predictions with higher accuracy. Pseudo-labels can effectively exploit unlabeled data as a semi-supervised learning method. The pseudo-label-assisted learning method predicts the unlabeled data through the model parameters obtained by the previous training and regards those labels as ground truth in back propagation calculation (Lee, 2013). By increasing the number of pseudo-labels, we intended to simulate the data distribution of the entire scene. We propose a framework for pseudo-label assisted point cloud semantic segmentation, and the details of the generation and update of pseudo-labels are introduced in the following two sections.

Generation of Pseudo-labels
We used the predictions at the unlabeled data of the training sets to generate pseudo-labels, and the category of each point was determined as the one with the maximum posterior probability: Because they are pseudo labels, there are inevitably some incorrect predictions. This is because the parameters of the trained model are only obtained from limited labeled data, which is underfitting to the entire data. Therefore, it is necessary to perform label selection to obtain suitable pseudo-labels. Generally, a prediction with a higher posterior probability is more likely to be correct. Thus, a commonly used method is to set a strict threshold and select labels with predicted probabilities above it. Although a high threshold will result in more accurate pseudo-labels, it will lead to a decrease in the number of labeled points. However, Yao et al. (2020) found that the experimental results are not sensitive to the threshold setting. Unlike using a fixed value, we developed an adaptive threshold-setting method.
After attaining the probability of points toward each predicted category, we chose their average value as the threshold: where pi = max(pic). N and n are the index and the number of predictions of the training data, respectively. Predictions with a probability above the threshold are considered reliable and are selected as pseudo labels.

Update of Pseudo Labels
We aimed to enhance the performance of the classifier through iterative training and identify essential pseudo-labels, thus the previous ones were discarded with the update. The initial pseudo-labels were generated by the model obtained from incomplete supervision, and subsequent updates were generated by pseudo-label-assisted learning. The real labels from the ground truth are essential. Thus, to maintain their influence on the model training, we used the loss function in Lee (2013), which calculates the loss of ground truth and pseudo-labels separately. A cross-entropy loss was still utilized for pseudo-labels, denoted as: whereŷ ic and M are pseudo-labels and their numbers, respectively. Thus, the total loss in pseudo-label-assisted learning is defined as: where α is the weight coefficient. Although the number of ground truth and pseudo-labels is quite different, both ltrue and l pseudo are averaged loss values over all the points, thus was empirically set to 1. The pseudo-labels were updated when the model fitted the current labels.
Unlike the setting in Yao et al. (2020), the model that converges on the ground truth is used as the condition for updating pseudo labels. We believe that there are some errors in pseudo labels, particularly when the scene is complex, thus completely fitting pseudo-labels may result in error transmission. In addition, because ltrue is retained in the entire training process, convergence on the ground truth is sufficient to guarantee the performance of the model. Another advantage is that it can increase the frequency of pseudo-label updates and improve training efficiency. A comparison is presented in the Experimental section. Convergence was determined by the minimum training accuracy in one epoch. If the value was above 99%, the pseudo labels were updated at the end of the epoch.

Training Progress
The entire training process consisted of incomplete supervised learning and pseudo-label-assisted learning. The algorithm is detailed in Algorithm 1. First, we trained the model with initial ground-truth labels for 100 epochs. Then, the initial pseudo labels were generated by the trained model, and we started pseudo-label-assisted learning. When the model converged on the ground truth, we updated the pseudo-labels. This was repeated for 100 epochs.

Dataset
The ISPRS benchmark dataset (Rottensteiner et al., 2014) was used in this study, as illustrated in Fig. 3. The dataset contains ALS data obtained from the Leica ALS50 system and the corresponding remote sensing image for extracting color information. ALS data and remote sensing images were obtained from the Stuttgart region of Germany. The point density of the data was between 4 and 7 points per m 2 . The corresponding remote sensing image covered the entire area, and the ground sampling distance was 8 cm. There were 9 categories in the dataset, including powerlines, low vegetation, impervious surfaces, cars, fences, building roofs, building facades, shrubs, and trees. The data were divided into two sets for training and testing, and the number of points for the training and testing sets was 753,859 and 411,721, respectively. The number of point clouds of different types considerably varied and were mainly distributed in the following four categories: low vegetation, impervious surfaces, building roofs, and trees. This data is listed in Table 1.
The dataset contained multiple scan data, and the distance between some points in the overlap area was very small. To remove redundant point clouds in overlapping ALS strips and maintain the even point cloud density, we set the subsampling grid size to d = 0.4 m and assigned labels to deleted points according to the nearest neighbor point in testing. After subsampling, the number of training sets was 401892, and weak labels were selected from the subsampled data. We followed three weak-label selection criteria: • Weak labels were randomly selected from the training data;  • The number of weak labels in each category did not exceed 10% of the number of that category; • Less weak-label cases were included in the more weakly labeled cases. For instance, the selected labels in 15 weaklabel settings were fully included in the 30 weak-label settings.
Four parameter settings, that is, 15, 30, 60, and 100 indicating 135, 270, 512, and 832 initially labeled points, respectively were examined for weak label numbers. The percentage of 100 weak-label settings to total points was approximately 0.2%.

Implementation
During the training, we sliced blocks with a radius of 30 m, and each time the selected central point was the one with the minimum number of times being trained. The batch size and steps models were implemented in the framework of Tensorflow and trained on a GeForce GTX 1080Ti 11 GB GPU. The entire training process lasted approximately 3 h.

Evaluation
Following the evaluation criteria of the ISPRS benchmarks, we used the overall accuracy (OA) and F1 score to evaluate the performance of our method. OA is the percentage of predictions correctly classified. The F1 score is the harmonic mean of the precision and recall: where tp, f p, and f n are true positives, false positives, and false negatives, respectively.

Results and analysis
We first trained the model under weak-label settings, and pseudo-label-assisted learning was proposed with trained model parameters. For a better comparison, a result was obtained by the KPConv network under full supervision. Figure 4 shows the OA and average F1 score (avg. F1) under the different conditions. OA and Avg. F1 were significantly improved after pseudo-label-assisted learning. The improvement was most evident in 15 weak-label settings, and the OA improved by nearly 10%, from 70.8% to 80.4%, as well as the   Avg. F1, from 51.9% to 63.1%. In the 100 weak-label setting, the numbers for OA and Avg. F1 were 84.2% and 70.5%, slightly below those under full supervision. With pseudo-labelassisted learning, we achieved a result comparable to that from full supervision by using approximately 0.2% of full labels in the training set only. The classification result and error map using 100 weak-label settings are shown in Fig. 5.
Two methods with leading results were also compared in this work, both using fully supervised learning during training, and the results are presented in Table 2. NANJ2 (Zhao et al., 2018) and WhuY4 (Yang et al., 2018) had better performance regarding OA, reaching 85.2% and 84.9%, respectively, whereas KP-Conv achieved a higher Avg. F1. This result illustrates the effectiveness of KPConv in point cloud semantic segmentation, and it also demonstrates that a competitive result compared to fully supervised methods was achieved with very limited and sparse ground-truth labels. A detailed comparison of the results before and after pseudo-label-assisted learning is presented in Table 2. The F1 score of almost every category was considerably improved under all weak-label settings, except for the class of shrub, in which the number of weak labels was very small, namely 15 weak-label settings. This is a limitation of pseudo-label methods, and leads to considerable errors in the pseudo labels generated from a model with poor performance. Because these false predictions are treated as ground-truth labels for training, when the error rate in certain categories is too high, false pseudo-labels dominate during subsequent training iterations and contribute to a side effect in the model performance with respect to these categories, which provides overconfident predictions (Lokhande et al., 2020).
To verify the effectiveness of our pseudo-label update strategy, a comparison experiment was conducted, as shown in Fig. 6. PL-all represents the strategy in which pseudo labels are updated when the model converges on both weak and pseudo la- Figure 6. Comparison of overall accuracy of validation on training data in 100 weak-label setting between two pseudo-label update strategies. bels, and PL is our method. As pseudo-label-assisted learning is transductive learning in training data, we used the OA of the training data to indicate the training process. We observed that PL achieved a better accuracy in early epochs, which means that the model was trained at a faster speed. In addition, the OA of the PL was higher than that of the PL-all at the end of training. However, because the number of training epochs was fixed at 100, the consumed time was the same for the two strategies.
The effect of the pseudo-label method was similar to that of entropy regularization (Grandvalet et al., 2005) by considering the probability of predictions P (P ≤ 1) as 1. Therefore, the posteriori probability of predictions will increase through the information transmission of pseudo-labels during training. We found that it is beneficial to label smoothness and reduce noise in the predictions. Figure 7 presents a comparison of the details in the training set before and after pseudo-label-assisted learning. Within the black boxes, the misclassified points of tree and low vegetation were corrected, which indicates that the accuracy of pseudo labels generated from the training set increased along with iterative update steps.

CONCLUSION
In this work, we proposed a pseudo-label-assisted point cloud semantic segmentation method for ALS data in urban areas. KPConv was used as the backbone for the feature extraction. An adaptive threshold was designed to balance the training accuracy and the required amount of pseudo labels. In addition, we used convergence on ground-truth labels as the requirement for updating pseudo-labels, which considerably improved the training efficiency. Experiments indicated that a competitive result was achieved compared to those under the full supervision scheme with only 2 of the original abundant labeled points.
In the future, the limitations of the pseudo-labels mentioned in the experimental analysis are expected to be solved. Because the recursive generation of pseudo labels is completely determined by the probability of the predictions, they are highly correlated with the selected weak labels. Thus, additional constraints are required to control the generation of pseudo-labels. In addition, an elaborated weak-label selection strategy can help achieve better performance under the same number of weak labels, which is a good advance for data annotation.

ACKNOWLEDGEMENTS
The dataset was provided by the German Society for Photogrammetry, Remote Sensing, and Geoinformation (DGPF). The work described in this paper was substantially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. PolyU 25211819), and was also supported by a grant from the Hong Kong Polytechnic University (Project No. G-YBZ9).