WEAKLY SUPERVISED LEARNING FOR TREELINE ECOTONE CLASSIFICATION BASED ON AERIAL ORTHOIMAGES AND AN ANCILLARY DSM

: Convolutional neural networks (CNNs) effectively classify standard datasets in remote sensing (RS). Yet, real-world data are more difficult to classify using CNNs because these networks require relatively large amounts of training data. To reduce training data requirements, two approaches can be followed – either pretraining models on larger datasets or augmenting the available training data. However, these commonly used strategies do not fully resolve the lack of training data for land cover classification in RS. Our goal is to classify trees and shrubs from aerial orthoimages in the treeline ecotone of the Krkonoše Mountains, Czechia. Instead of training a model on a smaller, human-labelled dataset, we semiautomatically created training data using an ancillary normalised Digital Surface Model (nDSM) and image spectral information. This approach can complement existing techniques, trading accuracy for a larger labelled dataset while assuming that the classifier can handle the training data noise. Weakly supervised learning on a CNN led to 68.99% mean Intersection over Union (IoU) and 81.65% mean F1-score for U-Net and 72.94% IoU and 84.35% mean F1-score for our modified U-Net on a test set comprising over 1000 manually labelled points. Notwithstanding the bias resulting from the noise in training data (especially in the least occurring tree class), our data show that standard semantic segmentation networks can be used for weakly supervised learning for local-scale land cover mapping.


INTRODUCTION
CNNs are currently one of the most commonly used classifiers in remote sensing, but their effective application requires large amounts of training data. Because acquiring more training data is not always feasible, many authors focus on approaches for reducing training data requirements. The most common strategies are transfer learning and data augmentation (Kattenborn et al., 2021). In this paper, we instead focus on a third approachweakly supervised learning. To explain the rationale underlying our choice, we will describe these three approaches and their applications, discussing their advantages and disadvantages.
Transfer learning involves pretraining the network or a part of it on a larger dataset. These datasets are often unrelated to the specific application, and most pretrained networks are based on three-band RGB images, for example ImageNet. As a result, remote sensing practitioners must either transform their data into three bands (Kattenborn et al., 2021) or use more than one encoder, as proposed by Audebert et al. (2018). Their approach can be utilised with multiple pretrained backbones, which can be particularly helpful when analysing multimodal data. Overall, this approach has been successful, but more significant advantages for remote sensing may derive from network backbones pretrained on large multispectral datasets, such as SEN12MS (Schmitt et al., 2019).
Data augmentation is another effective technique that compensates for insufficient training data. This approach augments existing data by changing them slightly, which allows the network to generalise more successfully. The most common approaches to data augmentation consist of rotating the existing training data and introducing random noise into the imagery. This technique has been exceptionally successful; in fact, almost half of all recent studies on CNNs for vegetation remote sensing augment their training data (Kattenborn et al., 2021).
Weakly supervised learning stands out among the less commonly used strategies for overcoming the lack of training data, such as synthetic data creation and semi-supervised learning (Kattenborn et al., 2021). Weakly supervised learning consists of training a network using low-quality labels (Kattenborn et al. (2021)). As noted by Schmitt et al. (2020), most remote sensing studies involving weak supervision learning have focused on object detection, whereas only a couple of studies have used this approach for semantic segmentation. These semantic segmentation studies trained models on sparse or image-level annotations, but Schmitt et al. (2020) worked with dense, noisy labels. More precisely, these authors attempted to classify global land cover from high resolution Sentinel-1/-2 data using CNNs trained on a MODIS-based land cover map. Despite highlighting that this complex classification scheme remains a challenge, they also indicated that weakly supervised learning may be useful in less complex scenarios. For example, Weinstein et al. (2019) successfully detected trees in RGB imagery by first training a network on a LiDAR-based unsupervised annotation and subsequently refining the model on a small number of handannotated images. For this reason, we focus on weakly supervised learning for semantic segmentation in this contribution.
Our goal is to discriminate Norway spruce (Picea abies) trees from dwarf pine (Pinus mugo) shrubs in multispectral (visible and near infrared) aerial images. Conventional machine learning approaches (support vector machine, random forest, etc.) were first considered. However, preliminary data exploration using conventional machine learning and image segmentation has shown that these classifiers are unsuitable for the given dataset as was further confirmed by Dvorak (2020). The complex spatial structure of spruces, especially their shadows, and dwarf pines showed to be difficult to address using object based as well as pixelwise classification.
Thus, our approach is similar to that of Weinstein et al. (2019), albeit aimed at semantic segmentation of two vegetation classes while avoiding training on hand-annotated data. For this purpose, we trained two CNNs on a noisy classification, which was independently derived using an ancillary normalised Digital Surface Model (nDSM) and spectral information. The nDSM is available only for a part of our study area and, therefore, cannot be used as a classification feature. Despite the inherent training data noise, we hypothesize that CNNs can identify the spatial structure of trees in multispectral images.

KRKONOŠE MOUNTAINS
The Krkonoše Mountains in the Sudetes Mountain system form a natural border between Czechia and Poland. The area is designated as a UNESCO biosphere reserve and as a national park on both sides of the borderknown as Krkonošský Národní Park (KRNAP) in Czechia and as Karkonoski Park Narodowy in Poland (UNESCO, 2016). This area contains an arctic-alpine tundra, a relict ecosystem unique for its flora. The Krkonoše tundra consists of two large, detached sections: Western (1 284 ha) and Eastern (2284 ha). Connecting the tundra to Sprucedominated montane forests below is the treeline ecotone (Treml and Chuman, 2015). The alpine tree line ecotone is characterised by an extensive dwarf pine shrub cover and by spruce stands sparsening with altitude. Local dynamics of the ecotone were analysed by Treml and Chuman (2015), who found that the timberline has advanced upwards by 0.43 m per year since the 1930s. The authors noted that this process is comparatively slower in in the Sudetes Mountain system than in other European mountain ranges, possibly due to its high pine shrub cover.
These and potential other changes in treeline ecotone dynamics are monitored by the KRNAP administration. In total, 13 transects have been delineated for this purpose, as shown in Figure 1. The transects were defined with different spruce and dwarf pine densities and distributions throughout the area. In total, the transects cover 7.79 km 2 , with six of the transects located in the western tundra and the remaining seven in the eastern tundra.

Aerial imagery:
Our study uses multispectral aerial imagery captured on the 18 th and 27 th of June 2012. The imagery covers the whole study area and is divided into 2×2.5 km tiles with 8bit radiometric resolution. As is relatively common in realworld tasks, we had no access to the original data, only to derived productsa Red, Green, Blue (RGB) and Near Infrared, Red, Green (CIR) orthorectified composites with ground sampling distance (GSD) of 0.2 m.
These products share two spectral bands (Red and Green), but their simple combination into a four-band dataset (NIR, R, G, B) is impractical because both composites were radiometrically enhanced separately. Therefore, the red and green bands do not share values between products, so the two datasets were used separately. Natural Earth data, © OpenStreetMap contributors a basis for creating training and validation labels. The data for this model were gathered with a point cloud density of 5 points/m 2 using a RIEGL LMS Q-680i full waveform scanner in spring of 2013 (Puchrik and Nýdrle, 2013). We had access to the nDSM as a preprocessed raster with a GSD of 0.125 m. This original raster was subsequently resampled using Nearest Neighbour interpolation to match the multispectral imagery (0.2 m × 0.2 m). While both the datasets were captured a year apart, we don't consider this to be an issue as there were no observable changes in the area.

Experiment setup
Our goal was to classify spruce trees and dwarf pine shrubs in the areas of interest of the Krkonoše treeline ecotone. Consequently, we defined three semantic classes for classificationspruce trees (picea abies), dwarf pines (pinus mugo) and background. The complex spatial structure of these classes is extractable using a deep Convolutional Encoder-Decoder network. Accordingly, we trained two deep models, as described in section 3.4.
To confirm the results, we generated 1086 points in the 13 evaluated transects by stratified random sampling; 543 points for the eastern and 543 points for the western part of the territory. The number of points was derived according to (Foody, 2009) for a two-sided confidence interval α = 2%. The points were consequently labelled based on visual interpretation of the multispectral imagery, essentially creating a testing dataset.
As the available nDSM covered only a portion of the treeline ecotone, we generated a semiautomatic annotation in the area where imagery and nDSM overlap. This semiautomatically labelled region was subsequently used for training and validating our Encoder-Decoder networks. Five of the eastern transects were also selected to validate our training / validation area. The Encoder-Decoder results found in these five transects were also compared with results in the remaining transects using the aforementioned hand-labelled validation points. This comparison allows us to gauge the generalizability of our models.
All Encoder-Decoders were trained on a Windows workstation equipped with an Intel i9-7940X CPU, 64 GB of RAM and a Nvidia GeForce GTX 1070 GPU (8GB VRAM). The models were written for PyTorch 1.4 with CUDA 10.2.

Semiautomatic annotation
Creating training data for our task manually is impractical given the overall size of the study area. To simplify this task, we took advantage of spruce tree and pine shrub aboveground heights in the area. As we were operating in the Krkonoše tundra, the only other objects with similar heights were either manmade structures or dead vegetation (Treml and Chuman, 2015). This fact allowed us to use aboveground heights (where nDSM is available) as a feature to create the training and validation dataset.
Thus, when creating our labels, we assumed that: 1) green trees and shrubs have normalised difference vegetation index (NDVI) higher than zero 2) trees and shrubs can be identified from their height We computed NDVI from CIR imagery and assigned all pixels with NDVI below 0 to the background class. To differentiate vegetation types, we assigned pixels with nDSM value above 2 m as spruce trees and pixels with nDSM value between 0.12 m and 2 m as dwarf pine shrubs. These arbitrary values were selected after visual comparison of multiple potential height values.
Our semiautomatic annotation considerably reduced the time required to create enough training data. However, this method introduces a degree of uncertainty. For this reason, we only used the labels for model training and validation, whereas the reported accuracy of the test set was based on visual interpretation of the original imagery.

Encoder-Decoder networks
A family of CNNs sometimes known as Convolutional Encoder-Decoders includes networks such as SegNet or U-Net, which are well suited to pixelwise classification/semantic segmentation. U-Net was selected as the basis for this task, given its recent popularity for solving comparable remote sensing tasks (Brandt et al., 2020;Huang et al., 2018;Kattenborn et al., 2019).
We performed no data augmentation as it may not be needed given the relatively large size of our training and validation dataset. Rotating the input patches may even have been harmful to the classification, given that this would rotate tree shadows, while their location in relation to the trees is a potentially valuable feature for classification.
Spruce trees cover a relatively small area in comparison to the other two classes. We therefore assign weights when computing the loss function with relation to the individual classes, a similar scheme was previously used by Audebert et al. (2018). The weights were assigned as: background -0.1, pine shrubs -0.2 and spruce trees -0.7. This setup will likely improve the results for spruce trees at the expense of overall accuracy.
Data is fed into the networks in the form of 512×512-pixel tiles with 50% overlap and six-band inputs. Using these relatively large tiles results in 1005 tiles for training and 252 tiles for validation. Both networks were trained using Adam (Kingma and Ba, 2015) as the optimiser and cross entropy as the loss function. While training, batch size of 2 was selected for both our models, due to memory constraints of our GPU. Suitable values for other hyperparameters were identified using three-fold cross validation, and they are detailed in Table 1 This adjustment results in network outputs with the same spatial resolution as inputs. As in the original U-Net (Ronneberger et al., 2015) the weights and biases of the network were randomly initialized from a Gaussian distribution.

KrakonosNet:
A modification of U-Net was introduced by (Dvorak, 2020). Its goal is to reduce overfitting of the network for tree classification in aerial imagery. The author proposes four modifications to the original U-Net structure: 1) Halving the number of feature maps in each convolution layerinspired by (Wagner et al., 2019). 2) Batch normalization after each convolutioninspired by Zhao et al., (2019) 3) Dropout after the last convolutional layer (50 % probability) 4) Parametric Rectified Linear Unit (ReLU) activation function instead of traditional ReLUinspired by Nwankpa et al. (2018). Other implementation specifics are the same as for our U-Net.

RESULTS
To evaluate our noisy training data, we created a confusion matrix which compares the training data to 382 overlapping hand-labelled points (Figure 2). This analysis showed a relatively high confusion between the background and pine shrubs, especially for pines, which were misclassified as background. Even more notable is the underrepresentation of spruce trees, with only over a third (38%) of spruces being correctly represented in the weak training labels. Figure 3 shows patches from the five training / validation transects, highlighting the underrepresentation of spruce trees. The automatically labelled patches also show a characteristic "salt and pepper" effect, which is especially apparent in a confusion between the background and dwarf pines. The resulting classifications have however filtered out most of this noise. Differences between individual results obtained with U-Net and KrakonosNet appear minor based on visual interpretation, but the importance of these slight differences may show during quantitative analysis. Analysis of confusion matrices presented in Figure 4 reveals that both models performed quite similarly, with KrakonosNet slightly outperforming U-Net. However, both models significantly underestimate spruce trees, and the producers' accuracy for this class was only 55.46% (U-Net) and 63.03% (KrakonosNet). This result is nevertheless considerably better than that of the original training data (38%). In other words, the networks learned some properties of trees from noisy annotations and delineated the trees more accurately than the training data. A notable improvement over the noisy labels has also been achieved in confusion between pine shrubs and the background. Estimations of how well a given model generalised are shown in Table 2. A lower difference between values for the test and training/validation set indicates that the model has learned truly valuable features for the given task, while a larger difference means that the model is potentially overfitted. Both our models had more than 8% difference in F1 score and IoU for pine shrubs and spruce trees. This difference is especially notable given the relatively low agreement between noisy labels and our validation points ( Figure 2). KrakonosNet also has lower differences between results for the training/validation and test dataset than U-Net across all classes and metrics while reaching better overall results on the test set.

DISCUSSION
Our results show that our Encoder-Decoders have learned useful features of spruce trees and pine shrubs from aerial imagery while being trained on noisy labels. Learning these features allows both models to outperform original labels by filtering out the noise.
KrakonosNet slightly outperformed U-Net across all metrics in the test set. One notable difference between the networks lies in the producers' accuracy for spruce trees, which also led to the higher F1 and IoU scores for KrakonosNet for this class. We can therefore confirm that the network structure adaptations prove useful for this task, but a more in-depth analysis of the usefulness of individual network adaptations should be conducted in further studies. KrakonosNet is also considerably faster to train because it only has half of the convolutional filters of U-Net. A considerable improvement in classification accuracy may have been achieved by using a state-of-the-art Encoder-Decoder, but we believe that there is value in first experimenting with a relatively well-understood architecture.
Contrary to Schmitt et al. (2020), we conclude that standard models such as U-Net (perhaps with a few simple adaptations) may be sufficient for our task given that mapping two vegetation classes on a local scale is considerably easier than the global land cover mapping presented in their contribution.
Although our overall results were satisfactory, a more careful analysis of accuracy metrics and the visualised results identifies several shortcomings of this approach. Firstly, shadows are difficult to successfully classify in this context as the appropriate label cannot always be assigned even based on a visual interpretation. We observe (as illustrated in Figure 3) that lower parts of the spruce trees are often misclassified as dwarf pine shrubs, especially on the shadowed half of the trees. This misclassification is likely caused by a systemic issue within our algorithm for generating training data.
Our training data generation technique assigns parts of spruce trees below two meters to the pine shrub class, thus misclassifying tree edges. This height restriction also taught the network to mistake newly planted forests for dwarf pine shrubs.
To correct this issue, the existing training data may be improved by a simple procedure. For example, the existing labels for spruce trees could be extended by a buffer of 50 cm in all directions, which would help to properly cover the spruce trees. However, this systematic error may also be resolved by creating training data through a different strategy.
One such strategy for automatically creating an annotation includes finding local maxima of the nDSM raster and creating circular buffers with a set radius around them, thus delineating the spruce tree canopies in the process. The radius of buffers may even be tied to the treetop height, thereby creating larger buffers for higher trees. This relationship between tree height and width would however need to be calibrated each time when using imagery captured by different sensors due to differences commonly found between aerial orthoimages caused by the image geometry and rectification process. These strategies for generating training data should be considered as part of a sensitivity study. Such study should also consider introducing shaded classes.
Our goal was to assess Convolutional Encoder-Decoders for weakly supervised learning, and the results show that these networks can overcome the noise present in the training imagery. However, we believe that a production-focused solution should also involve data augmentation, transfer learning or a small amount of precisely labelled training data. In particular, data augmentation through patch rotation or the introduction of random noise can be easily implemented, while typically improving the generalization capabilities of models. Transfer learning with remote sensing data is often challenging due to the high dimensionality of the data, but Audebert et al. (2018) have presented a potential avenue for transfer learning in the context of our data. We could essentially run both our three-band datasets through separate pretrained encoders and then combine both signals for the decoder part of our network.
Besides already mentioned directions for future research in the context of this task, we aim to adapt weakly supervised learning in the same area to data captured in different time periods and using different sensors. For example, a land cover map created based on 2012 imagery may be used as training data for models training on 2001 imagery to assess historical changes in the area. To achieve satisfactory results, a model must consider specifics related to tree growth and the quality of available aerial imagery.
In summary, the weak labels proved sufficient to teach useful classification features to standard Encoder-Decoder networks, but the labels contain a systematic error which ultimately limited the performance of the model. Nevertheless, the limitations of our results may be overcome by creating training data in a different manner and by improving the training process with data augmentation or transfer learning.

CONCLUSIONS
Weakly supervised learning proved viable on our task of classifying land cover in the treeline ecotone using labels based on an ancillary nDSM. The networks could overcome noise present in the training data and learn better representations than the labels. Our results still reflect systematic errors introduced when creating training data, but we show that standard semantic segmentation networks may be sufficient for weakly supervised learning for local-scale land cover mapping.