GEOGRAPHICAL TRANSFERABILITY OF LULC IMAGE-BASED SEGMENTATION MODELS USING TRAINING DATA AUTOMATICALLY GENERATED FROM OPENSTREETMAP – CASE STUDY IN PORTUGAL

: Synoptic remote sensing systems have been broadly used within supervised classification methods to map land use and land cover (LULC). Such methods rely on high quality sets of training data that are able to characterize the target classes. Often, training data is manually generated, either by field campaigns and/or by photointerpretation of ancillary remote sensing imagery. Several authors already proposed methodologies to attenuate such labour-intensive task of generating training data. One of the preferred datasets that are used as input training data is OpenStreetMap (OSM), which aims at creating a publicly available vector map of the world with the input of volunteers. However, OSM data is spatially heterogenous (e.g., capital cities and highly populated areas often have high degrees of completion while unpopulated regions often have a lower degree of completion), where there are still large areas without OSM coverage. In this paper we present a set of experiments that aim at assessing the geographical transferability of satellite image-based segmentation models trained with OSM derived data. To this end, we chose two locations with different OSM coverage and disparate landscape (metropolitan region vs natural park region, in different landscape units), and assess how these models behave when trained in a region and applied in the other. The results show that the mapping of some classes is improved when considering a model trained in a different location.


INTRODUCTION
Land use land cover (LULC) information is central to several fields such as climate change monitoring (Li et al., 2017), national to urban planning (Schneider, 2012), policy (European Commission and Directorate-General for Communication, 2020), among others (Rajib and Merwade, 2017). Satellite systems have been widely used to map such LULC classes, given their revisit capabilities and world coverage (Abdi, 2020;Clark, 2017;Friedl et al., 2002). Such LULC mapping is often based on supervised classification approaches, relying on a set of quality training data which is able to map the several target classes. However, the creation of a training dataset is costly and time consuming given that these are usually performed from visual interpretation of higher resolution imagery (e.g., commercial satellite imagery) and/or field surveys. To avoid such laborious and costly task, and aiming at the automation of the generation of LULC training data, several researchers have been assessing the use of volunteered geographical information such as the data coming from the OpenStreetMap (OSM) project. This collaborative project relies on the contributions of volunteers to generate an open vector map of the world; which is publicly available online at the OSM website. Several researchers have been extracting LULC maps directly from OSM (Fonte et al., 2017;Jokar Arsanjani et al., 2013;Jokar Arsanjani and Vaz, 2015;Patriarca et al., 2019). The methods focus on the automation of the conversion between OSM data and LULC map and on the class nomenclature harmonization between maps. However, given the spatial heterogeneity of OSM coverage (dependant on user contributions), attention has been given to use such information as training data within supervised classification approaches (Chen et al., 2021;Fonte et al., 2020;Haufel et al., 2018;Schultz et al., 2017). In this way, areas without OSM coverage could be mapped. Several problems were reported such as missing data for some classes (Schultz et al., 2017), positional and thematic errors (Johnson and Iizuka, 2016;Schultz et al., 2017) and class imbalance problems (Fonte et al., 2020;Johnson and Iizuka, 2016). In a recent work, Fonte et al. (2020) mapped eight LULC classes using training data automatically generated from OSM. The authors tested their approach considering two different study areas and several filtering approaches aiming at improving the raw OSM data. Three different datasets, with different filtering approaches were tested against an official Portuguese thematic map. The authors found that the thematic quality improved when using band ratios (NDVI, NDWI and NDBI) to filter the raw OSM data. In most studies the authors separately assess the classifiers behaviour in regions of their study area with no OSM coverage (Fonte et al., 2020;Schultz et al., 2017). This allows for a better understanding of how representative OSM training data is regarding the target classes over a region within their study areas but without OSM coverage. However, such geographical transferability corresponds to contiguous areas of the same study area. Hence, similar to areas with OSM coverage. In such cases there is no knowledge regarding the behaviour of the model when applied in a region with a different landscape, urbanisation density, terrain morphology or forest and vegetation types. This is relevant given the current spatial (and thematic) heterogeneity regarding OSM contributions, where often metropolitan areas and locations of interest (e.g., touristic places) have higher concentrations of contributions. In this way, models could be trained with OSM data in a region with a higher degree of OSM coverage and then be deployed with areas with little or no OSM data, regardless of the landscape, urbanisation degree and other possible differences between regions such as terrain morphology. In this paper, the geographical transferability of such OSM image-based segmentation models is assessed between two different regions, a metropolitan area with dense built-up environment and another rural area with only a major town, mountains and forest. To this end, training data was generated for the two training areas using OSM data; then, for each region a model was trained with the corresponding data. These models were then applied to both regions to assess their geographical transferability when compared with a classical approach of using a set of location specific training data. Section 2 will present the study area and the used data, including the satellite images and the training data. Section 3 presents the experiments while sections 4, 5 and 6 present the results, discussion and conclusions, respectively.

STUDY AREA AND TRAINING DATASET GENERATION
To assess the geographical transferability of OSM based image segmentation models two regions located in continental Portugal were chosen. Despite being a small country, Portugal presents a diverse landscape due to both its geographical situation and a long record of human intervention (Moreira, 2004). With this in mind, the selected regions are in different landscape units, and belong to the Lisbon Metropolitan Area (LXMA) and the Serra da Estrela National Park (SENP). See Figure 1 for an overview of the location of both study areas within Portugal. LXMA presents a high degree of urbanisation, high population density and a drier and more Mediterranean region when compared with the SENP region. The SENP region is characterized by a less urbanised region with low population density, with a less dry climate and more mountainous region. The experiments were based on two main sets of data: 3 Sentinel 2 images for each of the study areas, considering the year of 2018, see Table 1; and the OSM data used as training mask, its generation is explained in sub section 2.1. The year 2018 was chosen given that there is an official LULC map produced by the Portuguese authorities, COS2018, which stands for "Carta Ocupação e Uso do Solo para Portugal, 2018" (DGT, 2019). Hence, this dataset can be used as reference data for the experiments. This thematic map is produced mainly by visual interpretation of orthophotos with RGB and near infrared bands and a spatial resolution of 25 cm. The overall accuracy of this map is still under assessment; however, the technical specifications require the overall accuracy to be higher than 85%. COS2018 maps the LULC with a hierarchical approach, with classes ranging from level 1 to level 4, being the level 4 the level with most thematic detail. Examples of COS 2018 can be seen in the top image of Figure 6 and Figure 7.

Training data generation
The first step to generate the training data to be used alongside the Sentinel 2 images for training of the models is to convert the OSM data to the class nomenclature of this study. This was performed by using the OSM2LULC open-source software package (Patriarca, 2020;Patriarca et al., 2019). OSM2LULC is composed by several modules which automatically convert OSM data to LULC maps. This conversion can be made to the Urban Atlas, Corine Land Cover Level 2 or GlobeLand30 nomenclature. In this study the OSM2LULC was adapted to the COS2018 nomenclature, as shown in Table 2. The general idea regarding the nomenclature was to have classes that are more related to land cover than to land use (e.g. include the urban green spaces in the herbaceous vegetation class instead of the urban level 1 class). The map obtained with OSM2LULC was then filtered considering several band ratios, such as the normalized difference vegetation index (NDVI), the normalized difference water index (NDWI) and the normalized difference built-up index (NDBI). For example, for a pixel to be considered artificial surfaces it must not only that class come from OSM2LULC but its NDVI and NDBI have values compatible to that class. The radiometric indices threshold values used to confirm the pixels assignment to the several classes are shown in Table 3. Figure 2 shows a detail of the resulting map for LXMA study area. More details about the generation of the training dataset are available in Fonte et al. (2020), in which the authors also compare such filtering of the OSM2LULC derived maps with other filtering techniques. For this study we chose the best performing approach presented in that study, referred to as TD_2. Table 4 presents the number of pixels per class per region. It also presents the class coverage regarding the whole study area and as a percentage within brackets ([0-100]). Overall, the training data corresponds to 46% and 55% of the study areas for the LXMA and SENP region, respectively. Many classes have less than 5% of training when comparing to each of the study areas. For LXMA only classes 1,3 and 8 have more than 5% of area to train the classifiers. While in SENP this happens for classes 1,3,6,7 (not present in this region) and 8.  Figure 3 shows an overview of the steps followed in the experiments for each region. The three Sentinel-2 images and the training masks coming from the filtered OSM data were used as training data for each of the regions. A segmentation model based on convolutional neural networks (see sub-section 3.1) was used to learn the feature representation of the 8 classes present in the OSM data from the Sentinel 2 images of both study areas (LXMA and SENP). For each region, three Sentinel-2 images (considering only the four 10 m resolution bands B2-4 and B8) were concatenated in the channel's axis (12 channel image). The next sub-section details the segmentation model used for the experiments.

Network definition and experiments
Regarding the CNN network, the objective was to use a wellestablished image segmentation model and apply it for the segmentation purposes of this study. Hence, an adaptation of densenet121 (Huang et al., 2017) convolutional neural network (CNN) for image segmentation was implemented. Densenet relies on the connection of a given layer with all preceding ones in a feed forward fashion. This is achieved with the stacking of consecutive denominated dense blocks. Each dense block is composed by a set of two convolutions with different kernel sizes (k=1 and k=3) as seen in Figure 4. The k=1 convolution aims at reducing the filters block dimension before the computationally expensive convolution with kernel size of 3 (k=3) (Szegedy et al., 2015). The transition blocks (  Figure 4) are used to downscale the increasingly more complex feature maps throughout the network with pooling (using average pooling). The original densenet was initially proposed for image classification; hence, it was adapted for the current study to perform image segmentation by introducing a decoder part on top the original convolutional part of densenet in a similar way as in Ronneberger et al. (2015). The decoder part consists on the stacking of decoder blocks (right side of Figure  5), where the decoder block upscales back the feature map size and concatenates the feature maps coming from the corresponding encoder feature map (left side of Figure 5) (Ronneberger et al., 2015). Figure 5 shows how the blocks in Figure 4 are used together to form the final network.

Figure 4
Dense, transitional and decoder blocks used to build the network in Figure 5. The indicated convolutions are always followed by batch normalization and relu. >>> indicates input to the concatenation in the decoder block. k, kernel size.

START END
TB (8x8) DB x 16 Figure 5. Network architecture to perform image segmentation. Denseblock as DB, transitional block as TB and decoder block as DcB. Within brackets the feature map size of a given block.
The indicated convolutions are always followed by batch normalization and relu. >>> indicates input to the DcB.
The images and the training data were divided in patches of 128x128 pixels to be fed to the previously defined network. In the end, the input to the networks was a 12-channel image and a 1 channel image where each pixel contains the indication of a given class coming from the generated training data. The images were normalized before being fed to the network. Data augmentation was used with random crops, rotations and zoom.
Only a few hyperparameters were tested: optimizer, learning rate, batch size and number of epochs. These hyperparameters were found through a 3-fold cross validation approach where the study data for a given area was divided into training, testing and validation. The optimizer stochastic gradient descent and the learning rate of 0.1 were found to be the overall best performing. The optimal batch size was 8. These hyperparameters were then used to train both models (LXMA and SENP) using the whole dataset of each region. Each model was applied to both the LXMA and SENP image data generating 4 different maps. These maps were then compared with the COS2018 derived map, which was generated following the nomenclature in Table 2. Confusion matrices were derived for each experiment, where both overall accuracy and users and producers' accuracy are shown for each model being applied to both regions.

RESULTS
This section will present the overall accuracy of the four experiments, the user's and producer's accuracy (UA and PA) and details of the generated maps. Table 5 shows the overall accuracies, where each of the models was compared with the COS2018 derived nomenclature. The overall accuracy of the model trained with LXMA and tested in SENP is much lower, while its counterpart model which was trained on SENP and tested on LXMA had higher accuracy. Both regions present better results when trained with their respective data.  Table 5 Overall accuracies for the experiments. Considering each of the combinations of training and testing datasets. COS2018 used as reference data. Table 6 and Table 7 show both the UA and PA for each of the testing sites, LXMA and SENP, respectively, and considering each of the models trained in each of the regions. The tables also present the difference between these models, where negative values indicate that the model trained in a different region performed better. Focusing on LXMA as testing area (Table 6) the model trained on the LXMA provides overall better results for classes 5 to 8, where both UA and PA are higher for that model; only the UA for class 8 is just 3% lower. However, this is not so clear for classes 1 to 4. In these classes the model trained in SENP had higher UA for classes 1 and 3, and PA for class 2. In class 4, the SENP model even outperformed the one trained in LXMA.

Overall accuracies
For the SENP testing case (Table 7), classes 4 to 6 were better mapped with SENP model while class 3 was better mapped with the LXMA, even if the difference between the UA values is small (only 1%). The LXMA model had higher PA for classes 1 and 8, while having lower UA for these same classes.
97 26 9 46 88 -19 Table 7 User's and Producer's accuracy for each of the classes, considering each of the training datasets, LXMA and SENP; tested only on the SENP dataset.
The results shown regarding UA and PA are also reflected in Figure 6 and Figure 7. For example in Figure 6, when testing in LXMA region. The model trained with PNSE data while being able to map the artificial surfaces class it struggles in differentiating agricultural areas from open spaces with little or no vegetation and herbaceous vegetation. In Figure 7, when testing in SENP, the model trained with LXMA just reflects the poor accuracy for that region, where the model classified most of the region with the artificial surfaces or forest areas.

DISCUSSION
In this study we assessed the geographical transferability of OSM derived image segmentation models. Two distinct regions were chosen given the disparate landscape composition, urbanisation, terrain morphology and population density. Four main experiments were presented. Two for each region, SENP and LXMA, where the model trained with the data from a given region is also applied to the other one. Overall accuracy as well as user's and producers' accuracy for the experiments was presented, having as reference "Carta Ocupação e Uso do Solo para Portugal, 2018" (DGT, 2019) an official Portuguese LULC map. Considering overall accuracy, the model trained with SENP data performed better; mainly given its overall accuracy when applied to the LXMA region.
Regarding per class assessment for the LXMA region, the 'Forest areas' class (class 4) was better mapped when considering the model trained with SENP data, where both UA and PA were higher. Likewise, but this time testing in SENP and considering 'Herbaceous vegetation' (class 3), the model trained with LXMA data performed better than the one trained with SENP data. However, the SENP model was not able to map class 3 for the SENP region, presenting already an UA and PA close to zero. Class imbalance could be one of the factors influencing the results (Buda et al., 2018). However, 'Shrublands' (class 5) and 'Artificial surfaces' (class 1) seem to indicate that the class imbalance on itself does not explain the differences in the results. For example, 'Artificial surfaces' (class 1) corresponds to 12% of the training area of the LXMA study area, while it only corresponds to 1% of the SENP training; however, this is not translated in a better recognition capability of the LXMA model to detect class 1 when compared with the SENP model. However, 'Forest areas' (class 4), which Bottom: training with LXMA also has a high amount of training data in the SENP study area improved the detection of that class in the LXMA region. While the 'Forest areas' (class 4), seems to have similar feature information regardless of the landscape, this is not the case for the class 5 'Shrublands'. Still considering the LXMA region and regarding class 1 'Artificial surfaces', the SENP model, even with a much smaller amount of training data, is able to improve the UA in 12%. However, at the expense of a 23% lower PA. Having less regions wrongly classified as artificial surfaces but missing more artificial surfaces. LXMA model when applied to SENP improves the PA while making the UA worst. Which in this case shows that less artificial surfaces were missing from the map, but more regions were wrongly classified as artificial surfaces. Hence, it seems that LXMA model is overestimating the class 1 'Artificial surfaces. Some classes may not be present in the training data. In this study, this can be seen for class 7 'Wetlands' which is not present in the SENP training dataset and consequently is not able to map it when the model is applied to the LXMA region.
Many classes, regardless of the considered region, are not well represented in the training data; as reported in previous studies (Schultz et al., 2017). For example, class 6 'Open spaces with little to no vegetation' is barely present in either of the study areas considered, indicating that regardless of the region this class is not well represented in OSM. An adaptation of a widely used CNN for computer vision tasks was used, which is certainly not optimal for the task at hand; given that these models are often used in problems with hundreds of classes and millions of training samples. Which in this case might have contributed to overfitting problems given the low amount of data present for the experiments. Another limitation is regarding the reference data used for the experiments. COS2018 is a LULC map with a minimum mapping unit of 1 ha, where the generated maps have 10m resolution. While a general view of the quality might be assessed, it was not performed to the resolution of the resulting maps. Moreover, given the land use nature of COS2018, airports for example, are considered as built-up, where in fact the polygons also correspond also different land cover classes, such as vegetation.

CONCLUSIONS
The geographical transferability of image-based segmentation models trained with OSM derived data was assessed in a study conducted in continental Portugal. To this end two Portuguese regions with distinct landscape, urbanisation degree and forest and vegetation types were used. One of the regions being the metropolitan area of Lisbon and the other location in the NE part of Portugal consisting of a mountainous region part of the Serra da Estrela National Park. For each region a CNN segmentation model was trained with OSM data, these were then used to classify each of the regions satellite images. The resulting maps were then compared to an official Portuguese LULC map. Overall, the results show that while some classes might be transferable, this is not true for all the classes. For example, Shrublands (class 5) seems to vary significantly from region to region given that this class always presented better results when using the respective model for that region. On the other hand, classes such as 'Forest areas' (class 4) or 'Herbaceous vegetation' (class 3) seem to be more transferable from one region to the other. While the amount of training data might be one of the explanations for these improvements, this does not happen to other classes presenting similar discrepancies in the amount of training data such as 'Artificial surfaces' (class 1). Given the different behaviour regarding this geographical transferability for the considered classes, LULC image-based segmentation methods could improve their performance being able to optimally merge this feature information from several regions and create a model which is able to accommodate differences related to a specific location and class. This would not only allow to have training regarding otherwise less well represented classes; while at the same time moving further from the costly and time-consuming training data generation for a specific location.