AN INVESTIGATION OF THE INFLUENCES OF TRAINING DATA PROPERTIES ON AERIAL IMAGE SEGMENTATION

: While numerous studies are being conducted to improve neural network performance for image segmentation, studies on the impact of training data in terms of data quality, bias and labeling noise are comparatively scarce. When opening state of the art algorithms to a large and varying dataset, they do not achieve the same results as under optimal and controlled conditions due to a mismatch of the data used for training and the data that is to be predicted. This paper presents an approach to show the influences of diverging image properties such as scale, contrast, brightness and saturation between the training data of a model and data that is to be predicted. For this purpose, a U-Net is trained to segment buildings in aerial images. It was found that while changes in brightness have a strong effect on precision, recall and F-score, a change in saturation does not have too much or even positive effect on segmentation. In general, however, it can be said that any differences between training and prediction data have a negative effect on segmentation results.


INTRODUCTION
The increasing overlap between remote sensing, computer vision and computer science led to rapid advancements regarding the analysis of geodetic data. The new possibilities, also due to increasing computer performance, opened up new approaches to geodetic data processing (Maxwell et al., 2018). At the same time, breakthroughs in machine learning led to rapid progress and new developments for remote sensing applications (Sagan et al., 2020). While the initial focus of industry and academia was on object detection using bounding boxes (e.g. with YOLO (Redmon et al., 2016)), further developments in the field of neural networks enabled the segmentation of raster data using Deep Learning (e.g. with U-Net (Ronneberger et al., 2015)). This in turn resulted in different approaches for the segmentation of satellite and aerial imagery (Kim et al., 2019), which made it possible to detect objects instead of bounding boxes.
Many of the aerial segmentation studies conducted since then have been concerned with the neural networks themselves, showing advances in the design of architectures that ultimately lead to better detection and segmentation results on benchmark datasets (Rahman M.A., 2016). Compared to the volume of research on neural networks and (fine tuning) their architectures, relatively little research is being done on the used data. Mostly, research is done in the context of data on topics such as class imbalance (Japkowicz and Stephen, 2002) or dataset size (Soekhoe et al., 2016), but rarely on the actual properties of the data such as exposure, color shift, or contrast values as they occur in the real world and in real-world applications during prediction. Especially in the field of aerial image segmentation, studies on image properties are pending.
This paper presents an approach to show the influences of scale, brightness, contrast and saturation as well as the combination of these factors on the segmentation results when predicting buildings from aerial imagery unknown to the neural network. The goal is to investigate whether these properties affect the segmentation results and if so, to what extent. For this purpose, said combinations are contrasted and compared with the segmentation results of data that has not been altered.

Direct dataset evaluation
As mentioned, much research in the field of aerial image segmentation regarding datasets amounts to a) the size of datasets (Sun et al., 2017) and b) the class imbalance of the training data (Li et al., 2021). While, to our knowledge, no publications deal with the direct impact of image properties mentioned in the introduction, in adjacent areas where training sets are as important as in image segmentation such as audio and signal processing, the impact of data quality and quantity is discussed when introducing new datasets (Manilow et al., 2019).

Data augmentation
Data augmentation, in this case image augmentation, is the process of automatically creating additional training data by making minor (or depending on the use case moderate) changes to it. Image augmentation algorithms can include flips, rotations, color space augmentations, geometric transformations and many more (Shorten and Khoshgoftaar, 2019). Some publications in the field of satellite and aerial segmentation use augmentations and describe those that haven been used like rotations and chromatic distortions (Li et al., 2018, Khryashchev et al., 2019. At the same time, information about how classification or segmentation results would look like without augmentations is often missing since it is not the main focus of given publications. When augmented and non-augmented segmentation and classification results are compared, they usually aren't compared granularly so the effect of each augmentation method is not always clear (Wu et al., 2019). It must also be mentioned that augmentations sometimes are only introduced into the training data to prevent the neural network from overfitting which means that in that case the network learns to perfectly model the training data which would lead to bad performance when introducing new data.

Segmentation approach and data
In order to perform a study regarding the effects of different image properties for segmentation, images must first be segmented. After successes in the field of biomedical image processing, the U-Net found great appeal in the field of image segmentation and became a popular choice for segmentation tasks in various fields, including aerial and satellite image segmentation (Zeng et al., 2019, Tang et al., 2019, Darapaneni et al., 2020. Accordingly, for this experiment, a U-Net was used to segment aerial images. The training data 1 has three channels (RGB), a resolution of 10000 x 10000 pixels with a ground sampling distance of 10 cm. The data was chosen to include both developed urban areas as well as rural areas. So that the images do not have to be reduced in resolution for training, they are divided into smaller tiles of 512 x 512 pixels. The same applies to the validation and test data. The training labels were automatically generated from outline polygons of buildings of the corresponding areas so labeling noise can not be ruled out entirely. However, the labels were examined for the proportion of building pixels with respect to background pixels, so the effects of class imbalance can be mitigated to some degree.
The training was carried out with around 20.000 image tiles. The best resulting model was used to predict the segmentation maps of unprocessed images. The segmentation maps of the unaltered images are later used as reference for examining the influences of the alteration of image properties on segmentation results. Figure 1 shows an overview of the training and prediction process. Figure 1. After splitting up the initial true-orthophoto into tiles, the training images and labels (black and white binary images) are being used to train the U-Net. While red arrows show the path to train the U-Net, the green arrows show the path to predict (to the network) unknown images. The gray arrow (back propagation) is supposed to depict that there is a training process included.

Image Alteration
While designing the concept, a decision had to be made about which changes in image properties to consider. To look for clues about discrepancies in datasets, publicly available data

Evaluation
To ensure comparability between the predicted images, the same metrics were chosen for all datasets where image alterations have been made. The fact that absolute values are ultimately important for segmentation results cannot be denied. For this experiment, however, the comparison of different variations is important. Accordingly, the results are additionally presented as percentages compared to the segmentation results of the original data. Furthermore, precision (equation 1), recall (equation 2) and the F-score (equation 3) are highlighted (Powers, 2007).
With True Positive being an outcome where the model correctly predicts the positive class (model predicts a building pixel as building) and True Negative correctly predicts the negative class (model predicts a background pixel as background). A False Positive being an outcome where the model predicts the positive class incorrectly (model predicts a background pixel as building) and a False Negative being an outcome, where the model incorrectly predicts the negative class (model predicts a building pixel as background).

RESULTS AND DISCUSSION
The segmentation maps and results generated with the original and unaltered data serve as a reference for the segmentation maps and results generated with the altered data. Table 1 shows the results of the evaluation of the unaltered image data whereas tables 2, 3 and 4 show the recall, precision and F-score of the evaluation of the of the altered data respectively.  Table 2 shows the percentage difference of the recall values in comparison to the recall values of the original data. By observing the recall we can look at the proportion of actual positives that have been identified correctly. According to our results, predicting imagery with lower brightness than our training data, especially in combination with lowering the contrast or saturation, will statistically lead to much worse segmentation results (-30 %). The numbers show that the recall suffers in particular from the fact that brightness is reduced. With an increased brightness, recall values are up to 11 % better than the recall values of the unaltered image data. We assume that especially when brightness and contrast are reduced, no sufficiently distinctive features remain that can be picked up during training for better segmentation results.  Figure 3 shows an exemplary segmentation map of the image data of unaltered images alongside the segmentation map of the images with low brightness and low contrast as well as the ground truth. Looking at the prediction maps, the values look plausible.
In table 3 we can see the percentage difference of the precision values in comparison to the original data. The precision shows the proportion of actual positives that have been identified correctly. The numbers show that increasing the brightness in any case leads to massive drops in precision. Increasing the brightness and simultaneously lowering the contrast leads to a percentage change in precision of almost -80 % which leads to the assumption that these alterations represent an unfavorable combination for recognizing strong and distinctive features for segmentation purposes. Solely images with lower contrast or lower saturation in comparison to the unaltered image data prior to the prediction have shown variations under 10 % compared to the predictions generated with unaltered images. Table 4 shows the percentage difference of the F-score of the segmentations of the altered images in comparison to the Fscore of the segmentation of the unaltered images. The Fscore lets us combine the precision and recall of our model  and is defined as the harmonic mean of precision and recall. As already seen with precision, the F-score for segmenting the altered image data is lower overall. Especially the images with an increased brightness show F-score deviations of consistently more than 50 %. As with precision, a combination of increased brightness and decreased contrast strongly influences the Fscore. At the same time, it can be seen that a reduction in saturation and contrast only leads to a decrease of 3 % and 9 %, respectively.  In addition to contrast, brightness and saturation, two crop factors were examined: Images cropped by 50 % and 90 %, respectively, and resized to the original size were predicted us-ing the model trained on the unaltered data. Table 5 shows the results, i.e. the deviations between the recall, precision, and Fscore, compared to the results of the unaltered data. The numbers show that due to the changing ratio between foreground and background pixels, strong losses in precision, recall and F-score occur when the image data is enlarged by 50 %. We assume that the shifted ratio between foreground and background pixels lead to this drop. At the same time, we can see that a magnification of 10 % (90 % crop) leads to better results compared to the 50 % crop. The reason for this is the same, since in this case the ratio between background and foreground pixels is not changed dramatically.

% CROP 90 % CROP Precision
-74 % -29 % Recall -46 % +4 % F-score -65 % -16 % Table 5. Precision, recall and F-score for the predictions conducted on images 50 % and 90 % of their initial resolution The entire experiment, beginning with the training process, was conducted five times to check for reproducibility. The deviations between the values for Accuracy, Precision, Recall and F-score were less than 11 % on average in between experiments.

CONCLUSION
This paper showed an approach to determine the influence of image properties on segmentation results. For this purpose, a U-Net was trained with image data that has not been altered and predictions were made on unknown images with and without altered image properties such as contrast, saturation and brightness. The quality parameters generated from the predictions of unaltered images, namely recall, precision and the F-score, were used to compare these parameters with segmentations of images with altered properties. It was found that increased brightness in particular had a strong negative effect on precision and F-score. At the same time it could be observed that for this specific approach desaturated images either do not show a large effect or, in the case of the recall, show a beneficial effect. In general, however, it could be observed that a discrepancy between the training data and the data to be predicted has a negative effect on the quality of the segmentation.

FUTURE WORK
The results shown in this paper are the results of testing a single dataset with a single U-Net implementation. In order to make more sustainable statements about the effects of changes in the image properties of training data or the divergence of training and prediction data, further investigations have to be performed on different implementations and datasets. In our opinion, TernausNetV2 (Iglovikov et al., 2018) and other implementations for segmentation of aerial image data like SegNet (Badrinarayanan et al., 2017) represent further interesting networks for similar tests. Another approach would be to derive properties from the difference of the training data and data to be predicted so changes can be applied to the latter. Overall, the question of the possibilities to use models trained on data with slight variations to the prediction data remains unanswered.
Wu, M., Zhang, C., Liu, J., Zhou, L., Li Figure 4. Exemplary depiction of segmentation maps when using altered images alongside ground truth and orthophoto