RESIDUAL SHUFFLING CONVOLUTIONAL NEURAL NETWORKS FOR DEEP SEMANTIC IMAGE SEGMENTATION USING MULTI-MODAL DATA

In this paper, we address the deep semantic segmentation of aerial imagery based on multi-modal data. Given multi-modal data composed of true orthophotos and the corresponding Digital Surface Models (DSMs), we extract a variety of hand-crafted radiometric and geometric features which are provided separately and in different combinations as input to a modern deep learning framework. The latter is represented by a Residual Shuffling Convolutional Neural Network (RSCNN) combining the characteristics of a Residual Network with the advantages of atrous convolution and a shuffling operator to achieve a dense semantic labeling. Via performance evaluation on a benchmark dataset, we analyze the value of different feature sets for the semantic segmentation task. The derived results reveal that the use of radiometric features yields better classification results than the use of geometric features for the considered dataset. Furthermore, the consideration of data on both modalities leads to an improvement of the classification results. However, the derived results also indicate that the use of all defined features is less favorable than the use of selected features. Consequently, data representations derived via feature extraction and feature selection techniques still provide a gain if used as the basis for deep semantic segmentation.


INTRODUCTION
The semantic segmentation of aerial imagery in terms of assigning a semantic label to each pixel and thereby providing meaningful segments has been addressed in the scope of many recent investigations and applications. In this regard, much effort has been spent on the ISPRS Test Project on Urban Classification, 3D Building Reconstruction and Semantic Labeling 1 , where one objective is given by a 2D semantic labeling of aerial imagery based on given multi-modal data in the form of true orthophotos and the corresponding Digital Surface Models (DSMs) (Rottensteiner et al., 2012;Cramer, 2010;Gerke, 2014) as shown in Figure 1. While the radiometric information preserved in an orthophoto can already be sufficient to distinguish specific classes, the geometric information preserved in the corresponding DSM might alleviate the separation of further classes, as each modality provides information about different aspects of the environment.
Generally, the semantic segmentation of aerial imagery based on true orthophotos and the corresponding DSMs can be achieved via the extraction of hand-crafted features and the use of standard classifiers such as Random Forests (Gerke and Xiao, 2014;Weinmann and Weinmann, 2018) or Conditional Random Fields (CRFs) (Gerke, 2014). Nowadays, however, many investigations rely on the use of modern deep learning techniques which tend to significantly improve the classification results (Sherrah, 2016;Liu et al., 2017;Audebert et al., 2016;Audebert et al., 2017;Chen et al., 2018;Marmanis et al., 2016;Marmanis et al., 2018). Figure 1. The challenge: given data in the form of a true orthophoto (left) and the corresponding DSM (center), a labeling close to the reference labeling (right) should be achieved, where the classes are given by Impervious Surfaces (white), Building (blue), Low Vegetation (cyan), Tree (green) and Car (yellow).
Some of these approaches also focus on using hand-crafted features derived from the true orthophotos or from their corresponding DSMs in addition to the given data as input to a deep learning technique. In this regard, the Normalized Difference Vegetation Index (NDVI) and the normalized Digital Surface Model (nDSM) are commonly used (Gerke, 2014;Audebert et al., 2016;Liu et al., 2017). Other kinds of hand-crafted features have however only rarely been involved so far although they might introduce valuable information for the semantic labeling task.
In this paper, we focus on the deep semantic segmentation of aerial imagery based on multi-modal data. We extract a diversity of hand-crafted features from both the true orthophotos and their corresponding DSMs. Based on a separate and combined consideration of these radiometric and geometric features, we perform a supervised classification involving modern deep learning techniques. As standard deep networks (Krizhevsky et al., 2012;Simonyan and Zisserman, 2014) are composed of many layers to learn complex non-linear relationships, such networks tend to suffer from the vanishing gradient problem if they are very deep, i.e. the gradients backpropagated through the layers become very small so that the weights in early layers of the network are hardly changed. This, in turn, causes a decrease in the predictive accuracy of the network and can be resolved by using a Residual Network (ResNet) (He et al., 2016a). Relying on the ResNet architecture originally intended to classify image patches, we present a modified ResNet architecture that allows a dense semantic image labeling. More specifically, we make use of the ResNet-34 architecture and introduce both atrous convolution and a shuffling operator to achieve a semantic labeling for each pixel of the input imagery. We denote the resulting deep network as Residual Shuffling Convolutional Neural Network (RSCNN). Via performance evaluation on a benchmark dataset, we quantify the effect of considering the different modalities separately and in combination as input to the RSCNN. Thereby, we observe that the additional extraction of different types of geometric features based on the DSM and the definition of corresponding feature maps for the RSCNN leads to an improvement of the classification results, and that the best classification results are achieved when using selected feature maps and not when using all defined feature maps.
After briefly summarizing related work (Section 2), we explain the proposed methodology for the deep semantic segmentation of aerial imagery based on multi-modal data (Section 3). Subsequently, we demonstrate the performance of our methodology by presenting and discussing results achieved for a standard benchmark dataset (Sections 4 and 5). Finally, we provide concluding remarks and suggestions for future work (Section 6).

RELATED WORK
For many years, the semantic segmentation of aerial imagery based on multi-modal data has typically been addressed by extracting a set of hand-crafted features (Gerke and Xiao, 2014;Tokarczyk et al., 2015;Weinmann and Weinmann, 2018) and providing them as input to a standard classifier such as a Random Forest (Weinmann and Weinmann, 2018) or a Conditional Random Field (CRF) (Gerke, 2014). Due to the great success of modern deep learning techniques in the form of Convolutional Neural Networks (CNNs), however, many investigations nowadays focus on the use of such techniques for semantically segmenting aerial imagery as they tend to significantly improve the classification results.
Regarding semantic image segmentation, the most popular deep learning techniques are represented by Fully Convolutional Networks (FCNs) (Long et al., 2015;Sherrah, 2016) and encoderdecoder architectures (Volpi and Tuia, 2017;Badrinarayanan et al., 2017). The latter are composed of an encoder part which serves for the extraction of multi-scale features and a decoder part which serves for the recovery of object details and the spatial dimension and thus addresses a more accurate boundary localization. A meanwhile commonly used encoder-decoder structure has been proposed with the SegNet (Badrinarayanan et al., 2017). To aggregate multi-scale predictions, a modification of the SegNet introduces a multi-kernel convolutional layer to perform convolutions with several filter sizes (Audebert et al., 2016). Further developments have been presented with the DeepLab framework (Chen et al., 2016) including i) atrous convolution by introducing upsampled filters to incorporate context within a larger field-of-view without increasing the computational burden, ii) atrous spatial pyramid pooling to allow for robustly segmenting objects at multiple scales and iii) a combination of the responses at the final layer with a fully-connected CRF to improve localization accuracy. Instead of using a deconvolution to recover the spatial resolution, an efficient sub-pixel convolution layer involving a periodic shuffling operator has been proposed to upscale feature maps (Shi et al., 2016;Chen et al., 2018).
Specifically addressing semantic segmentation based on multimodal data in the form of orthophotos and the corresponding DSMs, different strategies to fuse the multi-modal geospatial data within such a deep learning framework have been presented (Marmanis et al., 2016;Audebert et al., 2016;Audebert et al., 2017;Liu et al., 2017), while the consideration of semantically meaningful boundaries in the SegNet encoder-decoder architecture and also in FCN-type models has been addressed by including an explicit object boundary detector to better retain the boundaries between objects in the classification results (Marmanis et al., 2018). As an alternative to involving a boundary detector, it has been proposed to discard fully-connected layers (which reduce localization accuracy at object boundaries) and to additionally avoid the use of unpooling layers (which are more complicated and e.g. used in SegNet) (Chen et al., 2017).
While lots of investigations focused on the improvement of the classification pipeline, however, only little attention has been paid to the input data itself. On the one hand, a true end-to-end processing pipeline from orthophotos and the corresponding DSMs to a semantic labeling (Marmanis et al., 2016) seems desirable. On the other hand, however, it remains unclear to which degree the use of hand-crafted features derived from the orthophotos or their corresponding DSMs in the form of additional feature maps serving as input to the deep network can still affect the quality of the semantic labeling. Such hand-crafted features have already been involved with the Normalized Difference Vegetation Index (NDVI) and the normalized Digital Surface Model (nDSM) (Gerke, 2014;Audebert et al., 2016;Liu et al., 2017), yet other kinds of hand-crafted radiometric or geometric features which can be extracted from a local image neighborhood (Gerke and Xiao, 2014;Tokarczyk et al., 2015;Weinmann and Weinmann, 2018) have only rarely been involved so far although they might introduce valuable information for the semantic labeling task.
In this paper, we investigate the value of different types of handcrafted features for the semantic segmentation of aerial imagery based on multi-modal data. We extract a diversity of hand-crafted features from both the true orthophotos and their corresponding DSMs. Thereby, we involve hand-crafted radiometric features such as the NDVI and one of its variants, but also radiometric features derived from transformations in analogy to the definition of color invariants (Gevers and Smeulders, 1999). Furthermore, we involve hand-crafted geometric features in the form of the nDSM (Gerke, 2014) and features extracted from the 3D structure tensor and its eigenvalues. While the analytical consideration of these eigenvalues allows reasoning about specific object structures (Jutzi and Gross, 2009), we use the eigenvalues to define local 3D shape features (West et al., 2004;Demantké et al., 2011;Weinmann et al., 2015;Hackel et al., 2016) which can efficiently be calculated on the basis of local image neighborhoods (Weinmann, 2016;Weinmann and Weinmann, 2018). Based on a separate and combined consideration of these radiometric and geometric features, we perform a supervised classification using a deep network. For the latter, we take into account the potential of the Residual Network (ResNet) (He et al., 2016a) in comparison to standard networks like AlexNet (Krizhevsky et al., 2012) and the VGG networks (Simonyan and Zisserman, 2014) as it allows for a higher predictive accuracy due to its capability of reducing the problem of vanishing gradients. Relying on the ResNet-34 architecture, we introduce both atrous convolution (Chen et al., 2016) and a shuffling operator (Chen et al., 2018) to assign each pixel of the input imagery a semantic label.

METHODOLOGY
The proposed methodology addresses the semantic interpretation of aerial imagery by exploiting data of several modalities (Section 3.1) which are provided as input to a deep network (Section 3.2). The result is a dense labeling, i.e. each pixel is assigned a respective semantic label.

Feature Extraction
Given a true orthophoto and the corresponding DSM on a regular grid, the information may be stored in the form of a stack of feature maps (i.e. images containing the values of a respective feature on a per-pixel level), whereby three feature maps correspond to the spectral bands used for the orthophoto and one feature map corresponds to the DSM. Further information can easily be taken into account by adding respective feature maps. In total, we define eight radiometric features (Section 3.1.1) and eight geometric features (Section 3.1.2) for the given regular grid. Based on these features, we define corresponding feature maps which serve as input to a CNN.

Radiometric Features
In our work, we assume that the spectral bands used for the orthophoto comprise the near-infrared (NIR), red (R) and green (G) bands (Cramer, 2010;Rottensteiner et al., 2012;Gerke, 2014). Accordingly, we define the reflectance in the near-infrared domain, in the red domain and in the green domain as features denoted by the variables R NIR , R R and R G , respectively. In addition, we consider color invariants as features. In analogy to the definition of color invariants derived from RGB imagery to improve robustness with respect to changes in illumination, we consider normalized colors which represent a simple example of such color invariants (Gevers and Smeulders, 1999): Besides these features derived via radiometric transformation, we extract further radiometric features in the form of spectral indices. In this regard, the Normalized Difference Vegetation Index (NDVI) (Rouse, Jr. et al., 1973) is defined as and represents a strong indicator for vegetation. A slight variation of this definition by replacing R R with R G results in the Green Normalized Difference Vegetation Index (GNDVI) (Gitelson and Merzlyak, 1998) which is more sensitive to the chlorophyll concentration than the original NDVI.

Geometric Features
In addition to the radiometric features, we extract a set of geometric features. The most intuitive idea in this regard is to take into account that the heights of objects above ground are more informative than the DSM itself. Consequently, we use the DSM to calculate the normalized Digital Surface Model (nDSM) via the approach presented in (Gerke, 2014). This approach relies on first classifying pixels into ground and off-ground pixels using the LAStools software 2 . Subsequently, the height of each off-ground pixel is adapted by subtracting the height of the closest ground point. Besides the nDSM, we involve a set of local shape features extracted from the DSM as geometric features. Using the spatial 3D coordinates corresponding to a local 3×3 image neighborhood, we efficiently derive the 3D covariance matrix also known as the 3D structure tensor (Weinmann, 2016;Weinmann and Weinmann, 2018). The eigenvalues of the 3D structure tensor are normalized by their sum which results in normalized eigenvalues λ1, λ2 and λ3 with λ1 ≥ λ2 ≥ λ3 ≥ 0 and λ1+λ2+λ3 = 1. The normalized eigenvalues, in turn, are used to calculate the features of linearity L, planarity P , sphericity S, omnivariance O, anisotropy A, eigenentropy E and change of curvature C (West et al., 2004;Pauly et al., 2003) which have been involved in a variety of investigations for 3D scene analysis (Demantké et al., 2011;Weinmann et al., 2015;Hackel et al., 2016):

Supervised Classification
For classification, we focus on the use of modern deep learning techniques in the form of convolutional neural networks, where standard networks like AlexNet (Krizhevsky et al., 2012) and the VGG networks (Simonyan and Zisserman, 2014) are composed of a collection of convolutional layers, max-pooling layers and activation layers followed by fully-connected classification layers. The use of deep networks with many layers allows learning complex non-linear relationships, yet it has been found that the performance of very deep networks tends to decrease when adding further layers via simply stacking convolutional layers as indicated in the left part of Figure 2. This is due to the vanishing gradient problem during training, i.e. the gradient of the error function decreases when being backpropagated to previous layers and, if the gradient becomes too small, the respective weights of the network remain unchanged (He et al., 2016b). One of the most effective ways to address this issue is given with the Residual Network ( the input of the basic unit and are motivated by the fact that optimizing the residual mapping is easier than optimizing the original mapping. The additional gain in computational efficiency allows to form deep networks with more than 100 convolutional layers. A further upgrade of the original residual structure to full preactivation style (He et al., 2016b) is indicated in the right part of Figure 2 and has proven to be favorable in theory and through experiments as no ReLU layer will impede the flow of information and the backpropagation of errors.
We make use of residual blocks as shown in the right part of Figure 2 to define our deep network. Therefore, each residual block is parameterized by the number of filters n, atrous rate r and stride s. Relying on these residual blocks, we use the structure of a ResNet-34 and modify it to the task of dense labeling by introducing atrous convolution (Chen et al., 2016) and a shuffling operator (Chen et al., 2018). Thereby, we take into account that pooling layers or convolutional layers with a stride larger than 1 will cause a reduction of the resolution. We refer to such layers as Resolution Reduction Layers (RRLs). To avoid severe resolution reduction and thus spatial information loss, we only keep the first three RRLs and change the strides of the remaining RRLs to 1. In addition, we remove the layer of global average pooling and its subsequent layers to allow for image segmentation. The resulting network is referred to as Residual Shuffling Convolutional Neural Network (RSCNN) and shown in Figure 3.

Atrous Convolution
As the field-of-view of the deeper layers will shrink after removing RRLs, we involve atrous convolution (Chen et al., 2016) which can be used to compute the final CNN responses at an arbitrary resolution through re-purposing the networks trained on image classification to semantic segmentation and to enlarge the field-of-view of filters without the need for learning any extra parameters. Experiments also reveal that networks adopting a larger atrous rate will have a larger fieldof-view, thus resulting in better performance. Considering a onedimensional signal x[i], the output y[i] of atrous convolution with a filter w[k] of length K is defined as where r is the atrous rate. For r = 1, this corresponds to the standard convolution. The use of atrous convolution in our work thus follows the principles mentioned in (Chen et al., 2016).
Shuffling Operator To achieve a dense prediction, we involve a shuffling operator to increase the resolution by combining feature maps in a periodic shuffling manner. The concept of the shuffling operator has been originally introduced for super-resolution (Shi et al., 2016) and it aims at the upscaling of feature maps. Inspired by this idea, it has been proposed to introduce this operator for the semantic segmentation of aerial imagery (Chen et al., 2018), and respective experiments reveal that the use of a shuffling operator improves the predictive accuracy through forcing networks to learn upscaling. For example, if we need to double the resolution of the feature map, we can combine four feature maps as shown in Figure 4, which can be expressed as where I refers to the feature map after the combination, I refers to the feature map before the combination, ci refers to the order of the feature map and (x, y) is the location. The only hyperparameter for the shuffling operator is the upscaling rate u. In our experiments, we adopt the shuffling operator with an upscaling rate of u = 4 and bilinear interpolation to recover the spatial resolution as done in (Chen et al., 2018).

EXPERIMENTAL RESULTS
In the following, we first describe the used dataset (Section 4.1). Subsequently, we summarize the conducted experiments (Section 4.2) and, finally, we present the derived results (Section 4.3).

Dataset
For our experiments, we use the Vaihingen Dataset (Cramer, 2010;Rottensteiner et al., 2012) which was acquired over a relatively small village with many detached buildings and small multi-story buildings. This dataset contains 33 patches of different sizes, whereby the given regular grid corresponds to a ground sampling distance of 9 cm. For 16 patches, a very high-resolution true orthophoto and the corresponding DSM derived via dense image matching techniques are provided as well as a reference labeling with respect to six semantic classes represented by Impervious Surfaces, Building, Low Vegetation, Tree, Car and Clutter/Background. According to the specifications, the class Clutter/Background includes water bodies and other objects such as containers, tennis courts or swimming pools. We use 11 of the labeled patches for training and the remaining 5 labeled patches for evaluation. 3  Figure 3. The overall structure of our Residual Shuffling Convolutional Neural Network (RSCNN): the symbol "/2" in the convolutional block refers to the stride of the convolutional layers and symbols like "/2" upon arrows indicate that the resolution is reduced to half of the input in length, while symbols like "X3" after a residual block indicate that the corresponding block is repeated three times.
Shuffling Operator

Experiments
For each orthophoto and the corresponding DSM, we extract the set of hand-crafted features (cf. Section 3.1). Based on the orthophoto, we derive eight feature maps containing radiometric information with respect to the reflectance in the near-infrared (NIR), red (R) and green (G) domains, the normalized nearinfrared (nNIR), normalized red (nR) and normalized green (nG) values, the Normalized Difference Vegetation Index (NDVI) and the Green Normalized Difference Vegetation Index (GNDVI). Based on the DSM, we derive eight feature maps containing geometric information with respect to the normalized Digital Surface Model (nDSM), linearity (L), planarity (P), sphericity (S), omnivariance (O), anisotropy (A), eigenentropy (E) and change of curvature (C). A visualization of the behavior of these features for a part of the considered scene is provided in Figure 5.
Then, we focus on a separate and combined consideration of radiometric and geometric information as input to the RSCNN (cf. Section 3.2). We train and evaluate our network using the MXNet library (Chen et al., 2015) on one NVIDIA TITAN X GPU with 12 GB RAM. The network parameters are initialized using the method introduced in (He et al., 2015). Regarding the choice of the loss function, we use the cross-entropy error which is summed over all the pixels in a batch of 16 patches. To optimize this objective function, we use the standard Stochastic Gradient Descent (SGD) with a momentum of 0.9. During training, the samples in each batch are represented by patches of 56 × 56, 112 × 112, 224 × 224 and 448 × 448 pixels for 200 epochs, 50 epochs, 30 epochs and 20 epochs, respectively. The learning rate is kept at 0.01 as the adaptation of Batch Normalization (BN) (Ioffe and Szegedy, 2015) allows for training with a big training rate. Each patch fed into the network is normalized by the subtraction of the mean value and a subsequent division by the standard deviation. In contrast to the common strategy that prepares patches beforehand, and in which patches are regularly cropped from the original large images and then saved on hard-disks before training, each sample is cropped randomly and temporarily in our experiments as proposed in (Chen et al., 2017).
In all experiments, the Patches 1, 3,5,7,13,17,21,23,26,32 and 37 are used to train the deep network, while the Patches 11, 15, 28, 30 and 34 are used for performance evaluation. As evaluation metrics, we consider the Overall Accuracy (OA), the mean F1score across all classes (mF1) and the mean Intersection-over-Union (mIoU). To reason about the performance for each single class, we additionally consider the classwise F1-scores.

Results
The classification results derived for different subsets of the defined feature maps are provided in Table 1. This table also contains information about the number of parameters as well as the time required to train the RSCNN for the respective input data. For selected subsets, the achieved labeling is visualized in Figures 6 and 7 for a part of Patch 30 of the Vaihingen Dataset.

DISCUSSION
The derived results (cf. Table 1) clearly indicate that reasonable classification results can already be achieved by only considering true orthophotos (OA = 84.59%, mF1 = 82.81%, mIoU = 59.54%). In contrast, the classification results are significantly worse when only considering geometric information (OA = 70.95 . . . 76.07%, mF1 = 59.58 . . . 71.04%, mIoU = 39.00 . . . 47.45%). For the class Building, the corresponding F1score reveals a slight decrease in most cases. However, the decrease is more significant for the other classes, particularly for the classes Low Vegetation and Car for which the F1-scores are reduced by more than 20%. The fusion of radiometric and geometric information improves the classification results. While the mF1 and mIoU reach an almost constant level around 83% and 60%, respectively, the OA is improved up to 85.69% when using the NIR, R, G, nDSM, NDVI, L, P and S feature maps as input to the classifier. The derived results also reveal that considering the NDVI feature map in addition to the NIR, R, G and nDSM feature maps does not yield better classification results which is in accordance with insights of other investigations (Gerke, 2014). In contrast, the consideration of geometric cues given in the L, P and S feature maps in addition to the NIR, R, G and nDSM feature maps leads to slightly improved classification results.
Interestingly, the best classification result is not obtained for the case when all feature maps are used as input to the network (cf. Table 1). This effect is likely to indicate the Hughes phenomenon (Hughes, 1968;Guyon and Elisseeff, 2003) characterized by a decrease in classification accuracy when increasing the number of considered features. This might be due to the fact that redundant and possibly even irrelevant features are included in the semantic segmentation task. As a consequence, the end-to-end processing pipeline with a deep network should still involve feature extraction and feature selection techniques to select appropriate input data for the network. The visualizations of derived results for semantic segmentation (cf. Figure 6) additionally reveal discontinuities in the final prediction when using only radiometric features or only geometric features. These artifacts in the classification results arise from the fact that, due to the limited GPU memory, the whole image is partitioned into patches of 448 × 448 pixels and these patches are fed into the network for prediction. Compared to the pixels at the center of patches, the marginal pixels have a smaller field-ofview which may result in inaccurate predictions and discontinuities. Indeed, the visualized results correspond to the bottom right part of Patch 30. When using both radiometric and geometric information, this issue is resolved as the geometric and radiometric information are complementary and their combination allows a better prediction. When visualizing the used radiometric and geometric features (cf. Figure 5), particularly the nDSM reveals non-intuitive characteristics as local constraints like horizontal ridge lines are not preserved. Consequently, a potential source for improvement could be to directly approximate the topology of the considered scene from the spatial 3D coordinates using spatial bins and a coarse-to-fine strategy (Blomley and Weinmann, 2017) instead of using the LAStools software to derive the nDSM (Gerke, 2014).

CONCLUSIONS
In this paper, we have focused on the use of multi-modal data for the semantic segmentation of aerial imagery. Using true orthophotos, the corresponding DSMs and further representations derived from both of them, we have defined different sets of feature maps as input to a deep network. For the latter, we  Figure 6. Visualization of the reference labeling and the results for semantic segmentation when using only radiometric features, when using only geometric features and when using both geometric and radiometric features (from left to right): the color encoding addresses the classes Impervious Surfaces (white), Building (blue), Low Vegetation (cyan), Tree (green) and Car (yellow). Figure 7. Visualization of the reference labeling and the results for semantic segmentation when using the original data (i.e. only the NIR, R, G and DSM feature maps), when using a specific subset of all defined feature maps (here the NIR, R, G, nDSM, NDVI, L, P and S feature maps) and when using all defined feature maps (from left to right): the color encoding addresses the classes Impervious Surfaces (white), Building (blue), Low Vegetation (cyan), Tree (green) and Car (yellow).

Reference NIR-R-G-DSM NIR-R-G-nDSM-NDVI-L-P-S Radiometry & Geometry
have proposed a Residual Shuffling Convolutional Neural Network (RSCNN) which combines the characteristics of a Residual Network with the advantages of atrous convolution and a shuffling operator to achieve a dense semantic labeling. Via performance evaluation on a benchmark dataset, we have analyzed the value of radiometric and geometric features when used separately and in different combinations for the semantic segmentation task.
The derived results clearly reveal that true orthophotos are better suited as the basis for classification than the DSM, the nDSM and different representations of geometric information and their combination. However, the combination of both radiometric and geometric features yields an improvement of the classification results. The derived results also indicate that some features such as the NDVI are less suitable, and that the use of many features as the basis for semantic segmentation can decrease the predictive accuracy of the network and might thus suffer from the Hughes phenomenon. We conclude that selected data representations derived via feature extraction and feature selection techniques provide a gain if used as the basis for deep semantic segmentation.