DEEP DOMAIN ADAPTATION BY WEIGHTED ENTROPY MINIMIZATION FOR THE CLASSIFICATION OF AERIAL IMAGES

: Fully convolutional neural networks (FCN) are successfully used for the automated pixel-wise classification of aerial images and possibly additional data. However, they require many labelled training samples to perform well. One approach addressing this issue is semi-supervised domain adaptation (SSDA). Here, labelled training samples from a source domain and unlabelled samples from a target domain are used jointly to obtain a target domain classifier, without requiring any labelled samples from the target domain. In this paper, a two-step approach for SSDA is proposed. The first step corresponds to a supervised training on the source domain, making use of strong data augmentation to increase the initial performance on the target domain. Secondly, the model is adapted by entropy minimization using a novel weighting strategy. The approach is evaluated on the basis of five domains, corresponding to five cities. Several training variants and adaptation scenarios are tested, indicating that proper data augmentation can already improve the initial target domain performance significantly resulting in an average overall accuracy of 77.5% . The weighted entropy minimization improves the overall accuracy on the target domains in 19 out of 20 scenarios on average by 1.8% . In all experiments a novel FCN architecture is used that yields results comparable to those of the best-performing models on the ISPRS labelling challenge while having an order of magnitude fewer parameters than commonly used FCNs.


INTRODUCTION
The automated pixel-wise classification of multispectral aerial orthophotos (MSI) and possibly additional data like digital surface models (DSM) is a highly relevant task e.g. for the automated generation or updating of maps.One way to address this task is based on machine learning techniques, where labelled training samples are used to train a classification model.Currently, the best performance in a wide range of applications is achieved by deep neural networks, in particular by variants of fully convolutional neural networks (FCN) (Long et al., 2015a).FCNs are highly scalable classification models that can learn very complex mappings between input and output, if enough training data, representative for the classification task, is available.If this requirement is not fulfilled, the trained model is very likely to overfit to the training data and, thus, to perform badly on unseen data.However, creating more training data usually implies manual labelling, a costly and time-consuming task that should be avoided if possible.To that end, much research is carried out to prevent overfitting without the requirement of additional data by either regularizing the model properly or by artificially increasing the variety of the training data, e.g. by data augmentation (Shorten and Khoshgoftaar, 2019).
Another strategy, referred to as transfer learning (TL) (Pan & Yang, 2010), is to transfer knowledge from a source domain, in which training samples are abundant, to a target domain, where only a limited amount or no training data are available.Domain Adaptation (DA) is a specific setting of TL, where the domains are assumed to differ only by the joint distribution of the features and the class labels.Regarding the task of aerial image classification (AIC), this corresponds e.g. to a situation where labelled images from one city (source domain) should be used to classify images of another city (target domain) taken with the same sensor type and considering the same class structure.However, the objects in the target domain may have a different appearance, thus, a classifier that was trained only on the source domain possibly performs badly on the target domain.
In the present work the setting of DA is addressed where only unlabelled samples from the target domain are used to adapt a classifier from the source to the target domain.According to Tuia et al. (2016) this setting is referred to as semi-supervised DA.This setting is particularly interesting, since unlabelled samples from the target domain are always available because they are to be classified in the first place.However, SSDA is known to be very challenging and can even result in a negative transfer, denoting a decreased performance on the target domain after adaptation compared to training on the source domain only.Analogously, an improvement is called positive transfer.SSDA is highly relevant when it comes to the classification of aerial images because on the one hand there is only a very limited amount of freely available data with annotations (Zhu et al., 2017) and on the other hand the appearance of both natural and man-made objects in aerial images has a huge variability, making it difficult for a model to perform well across different domains.These factors can lead to huge domain-gaps in AIC.Although this is certainly a huge challenge, recent advantages in methods for SSDA in related domains like street scene segmentation indicate that these methods can compensate existing domain-gaps in AIC and, thus, increase the applicability of neural networks.However, there is hardly any work addressing SSDA for AIC with deep neural networks and no work was found that considers imbalanced class distributions.
In this paper, a two-step strategy for SSDA for the pixel-wise classification of aerial images is proposed.First, a model is trained in a supervised way on the source domain, referred to as source training.In the second step, the model is adapted to a target domain by applying implicit instance transfer.Following Vu et al. (2019) this is realized by minimizing the mean entropy of the pixel-wise target domain class predictions.However, the direct entropy minimization is assumed to perform badly in cases where the classes are highly unbalanced, as often in AIC.To that end, a novel weighting technique is introduced.To increase the stability of the adaptation, pixels that are close to a predicted object boundary are not considered in the instance transfer.
In order to evaluate the proposed method, FCNs are trained on five different (source) domains and adapted to the respective four other (target) domains, resulting in 20 adaptation scenarios.By varying the amount of data augmentation during source domain training, the respective influence on the initial target domain performance but also on the adaptability are investigated.In particular, it is assumed that strong data augmentation during source training increases the initial performance on the target domain.The method is compared to the regularized, direct entropy minimization, as proposed in (Vu et al., 2019), to validate the benefit of the proposed variant.A novel FCN architecture is used that has much fewer parameters than common FCNs without a considerable loss of performance.The architecture is mainly based on the combination of partial padding (Liu et al., 2018) and dilated convolutions (Fisher and Vladlen, 2016), assembled in residual layers (Szegedy et al., 2017).The performance of the architecture, it is evaluated on the Vaihingen benchmark.
The scientific contributions of this paper are as follows:  A two-step approach for SSDA based on entropy minimization is proposed and applied to the task of aerial image classification with neural networks.A pixel-wise weighting strategy based on the statistics of semi-labels and predicted object boundaries is proposed that improves both the success rate and the performance of the adaptation.
 As an additional, minor contribution, an architecture for a fully convolutional neural network is proposed.By combining residual layers with partial padding and dilated convolutions the network achieves a performance close to the state of the art while requiring much fewer parameters.
 Lastly, the influence of data augmentation on the initial domain gap and on a succeeding domain adaptation is investigated.It is shown that by using proper data augmentation the domaingap can be reduced significantly.

RELATED WORK
In this section, the state of the art in SSDA in computer vision and photogrammetry is discussed, focussing on the task of pixelwise classification using FCNs.According to Tuia et al. (2016) SSDA can be either based on representation transfer or on instance transfer.In the following, the two approaches are discussed in further detail.
Representation transfer tries to find mappings from the feature spaces of both domains to a common representation space such that a shared classifier can be applied.In remote sensing, this is often done by finding a mapping that minimizes a statistical distance between the domains, e.g. the maximum-mean discrepancy (MMD) (Matasci et al., 2015).This approach was transferred to neural networks for the task of assigning a single class label to an image in (Long et al., 2015b).Ganin et al. (2015) introduced the concept of domain adversarial training and showed that this approach is superior to minimizing the MMD.The concept of domain adversarial training was also frequently used for the pixel-wise classification with FCNs, e.g. in (Huang et al., 2018), (Hoffmann et al., 2018) and (Zhang et al., 2018) for the semantic segmentation of street scenes.While Zhang et al. (2018) apply the domain discriminator to the final layer of the classification network, Huang et al. (2018) propose to perform the representation transfer in multiple layers of the network.Hoffmann et al. (2018) apply the representation transfer to one intermediate layer of the network.Although all above mentioned approaches yield stable improvements, they are tailored to street scene classification.In (Wittich and Rottensteiner, 2019) the concept of domain adversarial training was applied to the task of aerial image classification.The authors achieve a stable positive transfer of around 1-5% in overall accuracy, evaluated on three domains with three classes.They show that domain adversarial training is highly susceptible to large differences in the marginal class distribution of source and target domain.As the performance of this method highly depends on the network architecture and the adaptation setup seems to be difficult to tune, in this work the alternative concept of instance transfer is explored.
An alternative way of representation transfer is related to the input space of the model.Several approaches for street scene classification (Hoffmann et al., 2018), (Zhang et al., 2018)  Instance transfer aims at adapting the classifier from the source to the target domain by using semi-labelled samples, i.e. target samples receiving their class labels from the current state of the classifier, e.g.(Bruzzone et al., 2008).Approaches based on instance transfer represent the second largest research branch of SSDA for the task of pixel-wise classification.Addressing the task of street scene segmentation, Zou et al. (2018) propose a class-balanced self-training, where they jointly train a network on labelled source domain data and target domain samples with semi-labels.For each class, they select the semi-labelled samples with the respectively highest confidence.Such a class-balancing is shown to be necessary, when dealing with imbalanced classdistributions.Based on the source domain samples, they further compute a spatial prior for each class, that is used to regularize the model.Although this approach yields results comparable to representation matching, using a spatial prior seems not reasonable when classifying aerial images, because objects can be located anywhere in the images.Class-balancing is also realized in the present work, however in a different way, because no explicit sample selection is used.An alternative approach based on semi-labelled samples is presented in (Iqbal and Ali, 2019).
The authors propose to use spatially independent samples with a high confidence-score from the semi-labelled images acquired by aggregating predictions at multiple scales.Since their sampling strategy relies on the assumption that the relative class-distributions in each image are similar in source and target domains this approach is probably not applicable to remote sensing applications, where large regions may contain only a single class.An approach for implicit instance transfer is presented in (Vu et al., 2019).Here, the entropy of the class predictions for each pixel is minimized which corresponds to increasing the probability of the most probable class.Thus, it is conceptually like the supervised training on semi-labelled samples.Besides the direct entropy minimization, the authors propose an adversarial approach that aligns the entropy distribution of source and target domain samples using a discriminator network.In both cases the model is regularized w.r.t. the predicted target domain class distribution, assuming it is close to the class distribution of the source domain.
Again, limited to street scene segmentation, they show that the second approach slightly outperforms the direct entropy minimization, while both methods achieve results comparable to those based on representation transfer.The method in the present paper is also based on entropy minimization, but only the direct version is explored and extended by a pixel-wise weighting strategy.Both assumptions are not generally valid in remote sensing scenarios, due to the previously mentioned reasons.Consequently, the proposed method does not rely on any assumptions regarding the target domain class distribution.The method can be seen as a combination of entropy minimization to realize instance transfer and class-balancing to address imbalanced class distributions.However, the balancing is realized here by weighting each pixels' loss depending on its semi-label.Further, predicted object boundaries are excluded during the adaptation in order to improve the adaptation stability, which is not done in any of the mentioned publications.

METHODOLOGY
In this section, the proposed strategy for the supervised source training and the unsupervised adaptation of a FCN is presented.
To that end, a formal description of DA according to Tuia et al.
(2016) is given.In DA, a source domain D S and a target domain D T are considered, both associated with remotely sensed imagery.
The domains are further associated with the joint distributions P S X,C and P T X,C of the image features X and the class labels C. In this paper, the setting of a homogeneous DA (Wang & Deng, 2018) is addressed, where the class structures C and the feature space X are assumed to be identical for both domains.The basic assumption of DA is that the joint distributions P S X,C and P T X,C are different, but related.The difference may be due to the marginal distributions of the features, i.e.P S X P T X , or the posteriors, i.e.P S C|X P T C|X .In both cases, the differences must not be too large.In the semi-supervised setting, a training data set T S of labelled training samples is available in the source domain, each consisting of a tuple xi S , ci S with xi S ∈ X and ci S ∈ C (in the addressed application, xi S , ci S corresponds to a labelled image patch, hence ci S is a matrix with one class label per pixel in xi S ).The information available in D T is restricted to the set U T of unlabelled samples xi T ∈ X.The task of SSDA is to use labelled data T S and the unlabelled data U T to learn a classifier that predicts the unknown labels ci T in the target domain.
In the proposed method this task is tackled by a two-step strategy.Firstly, a FCN is trained in a supervised way on labelled source domain training data T S , resulting in the model M S .The model is trained by minimizing the deviations between predicted labels and the reference ci S , measured by a differentiable loss function ℒ M S , xi S , ci S as described in section 3.2.In the second step the model is adapted to a target domain D T based on the unlabelled data U T , resulting in the final target domain classifier M T .The corresponding strategy is described in section 3.3.

Network Architecture
The FCN used in this work is designed to have a large receptive field while being able to propagate low level details through the network to preserve precise object boundaries.Preliminary experiments using different FCN architectures have shown that these two properties mainly affect the performance in AIC.They are commonly achieved by using encoder-decoder networks with skip-connections such as U-Net (Ronneberger et al., 2015).However, this architecture uses strong spatial down-sampling and deep feature maps, which leads to a large number of learnable parameters.For instance, the conventional U-Net has ~30M parameters.Ronneberger et al. (2015) state, that convolutions with zero-padding should be avoided because they produce artefacts wherever the receptive field exceeds the boundaries of the input image.This is even more important when a larger receptive field is used.Nevertheless, zero-padding is frequently used in AIC applications, possibly resulting in larger training times, boundary artefacts or even a suboptimal performance.
The proposed architecture combines several techniques to enable a large receptive field, yet preserving low-level information without any of the listed drawbacks.This is mainly achieved by combining partial convolution based padding (Liu et al., 2018) with dilated convolutions (Fisher and Vladlen, 2016).While dilated convolutions can effectively increase the receptive field without heavily increasing the number of parameters, partial convolutions reweight the parameters of each learned convolutional filter in areas where padding is necessary, i.e. at the border of the input.The two concepts are combined in residual blocks.In each residual block, the input is convolved with four dilated convolutional layers with dilation rates of 1, 2, 3 and 4, respectively, using partial convolution based padding.The results are concatenated, merged by another convolutional layer and added to the input of the block.Using multiple filters with different dilation rates is inspired by the inception ResNet (Szegedy et al., 2017), where it was shown to improve the performance.
The architecture takes input patches of size 256⨯256 px containing both MSI and the rasterized height data.Firstly, a downsampling layer is applied that performs a strided convolution (Springenberg et al., 2015) with a step with of 4 along both spatial dimensions.Next, 8 residual blocks are concatenated, followed by an up-sampling layer that uses a strided transposed convolution (Noh et al., 2015) again with a step width of 4 to scale the feature space back up to the size of the input.Down-sampling and up-sampling layer also use partial convolution based padding.The last layer predicts the class probabilities for each pixel using the softmax function.As activation function leaky rectified linear units (leaky ReLU) with a slope of 0.1 are used.
All residual blocks as well as the up-sampling layer include a dropout layer (Srivastava et al., 2014) to regularize the parameters and prevent overfitting to the training data.The architecture is presented in figure 1.All architecture-related parameters were found empirically in preliminary experiments.

Supervised training
The supervised training of the model is based on minimizing ℒ based on mini-batch gradient-descent.Instead of the commonly used cross-entropy loss ℒce, the multi-class focal loss ℒfcl is used here, because it is less affected by an unbalanced class distribution, often the case in aerial scenes.The focal loss was proposed by Lin et al. (2015) for binary image classification and adapted to the multi-class case in (Yang et al., 2019).Supervised training is carried out for a fixed number of iterations using the Adam optimizer (Kingma and Ba, 2015) and a constant learning rate.The batch size is not fixed but increased during training, which speeds up the training time without decreasing the classifiers performance.The tuning strategy and choice of all training-related hyper-parameters is further described in section 5.2.
During training, data augmentation is used to increase the variety of training samples and, thus, to improve the model generalization capability (Shorten and Khoshgoftaar, 2019).In AIC often only weak data augmentation is used, e.g. by random cropping of patches, followed by a random rotation in steps of 90° and a horizontal or vertical flip with a respective probability of 50% (Tasar et al., 2019).Sang and Minh (2018) perform only random flipping and Nogueira et al. (2019) do not use data augmentation at all.Yang et al. (2019) perform a slightly stronger geometrical augmentation, by rotating the patches in steps of 90°.In this paper a stronger data augmentation is applied, assuming that it will increase the model performance on the target domain.In each iteration a random affine transformation is applied in combination with radiometric augmentation, where each channel of the input patch is modified by a random linear transformation.This corresponds to a modification of brightness and contrast of each channel independently.Although the transformation of height data cannot be considered as radiometric augmentation, it is treated equivalently here assuming that a random transformation of the height data can compensate for differences in object heights or the ground level in D S and D T .

DA using entropy minimization
In this work, the actual DA step is carried out after source training as presented in section 3.2.The proposed approach realizes the concept of instance transfer in order to adapt the initial model M S to a target domain D T in an unsupervised way.The idea of instance transfer is to use semi-labelled samples from D T , obtained by applying M S to U T to retrain the classifier on samples with high confidence, i.e. measured by the entropy of the prediction.In this work, the instance transfer is not realized by supervised training using semi-labelled samples, but instead, following (Vu et al., 2019), by directly minimizing the entropy E of predicted class distributions of target domain samples.Conceptually, this minimization is directly related to supervised training on semilabelled samples, because minimizing the entropy corresponds to a maximization of the confidence of the currently most probable class.However, instead of explicitly choosing semi-labels with high confidence, this is done implicitly in entropy minimization.Samples with high confidence, thus a low entropy, result in larger gradients compared to samples with a high entropy and, thus, contribute more to the stochastic gradient descent.The entropy loss ℒ is derived as follows.Let ncls be the number of classes, the entropy Ei,x,y of the predicted class distribution pi,x,y for the pixel at position x, y in image i is defined as ℒ for a mini-batch with  patches of size ℎ  becomes Minimizing eq. 2 to perform the adaptation is assumed to be not reasonable when dealing with unbalanced class distributions, because the model would tend to increase the probability of the most frequent classes, thus, getting biased towards them.To counteract this behaviour a pixel-wise weighting strategy is proposed as follows.In each iteration of the adaptation phase the current model M is used to predict the semi-label map ̂    for each image xi in the mini-batch.The weighting function Π  for each class  corresponds to the ℓ1-normalized, inverse class ratio, thus where oc is the number of pixels with semi-label c in all samples of the current mini-batch.As a second extension, pixel entropies that are closer to a predicted object boundary than  pixels are not considered in the entropy minimization.This is motivated by the observation that classification models predict object boundaries usually with lower confidence, which leads to a high entropy of the corresponding pixels.Forcing a model to predict boundary regions with high confidence is assumed to be harmful during the adaptation.Formally, a binary boundary region indicator ϕ is introduced.Let  be the set of all pixels in the current mini-batch that have different semi-labels than any of its four adjacent pixels, the boundary region indicator becomes ϕi,x,y 0 if any pixel i', x', y' ∈ B fulfils    ′ ²  and ϕi,x,y 1 otherwise.According to the predicted semi-label, the class weights and the boundary indicators, the weight for pixel i, x, y becomes  , , Π ̂ , , ⋅ ϕ , , .This leads to the proposed weighted entropy loss where Γ denotes the sum of all weighting factors over the current mini-batch used to normalize the magnitude of the overall loss.
During adaptation, ℒ * is minimized using data samples from the training and test sets of the target domain.The adaptation is carried out for a fixed number of epochs using Adam optimizer for stochastic gradient descent.The selection of all hyperparameters of the adaptation phase is described in section 4.2.

Datasets
For the evaluation of the proposed strategy datasets from 5 different German cities are used.Data from a sixth city was used for tuning purposes only.Firstly, the datasets Potsdam P and Vaihingen V are used, provided by the ISPRS labelling challenge (Wegner et al., 2017).Secondly, datasets for Schleswig S , Hameln H , Buxtehude B and Nienburg N are used.In the following, each city is treated as a separate domain.The datasets were pre-processed to have a common ground sampling distance of 20 cm, which required a spatial down-sampling of the data for V and P. Furthermore, all reference data were mapped to a common class structure, containing the five classes Sealed ground, Building, Natural ground, Vegetation and Vehicle.To this end, the class Clutter of the datasets P, V, S and H was manually relabelled to one of the remaining classes.The classes Water and Soil, originally present in the reference for S and H, were mapped to Natural ground.The reference for B and N, provided by (Vogt et al., 2018), was revised to match the shared class structure.For all datasets the channels near infrared (NIR), red and green are available, thus these channels are used to compose the MSI.Normalized digital surface models (nDSM) are used as height maps the.While for P and V the original split into training and testing patches was kept, the remaining datasets were randomly image-wise split into disjoint sets for training and testing with a ratio of approximately 2:1, such that the class distribution of each subset roughly corresponds to the overall class distribution of the dataset.The dataset N was only used for tuning the hyper-parameters of the method.To obtain unbiased results, this dataset is not used for the evaluation of the method.The nDSM for V was provided by Gerke (2015).

Test setup and evaluation protocol
The evaluation is split into three parts.In the first experiment, the proposed network architecture is evaluated on the Vaihingen benchmark.The other two experiments correspond to the two phases of the proposed strategy for SSDA, i.e. source training and DA.To investigate the influence of data augmentation, two variants Ω ∈ Ω-, Ω ] are used during source training.Variant Ωrefers to a weak amount of augmentation.Samples, drawn from a random position, are randomly rotated by n 90° with n ∈ 0, 1, 2, 3 before being flipped horizontally or vertically with a respective probability of 50%.This variant is considered as frequently used augmentation strategy in AIC (see section 3.2).
The second variant Ω is the proposed, strong data augmentation strategy.Here, a random affine transformation is applied to obtain image patches using bilinear interpolation for input data and nearest neighbour interpolation for the label maps.The rotation is drawn from the uniform distribution  0°, 360° , shear according to the normal distribution  0, 0.3 and the scales for both spatial dimensions from  1, 0.3 .Additionally, each channel of the patch (including the height map) is linearly transformed with random bias rb and scale rc.Formally, the j-th channel of the sample xi is transformed as x'i,j rc,j xi,j rb,j .The corresponding random variables are drawn from rc,j ~ 1, 0.3 and rb,j ~ 0, 0.3 .The resulting classifiers for each source domain D S and augmentation scenario are denoted as M S,Ω .The proposed network architecture and all training related hyper-parameters were tuned in advance on domain N, such that the average overall accuracy on the test set of N based on both augmentation variants is maximised.In the tuning process, the following hyper-parameters were obtained.Training is done for 100K iterations using the Adam optimizer with a fixed learning rate of 10 -4 and hyperparameters β1 0.9 and β2 0.999.The batch-size is initialised with 2 and increased by 1 every 6K iterations up to a maximum size of 16.The hyper-parameters related to the adaptation were obtained by maximising the average improvement of adapting the models from N to H. DA is done for 200 iterations by minimizing ℒ * for mini-batches of target domain samples, obtained without data augmentation.The batch-size is set to 24, and the boundary margin to  2 px.The Adam optimizer is used with a learning rate of 10 -6 and hyper-parameters β1 0.0 and β2 0.99.
All evaluations are done on the test set of the respective target domain.A sliding window evaluation is performed, i.e. the input image is split into overlapping patches (with an overlap of 50% in both spatial dimensions), each of the patches is processed and the resulting probability distributions for pixels in overlapping areas are averaged.Following Kaiser et al. (2017), the quality metrics Overall Accuracy (OA) and the Mean F1-score (MF1) are used to assess the resulting label predictions.While the first one, being the ratio of correct predictions to the total number of predictions, can be biased for imbalanced class distributions, the MF1 averages the prediction quality, i.e. the harmonic mean of precision and recall, of each class equally and, thus, is not biased towards classes with higher frequency.

Evaluation of the FCN architecture
In this section, the proposed network architecture is evaluated on the original Vaihingen dataset from the ISPRS labelling challenge.The model is trained using the protocol described in section 4.2 using augmentation scenario Ω .To validate the effect of using the focal loss, a second model is trained using the standard cross-entropy loss ℒce.Two inference protocols (IP) are evaluated.The first (IP1) is the sliding window inference presented in section 4.2.The second protocol (IP2) additionally evaluates vertically and horizontally flipped versions of each image and averages the predictions of corresponding pixels, resulting in a higher redundancy per pixel.Table 2 shows the achieved quality measures, corresponding to those listed on the benchmark website (Wegner et al., 2017).Table 2. Achieved quality metrics for the Vaihingen benchmark.
Having followed the protocol of the ISPRS labelling challenge, i.e. evaluating the model on the reference with eroded class boundaries, the above results can be compared to those, achieved by other methods.Currently, the benchmark website lists the best OA as 91.6%.Thus, the achieved results are slightly worse (~1%).Regarding the loss function, for both IPs the OA is 0.2% higher when minimizing ℒce, but The F1-scores are less balanced.The F1-score of the underrepresented class Car is around 3% higher, when minimizing ℒfcl instead.Thus, it is reasonable to use ℒfcl when dealing with unbalanced class distributions.

Evaluation of models before DA
In this experiment, models are trained in a supervised way using the labelled source domain training data of D S ∈ P, V, B, S, H , and augmentation variant Ω ∈ Ω-, Ω .The resulting models are evaluated on the test set of all domains.Evaluating the models on the target domains without any adaptation technique is considered as baseline when assessing the effectiveness of DA in the further experiments.By training with two different augmentation variants, the initial statement, that strong data augmentation can partially alleviate the domain-gap, is validated.Table 3 shows the resulting OA and MF1.Results printed in bold font correspond to intra-domain (ID) settings, thus D S D T .The settings, where D S D T are referred to as cross-domain (CD) settings.
In nearly all cases, the quality metrics increase when changing the augmentation strategy from Ωto Ω .However, in the ID settings a stronger augmentation yields only a minor improvement of ~1% on average for both metrics, while the averaged OA for CD settings is increased by 8.4% and the average MF1 by 10.5%.

Evaluation of DA
The models obtained by source training with Ω ∈ Ω-, Ω are now adapted to the other domains using the proposed DA strategy.Table 4 shows the achieved improvements of OA and MF1 compared to the initial evaluation in table 3. Negative transfers are printed in bold font.Because H was used as target domain to tune the hyper-parameters of the adaptation method which is why the respective results should be taken with caution.

Ablation study and comparison
In the last experiment, the influence of the proposed loss weighting strategy is investigated.To that end, the source models trained with Ω , are adapted by minimizing ℒent, thus, the mean entropy without any weighting.Additionally, the direct entropy minimization strategy, as proposed in (Vu et al., 2019) 5 shows the resulting metrics, again as differences to the results obtained without adaptation.Negative transfers are printed in bold font.

CONCLUSION
In this paper, an approach for SSDA based on weighted entropy minimization was proposed and evaluated for several adaptation scenarios.The experiments indicate that a strong data augmentation can already alleviate the domain-gap significantly.In particular, the average MF1 in in cross-domain settings was increased from 61.6% to 72.1%.By applying the proposed adaptation strategy, this metric was further increased to 74.7%.The adaptation approach is considered to yield mostly stable improvements, since only one out of 20 adaptation scenarios resulted in a negative transfer, regardless of the augmentation strategy during source training.In contrast, adaptation without the proposed weighting strategy resulted mainly in negative transfer, indicating that the proposed weighting strategy is necessary when dealing with imbalanced domains.The proposed FCN architecture performs comparable to the state of the art while having one order of magnitude fewer parameters than common architectures like U-Net.Despite the stability of the proposed method, the average cross-domain metrics after adaptation are still ~9% lower than the intra-domain metrics and seasonal effects were not fully compensated as shown in the visual evaluation.The results of this work also support the general assumption that training on larger datasets result in better generalizing models.
Future research should analyse whether the proposed method can be combined with other DA methods, e.g. based on image-toimage-translation (Tasar et al., 2019) or domain adversarial training (Wittich and Rottensteiner, 2019).Due to the obviously large impact of proper data augmentation during source training, it further seems reasonable to seek for more sophisticated augmentation scenarios that generate a wider range of meaningful augmentations or to deduce the augmentation parameters from statistical differences between the source-and target domain.

Figure 1 :
Figure 1: Proposed FCN architectureIn a typical scenario with nch 4 input channels and ncls 5 classes, the model has about 3.5M parameters, one order of magnitude less than frequently used networks for image classification like U-Net.However, the dilated convolutions in each residual block result in a very large receptive field, in this configuration a theoretical window of over 512⨯512 px.Since the size of the receptive field is more than two times the input size, all predictions are affected by all input pixels.For instance, the prediction of the class label for the pixel in the lower-left corner is affected by the values of the pixel in the upper-right corner.
In a pre-processing step the colour channels of all domains were normalized individually to zero-mean and unit-standard deviation based on the statistics of each domain.The new value v'j,b,d of band b of pixel j in domain d after normalization thus is computed as v'j,b,d vi,b,dμb,d /σb,d , where μb,d is the mean and σb,d the standard deviation of all pixels of band b in domain d.The nDSMs were normalized according to h'i,b,d hi,b,d /uh with a fixed value of uh 5 m to bring them to a value range close to the normalized MSI channels.For the evaluation of the proposed network architecture on the ISPRS labelling challenge, the original version of the Vaihingen dataset with a GSD of 8 cm was used, considering the original classes including Clutter.The channel-wise normalization was carried out as described above.

Figure 2 :
Figure 2: Exemplary test set samples and predictions.First three rows show nDSM, MSI and reference for a 256⨯500 region of each domain.Remaining rows show predictions after adaptation where D S D T and before adaptation where D S D T .
is evaluated for comparison.Vu et al. (2019) propose a one-step approach, where the mixed loss ℒvu ℒce λent * ℒent ℒcp is minimized.Here, ℒcp is an additional loss that penalizes the deviation of source domain class distribution from target domain class distributions of each prediction; see(Vu et al., 2019) for further details.The relaxation parameter ℒcp is set to μ 0.5 and the entropy weight to λent 0.001 as proposed by the authors.AlthoughVu et al. (2019) do not consider any online data augmentation, it is used here to have a fair comparison.Further, the training was started after the proposed source training.Training without online augmentation and training from scratch was carried out in additional experiments, not presented for lack of space.Both variants resulted in significantly worse results.Table in [%], achieved by adapting M S,Ω to D T using alternative loss functions.The adaptation without the proposed class-balancing achieves a slight average improvement in OA of 0.1%, while the MF1 decreases on average by 1.5%.A positive transfer w.r.t.both metrics could only be achieved in 5 out of 20 scenarios.The approach according toVu et al. (2019) was even less stable with only 3 cases of positive transfer.The results indicate, that both variants perform worse than the proposed method in AIC scenarios with highly imbalanced classes.Adapting with the proposed strategy without excluding boundary regions and jointly training on labelled source domain samples was also carried out in nonlisted experiments, both leading to a slightly worse performance.

Table 1 :
Table 1 shows the size of training-and test set and the class distribution of each domain.Dataset overview.

Table 3 :
OA MF 1 OA MF 1 OA MF 1 OA MF 1 OA MF 1 Quality metrics in % of non-adapted models, trained on D S and evaluated on the test set of D T .

Table 4 :
Improvements in % , achieved by adapting M S,Ω to D T .
do not contain any trees without leaves.Even after adaptation, the models barely detect any of the trees in H, while they are detected when coming from D S P, the only other domain captured in autumn.