USING SEMANTICALLY PAIRED IMAGES TO IMPROVE DOMAIN ADAPTATION FOR THE SEMANTIC SEGMENTATION OF AERIAL IMAGES

: Modern machine learning, especially deep learning, which is used in a variety of applications, requires a lot of labelled data for model training. Having an insufﬁcient amount of training examples leads to models which do not generalize well to new input instances. This is a particular signiﬁcant problem for tasks involving aerial images: often training data is only available for a limited geographical area and a narrow time window, thus leading to models which perform poorly in different regions, at different times of day, or during different seasons. Domain adaptation can mitigate this issue by using labelled source domain training examples and unlabeled target domain images to train a model which performs well on both domains. Modern adversarial domain adaptation approaches use unpaired data. We propose using pairs of semantically similar images, i.e., whose segmentations are accurate predictions of each other, for improved model performance. In this paper we show that, as an upper limit based on ground truth, using semantically paired aerial images during training almost always increases model performance with an average improvement of 4.2% accuracy and .036 mean intersection-over-union (mIoU). Using a practical estimate of semantic similarity, we still achieve improvements in more than half of all cases, with average improvements of 2.5% accuracy and .017 mIoU in those cases.


INTRODUCTION
Many applications in a variety of domains rely on automatically extracting useful information from images: person identification, object detection and tracking, defect detection during production, semantic segmentation of images and many more. Semantic segmentation, i.e., assigning a semantically meaningful class label to each pixel in an image, is of particular interest for aerial images. Such a segmentation serves as the basis for applications such as creating and updating maps, tracking city growth over time, or tracking deforestation over time.
Most modern machine learning approaches for computing a semantic segmentation of an image use deep learning techniques. From a collection of training examples, i.e., pairs of images and their respective segmentations, a model is learned. Such models usually consist of tens of millions of learnable parameters, with some of the larger models even reaching into the hundreds of millions of parameters. In other domains, e.g., natural language processing, models with more than a billion parameters exist (Radford et al., 2019).
Training large models requires large, diverse sets of training examples. Too few training examples compared to the number of parameters lead to the model overfitting on the training data. The model merely memorizes all the segmentations used during training. However, for a model to be actually useful in a real application, it has to compute features which generalize well to new images. Furthermore, a model trained to compute a semantic segmentation of large cities such as London or Berlin into buildings, roads and trees will likely perform poorly when used on images from rural areas. Acquiring sufficiently large and diverse datasets is a very time-and labor-intensive process and thus it is expensive and often impractical.
To mitigate this issue, transfer learning, in particular domain adaptation, can be used. The goal of domain adaptation is to reuse knowledge learned from one dataset, the so-called source domain, which consists of sufficiently many training examples, and apply it to another dataset, the so-called target domain, for which little to no training examples are available, i.e., only a set of input images without segmentation is known.
A lot of research on domain adaptation for semantic segmentation using deep learning has been published in recent years. However, the area of focus of most research is the segmentation of street scenes with the goal of improving the semantic data available to autonomous cars. The most popular domain adaptation setting is the transfer of knowledge from the SYNTHIA dataset (Ros et al., 2016) to the Cityscapes dataset (Cordts et al., 2016), both of which consist of street scenes from the point of view of a car. Both datasets have in common that there is a spatial prior for each class: buildings are most likely to appear in the top corners of an image, sky is most likely to appear in the top-center and the road is most likely to appear in the bottom half of an image. These spatial priors, which are visualized for SYNTHIA in Fig. 2 of (Zou et al., 2018), make it more likely for the same pixel position in two random images from either dataset to belong to the same semantic class. Aerial images do not have such spatial priors due to their point of view. Thus, two random street scenes are likely to be semantically similar, while aerial images are not.
Inspired by (Kuhnke, Ostermann, 2019), using a model's prediction as pseudo ground truth, and this difference between street scenes and aerial images, we propose using semantically paired images for domain adaptation. In this paper we demonstrate that such a pairing improves the performance of domain adaptation, as shown in Fig. 1. Furthermore, we demonstrate an approach which approximates image pairing based on semantic similarity without using any knowledge of the ground truth segmentation of the target domain. This paper is structured as follows: Sect. 2 explains the theoretical foundations, followed by related work in Sect. 3. Our contribution is then described in Sect. 4 and evaluated in Sect. 5. The paper finishes with a conclusion in Sect. 6.

DOMAIN ADAPTATION
This section explains the foundations necessary for understanding our contribution, while focusing on a deep learning context.

Foundations
The goal of domain adaptation is reusing knowledge learned from one dataset and applying it to another dataset. More precisely, a domain D is a tuple (X , P (X)) where X is a feature space and P (X) with X = {x1, x2, . . . , xn}, xi ∈ X , is a marginal probability distribution. In our case, X is the set of all aerial images and P specifies how likely each image is. Furthermore, there is a task T = (Y, f : X → Y ) in a domain adaptation setting, with Y being a label space and f being a function mapping instances xi ∈ X to their label yi ∈ Y . Again, in our case, Y is the set of semantic segmentations and f is the model being learned from a set of training examples (xi, yi).
In a common supervised machine learning setting there is only one domain D and one task T . However, in domain adaptation, there are two domains and two tasks: the source domain D S = (X , P (X S )) with the source task T S = (Y, f S ) and the target domain D T = (X , P (X T )) with the target task T T = (Y, f T ). The underlying feature space X of both domains is the same, e.g. RGB images, but the distribution P of instances differs for both domains. This difference is called domain shift or domain gap. Also, the same task shall be performed for both domains, e.g., assigning the correct semantic segmentation from Y to each instance x S i or x T i . However, the actual function f , i.e., the model, used to perform this mapping of instances to elements in Y may differ between domains. Fig.  2 shows a classification task (geometric shape) for two domains (color). It also demonstrates that a model trained on either domain is likely to perform poorly on the other domain.
There are two common domain adaptation settings: in the semisupervised setting only a small number of training samples (x T i , y T i ) in the target domain are known, while in the unsupervised setting, which is our focus, no target domain training samples are known at all. In both settings, a large amount of source domain training samples (x S i , y S i ) are readily available.  Deep learning-based domain adaptation approaches can be classified based on what part of the chain yi = f dec (fenc(xi)) they affect. Appearance adaptation or instance transfer is the transformation of source instances x S i s.t. they look like instances drawn from P (X T ) or vice versa. We denote such transformed source instances as x S→T i . In feature level adaptation the goal is to map instances x S i and x T i to a common intermediate feature space Xint shared by both domains. Domain adaptation through parameter transfer means reusing some of the, potentially transformed, parameters of the model f S in the model f T . Which parts of a model f are affected by the different kinds of domain adaptation approaches is shown in Fig. 4.

Adversarial Training
The currently most popular approaches for unsupervised domain adaptation for deep learning models are based on adversarial training. The idea comes from so called Generative Adversarial Networks (GANs) (Goodfellow et al., 2014). In a GAN, a generator model attempts to create artificial data, e.g., images, from noise, while a discriminator model tries to distinguish between real data and artificial data. Properly training these two adversarial models results in a generator model   Figure 5. Partial Cycle-GAN approach for appearance adaptation. The generator G S→T tries to fool the discriminator, i.e., ideally the set of generated instances X S→T looks indistinguishable from X T .  which is capable of transforming data drawn from one distribution (noise) into data drawn from another distribution (images).
In a Cycle-GAN , visualized in Fig. 5, two generator models and two discriminator models are used for appearance adaptation. The generator G S→T maps source instances x S i to the target domain's marginal probability distribution P (X T ), while the generator G T →S performs the mapping in the opposite direction. The generators are trained s.t. cycle consistency holds, i.e., Additionally, the generators must generate outputs x S→T i and x T →S i which the two discriminators cannot distinguish from real data of the respective domain.
In the DANN approach (Ganin et al., 2016) for feature level adaptation a discriminator tries to distinguish instances from both domains in the intermediate feature space. The approach is shown in Fig. 6. A gradient reversal layer between the discriminator and the two encoder models f S enc and f T enc ensures that, while the discriminator gets increasingly better at distinguishing the instances, the encoder models learn to map to a shared intermediate feature space Xint. Thus, eventually it becomes impossible to tell the instances from the two domains apart. When this point is reached, f S dec can be used as a decoder for both encoders, in particular, for f T enc .

RELATED WORK
Modern domain adaptation approaches are based on adversarial training, usually combining the DANN approach (Ganin et al., 2016) and the Cycle-GAN approach . Notable approaches combining the two are CyCADA (Hoffman et al., 2017) and CrDoCo (Chen et al., 2019b). Additionally, they enforce cross-domain consistency, i.e., the prediction for an input instance should be the same regardless of which domain said instance currently appears to belong to.
(Kuhnke, Ostermann, 2019) follows a similar approach to ours, but for head pose estimation, i.e., a regression task rather than a semantic segmentation task. They use a model pre-trained on the source domain to generate predictions for the target domain. Then, they sample from each domain during a DANN-like domain adaptation training s.t. the label distributions are similar for samples from both domains. We also propose using the prediction of a pre-trained model. (Tsai et al., 2019) cluster their input images based on the source domain class distribution and use adversarial training to implicitly map each target image to a cluster. This auxiliary mapping task causes the encoders to learn features representing clusters instead of features representing domain-specific details. We propose using an explicit mapping to be able to compose minibatches in a specific way. Also, our similarity metric is stronger than just comparing class distributions.
Other related works use the maximum mean discrepancy (MMD) instead of a discriminator for domain adaptation and related tasks such as source selection. Notable works include (Vogt et al., 2018), which does not rely on deep learning at all, but still is able to reach competitive adaptation performance in terms of the traditional machine learning model used for classification. (Long et al., 2018) use an MMD-based loss function instead of a discriminator in an otherwise DANN-like approach.
While parameter transfer, other than weight sharing, is rather rare in modern approaches, (Rozantsev et al., 2018) explicitly model the domain shift as a linear function mapping the weights of one domain's model to the other domain's model.

SEMANTIC SIMILARITY
This section formalizes our contribution, a metric for measuring the semantic similarity of two instances xi and xj and an approach for approximating this metric for instance pairs for which the ground truth segmentation is unknown for one of the instances. Our contribution is thoroughly evaluated in an empirical manner in Sect. 5.

Semantic Similarity Metric
As explained in the introduction, the class-wise spatial priors of the SYNTHIA and Cityscapes datasets means that two random images drawn from either dataset are likely to contain a high amount of pixel positions, which have the same semantic class in both images. This is what we intuitively understand as semantically similar images and we formalize this intuition as a metric.
Note: For brevity, we often write just xi instead of (xi, yi). The context will make it clear whether we just use xi or also its ground truth segmentation yi.
Definition: Let xi and xj be two instances with their respective segmentations yi : L → C and yj : L → C with L being the common set of all spatial locations (pixel positions) in both instances and C being the common set of all semantic classes. The semantic similarity SemSim of these instances is This definition is the same definition as used for accuracy. In other words, by our definition two instances are semantically similar, if the segmentation of one instance is an accurate prediction of the segmentation of the other instance.
The semantic similarity is a metric which assigns values from the interval [0, 1] to every pair of instances xi and xj. Higher values indicate a high similarity, with a value of 1 being assigned iff the segmentations yi and yj are identical. A high similarity also implies a high overlap in the class probability distributions of both segmentations. If a class is likely to appear in one segmentation of two semantically similar instances, then it must also be likely to appear in the other segmentation, otherwise the semantic similarity would be low. These class probability distributions are identical if the similarity is 1. However, the opposite is not true: two segmentations can have the same class probability distribution while their similarity is 0. Imagine a segmentation yi which is partitioned into two segments of equal size, each with a different class. A segmentation with flipped labels will not be similar by our definition, but the class probability distribution will be identical.

Semantic Similarity Approximation
We will show in Sect. 5 that, in an unsupervised domain adaptation scenario, it is desirable to create pairs of instances x S i and is maximized over all x T j ∈ X T for any given x S i ∈ X S . Using these pairs together in mini-batches to train a deep learning model will result in an improved model. However, to compute SemSim(x S i , x T j ), as defined in the previous subsection, their respective ground truth segmentations y S i and y T j must be known. As the latter is unknown, in fact, our very goal is to learn a model which can predict y T j well, we need an approximation of SemSim(x S i , x T j ) in our setting.
To approximate SemSim, we use a semantic segmentation model f S learned using the source training samples (x S i , y S i ), as inspired by (Kuhnke, Ostermann, 2019). The output tensor y = f S (x) of the model is itself a functionŷ : L × C → [0, 1] which assigns class probabilities to every location l ∈ L, s.t. c∈Cŷ (l, c) = 1 ∀l ∈ L. Some models require an additional Softmax layer, which does not have any parameters which need to be learned, to map class scores to class probabilities.
With the model f S we compute the approximation SemSim * using the equation withŷ T j = f S (x T j ) and y S i being the ground truth segmentation of x S i . SemSim * is the mean of the probabilities assigned to the ground truth classes y

Mini-Batch Composition
During model training, we will not actually compute SemSim or SemSim * . Rather, we propose pre-computing a mapping M : X S → X T . Then, when creating a mini-batch during training, for every x S i ∈ X S we put into the mini-batch we also put M (x S i ) ∈ X T into the same mini-batch.
We compute the mapping M : X S → X T using the equation Using SemSim * is very similar to using the predictions f S (x T j ∈ X T ) as pseudo ground truth segmentations, however, we explicitly take the model's confidence into account. This is done to avoid small changes in f S or x T j , e.g., noise, causing drastic changes in SemSim * (x S i , x T j , f S ) due to uncertainty in large regions L * ⊆ L causing the labels in said regions to flip to or from the correct ground truth class.

EVALUATION
As a second part to our contribution, we evaluate our proposed metric and its approximation in an thorough empirical evaluation. As evaluation metrics, we use accuracy and mean Intersection-over-Union (mIoU). Though these are commonly used metrics used, we still included their definition as a reminder in the appendix.

Datasets
We used five datasets, each dataset associated with a different German city. We used the two ISPRS 2D Semantic Labeling Benchmark Challenge datasets consisting of images from Vaihingen and Potsdam. Furthermore, we used images of Hannover, Buxtehude and Nienburg. The images are true orthophotos of the respective cities. Tab. 1 contains details about each dataset. To make the datasets comparable in terms of sensor data we resampled Vaihingen and Potsdam to a ground sampling distance of 20cm/pixel. Furthermore, we did not use any depth information as it was unreliable for at least one domain (Hannover) and we did not use the blue channels as no such information is available for Vaihingen. From Vaihingen and Potsdam we only used the 16 images initially released as training set for the respective benchmark challenges. The images of Hannover, Buxtehude and Nienburg were divided into 16 patches of 2500×2500 pixels. One such Hannover patch was omitted due it containing almost exclusively one class (tree). The image or image patches of each dataset were randomly partitioned into a training set and a validation set by randomly drawing ten images as training images and using the remaining images as validation images. The ground truth segmentations of each dataset include six classes: impervious surfaces, buildings, low vegetation, tree, car, and clutter/background. Their distribution is shown in Fig. 7. Hannover, as the largest city out of the five, has more buildings than the others. The distribution of Vaihingen is closer to that of Potsdam, despite Vaihingen being closer to Buxtehude and Nienburg in size. In the latter two, low vegetation is more common than in the other datasets. While the classes car and clutter are rare in all datasets, Potsdam has significantly more clutter than the other datasets.

Test Setup
We created actual training sets for each dataset by randomly cropping 224×224 image patches from the training images. We Figure 7. Class distribution of the datasets used random translations, rotations and shearing for data augmentation. At image boundaries, we used reflection padding. We drew 800 image patches from each of the ten images, resulting in 8000 image patches as the training set for each dataset/domain. The validation image patches were created by sliding a 224×224 window with stride 224×224 over the validation images. At the right-hand and bottom boundaries of each image we moved the sliding window inwards s.t. it fits entirely inside the image, creating a small overlap region with the previous window. Thus, we avoided interpolation and padding for the validation image patches.
As explained in Sect. 4.3 we use mappings M S→T : X S → X T to compose mini-batches for training deep learning models. During any model training we only used the segmentations y S i of the source domain.
For every pair of domains we computed three mappings: • Random: We mapped each x S i to a random x T j .
• Ground Truth (GT): We created a mapping based on Eq. 3 but used SemSim instead of SemSim * . For that, through our data augmentation process, we created 16000 image patches per target image instead of just 800, in order to find a better mapping.
• Approximation: This mapping is similar to the Ground Truth mapping, however, we actually used Eq. 3, i.e., no target ground truth was required.
All mappings were computed once and then stayed fixed throughout all our experiments.
For the domain pair Hannover (source) and Vaihingen (target) we created three additional mappings, two of which are based on the MMD. The MMD is a measure of how similar two probability distributions are. We used the estimate described in (Vogt et al., 2018) to compute it. The additional mappings are: • Shuffled Ground Truth: Based on the Ground Truth mapping M S→T GT , this mapping assigns a random x T j ∈ M S→T GT (X S ) to each x S i , s.t. that M S→T GT (X S ) is re-used in its entirety. The idea is to have a mapping with the same overall class distribution as the Ground Truth mapping, but with a randomized mini-batch composition.
• Image space MMD: For this mapping we treated each pixel in each image patch x S i or x T j as a sample in terms of the MMD. The mapping creates pairs which minimize the MMD between paired image patches.
• Feature space MMD: We used a model f S , which was comprised of an encoder f S enc and a decoder f S dec , to create feature tensors z S i = f S enc (x S i ) and z T j = f S enc (x T j ). Treating each spatial location in each such tensor as a sample, we computed the MMD between feature tensors to create pairs (x S i , x T j = M S→T (x S i )), again minimizing the MMD between paired data.
These additional mappings were used to investigate whether SemSim * offers competitive performance and whether having a similar class distribution across the entire dataset is sufficient or a similar distribution across each mini-batch (Shuffled Ground Truth vs. regular Ground Truth mapping) is beneficial.
As a semantic segmentation model we used DeepLabv3+  with MobileNetv2 (Sandler et al., 2018) as a backbone. We chose the same configuration for the spatial pyramid pooling (SPP) as used by the authors on their reference implementation. Furthermore, we used a large feature space, i.e., we only used an overall downsampling factor of 8 for each dimension. This model was also used for computing SemSim * and the feature space MMD mapping. As a feature space, we picked the output of the 1×1 convolution which combines the concatenated SPP branches into a single feature space with 256 channels per spatial location. To speed up training, we used weights from a pre-training on ImageNet. We then trained the model for an additional 80 epochs with 250 mini-batches, each of size 32, per epoch. We used Adadelta (Zeiler, 2012) with default parameters as optimizer.
As a DANN-like adversarial training approach (Ganin et al., 2016) for feature level adaptation, we took DeepLabv3+ and added a discriminator comprised of three inverted bottlenecks (Sandler et al., 2018). The discriminator assigned a class (source or target) to each spatial location of a feature tensor. We used a shared encoder fenc for both domains. In each training iteration we first trained fenc and f dec to perform a semantic segmentation on the source domain, then we froze the weights fenc to train the discriminator using cross-entropy loss. As a third and final step, we froze the discriminator weights and trained fenc with flipped class labels (source became target and vice versa). This imitates having a gradient reversal layer. Again, we used the Adadelta optimizers (one for each step), trained for 80 epochs with 250 mini-batches each. We had to reduce the batch size to 16, using only a fixed subset of X S .
For appearance adaptation we used the Cycle-GAN architecture from . However, as a final activation for the generators, we used ReLU instead of tanh. We also removed the Sigmoid activation from the discriminators and used mean squared error loss instead of cross entropy (Mao et al., 2017). Additionally, we performed identity loss training every ten mini-batches (Taigman et al., 2016). We used λ = 10 for the reconstruction/cycle-consistency loss weight and λ = 1 for all other loss weights. We trained the Cycle-GAN for 200 epochs with 250 mini-batches each. The mini-batch size was set to just  10, again forcing us to use only a fixed subset of our datasets for training. Adam (Kingma, Ba, 2014) was used as optimizer for all parameters. After training the Cycle-GAN, we used the adapted source image patches x S→T i to train a semantic segmentation model. This model was then evaluated on the target validation image patches.

Results
We ran all experiments ten times. The reported numbers in this chapter are the mean values. Furthermore, the reported numbers were calculated on the target validation sets. Table 2 shows the baseline performance of DeepLabv3+ on our datasets. The model performs well in terms of accuracy, however, the accuracy and mIoU together indicate that it performs better on abundant classes than it does perform on rare classes. Fig. 8 shows the performance loss when using a model trained on one domain to perform a segmentation of another domain as opposed to both domains being the same. 13% to 67% accuracy is lost in such cases. Notable domains are Potsdam, which performs poorly as a source domain, Hannover, which performs poorly as a target domain, and the pair Buxtehude and Nienburg, whose models, even without any domain adaptation, already work relatively well for the other domain.
Next, we investigated the pair Hannover (source domain) and Vaihingen (target domain) more closely for a comprehensive comparison of the mapping strategies using the DANN-like approach. The similarity scores for the different mapping strategies are in Table 3. Even the Ground Truth strategy often cannot guarantee a perfect match, however, it still has the highest score. The second highest score is achieved by our Approximation strategy, with the other strategies lagging behind. Fig. 9 shows our SemSim * -based Approximation strategy (blue curve in all graphs) performing significantly better than Random or either of the MMD-based strategies. However, in almost all curves a slight downward trend, as model training goes on, can be observed. This implies that a short training period likely results in a better model, however, the exact epoch after which to stop training cannot be determined in a real world application due to the lack of target ground truth.
Furthermore, we observe that using a better semantic pairing (GT strategy) results in even better performance and that class distribution similarity only (Shuffled GT strategy) still provides a benefit, but less so than semantic similarity. So while the Approximation strategy already increases performance, there is still room for improvement. A limitation of this experiment is that it cannot answer whether aligning the mini-batch class distribution is already sufficient or whether having the same classes at the same pixel locations, as SemSim and SemSim * are designed to favor, brings an additional benefit.
While the GT strategy is better than Shuffled GT in terms of accuracy, they appear to be similar in terms of mIoU. As Table  4 shows, this is due to the GT strategy performing better on abundant classes such as impervious surfaces and buildings, while Shuffled GT performs significantly better on the rare class clutter. The table also shows that the semantic similarity based  We performed DANN-like domain adaptation for all 20 possible source and target domain pairs using the strategies Random, GT and Approximation. Our Approximation strategy achieved a mean SemSim score of .447 (std. σ = .212), lagging behind the Ground Truth strategy with a mean score of .667 (σ = .13) but being far ahead of the Random strategy which reached .24 (σ = .125). Examples of the accuracy and mIoU loss, compared to a model trained using the target domain ground truth, are shown in Fig. 11. Domain adaptation in general already mitigates some of the loss, with our Approximation strategy usually mitigating the loss even further. Sample images of actual segmentations are shown in Fig. 10.
Over all the 20 possible domain pairs, the GT strategy improves accuracy and mIoU in 19 instances when compared to the Random strategy. When there is an improvement, the average improvement is 4.2% accuracy and .036 mIoU. The Approximation strategy, again compared to the Random strategy, improves accuracy in 11 instances (average improvement 2.5%) and mIoU in 13 instances (average improvement .017). This proofs that semantically paired image patches improve domain adaptation performance. Our proposed approach already brings a measurable performance improvement while not quite reaching the full potential (GT strategy).
As Cycle-GAN-based domain adaptation is much slower, we only performed six experiments using it. We observed accuracy and mIoU improvements in only three cases. In those cases, the average improvements over the Random strategy were .88% accuracy and .009 mIoU (GT strategy) and .64% accuracy and .003 mIoU (Approximation strategy). These improvements are below the observed standard deviations for each experiment, thus our approach likely has no significant effect on appearance adaptation. However, the GT strategy achieved improvements of 1.7% accuracy and .016 mIoU in a case where Potsdam, which appeared to be a particularly problematic source domain, was used a source domain. Thus, there may still be a benefit for difficult domain adaptation problems. Since modern domain adaptation approaches are based on both, appearance adaptation and feature level adaptation, SemSim/SemSim * -based mini-batch composition may still improve the performance of these approaches.

CONCLUSION
In this paper we introduced the concept of semantically similar images: two images are semantically similar, if the semantic segmentation of one image is an accurate prediction of the semantic segmentation of the other image. We furthermore proposed a way to estimate the similarity of an image pair if only the segmentation of one of the images is known. Additionally we demonstrated that adversarial domain adaptation methods show improved performance when composing mini-batches s.t. there is a semantically similar target domain image for each source domain image in the mini-batch. Using a DANN-based approach for domain adaptation we saw an improvement in 19 out of 20 instances (average improvement: 4.2% accuracy and .036 mIoU) when computing the similarity using the target domain ground truth segmentation. Using our estimate of the semantic similarity, we saw accuracy improvements in 11 out of 20 instances (average improvement in those instances: 2.5%) and mIoU improvements in 13 out of 20 instances (average improvement in those instances: .017). Using a Cycle-GAN-based approach instead, we saw no significant improvements aside from one experiment, which used a particularly difficult source domain. This indicates that semantic image pairing helps especially with difficult domain adaptation problems. While the maximum performance increase (i.e., when using target ground truth) offered by semantic pairing is almost always positive, our estimate does not quite achieve this limit yet, however, it already is beneficial on average.
In the future, we want to investigate open questions such as whether the pairing also improves performance of combined approaches, e.g., CyCADA or CrDoCo, and whether having mini-batch class distributions that match for both domains is sufficient already. Another open question is whether the ideal point in time when to stop training a model can be estimated, as a downward trend in performance has been observed when a model is trained for too long. Furthermore, preliminary tests have shown that semantic pairing makes the DANN approach more robust w.r.t. discriminator complexity, which may mean that it improves hyperparameter setting robustness in general.