CLASSIFICATION UNDER LABEL NOISE BASED ON OUTDATED MAPS

Supervised classification of remotely sensed images is a classical method for change detection. The task requires training data in the form of image data with known class labels, whose manually generation is time-consuming. If the labels are acquired from the outdated map, the classifier must cope with errors in the training data. These errors, referred to as label noise, typically occur in clusters in object space, because they are caused by land cover changes over time. In this paper we adapt a label noise tolerant training technique for classification, so that the fact that changes affect larger clusters of pixels is considered. We also integrate the existing map into an iterative classification procedure to act as a prior in regions which are likely to contain changes. Our experiments are based on three test areas, using real images with simulated existing databases. Our results show that this method helps to distinguish between real changes over time and false detections caused by misclassification and thus improves the accuracy of the classification results.


INTRODUCTION
The updating of topographic databases (referred to as maps for brevity) is typically based on a classification of current remote sensing imagery.Comparing the results to the map, areas of change can be detected and the map can be updated accordingly.Supervised classification is commonly used for that purpose, requiring representative training data that are typically generated in a time-consuming manual process.The latter could be avoided by using the existing map to derive the class labels of the training samples.As the map may be outdated, classifiers using the class labels derived from the map for training must take into account the fact that some of these labels will be wrong.Nevertheless, changes typically only affect a relatively small part of a scene, so that one can assume the majority of the training data to be correct.
In machine learning, errors in the class labels of training data are referred to as label noise (Frénay and Verleysen, 2014).In remote sensing, the problem has mostly been dealt with by data cleansing, i.e. by detecting and eliminating wrong training samples, e.g.(Radoux et al., 2014).An alternative is to use probabilistic methods for training under label noise which also estimate the parameters of a noise model.An example for such an approach is the label noise tolerant logistic regression (Bootkrajang and Kabán, 2012), which has been applied successfully in the context of remote sensing in (Maas et al., 2016).However, the underlying noise model of that technique assumes wrong labels to occur at random positions in the image.This is not a very realistic model for change detection, where changes typically occur in clusters, e.g.due to the construction of a new building, and may lead to a degradation of the classification performance.
Using the existing map has another potential benefit.As change is usually a rare event, the existing class labels can be seen as providing observations for the prediction of the new class labels.This may be particularly useful in areas where the classifier cannot distinguish the class label by the given features well, e.g. at object borders.The corresponding probabilities for the classes to be correct are related to the probability of observing a wrong label and, thus, to the parameters of a probabilistic noise model (Bootkrajang and Kabán, 2012).However, such an assumption again neglects the fact that changes typically occur in compact clusters.It would typically lead to a strong bias for maintaining the class label of the map, which is desired in areas without changes, but may limit the prospects of detecting real changes.
In this paper, we propose a new supervised classification method that tries to extract as much benefit as possible from the availability of the existing map.Firstly, our method uses the class labels from the map for training.This is achieved by expanding the method by Bootkrajang and Kabán (2012) to take into account that changes typically occur in clusters, which we expect to improve the results in scenes with a large amount of change.Secondly, the class labels of the existing map are included as observations in a classification procedure based on Conditional Random Fields (CRF).We propose an iterative procedure to reduce the impact of the observed class labels in compact areas that are likely to have changed, which we expect to improve the classification results in areas of weak features without affecting the detection of real changes too much.For evaluation we use three data sets with different degrees of simulated changes.

RELATED WORK
Of the basic strategies for change detection identified in Jianya et al. (2008), we apply the one in which changes are inferred from differences between the independent classification of a current image and the existing map, because no sensor data are assumed to be available for the time of the acquisition of the existing map.Nevertheless, the existing map will be integrated into the classification process, which is also the focus of this paper.
For the reasons pointed out in Section 1, a training procedure taking the class labels of the training samples from an existing map must cope with label noise.Frénay and Verleysen (2014) differentiate three types of statistical models for label noise.The noisy completely at random (NCAR) model does not consider dependencies between label noise and other variables.In the noisy at random (NAR) model, the probability of an error depends on the class label.If the dependencies between labelling errors and the observed data are considered, the model is called noisy not at random (NNAR).This would be an appropriate choice in our case to model that label noise typically occurs in clusters in image space.We do not build a NNAR model explicitly, but we use one implicitly by an iterative strategy for reducing the impact of training samples forming clusters of potentially changed pixels.Existing NNAR models tend to analyse the distributions of the training samples in feature space, e.g.assuming label noise to occur more likely near the classification boundaries or in low-density regions (Sarma and Palmer, 2004).Apart from being drawn from another domain than image classification, this is not a model of local dependencies between label noise at neighbouring data sites.
Frénay and Verleysen (2014) distinguish three strategies for dealing with label noise.First, classifiers that are robust to label noise by design can be used, e.g.random forests, but they still may have difficulties with large amounts of label noise (Maas et al., 2016).The second strategy tries to remove training samples affected by label noise from the training set.Such data cleansing methods have been criticised for eliminating too many instances (Frénay and Verleysen, 2014).The third option is to use a classifier which is tolerant to label noise.In this context, probabilistic approaches learn the parameters of a noise model along with the classifer in the training process; examples are (Bootkrajang and Kabán, 2012), using logistic regression as the base classifier, and (Li et al., 2007), presenting a method based on the kernel Fisher discriminant.An example for a non-probabilistic approach is the label noise tolerant version of a Support Vector Machine (SVM) (An and Liang, 2013).However, non-probabilistic methods typically do not estimate the parameters of a noise model, e.g.transition probabilities containing the probability for the observed label to be affected by a change (Bootkrajang and Kabán, 2012), which may be used as temporal transition matrices in our application, linking the observed class labels of the map to class labels at the second epoch (Schistad Solberg et al., 1996).
In the domain of remote sensing, classification under label noise seems to be based on data cleansing in most cases.An example is (Radoux et al., 2014), who include two techniques for eliminating outliers to derive training data from an existing map.The first technique removes training samples near the boundaries of land cover types and the other one removes outliers based on a statistical test, assuming a Gaussian distribution of spectral signatures.Designed for data of 300 m ground sampling distance (GSD), the model assumptions, e.g.Gaussian distributions, cannot be used directly for high resolution images.A similar method was used for map updating in (Radoux and Defourny, 2010), using Kernel density estimation for deriving probability densities.Another data cleansing method is reported in (Jia et al., 2014).Similarly to the method proposed in this paper, all pixels from an existing map are used for training and the resulting label image is compared to the existing map to detect changes.However, no parameters of a model for label noise are estimated in the training process.This is also true for the data cleansing method based on SVM reported in (Büschenfeld, 2013), who eliminate training samples that are assigned to another class than indicated by the given map or that show a high uncertainty.Label noise tolerant training using maps for deriving training data was done by Mnih and Hinton (2012).They propose two loss functions tolerant to label noise to train a deep neural network, but their method only deals with binary classification problems.Bruzzone and Persello (2009) include information of the pixels in the neighbourhood of the training samples in the learning process to achieve robust-ness to label noise in a context-sensitive semi-supervised SVM.Although the authors argue that such a strategy can be used to integrate existing maps for training, this is not shown explicitly.
In (Maas et al., 2016), we applied label noise tolerant logistic regression (Bootkrajang and Kabán, 2012) to use an existing map for training, integrating it into a CRF for context-based classification.The experiments showed that the method is tolerant to a large amount of label noise if it is randomly spread over the image, as would be expected for a method based on a NAR model.However, experiments with more realistic changes were only shown with a small percentage of wrong training labels, and the class labels from the existing map were not used in the classification process.The latter was done by Schistad Solberg et al. (1996), who applied a temporal model based on transition probabilities to include an outdated land cover map in multitemporal classification, but no local dependencies between changes were considered.In this paper we want to expand our previous work (Maas et al., 2016) by considering the fact that changes occur in clusters.Label noise logistic regression Bootkrajang and Kabán (2012) is applied in an iterative procedure in which the impact of training samples in areas of potential change is reduced, while these samples are not completely eliminated.To consider local context, the resultant classifier was integrated in a CRF, in which we also consider the original class labels as additional observations.In contrast to (Schistad Solberg et al., 1996), the influence of these observations may change in the course of an iterative process if a pixel is situated in a large cluster of potentially changed pixels, so that temporal oversmoothing (Hoberg et al., 2015) can be avoided.Our method can be seen as a combination of "soft" data cleansing (because samples are not eliminated completely) with a probabilistic noise model for including the observed labels from the map.Thus, we expect to be able to cope with a larger amount of real change than our previous method.

LABEL NOISE TOLERANT CHANGE DETECTION
We assume remotely sensed data and an existing but outdated raster map to be available on the same grid.The data consist of N pixels, each pixel n represented by a feature vector xn = [x 1 n , ..., x F n ] of dimension F, calculated from the imagery, and an observed class label Cn ∈ C = {C 1 , ..., C K } from the existing map.C denotes the set of classes and K is the total number of classes.As the database may be outdated, the observed labels may differ from the unknown true labels Cn ∈ C. Collecting the observed and the unknown class labels in two vectors C = ( C1, ..., CN ) T and C = (C1, ..., CN ) T , respectively, and denoting the observed image data by x, it is our goal to find the optimal configuration of class labels C by maximising the joint posterior P (C|x, C) of the unknowns given the observations.In this process, we use the class labels of the outdated map for deriving the class labels of the training samples.We start by outlining our modified version of the training procedure for logistic regression by Bootkrajang and Kabán (2012) (Section 3.1).In Section 3.2, we show how logistic regression is integrated into a CRF (Kumar and Hebert, 2006) together with a model for considering the existing class labels C as observations.Section 3.3 describes the new iterative procedure for training and inference.

Label noise robust logistic regression
Classification is based on logistic regression, a discriminative probabilistic classifier that directly models the posterior probabylity p(Cn|xn) of a class label Cn given the feature vector xn.
A feature space transformation Φ(xn) may be applied to achieve non-linear decision boundaries in the original feature space.In the multiclass case the posterior is modelled by (Bishop, 2006): where w k is a vector of parameters for a particular class C k .
As the sum of the posterior over all classes has to be 1, these parameter vectors are not independent, so that w1 is set to 0; the other vectors are collected in a joint parameter vector w.
In our case, each training sample consists of a feature vector xn and the observed label Cn.In order to consider this fact in training, Bootkrajang and Kabán (2012) model the probability p( Cn|xn) as the marginal distribution of the observed labels Cn over all values the unknown class labels Cn may take: is the probability for a specific type of label noise affecting the two classes C a and C k .These transition probabilities for all class configurations form the K The transition matrix Γ contains the parameters of a NAR model which are estimated along with the parameters w in eq. 1.Because this kind of model is unrealistic to describe changes in land cover, we introduce a weight gn ∈ (0...1] for every sample n to control its influence in the training process.In the beginning, these weights are all set to 1; Section 3.3.3describes how they are changed iteratively to consider the assumption that changes occur in local spatial clusters.To deterimine the unknown parameters w and Γ, we apply maximum likelihood estimation of the unknown parameters with a Gaussian prior over w for regularisation.Taking the negative logarithms of the involved probabilities, this results in the minimisation of the following target function: In eq. 3, t nk is an indicator variable taking the value 1 if Cn = C k and 0 otherwise, S nk = p( Cn = C k |xn) as defined in eq. 2, and the rightmost term corresponds to a Gaussian prior with zero mean and covariance σ • I, where I is a unit matrix.
We use the Newton-Raphson method (Bishop, 2006) for minimising E(w, Γ).In each iteration τ , the parameter vector w τ is determined from T is the gradient of E(w, Γ): In eq. 4 we use the shorthand fna = p(Cn = C a |xn) for the posterior in eq. 1, and tnj = fnj K k=1 γ jk t nk S nk .The Hessian matrix H consists of (K-1) x (K-1) blocks Hij = ∇w i ∇w j E: , Iij are the elements of a unit matrix, and δ(•) is the Kronecker delta function delivering a value of 1 if the argument is true and 0 otherwise.
Optimising for the unknown weights requires knowledge about the transition matrix Γ, which, however, is unknown.Bootkrajang and Kabán (2012) propose an iterative procedure similar to expectation maximisation (EM).Starting from coarse initial values for Γ, the parameters w of the classifier are updated as just described.Using these weights, the transition matrix Γ is updated afterwards, expanding the updating step presented in (Bootkrajang and Kabán, 2012) by the weights gn: This alternating update of the parameters w and Γ is repeated until a termination criterion is reached.The estimated parameters w are related to a classifier delivering the posterior for the unknown current labels Cn, not the noisy labels Cn.
Note that this training with equal weights gn was already used in (Maas et al., 2016).In this paper this is just the case in the beginning of the training procedure.Note that the transition matrix Γ only represents the transition between the old database and the current labels in this initial step with equal weights gn.If the weights of training samples in large clusters of potential changes are low (cf.Section 3.3.3),the majority of the samples affected by label noise will have a low impact on the result, so that Γ only represents residual label noise of small local extents for which the NAR model is a sufficiently good approximation.

CRF considering the existing map
CRFs are graphical models consisting of nodes and edges that can be used to consider local context in a probabilistic classification framework (Kumar and Hebert, 2006).The nodes of the underlying graph represent random variables whereas the edges connect pairs of nodes and describe their statistical dependencies.Here, the unknown nodes correspond to the current labels Cn of all pixels n, and the edges are defined on the basis of a 4-neighbourhood on the image grid.As described above, the observed variables are the image data x and, different from (Kumar and Hebert, 2006), the observed class labels C (cf. fig. 1 for the structure of the graphical model).The joint posterior P (C|x, C) of the unknowns given the observations is modelled by: were Z is a normalization constant and is the set of edges in the graph.The association potential Ax(Cn, x) connects the unknown label Cn of pixel n with the image data x.Its dependency from the entire input image x is considered by using site-wise feature vectors xn(x), which may be a function of certain image regions.Any discriminative classifier can be used to model this potential (Kumar and Hebert, 2006); here, it is based on the posterior p(Cn|xn) of logistic regression according to eq. 1: Ax(Cn, x) = ln p(Cn|xn). (6) The interaction potential I(Cn, Cm, x) describes the statistical dependencies between a pair of neighbouring labels Cn and Cm.
In this paper, the contrast-sensitive Potts model is used for that purpose, which results in a data-dependant smoothing of the resultant label image (Boykov et al., 2001): where the parameters β0 and β1 describe the overall degree of smoothing and the impact of the data-dependent term, respectively, σD is the average squared distance between neighbouring feature vectors, ∆x = xn − xm is the distance of two feature vectors xn and xm, and δ(•) is the Kronecker delta function.
The observed labels are related to the unknown class labels by the temporal association potential Am(Cn, Cn), derived from the probability of the unknown label given the observed one: In eq. 8, there is an individual weight θn ∈ [0...1] for every pixel n.This weight models the influence of the observed label on the classification result of this pixel in inference.As we shall see in Section 3.3, these weights will be adapted in the inference process to reduce the impact of the observed labels for pixels that are very likely to belong to a larger area affected by a change. x

Co Cp
Cn Cm

Co Cp
Cn Cm

Training and inference
In order to obtain the optimum configuration of the current class labels given the observations by maximising P (C|x, C) according to eq. a joint iterative training and inference strategy is applied.After the determination of initial parameters of the association potential and the parameters of the temporal association potentials in an initial training phase, an iterative scheme of classification and re-training is applied in which the weigths of pixels in large areas of potential change according to the current classification result are modified to reduce their impact on the results.These steps are described in the subsequent sections.

Initial training and classification:
In the initial training phase, the observed labels and the data are used for label noise robust training of the logistic regression classifier that serves as the basis for the association potentials of the CRF.For that purpose, the method described in Section 3.1 is applied, using identical weights gn = 1 for all training samples.This will result in an initial set of parameters w for the association potentials and a transition matrix Γ that contains the transition probabilities p( Cn = C a |Cn = C k ) of the NAR model (Bootkrajang and Kabán, 2012).According to the theorem of Bayes, these probabilities are related to the probabilities p(Cn required for the temporal association potential (eq.8) by: As we have no access to the distribution of the unknown class labels p(Cn), we assume p(Cn = C k ) ≈ p( Cn = C k ) to derive the temporal association potential from Γ.These parameters are kept constant in the subsequent iteration process for the reasons already pointed out in Section 3.1: the transition matrix corresponds to the real transition probabilities only in the first iteration (when all training samples have an identical weight gn = 1).The parameters of the interaction potentials (β0, β1; cf.eq.7) are set to values found empirically.
For the determination of the optimal configuration of labels C = argmax(P (C|x, C)) loopy belief propagation is used (Frey and MacKay, 1998).In the initial classification, the weights θn of the temporal association potentials is set to 0 for all pixels, so that this classification is only based on the current state of the association and the interaction potentials.Furthermore, the information about potential areas of change is also used to change the weights θn of the temporal association potentials as explained in Section 3.3.4.Using the updated parameters w and weights θn, another round of inference is carried out, which will lead to an improved classification result.This procedure of updating weights on the basis of the current state of the classification results, re-training and inference is repeated until the proportion of weights that are changed in an iteration is below a threshold or a maximum number of iterations is reached.The procedure is inspired by re-weighting strategies for robust estimation in adjustment theory, e.g.(Förstner and Wrobel, 2016).

Weights gn of training samples:
The weight gn of a training sample n should be high for samples which are probably not affected by a change and low for other ones.The weights are initialised by gn = 1 as long as no information about changes is available.After classification, the resulting labels Cn can be compared to the map Cn to generate a binary map BC of potential changes.However, as indicated in fig.2(b) for an aerial image, this binary map will also contain classification errors.
To distinguish between real changes and classification errors three assumptions are made.First, classification errors often occur at object boundaries, e.g. because of mixed pixels or because of matching errors if digital surface models (DSM) are used in classification.Thus, a set of connected foreground pixels in BC forming a line that is thinner than a threshold s is very likely caused by classification errors.Such sets are removed by morphological filtering using a structural element of size s.The second assumption is that changes occur in clusters having a cer-tain minimum size.This is considered by removing all connected components of foreground pixels in BC which cover an area smaller than a threshold u.The third assumption is that in areas affected by cast shadows, the quality of spectral information or of the DSM (if available) is poor and, thus, potential changes as indicated by BC are very likely to correspond to classification errors.To detect shadow areas, the median and the mean of the image intensity in each cluster cl is compared to the median and the mean of the entire images.If mean cl < meanimg/2 and med cl < medimg/2, i.e., if the pixels in the cluster are very dark compared to the image, the pixels belonging to cluster cl are removed from the binary map of potential changes BC .The remaining foreground pixels in BC are likely to correspond to real changes (cf.fig.2(d) for an example).
For pixels corresponding to the foreground in BC , the weights gn are decreased by a constant c, so that in iteration t + 1, the weight of the corresponding samples is given by g t+1 n = max(g t n −c, ξ).The minimal weight is set to a small constant ξ to avoid numerical problems.The weights of pixels that belong to the background in BC are updated according to g t+1 n = min(g t n + c, 1).As a consequence, the weights of pixels that are considered to be changes will be reduced in each iteration; however, a pixel may regain influence if in a certain iteration its most likely class label is identical to the one from the map, e.g.due to the influence of its neighbours or due to the temporal model.

Test data and test setup
We used three datasets in our experiments.The first one consists of a part of Vaihingen data of the ISPRS 2D semantic labelling contest (Wegner et al., 2015).We use ten of the training patches, each consisting of about 2000 × 2500 pixels.For each patch, a colour infrared true orthophoto (TOP) and a DSM are available with a ground samling distance (GSD) of 9 cm.The reference consists of five classes: impervious surfaces (sur.), building (build.),low vegetation (veg.), tree, and car.As cars are not a part of a topographic map, this class was merged with sur.For each pixel, we defined a feature vector xn(x) consisting of the normalised difference vegetation index (NDVI), the normalised DSM (nDSM), the red band of the TOP smoothed by a Gaussian filter with σ = 2, and hue and saturation obtained from the TOP, both smoothed by a Gaussian filter with σ = 10.These features were selected from a larger pool based on the feature importance analysis of a random forest classifier (Breiman, 2001).
The other two data sets are based on satellite imagery and were also used in in (Maas et al., 2016).The first one consists of a Landsat image from 2010 of an area near Herne, Germany, with a GSD of 30 m and a size of 362 × 330 pixels.The second dataset consists of a RapidEye image of an area near Husum, Germany, from 2010.Its GSD is 5 m and its size is 3547 × 1998 pixels.In both cases only the red, green and near infrared bands are available.The reference contain four classes residential area (res.), rural streets, forest (for.) and cropland (crop.).As the class rural streets is underrepresented in both images, we merged it with cropland.In both datasets, 19 features were selected: four Haralick features (energy, contrast, homogenity and entropy) related to texture, the mean and variance of five spectral features (near infrared band, intensity, hue, saturation and ndvi) in a local neighbourhood of 6 × 6 pixels and the values of the same spectral features smoothed by a Gaussian filter with σ = 5.
For Vaihingen, we used a feature space mapping Φ(xn) based on quadratic expansion, whereas for Husum and Herne no feature space mapping was used.The hyperparameter for regularisation in eq. 3 was set to σ = 10.The initial values for the transition matrix Γ (cf.Section 3.1) were γij = 0.8 for i = j and γij = 0.2/(K − 1) for i = j, where K is the number of classes.The initial values for the parameter vector w of logistic regression were determined by standard logistic regression training without assuming label noise.The parameters of the contrastsensitive Potts model were set to β0 = 1.0 and β1 = 0.5.The thresholds for updating the weights (Section 3.3.3)depend on the GSD.For Vaihingen, the threshold for object borders s was set to 0.5 m, assuming wrong classifications near object borders to be caused by errors in the nDSM or mixed pixels.The minimal size u of an object is set to 4 m × 4 m (i.e., smaller than a small house).For the satellite data, s is set to 2 pixels, because mistakes caused by the nDSM do not exist, and u is set to 250 m × 250 m, assuming this is the minimum size of a new residential area or field.The value c for updating the weights (Sections 3.3.3and 3.3.4)was found empirically and set to 0.1.Except for the dataset of Herne, where all pixels are used due to the small image size, just about 20% of the data are used for training to reduce the processing time.The iteration is terminated (Section 3.3.2) if either less than 0.01% of the weights for the observed labels in classification change or if at least 40 iterations have been done.
For all experiments we manually changed the reference to simulate an outdated map.For each patch of the Vaihingen dataset three simulated maps were created, each with a different amount of change.For Herne and Husum, the changed map from (Maas et al., 2016) were used.Based on these data, we carried out four experiments.In the first experiment (Init), training and classification was carried out as in (Maas et al., 2016), i.e. without iterative re-training and classification (gn = 1 = const, θn = 0 = const).The second experiment (V g ) is based on our method, but without considering the outdated map (θn = 0 = const).It shows the impact of the sample weights gn introduced in section 3.1 in the training step.The third experiment (V θ ) uses constant training weights gn = 1, but does apply the modified weights θn to include the map information.The last experiment, V g θ , uses our method with weight modification both in the training and classification steps.In each case, we compare the results to the reference on a per-pixel basis, determining the overall accuracy (OA) as well as completeness and correctness per class (Heipke et al., 1997).Comparing the simulated map with the real reference shows the amount of change in the corresponding data set, and the resultant quality indices are also reported (map); 100% -OA of map gives the amount of simulated change in each experiment.We do not distinguish a training set from a test set because an outdated map is always used, at least for training.

Results and evaluation
4.2.1 Vaihingen: Fig. 3 shows the OA of all patches achieved for three versions of the outdated map (map 1 -map 3) for Vaihingen.In most cases the variant V g θ achieves the best OA (85%-90%), but variant V θ performs at a similar level, and both variants clearly outperform the variants without weights and without considering the outdated map (Init, V g ).Obviously, the inclusion of the outdated map has a relatively high impact on the quality of the results, improving the OA by 2%-10%.This is mainly caused by an improved classification at object boundaries or at individual pixels.In fact, in some cases, the variants not considering the outdated map lead to results where a larger percentage of change than actually present is predicted, so that the corresponding OA is lower than the one indicated by map.The advantage of considering the sample weights gn in training becomes more obvious for experiments with a large amount of change.If the level of change is small, it can be compensated by the original method based on the NAR model (Maas et al., 2016).If the label noise cannot be compensated by the NAR model any more, considering the weights can improve the results.
One example is patch 17 (fig.4).It contains three buildings with a brighter appearance than the rest (blue rectangles in fig.4(a)).
Only one of them is contained in the outdated map (fig.4(b)).
Without considering the weights (variant Init, fig.4(c)), one building is mostly classified as veg.In variant V g the two changed buildings are correctly detected (fig. 4(d)).Another difference between the results of variants Init and V g is the label of the vineyard which belongs to the class veg.but is often classified as tree in experiment Init.Without considering weights, the probability p(Cn|xn) is low for all classes in the area of the vineyard, so that the classification results are not reliabe.By considering the sample weights in experiment V g , the probability p(Cn|xn) for the class veg. is much higher than for the other classes.However, because the vineyard has a similar appearance to trees, the tree marked in fig.4(c) is also classified as veg. in V g .For patch 5 (fig.5) and the otudated map 3 with more than 30% label noise, OA is always below 61% (fig.3).In this case nearly 50% of all building pixels are labeled as sur. in the outdated map.This amount of label noise cannot be dealt with by the original method (Init).The transition probabilities γii for no change for build.and sur.determined in the initial training step are close to 1 and, thus, not very accurate.Consequently, the iterative weight updating procedure does not converge to the correct solution.Figs. 6 and 7 show the completeness and the correctness of the results.Both quality indices are higher for variants V θ and V g θ than for the others, which again highlights the importance of using the outdated map for classification.Using the sample weights gn in the training process does not improve the completeness in most cases, but it does have a small positive impact on the correctness.
For buildings, we also provide an evaluation on a per-object basis, counting a detected building (i.e. a connected component of pixels classified as build.)as a true positive if more than 70% of its area overlaps with a reference building.
. Average correctness over all patches in Vaihingen.
were excluded from the evaluation.The mean completeness and correctness of all areas are shown in tab. 1.Again, variant V g θ achieves the best completeness (98.9%) and correctness (82.5%).However, variants V θ and V g do not perform significantly worse considering the standard deviations of the quality indices.Nevertheless one can notice an positive impact of the new developments presented in this paper (variants V g , V θ and, particularly, V g θ ) compared to the original algorithm (Init) (Maas et al., 2016).4.2.2Herne and Husum: As the amount of change in Husum and Herne is quite small (3% -4%), using the sample weights gn does not affect the results much; the OA changes by less than 0.6%.Thus, this section focuses on the impact of using the outdated map for classification.In tab. 2 OA, completeness and correctness are shown for both datasets for variants Init and V θ .All values are larger for variant V θ by a large margin, the OA increasing by 13.9% for Herne and by 5.2% for Husum.One reason for that increase is the improvement of the delineation of object borders.As the features all depend on a local subset of pixels, borders of objects are blurred in the standard classification process.In variant V θ these areas can be correctly classified in regions without change.If regions of change are smaller than the threshold u (Section 3.3.3),considering the map has the same effect, which may lead to cases in which such small changes are not detected.An example for such a situation in Herne with vari-ant V θ is indicated by a blue rectangle in fig.8(d).To highlight the potential for detecting changes despite using the existing map for classification, tab. 3 shows the OA achieved for pixels in the areas affected by a change.The results show that the improved OA for the entire image caused by the inclusion of the outdated map (cf.tab.2) comes at the cost of a reduced OA in the changed areas.In Husum, this reduction in OA is low (0.8%).In Herne it is somewhat larger (7%), though still considerably smaller than the improvement for the entire scene (14.4%).Herne, Init Herne, V θ Husum, Init Husum, V θ 75.9 % 68.9 % 92.6 % 91.8 % Table 3. OA of Husum and Herne for areas affected by a change.

CONCLUSION AND FUTURE WORK
In this paper we presented a iterative method for supervised classification under label noise making use of the existing map both for training and in the classification process.No manual effort for the generation of training data was required.In both, the training and the classification procedure we considered the fact that changes in land cover usually appear in clusters.In training this was achieved by using a weight for each training sample in order to reduce the impact of samples in larger areas of change.
By adding the labels of the map to the CRF as weighted observations, our method includes the map information for pixels which are unlikely to correspond to changes.Thus, new objects can be found without the additional map information while pixels probably not affected by label noise can take advantage of this prior information.
We tested our method using datasets with different properties and varying degrees of label noise.Due to our re-weighting scheme for training samples the method can also deal with larger amount of noise, but the improvement brought about by this strategy was smaller than expected.The inclusion of the map information to the CRF has a considerably larger positive effect, largely due to a better classification of pixels near object boundaries.The actual changes are detected nearly as good as without considering the map in classification, although very small changed objects might not be detected.These observations could be made independently from the GSD of the images.A major limitation of the method is that each cluster in feature space still must contain enough correct training samples for it to work.If the results of the base classifier in the initialization step are sufficiently good, considering the map in the classification can improve the results considerably.
In our future work we want to expand our model by images from other epochs, so that not only the map can help to improve the classification result, but also other image data.Additionally we want to expand our experiments to data with real changes to see how our method works under more realistic circumstances in terms of the extent of change, level of detail or number of classes.

Figure 1 .
Figure 1.Graph structure of the expanded CRF: C: unknown labels, C: observed labels, x: image data.

3. 3 . 2
Iterative re-training and classification: By comparing the current label image with the outdated map, areas of potential changed areas can be detected.This information is used to update the weight gn of each training sample, and label noise robust training of the logistic regression classifier is repeated, using the updated weights.The way in which the weights are updated is explained in Section 3.3.3.Training will result in new values for the parameters w of the association potentials of the CRF.
Figure 2. Example for the identification of potential areas of change.Black / gray: changed / unchanged pixels.Red rectangle in (b): a cluster that corresponds to a shadow.

Figure 3 .
Figure 3. Overall accuracy for the four variants in Vaihingen.
Because small buildings are often not included in maps, buildings smaller than 16 m 2 Figure 6.Average completeness over all patches in Vaihingen.

Table 1 .
Completeness and correctness on a per-object basis for buildings; mean of all areas in % [standard deviation in %]