Archetypal Analysis for Sparse Representation-based Hyperspectral Sub-pixel Quantification

The estimation of land cover fractions from remote sensing images is a frequently used indicator of the environmental quality. This paper focuses on the quantification of land cover fractions in an urban area of Berlin, Germany, using simulated hyperspectral EnMAP data with a spatial resolution of 30m$\times$30m. We use constrained sparse representation, where each pixel with unknown surface characteristics is expressed by a weighted linear combination of elementary spectra with known land cover class. We automatically determine the elementary spectra from image reference data using archetypal analysis by simplex volume maximization, and combine it with reversible jump Markov chain Monte Carlo method. In our experiments, the estimation of the automatically derived elementary spectra is compared to the estimation obtained by a manually designed spectral library by means of reconstruction error, mean absolute error of the fraction estimates, sum of fractions, $R^2$, and the number of used elementary spectra. The experiments show that a collection of archetypes can be an adequate and efficient alternative to the manually designed spectral library with respect to the mentioned criteria.


I. INTRODUCTION
The estimation of the degree of imperviousness as an indicator of environmental quality is subject of current research towards a time and cost efficient monitoring of urban areas [1]. Due to increasing land consumption in cities in the recent years, which has negative effects on the natural water cycle, the monitoring of land use in those areas is important [2].
Remote sensing data, such as imaging spectroscopy, builds a valuable basis to comprehensively map urban areas and quantify the imperviousness based on the spectral information (e.g., [3], [4]). Especially, hyperspectral imagery is a suitable source for mapping of such areas, because it offers a high spectral separability of different materials. However, generally, the temporal and spatial resolution is limited in comparison to sensors with lower spectral resolution. These limitations are partially overcome with the launch of missions such as Environmental Mapping and Analysis Program (EnMAP), which increases the availability of hyperspectral data and the temporal resolution [5]. Nevertheless, due to its spatial resolution, the provided data is mainly characterized by spectrally mixed pixels, which demands sophisticated sub-pixel quantification approaches in order to estimate the fraction of various land cover classes in each pixel.
In this context, several approaches have been developed comprising regression approaches [6], [7], probabilistic classification methods [8], [9], [10], and the usage of spectral libraries for spectral mixture analysis [11], [12]. An overview of a wide variety of unmixing approaches can, for example, be found in [13]. While the latter approach needs a spectral library containing elementary spectra of known materials, the first two approaches also require mixed spectra for learning an appropriate model. These mixed spectra can be derived from the image using information about known mixed pixels, or from synthetically mixed pixel, as it has been presented in [8] and [6].
When using spectral libraries, a critical step is the extraction of the elementary spectra. A manual extraction is time-consuming and requires human expert-knowledge and therefore, automatic extraction techniques have been an active field of research during the past decade (e.g., [13], [14]). Most of the algorithms rely on the assumption that the elementary spectra lie on a convex hull or a convex polytope enclosing the data distribution (e.g., [15], [16]). Based on this assumption all data samples can be reconstructed by a non-negative linear combination of the elementary spectra. A promising approach from this group is the so-called archetypal analysis, which searches for extreme points (also known as archetypes) in the data distribution (e.g., [17], [18], [19], [20]). Archetypal analysis has already been successfully applied in the field of sport analytics [21], plant phenotyping [22] or text analysis [23]. A valuable extension to archetypal analysis is presented by [24], in which extreme points are extracted in the kernel space, enabling an efficient nonlinear unmixing. Besides the actual determination of elementary spectra, other challenges exist which need to be tackled: The number of elementary spectra is unknown beforehand and thus, a suitable amount of spectra needs to be extracted to make the set representative enough, but also small enough to keep the sub-pixel quantification robust and efficient. Moreover, many extraction techniques depend on the initialization and thus, a strategy needs to be defined to ensure a stable result (e.g., [25], [26]).
In this paper, we address the challenge of automatically finding a representative set of elementary spectra, including the automatic determination of the number of elements, for sub- pixel quantification. We perform the sub-pixel quantification using a freely available simulated EnMAP scene of an urban area in Berlin, Germany [27] 1 , aiming at the estimation of a fraction map containing the classes impervious surface, vegetation, soil and water. In order to determine the class fractions, we use sparse representation with non-negativity, L 0sparsity and sum-to-one constraint. We exploit archetypal analysis to extract elementary spectra in a fully automatic and unsupervised way. Moreover, we perform archetypal analysis by simplex volume maximization (SiVM), which states an efficient selection method [28], [29].
The main contribution of this work is the combination of archetypal analysis with reversible jump Markov chain Monte Carlo method (rjMCMC, [30]) to obtain a spectral library of high representational power, yet a small number of elementary spectra. Using this approach, we are able to determine the best set of spectral library elements regarding a pre-defined criteria, as well as the number of elements. Moreover, the approach of [24] is applied to select the archetypes in kernel space, resulting in archetypes lying on the concave hull of the data distribution in the original feature space. Our experiments confirm that these archetypes are more suitable for sub-pixel quantification than archetypes extracted from the convex hull exclusively. To illustrate the usefulness of our proposed approach, we analyze various kinds of automatically derived spectral library and compare them to a manually designed spectral library. Our presented approach is flexible regarding the chosen estimation technique, such that the constrained sparse representation can be replaced by other approaches commonly used in the unmixing community ( [13]), or other constraints such as sparsity induced by L 1 -norm ( [31]). As such, archetypal analysis can be replaced by, e.g., endmember extraction methods presented in [13], and deliver the input for the rjMCMC method.

II. DATA
Our studies are performed using the Berlin-Urban-Gradient dataset [27], illustrated in Fig. 1. The dataset consists of two hyperspectral images of different spatial resolution, two simulated EnMAP scenes of different spectral resolution, a manually designed spectral library, reference land cover information and reference fractions for evaluation, which are explained in more detail in the following paragraphs.  1) HyMap Data: The dataset contains two hyperspectral images, one with a spatial resolution of 3.6m and one with a spatial resolution of 9m, whereby we use the higher resolution image in our work. Both images were acquired with the Hyperspectral Mapper (HyMap), and serve as basis for the extraction of the archetypes. It offers a high variety of urban land use and land cover patterns in the study site Southwest of Berlin, Germany. The external conditions were a cloudless sky at solar noon and a minimal possible altitude (maximal altitude according to the second lower resolution image). The HyMap image consists of 126 spectral bands, however, 15 noisy bands were removed resulting in 111 bands used for this work. Moreover, other preprocessing steps were performed, as encompassed system correction [32], atmospheric correction and parametric geocoding [33]. The observed wavelengths range from 0.45µm to 2.5µm, showing a high spectral information diversity, which enables a detailed analysis of the urban structure. In addition to the 3.6m-resolution HyMap image, a manually derived reference image containing 112.690 reference pixel is provided with valuable spectra, each labeled with one of the four land cover classes: impervious surface, vegetation, soil and water. The reference information is manually obtained by using digital orthophotos and cadastral data. For our experiments, the set of elementary spectra is extracted from this data to obtain the class membership of the elementary spectra, however, the extraction can be performed on unlabeled data with limited human user interaction.
2) Simulated EnMAP Data: EnMAP is German hyperspectral satellite mission, which will start not earlier than 2018, with a focus on Earth environmental observations in a global scale. Based on the HyMap data, an EnMAP scene was simulated using EnMAP end-to-end simulation tool (EeteS, [34]) with two different spectral resolutions (111 and 244 bands). Just as the HyMap data, both EnMAP images have a spectral resolution ranging from 0.45µm to 2.5µm, where we use 111 bands for a better comparability. The spatial resolution of 30m is lower than the resolution of the HyMap scene, and thus the mixing of land cover classes are more apparent. For evaluation purposes, 1495 EnMAP pixels were obtained from the simulation tool, containing the fractions of the land cover classes ranging from 0 to 100%.
3) Manually Designed Spectral Library: The spectral library is a manually designed collection of 75 pure spectra obtained from the HyMap image. The spectra contain different land cover classes with 39 impervious surface spectra (different types of roof, pavement, tartan, pool water), 31 vegetation spectra (grass, tree), 4 soil spectra (uncovered ground, sand) and one natural water spectrum. All together, it is a balanced library for urban structures with 39 impervious and 36 pervious spectra. All spectra in the library can be assigned to a hierarchical urban classification scheme, which was developed by [35]. First, an initial collection of 300 spectra was selected by expert knowledge. Afterwards a two-step filtering was performed, in which the variability between the spectra is maximized in consideration of spectral variability of materials, brightness and shading effects. Moreover, in an iterative process those spectra for the final subset were selected that best describes the spectral diversity in a specific neighborhood. More details can be found in [27].

III. METHODS
The following section describes the sub-pixel quantification using sparse representation and archetypal analysis. Archetypal analysis determines the extreme points of the data distribution, which are used within the sparse representation approach to estimate the fractions of land cover classes in each pixel. We have given a (M × N )-dimensional data matrix X , in which N is the number of M -dimensional reference pixels X = [x 1 , . . . , x N ] with given land cover class c n . Moreover, we have given a test set of 1495 data samples T X = [x 1 , . . . , x T ] with known land cover fractions T f t for evaluation purposes, as presented in Sec. II-2.

A. Sparse Representation
In order to determine the sub-pixel fractions, we use sparse representation with non-negative least squares optimization. In terms of sparse coding, a sample x is represented by a weighted linear combination of a few elements taken from a (M × D)-dimensional dictionary D, such that x = Dα + γ with γ 2 being the reconstruction error. The dictionary D = [x 1 , . . . , x d , . . . , x D ] contains elementary spectra, such that this approach is identical to the linear mixing model. The coefficient vector, comprising the activations, is given by α. The activations are interpreted as class fractions for sub-pixel quantification. The optimization problem for the determination of optimalα is given byα where the terms in (2) are the non-negativity constraint, the sum-to-one constraint, and the sparsity constraint. Generally, non-negativity alone leads to a sparse solution, however, in order to ensure a strict fulfillment of the sparsity constraint we use a backward selection procedure in which the activations with the smallest values are set to zero in a greedy manner.

B. Archetypal Analysis
Archetypal analysis is a suitable method to determine the elements of D, where each archetype serves as one dictionary element. The extraction of the archetypes, collected in a (M × K)dimensional matrix of A = [a 1 , . . . , a k , . . . , a K ], k = 1, . . . , K, is carried out by SiVM, which is an efficient method to determine the archetypes of the data distribution. This approach is aiming on an approximation of the convex hull, where all archetypes are located on it.
In order to find the first archetype, the approach is initialized with a random or pre-defined vector a 0 . The pixel with maximal distance to a 0 , is defined as first archetype a 1 , defined by with d(·, ·) being the Euclidean distance function between the spectral features of the archetype a 0 and the pixel x n . Further archetypes are specified sequentially, such that the volume of the simplex becomes maximized with each additional archetype. Since the volume operation is too computational intense, instead the archetypes are selected to have maximum distance to all previously detected ones, using The stopping criterion is generally chosen to be the number of archetypes. We further use the approach of [24] and transform X into kernel space using a Gaussian radial basis function kernel with a hyperparameter σ describing the width of the Gaussian kernel. In this way, archetypes are selected which lie on the the concave hull rather than the convex hull.
The disadvantage of archetypal analysis is that the final set depends on the initialization point, and as a result, there is no unique solution to the final set. Especially, if the number of archetypes in the dictionary is low, various solutions lead to significantly different accuracies. Moreover, the number of archetypes is generally not know beforehand, and depends on the number of informative dimensions and the variability of the data.

C. Reversible Jump Markov Chain Monte Carlo Method
To overcome this problem, we propose to use an optimization procedure to find the best set of archetypes from a large set of pre-selected ones, called the initial set. Under the assumption that a suitable archetypal set is able to represent the test set T X with a low reconstruction error, our task is to find the set of archetypes D = A which minimizes the energy where γ is obtained by using the current set of archetypes as dictionary D. The energy U is a complex function with rough landscape and unknown dimensionality due to the unknown number of archetypes. Therefore, we optimize with rjMCMC coupled with simulated annealing to find the global optimum. Introducing the temperature parameter R, the optimizer is given by While MCMC is dedicated to sample from complex functions, simulated annealing allows to make a point estimate of its global optimum. Using simulated annealing we create a Markov chain, such that the samples explore the whole state space in the beginning and gradually concentrate around the global optimum of the energy function U. In this way we avoid trapping into local optima, as it is usually the case for greedy algorithms. We use the so-called birth an death algorithm [36] to sample from a restricted sample space of possible sets of archetypes, which turns out to be a special type of Green's rjMCMC sampler [30]. The sample space is restricted by a Poisson prior on the expected number of archetypes, as presented in [37]. We further introduce an upper bound on the the number of selected archetypes.

A. Experimental Setup
In our experiments, the simulated EnMAP data is reconstructed by sparse representation with the before mentioned constraints using the following libraries: 1) the manually designed spectral library (LibCom), 2) an optimized set of the manually designed spectral library LibRed (using rjMCMC), 3) a library containing 40 extracted archetypes in the original feature space initialized by the mean vector (AA-M-Lin), 4) a library containing 75 extracted archetypes in the kernel space initialized by the mean vector (AA-M), 5) a library containing 75 extracted archetypes in the kernel space initialized by a random vector (AA-Rand), 6) a library containing an accumulated set of multiple single sets of archetypes, obtained in the kernel space with random initializations (AA-Full), and 7) an optimized set of AA-Full denoted by AA-Opt (using rjMCMC). We compare our results to the regression and classification methods presented in [8], namely support vector machine (SVM), import vector machine (IVM), support vector regression (SVR), and multi-output support vector regression (MSVR). The aim of the experiments is to show the suitability of the spectral libraries for sub-pixel quantification. Moreover, the goal is to show that the set of automatically derived spectral libraries using archetypal analysis and rjMCMC achieve similar results than the manually designed spectral library. The sub-pixel quantification is evaluated with the given reference mixing fractions. Additionally, the number of used archetypes is evaluated and discussed. We run all experiments including a random component 10 times and report the average result and standard deviation.
As pre-processing step, the dataset is outlier-cleaned using the local outlier factor approach [38], using 10 neighbors and a quantile of 0.95 as threshold on the pairwise Euclidean distances. The width of the Gaussian kernel σ is chosen to be 0.5 of the average standard deviation over all bands. In order to choose a suitable sparsity value W , we use Elbow method to analyze the reconstruction error of the test set obtained by the manually designed library as well as the automatically derived library containing 75 archetypes. Fig. 3 shows that the reconstruction error will not significantly decrease for W > 7, and thus we choose this value as upper bound for the sparsity constraint. For rjMCMC we choose the prior on the expected number of archetypes to be 75 when using all labeled data, and 40 when using the manually designed spectral library. The upper limit of archetypes is chosen to be 150.

B. Evaluation
We use several statistics for evaluation. First, the reconstruction error of the sparse representation is provided, indicating the ability of the elementary spectra to represent the EnMAP pixels. Moreover, the result of the sparse representation is evaluated by a comparison between reference and actual estimated fraction coefficients. This is done by means of the overall and class-wise mean absolute error (MAE) between reference and actual coefficients.
with T f t being the reference fractions, I f t being the estimated fractions, and T being the number of evaluated pixels. We obtain I f t by summing up all coefficients belonging to elementary spectra of the same class. The smaller the MAE, the better the reconstruction. Moreover, the root mean square error (RMSE) is provided, which is defined by Besides this, we provide the coefficient of determination (R 2 ) for each class. It is calculated as the squared correlation coefficient between T f t and I f t . Table II presents the results of the EnMAP sub-pixel quantification. In the first block, the results of [8] are shown for comparability with regression and classification results. The middle block shows the results obtained by the full and reduced manually designed spectral library. The bottom part of the table presents the results obtained by archetypal analysis.

C. Results and Discussion
Overall, the MAE values show that all spectral libraries which are obtained from the full manually designed library X lib achieve similar and satisfactory results. It can be seen that in comparison to the results of the regression and classification approaches presented in [8], sparse representation with the mentioned constraints show equivalent results. Both spectral libraries have a small average MAE < 9%, which is also obtained by IVM and MSVR.
Remarkably, LibCom and LibRed show similar results, however, with LibRed having much fewer elements. LibCom has slightly lower class-wise MAE for the classes impervious surface, vegetation and soil, where LibRed obtains better results for water. This indicates that the presented rjMCMC method is able to find a suitable subset of elementary spectra having nearly the same representational power than the manually designed library with all elements. Moreover, it can be stated from this result that the full spectral library contains redundant elementary spectra which can be discarded for sub-pixel quantification. These findings are underlined by Fig. 4, which shows the scatter plots representing the 1495 class-wise fraction estimates opposed to the reference fractions. It can be observed that soil and water have a high proportion of 0% reference fractions and lesser >0% fractions than impervious surface and vegetation. The impervious surface scatter plot show that for LibCom the two lines intersect nearly at the 15% point on the x-axis. This means that the estimated fractions are usually too small for all pixels which have a degree of impervious surfaces over 15%, and otherwise too high for smaller values than the intersection point. In comparison to LibRed, the reference values are also mostly underestimated, but slightly better than LibCom. The soil scatter plots have large discrepancies between the true 1:1 line (dashed line) and the estimated least-squares regression line (solid line). Besides this, there is a high amount of 0% fractions, which are estimated with up to 40% when using the LibCom spectra and 30% fractions when using LibRed spectra. Furthermore, all pixels with soil-fractions are clearly underestimated equally. This is also the case for the regression and classification approaches presented in [8]. We assume that the small number of elementary spectra is insufficient for the spectral diversity of soil pixels. Finally, the described observation with high estimated 0% fractions also occurs in the Water class with values over 50%. Figure 5 illustrates the frequency of the usage of each spectrum for reconstruction in order to determine the fractions of 1493 EnMAP pixels. The results obtained by LibCom are shown on the left and the results for LibRed on the right side. Moreover, the lower bright part of each bar indicates the sum over all fractions from which the total proportion of each land cover class in the study area can be derived. LibCom shows that there are significant elementary spectra, e.g., 18, 19 (impervious surface), 45, 48, 52, 54 (vegetation) and 75 (water), which show a high fraction in the reconstruction. In the other side, some elementary spectra in this plot are infrequently used. It can be observed that there is generally no dependence between the height of a bar, i.e. the number of non-zero estimated fractions for the spectra, and its overall fraction's sum for all pixels. A striking example is spectra 19. Spectrum 19 in the LibCom library is used for more than 500 pixel reconstructions, but its fraction's sum is lower in comparison to, e.g., spectrum 18, which is used more infrequently. However, LibRed has a more balanced composition of elementary spectra with fewer bar heights, which are close to zero. Especially noticeable are the differences in the bar for the class water between LibCom and LibRed. A water spectrum has the special characteristic that there are only small reflectance values over all bands, so that it acts almost as a linear factor. Because of this, it is well suited to support all reconstructions, resulting in a high overall fractions' sum. Its value in the LibCom is high with a share in over 800 reconstructions, in comparison to LibRed which shows a bar height of just over 600 and a lower fractions' sum. We assume the reason for this is the smaller number of elementary spectra, where oftentimes a reconstruction with water spectra is helpful regardless of the presence of water in the pixel.
The results in Tab. II obtained from various spectral libraries based on the labeled data set X are more variable, indicating their dependency of a proper choice of hyperparameters such as the number of archetypes and the selection approach. For example, the number of selected archetypes in set AA-M-Lin is 40, which is smaller in comparison to AA-M. Both sets are initialized with the mean vector of all samples in X , where the first set is extracted in the original feature space and the latter one in the kernel space. We observed that a selection of more than 40 archetypes in the set AA-M-Lin means that some archetypes lie in the center of the dataset X , since they have the maximal distance to the previously selected archetypes. The set AA-M-Lin consists of 25 impervious surface spectra, 12 vegetation spectra, 2 soil spectra and one water spectrum. The total number of selected archetypes in AA-M is chosen to be 75, as for the manually designed library LibCom. For all classes except the water class, the results obtained by AA-M are worse than these ones obtained by AA-M-Lin. Also the results for the randomly initialized archetypal sets AA-Rand show only slightly better results. Moreover, we observed that oftentimes the selected archetypal set with random initialization does not contain all land cover classes. Thus, as indicated by the results, a single set on selected archetypes is not suitable enough for an accurate sub-pixel quantification. Therefore, in order to create a higher diversity of archetypes, various initializations are used for SiVM and accumulated into a larger set. The set AA-Full contains 142 different archetypes, however, the high amount of elementary spectra results in high MAE values. The best result based on X is obtained by AA-Opt, which is the reduced AA-Full set. This set has the best overall MAE and the classwise MAE are in most cases comparable to the approaches based on X lib . The final number of elements in set AA-Opt is 81.7 on average, which is similar to the number of elements in the manually designed spectral library. Fig. 4 underlines these findings. For impervious surface, vegetation and water the results are similar for all three libraries. However, also for soil the set AA-Opt has large discrepancy between the true 1:1 line (dashed line) and the estimated least-squares regression line (solid line), which is more distinct than for LibCom and LibRed.

V. CONCLUSION
The quantification of sub-pixel fractions in remote sensing images is a relevant task to assess the environmental quality, for example, in urban areas. This paper presents a sub-pixel quantification of the urban area of Berlin, Germany, into four land cover classes impervious surface, vegetation, soil and water, using a manually designed spectral library and various kinds of automatically derived spectral libraries. We perform the estimation of the fractions by using sparse representation with non-negativity, sparsity, and sum-to-one constraints. We use archetypal analysis by simplex volume maximization to automatically derive the elements for a library, and apply reversible jump Markov chain Monte Carlo method to find a small, yet representative set of suitable elementary spectra. The archetypes are extracted from HyMap data with given reference information for all land cover classes. As our experiments suggest, the extracted archetypes are suitable to serve as spectral library for sub-pixel quantification.
Moreover, in contrast to a manually designed library, the automatically derived spectral library can be easily extracted with no or limited human user interaction, and the library is specifically adapted to the current image characteristics. In case no reference data is given, the interpretation can be done in a fast way by assigning land cover classes to the extracted archetypes using expert knowledge. The presented approach is flexible regarding the chosen estimation technique for the sub-pixel fractions, and can also be combined with spatial information, which may further increase the approximation ability and accuracy of the fraction estimates.