URBAN MATERIAL CLASSIFICATION USING SPECTRAL AND TEXTURAL FEATURES RETRIEVED FROM AUTOENCODERS

Classification of urban materials using remote sensing data, in particular hyperspectral data, is common practice. Spectral libraries can be utilized to train a classifier since they provide spectral features about selected urban materials. However, urban materials can have similar spectral characteristic features due to high inter-class correlation which can lead to misclassification. Spectral libraries rarely provide imagery of their samples, which disables the possibility of classifying urban materials with additional textural information. Thus, this paper conducts material classification comparing the benefits of using close-range acquired spectral and textural features. The spectral features consist of either the original spectra, a PCA-based encoding or the compressed spectral representation of the original spectra retrieved using a deep autoencoder. The textural features are generated using a deep denoising convolutional autoencoder. The spectral and textural features are gathered from the recently published spectral library KLUM. Three classifiers are used, the two wellestablished Random Forest and Support Vector Machine classifiers in addition to a Histogram-based Gradient Boosting Classification Tree. The achieved overall accuracy was within the range of 70 80% with a standard deviation between 2 10% across all classification approaches. This indicates that the amount of samples still is insufficient for some of the material classes for this classification task. Nonetheless, the classification results indicate that the spectral features are more important for assigning material labels than the textural features.


INTRODUCTION
Assessing the materials in the urban environment has in recent times increased in importance for several reasons and applications. For example, this information is useful for researchers and city planners who deal with city simulations or models where knowledge about the existing materials is important. This can include three-dimensional building models as given in CityGML (Kolbe et al., 2005) and the extended dataset called Buildings from OpenStreetMap (OpenStreetMap contributors, 2017). Furthermore, knowledge about the material of a cultural heritage object contributes to preservation and conservation of either buildings (Sánchez and Quirós, 2017;Yuan et al., 2020) or statues (Grilli and Remondino, 2019). Further knowledge about urban materials can be an indicator on how to tackle and handle the urban heat island effect which has an increasing occurrence in cities (Ward et al., 2016;Santamouris et al., 2011).
Observed in previous studies regarding urban material classification, the characteristic spectral features of different urban material classes can be similar which makes it challenging to distinguish them from each other (Ilehag et al., 2017b;Ouerghemmi et al., 2017;Deshpande et al., 2019). This is partly due to the high inter-class correlation which leads to misclassification. Furthermore, when different spectral libraries only provide the spectra and no imagery of their samples, it can be challenging to assure that the different spectral libraries contain materials that have been labeled in a consistent manner (Fairbarn Jr, 2013;Ilehag et al., 2019). Likewise, this can again lead to misclassification as the different materials might have the same labels. Few spectral libraries contain imagery of the samples (Kotthaus et al., 2014;Kokaly et al., 2017;Ilehag et al., 2019) which can provide additional information for material classification. * Corresponding author Combining spectral and textural features from satellite images for classification of e.g. tree species (Ferreira et al., 2019), agriculture (Mirzapour and Ghassemian, 2015;Ding et al., 2019) and urban areas (Yuan et al., 2013) has shown great potential. Thus, as urban materials are challenging to separate on a spectral level due to the similarity of the spectral characteristic features, added textural information could be beneficial. Assessment of materials for close-ranged approaches, such as scene analysis via unmanned aerial vehicles (UAV), could likewise benefit from combining spectral and textural features (Ilehag et al., 2017a). Imagery of material samples acquired from a close distance, where distinct textural features of the material can be detected, could contribute to improved material distinction. One way to extract textural features can be through the usage of an autoencoder (AE) (Kramer, 1991). An AE is an unsupervised machine learning method for compressing and decompressing data to retrieve important dimensionality-reduced features, which can include textural features (Das and Walia, 2019). The retrieved features can be used as input for classification tasks and have shown great potential (Geng et al., 2015;Li et al., 2016).
In this paper, we aim to classify urban materials based on spectral and textural features using the close-range acquired spectral library KLUM (Ilehag et al., 2019). In particular, we intend to determine the benefits of using either spectral or textural features, or the combination of both. The spectral features are either expressed as the original spectra, the encodings based on the first few principal components or the compressed spectral features acquired from a deep AE (DAE) (Vincent et al., 2010). The textural features are retrieved from a deep denoising convolutional AE (CAE) (Masci et al., 2011). This paper is structured as follows. We briefly summarize the related literature about AEs and the approaches for urban material classification in Section 2. The used dataset, the spectral library KLUM, is presented in Section 3. The proposed methodology is presented in Section 4, which consists of creating AEs for extraction of the compressed representation and classification. The classification results and further analyzes of the usefulness of AEs are presented in Section 5. Lastly, final remarks and suggestions for future work are provided in Section 6.

RELATED WORK
To provide an overview of related work, the architecture of an AE and its applications are presented in Subsection 2.1. Spectral and textural approaches to urban material classification are presented in Subsection 2.2.

Autoencoders
An AE is a data compression method that can be used to learn and extract important features in an unsupervised manner (Kramer, 1991). An AE consists of a compression function, an encoder, that reduces the dimensionality of the dataset and of a decompression function, a decoder, that restores the compressed dataset into its original form. The architecture of a typical AE can be seen in Figure 1. As visualized, the input layer is compressed using a number of hidden layers which the user has chosen for the particular task in mind. The outcome of the encoder is the compressed representation, which is a dimensionality-reduced representation of the original input. To restore the original input into the reconstructed input, that is, the output layer, the compressed representation is decompressed using the decoder. An AE is built for specific sets of data, that is, it is only able to compress and decompress datasets comparable to what it was trained for. For example, if being trained on images of animals, it would perform poorly on images of cars. An AE requires both training and testing data to assess the quality of the encoder and the decoder, and can perform well on smaller datasets (Feng et al., 2019).
Depending on the dataset and the purpose in mind, different autoencoding architecture are available. We will here present those relevant for this paper. A DAE (Vincent et al., 2010) is an AE that contains several hidden layers connected with each other. It means that the output from one hidden layer is the input of the next consecutive hidden layer. A CAE (Masci et al., 2011) utilizes convolutional layers to extract features, which is most commonly applied to imagery as it discovers localized repeated features over the input domain.
AEs can be used for different kinds of tasks, one being denoising of corrupted data, such as signals (Vincent et al., 2008;Vincent et al., 2010). A denoising AE is trained on both the corrupted and uncorrupted data. The uncorrupted data receives intentionally randomly added noise to corrupt the dataset. By adding noise to the uncorrupted dataset, it forces the AE to extract the most important features (Vincent et al., 2010) which makes it more robust. The compressed representation can be used as an input for classification tasks, such as land-cover classification (Li et al., 2016) or classification of a SAR image (Geng et al., 2015). As the compressed representation is a dimensionality-reduced representation, it can be a useful approach for high-dimensional data e.g. given when dealing with hyperspectral data Lan et al., 2019;Lin et al., 2013) or for extracting important features from images (Das and Walia, 2019).

Urban material datasets and classification approaches
The utilization of spectral libraries is a common procedure while dealing with material classification based on purely spectral features. They contain a large variation of material samples which can be used for training a classifier due to their unique spectral characteristic features. As libraries are compiled in different countries across the world, the provided samples can vary due to the regional differences. Spectral libraries are therefore generated to represent a particular region. Publicly available spectral libraries such as the Santa Barbara spectral library (Herold et al., 2004), ASTER (Baldridge et al., 2009), SLUM (Kotthaus et al., 2014), USGS spectral library version 7 (Kokaly et al., 2017) and KLUM (Ilehag et al., 2019) enable the possibility to extract spectral features for further studies. Since spectral libraries aim to provide only spectra of a variation of materials, imagery corresponding to the given samples is usually not provided.
In contrast to such spectral libraries, other material libraries have been presented where the focus is set on material classification based on textural properties. Respective material libraries include the CUReT Database (Dana et al., 1999) and the KTH-TIPS Database (Hayman et al., 2004) that contain RGB images of a single representative per material class captured under different conditions (e.g. with respect to scale, illumination and viewpoint) in a controlled setting. Since these databases only contain a single material sample per class and thus do not take into account intra-class variations, an extension has been presented with the KTH-TIPS2 Database (Caputo et al., 2005). It contains further material classes and different samples of the same material class that have been acquired under different viewing and illumination conditions. A similar concept for data acquisition has been applied for acquiring the UBO2014 Database (Weinmann et al., 2014). However, one of the main limitations of such databases is the fact that they have been acquired in a lab environment in a controlled setting, thus not appropriately taking into account the complexity of real-world environment conditions, as e.g. illumination conditions strongly influence the appearance of materials. This has for instance been addressed with the Flickr Material Database (Sharan et al., 2009), the OpenSurfaces dataset (Bell et al., 2013) or the Materials in Context (MINC) Database (Bell et al., 2015). These contain images of materials acquired under uncontrolled viewing and illumination conditions beyond lab environments, thus accounting for a large intra-class variation ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume V-1-2020, 2020 XXIV ISPRS Congress (2020 edition) of material samples regarding their appearance in complex realworld scenarios. Further work is focused on the use of material representations digitized in lab environments to synthesize the respective material appearance variations as seen in real-world environments under different viewing and illumination conditions (Weinmann et al., 2014).
The combination of textural and spectral features for material classification is not as well-established as only using spectral features. However, due to hyperspectral data having a high dimension, classification can be challenging, especially if there is a lack of training data available. Thus, texture can provide additional information for classification of agricultural land using three-dimensional wavelet texture features (Qian et al., 2012), or with a Gray-Level Co-Occurrence Matrix to assess either building materials (Lerma et al., 2000) and meat quality (Yang et al., 2018). The texture can be described using different approaches and the algorithm is chosen to suit the needs.

DATASET
For this paper, we used the publicly available spectral library KLUM (Ilehag et al., 2019) that was acquired in-situ in 2018 in the city of Karlsruhe, Germany. The spectral samples were acquired with the high-resolution spectroradiometer ASD Field-Spec -4 Hi -Res 1 in the spectral range of 350 -2500 nm. The spectroradiometer has three sensors which cover the ranges of 350 -1000 nm, 1001-1800 nm and 1801 -2500 nm respectively. A spectral sampling of 1.4 nm and a spectral resolution of 3 nm are used in the spectral range of 350 -1000 nm, while the spectral sampling of 1.1 nm and a spectral resolution of 8 nm are used in the remaining spectral range. The spectroradiometer has in total 2151 channels and a wavelength accuracy of 0.5 nm.
KLUM mainly contains building facade materials, consisting of 181 spectral samples. KLUM is split into 13 material classes and 33 material subclasses. The subclasses are split based on the color, the surface attributes (coating, structure and texture) and the state of the sample (e.g. new or old). The spectral library also provides images of the acquired samples.
For this paper, we choose a total of 61 samples from the six material classes Asphalt, Ceramic, Conglomerate, Limestone, Plaster and Sandstone (see Figure 2 for selected examples). These samples provide us with 606 spectra, each sample being represented by eight to ten continuously acquired spectra. The six chosen classes are selected to firstly focus on material classes with similar spectral features but with different textural features, such as Asphalt and Conglomerate. Secondly, we wanted to limit the amount of contaminated samples (e.g. painted surface) as these features can be observed in this spectral range (Ilehag et al., 2019). Lastly, we include material classes which have distinguished visible textural features, such as Sandstone, to investigate the impact of the textural information.
Due to this paper focusing on both spectral and textural information for material classification, we extend the imagery of KLUM by acquiring further high-resolution images of each sample. The images are acquired with a Nikon D810, a single-lens digital camera with 36.6 megapixels and a sensor size of 35.9 mm x 24 mm. The acquired images have a pixel-resolution of 6144 x 4080 pixels. For each sample, three to four images from different distances were acquired, ranging between 0.

PROPOSED METHOD
To classify the given samples, a framework consisting of two major steps is utilized. We firstly address the feature extraction by separately deriving the compressed representation of both the spectral information and the textural information using a DAE and a CAE respectively (visualized as the workflow in Figure 3). Secondly, we perform the material classification using all extracted textural and spectral features as input to the supervised classifiers. For the assessment of the DAE, we perform a comparison with the alternative spectral representations by firstly using the original spectra and secondly employing the Principal Component Analysis (PCA) (Tipping and Bishop, 1999) on the original spectra.  Figure 3. Data processing flow for one material sample, either for the spectral features, the textural features or the combination using the original spectra.

Compressed feature representation with autoencoders
To assess the performance of the encoder, a decoder, which decodes the compressed representation, and a loss function, that measures the information loss between the compressed and decompressed representation, are needed. The AE structures differ depending on the input, as it is either spectral data or imagery.
Some precautions and procedures remain the same for both types of AEs. Firstly, the input data and the compressed representation are normalized to the closed interval [0, 1]. Secondly, we use 20% of the dataset as testing data, 10% for validation and the remaining 70% for training. Lastly, due to the KLUM dataset containing a class imbalance, we assure that each class is represented in both the training and testing datasets.
The utilized models and the necessary pre-processing procedures are further explained in the following sections, firstly the DAE for the spectral compression and secondly the CAE for the textural compression.
Deep autoencoder for spectral information As the dataset was acquired in-situ, some of the spectral channels are removed due to atmospheric effects, resulting in spectral gaps. To counter this, we split the spectral range into four individual sections to reduce the impact of any undesired atmospheric feature. Thus, four spectral bands are defined: 350 -949 nm, 1021 -1339 nm, 1451 -1779 nm and 1971 -2299 nm. The bands are not equally sized and each individual DAE is therefore structured differently.
As the KLUM dataset contains several spectra of the same sample, we take measures to assure that the training and testing data do not contain spectra from the same sample. That is, a sample's spectra are either present in the training or the testing dataset.
Four DAEs are used in this paper and they are constructed by stacking either three or five hidden layers, depending on the size of the input spectra (see Figure 4 for the DAE structure for the first spectral range). We utilize densely connected layers and the rectified linear unit (ReLU) activation, defined as y = max(0, x), for each hidden layer except for the last output layer where we use a linear activation, defined as y = x. Additionally, we add a sparsity constraint of 10 − 5 to prevent that the hidden layers are learning an approximation of PCA, since this is not desired in this paper. To assemble the DAE, we use the stochastic optimizer Adam (Kingma and Ba, 2014) and the mean squared error to assess the loss. For each of the four DAEs, we retrieve a compressed representation as a vector of its corresponding spectral section. Over the complete range, we receive a compression representation with 32 elements (each DAE retrieves a vector with the size eight).  Deep denoising convolutional autoencoder for textural information To extract the textural features from the extended imagery, we utilize a CAE. We perform a few image pre-processing procedures to enhance and improve the ability to encode and decode the images. Firstly, as the acquired imagery has a high pixel-resolution of 6144 x 4080 pixels, we scale down the pixelresolution with a factor of 2 as such a high level of detail is neither desired nor necessary. Secondly, we extract six randomly located patches of 256 x 256 pixels of each image to extend and increase the variation of the dataset. Thirdly, the patches are transformed into a grey-scale using Y = 0.2125·R+0.7154·G+0.0721·B as we do not require information from three color channels to retrieve textural features. Lastly, we add normally distributed noise to increase robustness.
The structure used for the CAE consists firstly of the encoder and secondly the decoder. The structure of the encoder can be seen in Figure 5 and consists of the combination of a 2D convolutional layer with a linear activation, a batch normalization and an additional ReLU activation layer. The last layer in the encoder consists of reshaping the final encoder output into the compressed representation, a vector with 32 elements. The decoder consists of the inverse stacks to decode the compressed representation into the same shape as the encoded input but with a sigmoid activation, defined as y = 1 1+e −x to retrieve a value between 0 and 1. We train the CAE with the stochastic optimizer Nadam (Dozat, 2016) and use the mean squared error to assess the loss. Figure 6 displays the distribution of extracted compressed textural features for two samples. As the compressed textural representations differ between the two examples, it indicates that we have retrieved distinct textural features which can be used for material class assignment.

Classification
To assess the material classes, we perform supervised classification using three different classifiers which are available in the Python toolbox scikit-learn (Pedregosa et al., 2011): Random Forest (RF) (Breiman, 2001), Histogram-based Gradient Boosting Classification Tree (HGB) (Pedregosa et al., 2011) and Support Vector Machine (SVM) (Wu et al., 2004). The RF classifier consists of an ensemble of randomly trained decision trees, where each decision tree uses a random subset of the training data for training, which results in a set of decision trees that are different from each other. The parameters of the RF are set to 100 decision-trees and a maximum tree depth of 100. The HGB classifier, which is available as an experimental approach in the toolbox scikit-learn, is implemented based on the classifier Light-GBM (Ke et al., 2017), a gradient boosting framework that uses a tree-based approach. HGB bases the tree splits on the potential gain and splits the samples into integer-valued bins (histogram bins), which reduces the amount of splitting points. Here, we set the maximum number of iterations to 100, a maximum depth of 10, a loss function based on categorical cross-entropy and a learning rate of 0.5. The SVM classifier bases its classification on the scheme one-vs-one, that is, it separates two classes of interest at a time in a high-dimensional space and considers the distance between the nearest feature of the two classes (Wu et al., 2004). We use a radial basis function kernel for the separation between the classes as a simple linear separation would not be enough for high-dimensional data. Furthermore, the regularization parameter is set to 100, the kernel coefficient for the radial basis function to 1/(num(samples) · var(dataset)) and the regularization parameter for each class as num(samples)/(num(classes) · bincount(classes)) to account for class imbalances.
The material classification is carried out for three kinds of input data: either only spectral features, only textural features that describe the image patches or a combination of both. We use the CAE features to represent the image patches while the spectral features are represented as either the compressed representation that we received from the DAE, the original spectra or the PCA-based encodings of the original spectra. All in all, we define seven different scenarios of input features. We can therefore assess both the contribution of combining spectral and textural features, and the usefulness of the DAE.
Some precautions and procedures are the same for all classification tasks. Firstly, we assure once again that the training and   testing data do not contain spectra and image patches from the same sample. Secondly, we take care that each class is present in both the training and the testing datasets due to the class imbalance. We utilize a stratified k-fold cross-validation approach to preserve the class distribution for training and testing the classifiers. A k-fold cross-validation approach allows us to perform the classification several times using different splits, which reduces the chance of overfitting during training. We use k = 4, which is based on the number of samples in the smallest class to make sure that each split contains enough samples.
Lastly, to evaluate the performance of the classifiers, we consider different measures. We determine the overall accuracy OA, which indicates the overall performance of the classifier, and the κ-value, which indicates how good classes can be separated from each other, in addition to the average recall-scoreR, the average precision-scoreP and the average F1-score across all classes. As we use a k-fold cross-validation, we provide the mean evaluation measures in addition to the standard deviation σ.
Classification based on spectral features Hyperspectral data contains redundant information due to their broad spectral range. We apply therefore the PCA algorithm (Tipping and Bishop, 1999) from the Python toolbox scikit-learn (Pedregosa et al., 2011) on the original spectra to reduce the spectral dimensionality and retrieve the few first principal components that cover 99.99% of the variability of the given data.
For the spectral classification, we either use the original spectra, the PCA-based encoding or the compressed representation of the original spectra from the four DAEs. As the compressed representations of the original spectra were split into four, we merge those in the corresponding spectral order. The number of spectral features is 1577 for the original spectra, 10 for the PCA-based encoding and 32 for the merged DAEs.
Classification based on textural features 32 textural features are retrieved from the compressed representations of the image patches and are used as input for the textural classification. The imagery database contains images acquired from different acquisition distances (0.2 -1.2 m) and the extracted features represent information from different detail levels. To retrieve uniform textural features, only the images acquired from 0.2 m are further used for the classification task.
Classification based on spectral and textural features To combine the spectral and textural features for each sample, we assign each image patch a corresponding set of spectral features randomly chosen from the available spectra for the same sample. We create three different spectral and textural combinations. They consist of the compressed textural features from an image patch combined with either the original spectra, the first few principal components or the compressed spectral representation, from the same sample.

RESULTS AND DISCUSSION
In comparison to the PCA method, a DAE requires both training and testing data to retrieve a compressed spectral representation. Depending on the size of the dataset and the structure of the DAE, the compressed spectral representation requires significantly more computing time and resource than the standard PCA computation. In Figure 7, we can observe the ability of the DAE to encode and decode the original spectra in the spectral range of 350 -949 nm. A difference can be noted while comparing the original spectra with their decompressed correspondence since AEs are lossy. The general spectral characteristics remain the same but smaller deviations can be observed. Furthermore, the compressed spectral representations of the two samples are significantly distinct, which indicates that they can be utilized for material distinction.
The material classification results using the three classifiers reveal, in general, an OA within the range of 70 -80% as seen in Table 1. However, as spectral characteristic features are similar for urban materials and an inter-class correlation exists, the OA around this range is expected. Furthermore, the compressed spectral representation received from the DAEs does not contribute to substantially better classification results than the standard PCA approach. In general, we receive a rather large σ across all classification approaches, which could indicate that we have a small dataset for this challenging classification task.
The combination of spectral and textural features contributes to unexpected results, as the OA does not significantly increase in comparison to only using spectral features. Contrary, it remains around the same interval. This hints that the spectral features are more important for material classification. This is further supported by observing the most important according to the RF classifier for these combinations. We receive an indication that about 75% of the most important features are spectral features.
The difference between the seven scenarios is minor as the evaluation scores are close to each other. The noticeable difference is however, in general, a larger variation across all scenarios. This implies either that the spectral and textural features are not comparable or that the assignment of spectral and textural feature pairs is not suitable since they contradict themselves. Nonetheless, this does indicate that adding the textural features to a spectral dataset does not improve the distinction of classes with similar spectral features, such as Asphalt and Conglomerate.
As we perform a stratified k-fold cross-validation approach, we retrieve several confusion matrices for each classifier. The confusion matrices for one cross-validation cycle for SVM can be seen in Figure 8. Here, the cycle with the highest average overall accuracy OA was chosen for comparison of the confusion matrices between the scenarios relying on original spectra, textural features and the combination of the two. The class distributions across these scenarios are similar, except for the scenario with textural features (which is likewise explained by the evaluation scores in Table 1). There is some confusion between the classes Asphalt and Conglomerate in addition to Ceramic, Sandstone and Plaster. The textural and spectral features of Limestone appear more distinct as the class has fewer wrongly predicated samples.
In general, the occasionally occurring large σ, the lower evaluation scores and an overall accuracy OA within the range of 70 -80% across all classification approaches may be attributed to the limited size of dataset. Furthermore, the dataset contains a certain class imbalance which is further prominent during training of either an AE or a classifier. If the class itself contains a distinct spectral or textural variation, this cannot be trained with the splits containing few samples. A rigorous balancing of all classes would either further decrease the number of samples or the number of considered classes. The conclusions drawn from this paper are specific to this setup. Future work should address the acquisition of a larger dataset including more variation of both spectral and textural features. Nonetheless, this paper can imply that the combination of spectral and textural features does not significantly improve the possibility of distinguishing materials with similar spectral features.

CONCLUSION AND OUTLOOK
In this paper, we have classified urban materials by utilizing a close-ranged acquired spectral library by using either spectral features, textural features or a combination of both. We have used two types of AEs to retrieve a compressed spectral representation of the original spectra and a compressed textural representation of the imagery. To assess the compressed spectral representation and its ability as input for classification, we also used the original spectra and a PCA-based encoding. The summary in Table 1 does not show a clear trend whether PCA-based or DAE-based features provide better classification results. While for the RF case the DAE features perform better than those from PCA, there are no consistent results for the HGB and SVM classifiers. Additionally, more steps are required to retrieve the compressed spectral features, which are more time and resource consuming than to retrieve the principal components. Three classification approaches were used: RF, HGB and SVM. The classification approaches perform similar and a clear ranking is not evident. In general, the achieved OA implies that material classification is heavily based on the spectral features as the retrieved results were worse when using only textural features. By analyzing the most important RF features for the combination of spectral and textural features, we receive a hint that implies that the spectral features are more important. The acquired images that were used for the textural study can be downloaded from the dataset 'KLUM extended imagery' (https://github.com/ rebeccailehag/KLUM_extended_imagery) for further analysis studies.
The combination of spectral and textural features needs to be further explored. On the one hand, a larger dataset needs to be acquired which includes more spectral samples and corresponding imagery. Such a dataset should likewise cover a larger variation of both spectral and textural features to account for variance and thus more representatively cover a wide range of urban materials. On the other hand, a comparison between CAEs and classical texture analysis approaches could be of interest to retrieve different kinds of textural features and to see if the combined approach provides improved classification results.  Bell, S., Upchurch, P., Snavely, N. and Bala, K., 2015. Material recognition in the wild with the Materials in Context Database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3479-3487.