HYPERSPECTRAL IMAGE CLASSIFICATION WITH LOCALIZED SPECTRAL FILTERING-BASED GRAPH ATTENTION NETWORK

: Graph-based deep learning has been proved a promising approach that has an apparent superiority for learning graph data and modeling spatial topological relations between features. In particular, graph attention networks (GATs) are good at efficiently processing the graph-structured hyperspectral data by leveraging masked self-attention layers to address the known shortcomings of previous frameworks based on graph convolutions or their approximations. In this study, we proposed a novel approach that combines localized spectral filtering and GAT for the hyperspectral image classification task. First, we conducted unsupervised t SNE ( t -distributed stochastic neighbor embedding) manifold learning-based feature dimensionality reduction to create localized hyperspectral data cubes. Then, these feature cubes combined with localized adjacent matrices were fed into a shallow graph attention network in a supervised learning manner. Finally, we obtained credible classification results and promising classification performance in distinguishing diversified land covers through reducing the possible redundancy of spectral information and enhancing the expression of local spatial-spectral information. Experiments on two real hyperspectral data sets (that is, Indian Pines-A (IA) and Huanghekou (HH) data sets) demonstrated that the presented approach offers promising classification performance, that is, the GAT using t -SNE acquires superior performance than that of using PCA (principal component analysis), and also proves the great importance of combining spatial- and spectral information for hyperspectral image classification.


INTRODUCTION
Hyperspectral imaging is particularly useful for per-pixel thematic classification by mining unique spectral signatures of land surface materials to obtain attribute map regarding diverse land covers. Hyperspectral image (HSI) classification categorizes each three-dimensional (3-D) pixel cube into a set of groundtruth classes, which has been a very active research area in recent years (Deng et al., 2018;Pu et al., 2021a). Recently, the graph-based deep learning represented by GCNs has received more and more attention in describing class boundaries and modeling topological relations among samples in the irregular graph domains converted from hyperspectral data (Pu et al., 2021b).
Graphs are a kind of universal representation of non-Euclidean structured data, which could encode complex geometric structures (Chung and Graham, 1997). Graph-based representations can be used to model a variety of problems and domains (Avelar et al., 2020). In this context, the recent novelties regarding graph-based deep learning have attracted growing attention from the scientific community. The graphbased deep learning methods as a kind of deep learning technique have undoubtedly brought enormous prosperity in hyperspectral remote sensing intelligent information extraction (Yang et al., 2018;Pu et al., 2021b). In particular, HSIs from regular grids (or call image domains) into irregular graph domains could adapt to the superiority of graph-based deep learning for preferably illustrating class boundaries and modeling feature relationships (Pu et al., 2021b). Graph Neural Networks (GNNs) are a class of graph-based deep learning methods designed to perform inference on data described by graphs (Wu et al., 2019). Graph convolution networks (GCNs) have received more and more attention in quantifying nonlinear or topological features in irregular graphs converted from hyperspectral data (Pu et al., 2021b). Typically, a simple GCN plays the functional roles depending on the message-passing, pooling, and fully-connected layers. The crucial role is the message-passing layer, which is to compute and update a representation of each node in the graph while leveraging local information from its neighbours (referring to Figure 1).
The graph-based convolutional neural networks (that is, a kind of GCNs) showcase distinctive characteristics (Wan et al., 2020), that is, competently process the irregular regions in the non-Euclidean (or non-grid) graph data structure, and even incorporate multiple graph inputs which can be dynamically updated and refined with the multi-scale neighborhood. However, the GCN-based methods might be difficult to aggregate the newly joined nodes, and then these methods might fail to understand the global and contextual information of the graph scenario (Ding et al., 2021). In this regard, attention mechanisms focus on the most relevant parts of the graph-based input to make effective decisions allowing for handling variable-sized inputs.
Graph attention networks (GATs) are a kind of graph-based neural network architectures that operate on graph-structured data, leveraging masked self-attention layers to address the known shortcomings of prior methods based on graph convolutions or their approximations (Velikovi et al., 2017). Different from many related works paid attention to the graphbased semi-supervised learning methods for HSI classification, which make the graph input built on the full graph (that is, one image corresponding to a full graph), which combines the labeled and unlabeled nodes by employing a graph Laplacian regularizer (Pu et al., 2021b). The unlabeled nodes are completely observed during training or test a deep learning model, whereas the standard formulation of semi-supervised learning paradigm requires the independent and identically distributed (i.i.d) assumption between the labeled and unlabeled nodes (Hamilton et al., 2020).
In this study, we also follow the abovementioned supervised setting to train a GAT. Continuing the previous works on spectral graph-based convolutional neural networks (CNNs) for HSI classification (Defferrard et al., 2016;Pu et al., 2021b;Pu et al., 2022), the main contributions of the presented study are summarized below.

a.
We introduced a graph attention-based deep learning architecture to perform node classification of the graphstructured hyperspectral data. b.
We employed unsupervised t-SNE (t-distributed stochastic neighbor embedding) manifold learning-based feature dimensionality reduction to collect the patch-based feature cubes and to create localized graph adjacent matrices, subsequently used the usual supervised setting to fit the presented graph-based deep learning model. c. We used two graph attention layers to learn the spatially local graph representation and to represent the localized topological patterns of the graph node and its neighboring nodes.
The rest of this paper is organized as follows. The technical details of the proposed approach are presented in Section 2. Next, we analyze the experimental results and discuss the derived findings in Section 3. Finally, some concluding remarks are given in Section 4.

t-SNE
High-dimensional data such as HSIs might not always have good separability. Since feature dimensionality reduction methods based on t-SNE (t-distributed stochastic neighbor embedding) manifold learning have been known to have excellent properties than those based on principal component analysis (PCA), hence we tried t-SNE for reducing dimensions of hyperspectral data in this study (see Figure 2). In this regard, t-SNE can be regarded as one of the effective dimensionality reduction and visualization methods for high-dimensional data preprocessing in the scientific community. Besides, t-SNE is an embedded model in essence, which can transfer data from highdimensional space to low dimensional-space, while retaining local characteristics of data (Van der Maaten and Hinton, 2008). h is the height and w means the width. Considering t-SNE transforms the affinities between data points into conditional probabilities, and the similarity of data points in the original space is represented by Gaussian joint distribution. Therefore, the t-SNE algorithm converts X into a lowdimensional matrix . C represents the number of bands, and D denotes the number of components, both subject to D C ＜ . The t-SNE algorithm can obtain the optimal dimensionality reduction result by calculating the minimum value of Kullback-Leibler (KL) divergence between the joint conditional probability of the original space and the embedded space. t-SNE focuses on local data structure by identifying the patterns based on the similarity of data points with multiple features. Intuitively, after t-SNE transformation, the obtained data components will become separable in the low-dimensional space. In the meantime, the KL divergence function can be regarded as a loss function minimized by the stochastic gradient descent (SGD) algorithm.

Localized spectral filtering
The formulation of CNNs in the context of spectral graph theory provides the necessary mathematical background and efficient numerical schemes to design fast localized spectral filters on graphs (Pu et al., 2021b). Importantly, such a technique offers the same linear computational complexity and constant learning complexity as classical CNNs, while being universal to any graph structure (Defferrard et al., 2016). Then, we give G by using a binary adjacency matrix The algorithm for constructing a localized adjacency matrix (See Figure 3) could be summarized as follows.
Firstly, given data matrix 3-D Thirdly, we compute the weighted graph G of K-neighbors with the number of neighbors K for points in M using the above Eq. (1) and return the resultant adjacent matrix A ; Finally, it will be ready for creating HSI cubes, and completes data preparation for GNNs.

Graph attention network
As known to us, graph attention networks (GATs) introduce attention-based deep learning architectures to perform node classification for graph-structured data (Veličković et al., 2017). By learning the importance weight of each node to the classified node, the graph attention mechanism makes the important nodes have greater weight, and hence global and contextual information can be learned from the graph via attention mechanism (Ding et al., 2021). The graph convolution output l i h of each node can be expressed as follows: where  denotes the activation function, i N is the size of the neighboring set of node i, and ij a is the learned attention weight. The attention coefficient ij  is generated dynamically just depending on the local neighborhood and rearranges the neighbors by their importance, which makes the model more flexible to the specific input sample. W is a filtering matrix for dimensionality reduction and feature extraction of HSI data.
Consequently, some important neighbors could be emphasized in the summation (Sha et al., 2020). The proposed HSI classification model based on GAT is schematically shown in Figure 4. The main idea for graph attention-based HSI classification is to refine the graph signal on the spatial-spectral domain via a localized spectral filter, implemented by a GAT composed of two graph attention layers and two fully-connected (FC 2 ) layers. After the multilayer perceptron, the output R N M   Z for the whole graph was obtained, and only the labeled nodes were used as supervised regression in the cross-entropy loss: where sm y is the true label of training data and M is the number of object classes. There is a particularity in the graphbased learning methods that the eventual predictions for labeled and unlabeled samples are iterated simultaneously until a stable status is reached, and therefore, no extra test process is required. Note that, "ELU" rather than "ReLU" is used for GNN; the 1 st FC layer is with nb_classes*10 (that is, 10 times the number of classes) units, while the 2 nd FC layer with nb_classes (that is, the number of classes) units in this study.

Environmental setting and data sets
The cloud-based experimental platform used in this study is the "Planetary Computer" platform built on Microsoft Azure cloud, which is a development environment that provides access to its data resources and application programming interfaces (APIs) through open-source tools and allows users to easily extend the experimental analysis process using the power of Azure computing. For an independent virtual machine with a 4-core CPU, 28 GB of RAM, and a T4 GPU (graphics processing unit) (NVIDIA Tesla T4 GPUs, 16 GB of graphic memory). These virtual machines are ideal for deploying artificial intelligence (AI) services, responding to user-generated requests in real-time, or using NVIDIA's grid driver and virtual GPU technology for interactive graphics and visualization workloads.
Indian Pines (IP) scene was gathered by the 224-band AVIRIS (airborne visible/infrared imaging spectrometer) sensor in the wavelength range 400 to 2500 nm at a 20 m spatial resolution (that is, 20 meters/pixel, or abbreviated as 20 m/p), in northwestern Indiana. Indian Pines-A (IA) data set (see Figure 5 was a subset of the IP dataset, which consisted of 86 × 69 pixels and contained 200 spectral reflectance bands by removing bands covering the region of water absorption. Huanghekou (HH) data set is released by Jiao et al. (2019), which includes 21 classes, and its material types of overwhelming landscapes incorporate water bodies, grassland, forest land, buildings, bare land, etc., which are different from the traditional land cover types. Specifically, a GF-5 AHSI (advanced hyperspectral imager) sensor was adopted with a spatial resolution of 30 m and a spectral range covering VNIR (visible and near-infrared: 390-1029 nm) and SWIR (shortwave infrared: 1005-2513 nm). There are 150 bands in the VNIR spectral range (visible and near-infrared, without excluding band 1) and 180 bands in the SWIR spectral range (short-wave infrared, without excluding bands 42-53, 96-115, 119-121, 172-173, 175-180, etc.), so there are 285 bands left after removing a number of bad bands from a total of 330 bands. The spatial size of the HH data set was 1185 rows and 1342 columns, and its capturing time was November 1 st , 2018. Note that, the spectral resolution is 5 nm of VNIR (visible and nearinfrared) and 10 nm of SWIR (short-wave infrared). Note that, the spatial distribution of groundtruth samples of HH appears very sparse than that of the IA data set. That's why there's a lot of white space in the scene.

Parameters and training
In order to facilitate parameter analysis of the experiments, all parameters have been divided into three groups (see Table 1). The first group contains parameters for training a network, the second group relates to sampling a data set and preparing data for training, the third one involves extra control parameters for each instance of the experiments. For the classification experiments, the size of the training sample set of different hyperspectral data sets was set to 35 samples for each category, and there are 5 independent random runs were performed in total. The training procedure continues 200 epochs. Note that, the values listed are finally determined, which are not necessarily the optimal value for acquiring the best classification performance. (a) HH with PCA (the 2 nd run) (b) HH with t-SNE (the 1 st run) Figure 6. Accuracy and loss curves of fitting and validating the designed GAT model using the HH data set.

#1
In this study, a fixed number of samples were used for each category (that is, each category has an equal number of training samples). Then, the validation set was designed in the same way to make it the same size as the training set. Finally, the training set and validation set were removed from all samples, and the remaining samples were treated as the test set. If the aforementioned condition is not met, the existing samples in hand will be simply reproduced by creating multiple copies to meet the requirements of the due amount of samples. In the following, the robustness of graph-based deep learning models mainly depends on fine-tuning hyperparameters in an iterative optimization.
The most representative method is to draw the accuracy and loss curves when training and validating in the designed graph-based deep learning models. It can reflect several aspects of the model's robustness, that is, whether a final convergence, whether a smooth convergence, and a general number of epochs required for convergence, etc. Therefore, the accuracy and loss curves of fitting and validating the designed GAT model with 200 epochs using HH data set were illustrated in this study. It can be seen from Figure 6 that the GAT model with different feature dimensionality reduction algorithms (that is, PCA and t-SNE) have relatively good characteristics of convergence. Note that, for the HH data set, the 2 nd run using PCA dimensionality reduction and the 1 st run using t-SNE dimensionality reduction indicate the best classification results in the total 5 random runs.

Classification results
(a) IA with PCA (the 5 th run) Experiments using two real hyperspectral data sets (note that, the first one IA is a small data set, and the latter one HH is a large data set) show that the proposed approach has promising classification performance using three classification accuracy metrics of overall accuracy (OA), average accuracy (AA), and Kappa (K), indicating that the combination of t-SNE and local spectral graph convolution filtering can indeed improve the discriminant ability of the graph-based deep learning model for identifying every pixel' category, and also prove the importance of combining spatial-and spectral information for hyperspectral remote sensing image classification. Particularly, probability maps are essentially a kind of heat map with scale normalization. It can be seen from the probability maps that the graph-based deep learning models also have weak prediction phenomena for the pixels residing on the boundary between categories.
(a) IA with PCA (the 5 th run) (b) IA with t-SNE (the 1 st run) Figure 8. Predictive maximum probability maps of the presented experiments using the IA data set.
By observing classification maps in Figure 7 (a)-(c), probability maps in Figure 8, and confusion matrices in Figure 9, for experiments using IA data set, most misclassifications occur between class 1 (corn-notill) and class 3 (soybean-notill), class 1 (corn-notill) and class 4 (soybean-mintill), and between class 3 (soybean-notill) and class 4 (soybean-mintill). At the same time, class 2 (grass-trees) maintains good separability from other categories, which means small intra-class differences. Note that, as confusion matrices have been disclosed here, therefore the per-class accuracies will not be included in the statistics of classification accuracies. In a sense, the effect of both is equivalent. Note that, see Figure 9, the names of every class are as follows, class 1 (corn-notill) class 2 (grass-trees), class 3 (soybean-notill), and class 4 (soybean-mintill).
(a) IA with PCA (the 5 th run) (b) IA with t-SNE (the 1 st run) Figure 9. Confusion matrices of the presented experiments using the IA data set.
Due to the non-strict production of data sets in many cases, that would make the labeled samples much smaller in size or sparsely distributed in the geospatial distribution, or even the labeled area of groundtruth samples has certain generalization processing. That is, there might be mislabeling issues at the boundary of or inside a certain category. Therefore, the reliability of the graph-based deep learning model will inevitably be adversely affected. In other words, the difference or variance within an individual class might be relatively small. The reason is that, on the one hand, there might be some dirty (or wrong) labels in the labeling procedure; on the other hand, the definition of categories might not consider the difference of materials, or the boundary between categories might go through generalized processing. The aforementioned two reasons may lead to the decrease of inter-class separability. In addition to the above reasons, strictly speaking, there might be mixed categories in the definition of some categories, or the intra-class differences might be over influential leading to misclassification to some extent. It also can be seen from Table 2 that, introducing t-SNE dimensionality reduction has brought up a better performance in feature learning and predicting labels of GNNs. The reason is that t-SNE can effectively improve the separability among classes in the feature space. Such excellent performance is not invariable in deep learning but is possibly affected by parameter settings and changes in the network structure. At the same time, the different number of categories, class separability, and scene complexity, all would lead to possible uncertainty when measuring the final classification performance. In terms of running time, since having the same number of components after the feature dimensionality reduction of raw HSI data, therefore their computational costs are fairly approximate for the same data set. Note that, the derived evaluation metrics (that is, OA, AA, K) are based on test samples rather than the whole image. At the same time, the records of training and test times don't include the processing times of t-SNE or PCA transformation. Therefore, it is the reason why fitting a GAT using t-SNE could be faster than that of using PCA.

CONCLUSIONS
Graph-based deep learning model has been widely used in HSI classification, and has attracted more and more attention due to its strong expression ability. In particular, the emerging graph representation learning and graph neural networks show good characteristics in processing and analyzing graph-structured data. In this study, we proposed a novel approach that combines localized spectral filtering and t-SNE manifold learning based on graph attention mechanism. Newly HH data set has poor separability between categories, yet better classification performance could be obtained. It indicates that class separability might not be a critical issue, and the spatial texture or contextual information might somehow reduce the sensitivity of spectral differences to the properties of land surface materials. For t-SNE based feature learning that contains a large number of iterative calculations would lead to huge computational cost, so we have to adopt GPU acceleration to derive feature data and saves them to the storage medium, then loads and resamples them again, further to effectively avoid huge memory consumption. In conclusion, with the emergence of more and more intelligent algorithms for HSI information extraction, graph-based deep learning models such as GATs would drive the advanced development of future HSI classification research.