DEEP FEATURE EXTRACTION BASED ON DYNAMIC GRAPH CONVOLUTIONAL NETWORKS FOR ACCELERATED HYPERSPECTRAL IMAGE CLASSIFICATION

: Deep learning has achieved impressive results on hyperspectral images (HSIs) classification. Among them, supervised learning convolutional neural networks (CNNs) and semi-supervised learning graph neural networks (GNNs) are the two main network frameworks. However, 1) the supervised learning CNN faces the problem of high model time complexity as the number of network layers deepens; 2) the semi-supervised learning GNN faces the problem of high spatial complexity due to the computation of adjacency relations. In this paper, a novel dynamic graph convolutional HSI classification method is proposed, which is called dynamic graph convolutional networks (DGCNet). We first obtain two classification features by implementing flattening and global average pooling operation on the results of the convolution layer, which fully exploits the spatial-spectral information contained in the hyperspectral data. Then the dynamic graph convolution module is applied to extract the intrinsic structural information of each patch. Finally, HSI is classified based on spatial, spectral and structural features. DGCNet uses three branches to process multiple features of HSI in parallel and is trained in a supervised learning manner. In addition, DropBlock and label smoothing regularization techniques are applied to further improve the generalization capability of the model. Comparative experiments show that our proposed algorithm is comparable with the state-of-the art supervised learning models in terms of accuracy while also significantly outperforming in terms of time.


INTRODUCTION
Hyperspectral imaging technology is capable of providing detailed spectral information by sampling hundreds of narrow continuous spectral bands from th e visible region (0.4-0.7 μm) to the short-wave infrared (SWIR) region (0.7-2.4 μm) (Ahmad et al., 2021). Applications based on hyperspectral images (HSIs) are widely developed, such as agricultural assessment, environmental inspection, et al. (Mahesh et al., 2015;Sabbah, 2012). However, the high-dimensional property of HSIs makes the topological relationship between high-dimensional units change essentially, such as the data distribution density becomes sparse and the spatial centroid is outside the hypersphere (Jimenez and Landgrebe, 1998). This makes the usual data processing methods based on color images cannot be directly applied to HSI to achieve the desired performance (Rasti et al., 2020). Therefore, a series of processing algorithms, such as target detection, change detection, and unmixing methods have been dedicated to develop for HSIs. Among them, the most basic one is HSI classification (HSIC) (Chang, 2021;Chen et al., 2016;Liu et al., 2019;Zhang et al., 2018).
The complex high-dimensional features of hyperspectral images are a challenge for traditional manual feature extraction and classification methods. After the development in the past few years, a series of DL models have been developed for HSIC. For instance, Chen (Chen et al., 2014) was the first to explore the SAE framework for hyperspectral classification. In his work, however, SAE can only extract higher-level features from onedimensional data, while ignoring the two-dimensional spatialspectral binding features. Unlike SAE, CNN uses fixed-size image patches for deep feature extraction (Liang and Li, 2016). Therefore, it can maintain the integrity of spatial information. In the work of Li et al. (Chen et al., 2016), wherein a spatialspectral joint 3D-CNN HSIC network, which adopts an end-toend training model, i.e., it no longer relies on complex data processing. Similarly, (Li et al., 2017) further investigated 3D CNNs for spatial-spectral classification using input squares of HSIs with smaller image patches. But the classification accuracy of the models decreases as the network becomes deeper.
However, on the one hand, the above methods usually have a high computational complexity and require a lot of time for training (Wang et al., 2018). On the other hand, limited labeled samples are a common issue in the practice of HSI classification, and GCN-based semi-supervised learning was found to be a good solution (He et al., 2021). By using GNNs, it is possible to mine the intrinsic connections between different vertexes and make full use of the rich spatial and spectral information of HSI (Kipf and Welling, 2017). For example, in (Dong et al., 2022), a graph attention network, which is a modification of GCN, was proposed for combining pixel and super pixel level HSI representations to extract salient features of HSI. To explore the spatial information of HSI, context aware GCN and multiscale GCN were proposed in (Wan et al., 2019) and (Wan et al., 2020), respectively. Nevertheless, since HSIs are so large that even with super-pixel segmentation, a semi-supervised GNN still leads to a massive amount of computation and limits its applicability .
Based on the problems of the above DL models, namely 1) high time complexity with increasing number of layers of CNNs and 2) high space complexity with using semi-supervised GNNs, in this paper, we propose the dynamic graph convolutional networks (DGCNet), as shown in Figure 1. Technically, HSI usually consists of hundreds of very narrow spectral. It contains a lot of redundant spectral information, which significantly Given an input HSI X , our method first use two convolution blocks extraction shallow features. Next, the deep spatial-spectral features are extracted based on the Flattening and Global Average Pooling (GAP) branches. The structure feature is extracted by dynamic graph convolutional module (Sec. 3.1 & 3.2). Finally, the spatial, s pectral and structure features are concatenated and classified by softmax function to obtain the label Y (Sec. 3.3).
increases the processing time of HSIs. Therefore, firstly, principal component analysis (PCA) is employed to extract the most informative components of the HSI. Then, two convolutional layers are employed to extract shallow features. Next, three branches extract spatial-spectral features and structural features of the HSI, respectively. Dynamic GCN, which is a novel network architecture for HSIC, is proposed to maximize the exploitation of the global structure information and further boost the performance. The module can be integrated into the head of any DL-based classification framework and trained jointly with them. Finally, we further adopt a fusing mechanism to make full use of the three features of network. In experiments on the rather challenging Indian Pines dataset, we obtain >98% overall labeling accuracy. We show that multi-different features are vital for good HSI classification and outperform standard CNN-based by a large margin.

Graph Convolutional Networks
Graph data are commonly found in the real world. However, the development of DL based on graph data is relatively slow due to the sequence disorder and dimensional variability of graph data (Zhou et al., 2020). Bruna et al. (Bruna et al., 2014) first proposed the concept of graph convolution based on traditional CNNs, which enabled the expansion of DL from Euclidean domain to non-Euclidean domain. Since then, graph-based DL methods have flourished along both the spatial domain and spectral domain.
Spectral approaches implement convolution operations on topological graphs with the help of graph theory. Unfortunately, the computational complexity of the graph Fourier transform, which must be used, is () 2 On . Therefore, ChebNet was proposed in (Defferrard et al., 2017), which defines the Chebyshev polynomial of the diagonal matrix of the feature vector as a filter, to reduce the computational complexity.
Further, first-order ChebNet (Kipf and Welling, 2017) was proposed to further reduce the computational complexity of GCNs, which is comparable to the top semi-supervised methods in terms of computational efficiency and accuracy. Subsequently, a large number of variants of GCNs have been developed based on this (Zhuang and Ma, 2018) (Xu et al., 2019).
Spatial approaches define the convolution operation directly on the connectivity of each node, essentially by continuously aggregating the neighbor information of nodes. Duvenaud et al. (Duvenaud et al., 2015) proposed a model where all nodes in a neighborhood share the same kernel weights. GraphSage (Hamilton et al., 2017) implements inductive training and testing by aggregating the information of neighboring nodes in a sampling and aggregation manner, which allows graph convolution to be easily extended to large graphs. Graph attention networks (Velicˇkovic et al., 2018) use the attention mechanism to determine the importance of each neighbor node to the central node when aggregating the neighbor information of node, avoiding complex matrix operations.

Regularization
Deep neural networks (DNNs) often face severe over-fitting problems, i.e., models have large gaps in accuracy between the training and test sets. Many regularization strategies have been devised to reduce testing errors (He et al., 2019). Dropout (Krizhevsky et al., 2017) is to temporarily drop it from the network with a certain probability during training, it can be considered as a practical Bagging method for integrating a large number of DNNs. Ghiasi et al. (Ghiasi et al., 2018) proposed a structured Dropout: DropBlock removes consecutive units and forces the remaining units to learn more semantic information to enhance the model generalization ability. SpatialDropout (Tompson et al., 2015) deactivates the whole channel of the feature layer in order to avoid the overall change of semantic information. FractalNet (Larsson et al., 2017) proposes a regularization strategy for random deactivation of multi-branch structures, which randomly removes branches in networks to reduce over-fitting.
Direct training of hard labels often results in over-confident models. Boosting labels can efficiently alleviate the over-fitting problem and improve the accuracy and robustness of DNNs. For example, Bootstrapping (Reed et al., 2015) improves the robustness of models by smoothing hard labels in two ways, Bootsoft and Boothard. Inception v3  proposes a label smoothing strategy combined with uniformly distributed updated label vectors to avoid over-confidence in the correct labels. Xie et al. (Xie et al., 2016) present DisturbLabel, which randomly replaces a part of labels as incorrect values in a mini-batch. Li et al.  use two networks to embed images and labels into a latent space separately and learn the relationship between them by deep distance measure for correcting the network.

METHOD
Concerning hyperspectral data, we denote it as and C denote the length, width, height and total number of categories of the HSI, respectively. The HSIC determines the category to which each pixel belongs by a classification model given the dataset. In this work, we propose the DGCNet, which mainly consists of graph convolutional layers and convolutional layers, divided into three branches. In the following sections, we will describe the framework in detail.

3.1.1
Vertex Encoder As mentioned before, GCN is able to process the information of aggregated neighbor nodes to achieve feature extraction. However, due to the incompatibility of data representation between different network architectures, it is not straightforward to integrate a GCN and a CNN directly. The feature map processed by CNN first goes through vertex encoder module to obtain a set of vertex feature, each of which describes the contents related to a specific label from the input feature map. As shown in Figure 2, . This can be expressed as: ' ,, 11 where T c m and ' are the weight of ( )th c activation map and the feature vector of the feature map at () i, j , respectively. Specifically, the vertex information thus generated can selectively aggregate class-specific relevant features and use them for subsequent processing.

Graph Convolutional Layer
With vertex representations V obtained in previous section, we develop a novel dynamic GCN to adaptively extract structural information of HSI. Unlike the existing semi-supervised GCN HSIC, dynamic GCN can use supervised learning methods, and generate discriminative vectors for the final classification. Specifically, our dynamic GCN consist of two graph convolutional layers, as shown in Figure 2. s A is obtained from the encoded V , which is different from the first layer whose is fixed after training. The s A can be dynamically updated by the self-attentive mechanism as the input features change. Thus, different s A are generated for each patch, which greatly enhances the expressiveness of the model and reduces the risk of over-fitting. Formally, this process can be formulated as： where A denotes the adjacency matrix, and s W represent the learnable parameters. ()   denotes the Leaky ReLU activation function. Overall, the DGCNet could capture content-aware category dependencies with the help of the dynamic graph convolutional layer.

3.2.1
DropBlock Regularization DropBlock is a simple regularization method like dropout. We assume that Figure (a) is a feature map. As shown in Figure 3 (b), white color indicates dropped neurons, while light blue color indicates normal neurons. Dropout randomly drops discontinuous neurons. Nevertheless, since neighboring neurons usually have similar features, the network also learns the same information from the vicinity of the dropped activation unit. Figure 3 (c) represents the random continuous regions used by DropBlock. By dropping out a portion of the whole adjacent area, the network will focus on learning other features to achieve correct classification and thus exhibit better generalization. DropBlock has two parameters which are drop block size s and drop probability  . These two parameters together control the percentage of semantic information that is discarded.
(a) (b) (c) Figure 3. Schematic diagram of the DropBlock module randomly discarding semantic information. A feature map (a) is given. If light blue squares are used to indicate the feature values contained in the feature map and white squares are used to indicate the deactivated feature values, (b) represents the feature map obtained using random deactivation and (c) represents the feature map obtained using semantic deactivation.

Label Smoothing Regularization
Since HSIC usually employs a cross-entropy loss function. This, however, can cause two problems. First, over-fitting: The neural network will drive itself to learn in the direction with the largest difference between the correct and incorrect labels. Second, learning incorrect labels: Manual labeling will inevitably produce errors. Calculating the cross-entropy loss on the wrong labels can lead to reverse optimization results.
Label smoothing regularization adds noise by soft one-hot, which reduces the weight of the class of real sample labels in computing the loss function and ultimately has the effect of suppressing over-fitting. Considering a prior distribution of labels () ut , independent of the training instance x  X , and a label smoothing coefficient  . For a training instance with real labels y  Y , the probability distribution of the label distribution ( | ) q t x after adding label smoothing becomes: where C denotes the total number of categories, which is a mixture of the original ground truth distribution i y and the fixed distribution () uc , with weights 1  − and  , respectively.

Feature Fusion and Classification
To fully explore the information contained in HSI, we use three branches to extract HSI features from different perspectives. As shown in Figure 1, where the operator  denotes concatenating features along the spectral dimension. To optimize the proposed model, we use the cross-entropy loss function as:

EXPERIMENTS
In this section, six supervised classification methods are used to compare with the proposed DGCNet, including a machine learning based benchmark algorithm RBF-SVM 2 and six CNNs methods. The approaches included in the comparison are summarized as follows.
1. SVM-RBF: SVM with an RBF kernel is implemented using the scikit-learn package.
4. SSRN: SPRN 3 is a CNN-based spectral partitioning residual network. 5. DFFN: We follow the architecture of the 2-D CNN as used in (Song et al., 2018). DFFN 3 is a 2D-CNN network that fuses the outputs of different hierarchical layers with the help of residual structure. 6. HybridSN: HybridSN 5 (Roy et al., 2020) improves the attention of the model to salient information by adding a spatial attention module and a channel attention module. 7. SPRN: SPRN 3 (Zhang et al., 2021) splits the input spectral bands into several nonoverlapping continuous subbands and uses cascaded parallel improved residual blocks to extract spectral-spatial features from these sub-bands.
To make a fair comparison, the spatial patch size and dimension are set to the same for all DL-based methods, while SVM adopts serialized original data. In this work, a classical benchmark dataset, namely the Indian Pines dataset, is used to verify the effectiveness of the proposed algorithm. When the number of category samples is greater than 100, 100 samples are selected as the training set, otherwise 15 samples are selected as the training set and the rest of the sample points are used for testing. The data set is divided as shown in Table 1, which can incorporate information about the neighbors of the target pixels and is beneficial for improving the accuracy of the CNN due to spatial autocorrelation.

Experimental Settings
The network structure of DGCNet is shown in Table 2. All the activation functions in DGCNet are the leaky ReLU and use the Adam optimizer with label smoothing. The number of training iterations is 100 and batch size is 128. The initial learning rate is 0.001, which decreases to one-tenth at one-half and five-sixths of the total number of iterations, respectively. For DropBlock, we set drop block size s to 3 and drop probability  to 0.01 because the patch size is relatively small compared with the conventional RGB image size. For label smoothing regularization, since Indian Pines dataset has 16 classes, we use ( ) 1/ 16 ut = and 0.01 The spatial patch size is set to 19 19  and dimension is reduced to 32. All parameter settings in this paper were obtained by referring to existing work and by trial-and-error methods. Some of the parameter analyses are given below. Our experiments are implemented with Python-3.8.5 and PyTorch-1.8.1. The environment consists of an i7-10700K CPU with 32 GB and a NVIDIA GTX-2060 graphical processing unit (GPU) with 6 GB.
To alleviate the influence of random factors, all the experiments were repeated five times, and the mean values of overall accuracy (OA), the average accuracy (AA) and the Kappa coefficient are taken as the evaluation indices. Precisely, OA is the ratio of the sample size of correctly predicted categories to the total tested sample size, while AA is the average of the accuracy of all tested categories. Furthermore, The Kappa coefficient is a statistic widely used to measure the consistency of multi-classification tasks.

Parameter Analysis
Deep learning often contains a very large number of hyperparameters that can sometimes significantly affect the performance of a model. A large amount of research work has been done on the setting of hyperparameters, but usually the selection of hyperparameters is done by trial and error. In this section, the hyperparameters patch size and data dimension are analysed to give an intuitive demonstration of the selection of hyperparameters for DGCNet.  The parameter patch size controls how much data is input to the network. As the patch size increases, more neighboring information of the target pixel is passed into the DGCNet. In HSIs, neighboring pixels usually exhibit similar spectral values. And DGCNet can extract this similar information well and use the information of neighboring pixels to assist in classification. As shown in Figure 4, we set the patch size to 15,17,19,21,23 and 25 to investigate the effect of different patch sizes on the classification results. It is interesting that the effect of patch size is not very significant. The accuracy of DGCNet on the Indian Pines dataset fluctuates between 97%-98.5%. This is probably since the dynamic graph convolution module is adaptive to model different sizes of input data. The best performance is achieved when the patch size is taken to 19, so the patch size is taken to be 19.
HSIs contain hundreds of dimensions. Each pixel can form a spectral curve. However, it is obvious that it contains very redundant information. This both increases the cost of storing hyperspectral images and significantly increases the processing time of hyperspectral images. In this paper, a simple hyperspectral dimensionality reduction method PCA is used to extract hyperspectral image principal components, which saves hyperspectral image processing time without degrading hyperspectral image classification accuracy. Figure 5 shows the schematic diagram of the variation of classification accuracy and training time as the dimensionality of hyperspectral images increases. As the hyperspectral dimension increases, the training time increases linearly from 20 s to 45 s. However, the training accuracy keeps fluctuating between 97.5% and 98.5%. This may be caused by the powerful characterization ability of deep learning, which is insensitive to the input data. Therefore, as a compromise between training accuracy and training time, we choose to downscale the hyper-spectrum to 32 dimensions. Table 3 shows the results of the quantitative evaluation of the Indian pine dataset. including per-class accuracy, OA, AA, kappa, training time and inference time. The proposed DGCNet outperforms all the other approaches in most cases. Specifically, SVM gets much less precision than other approaches, Showing that the importance of spatial information. The accuracy of the 3D-CNN is about 2%-4% gap compared with other spacespectral combined methods, which can be attributed to the positive impact of the architecture and specific module design. HybridSN achieved high AA, which may be due to the subtle information extraction and attention mechanism. Furthermore, since the intrinsic structure and spatial-spectral information of HSI are fully utilized, the classification accuracies obtained by DGCNet are higher than methods, and the three branches make it have the best robustness.  (j) DGCNet.

Visualisation
More visually, Figure 6 shows the visualized classification results of all these approaches in a single experiment on the Indian Pines dataset. It can be seen from Figure 6 that the classification maps generated by DL-based methods are smoother than SVM. The classification map of SVM has serious salt and pepper noise. While our DGCNet result is both smooth and more consistent with the ground truth map than all compared methods. All the observations validate the superiority of our methods and the rationality of integrating GCN and CNN modules together.

CONCLUSIONS
The summary of this study summarized as given below: This research study gives a solution to HSIC by integrating a dynamic GCN with a CNN. The framework uses three branches to process HSI in parallel which can learn from the global information, achieving faster training and inference. To alleviate the over-fitting problem caused by the small sample problem in HSIC, scheduled DropBlock is applied to learn more generalizable features and label smoothing to reduce the interference of mixed image elements on the classification performance. By combining the advantages of the dynamic GCN and CNN, the proposed DGCNet can learn features on spatial-spectral and structure information simultaneously. Experiments on the Indian Pines benchmark dataset show that the proposed framework can obtain competitive results at a faster speed compared with six other state-of-the art methods.
In addition, there are several directions to be explored in the future work. One idea is more experiments, both on further datasets and on the ablation of the model itself, to explore the properties of the DGCNet. One idea is to use a disjoint data partitioning approach, which is a test of the model generalization capability. Another possible direction is to formally state the contribution of different feature transformations to the classification results.