SCENE CLASSIFICATION BASED ON THE INTRINSIC MEAN OF LIE GROUP

Remote Sensing scene classification aims to identify semantic objects with similar characteristics from high resolution images. Even though existing methods have achieved satisfactory performance, the features used for classification modeling are still limited to some kinds of vector representation within a Euclidean space. As a result, their models are not robust to reflect the essential scene characteristics, hardly to promote classification accuracy higher. In this study, we propose a novel scene classification method based on the intrinsic mean on a Lie Group manifold. By introducing Lie Group machine learning into scene classification, the new method uses the geodesic distance on the Lie Group manifold, instead of Euclidean distance, solving the problem that non-euclidean space samples could not be calculated by Euclidean distance directly. The experiments show that our method produces satisfactory performance on two public and challenging remote sensing scene datasets, UC Merced and SIRI-WHU, respectively.


INTRODUCTION
Remote sensing scene classification refers to distinguishing semantic objects with similar scene characteristics from multiple image categories and classifying them into scene types accordingly. In other words, different remote sensing images in the database are classified according to certain dominating features, which make the extraction of image features a key to scene classification. In the past, many classical methods for extracting image features have been developed, which are mainly divided into three categories: (1) extracting feature descriptors directly from images, by using methods such as the Scale-invariant feature transform(SIFT) (Dellinger et al., 2014, Sedaghat, Ebadi, 2015, Histogram of Oriented Gradients(HOG) (Qi et al., 2015, Kaâniche, Bremond, 2012, and Local Binary Pattern(LBP) (Ren et al., 2016, Xiao et al., 2018; (2) extracting continuous features on the basis of image blocks, by using the Bag-of-visualwords model (BoVW) (Zhao et al., 2014, Zhu et al., 2016 and sparse matrix (Ye et al., 2014, Zhang et al., 2018; and (3) extracting features by training the deep learning models (Yang, Newsam, 2010, Cheriyadat, 2013, Yang et al., 2015, Huang, Yan, 2015.
Each of the three methods mentioned above for extraction has its advantages and disadvantages. The first method is simple and easy to implement, but it contains very little information about the semantic characteristics of the scene. Although the classification accuracy of the second method is higher than that of the first method, the processing process is more complicated. The third method is developed in recent years, which does not need to extract feature descriptors manually, and the classification effect on scenes is very good if a learning network is well trained. Nevertheless, a deep network model needs a large amount of data for training, which usually takes a long time for computing and relies on highly configured device supports.
In addition to the above feature extraction methods, the construction of scene classifiers is also very important. For example, clustering is a powerful tool for data classification. It is * Corresponding author (G. Zhu) a kind of method for clustering a data set into groups with the most similarity in the same cluster and the most dissimilarity between different clusters (Aggarwal, 2014, Kaufman, Rousseeuw, 2009. Clustering methods mainly include a probability modelbased approach and a non-parametric approach. Among nonparametric methods, partitional methods are most commonly used. The k-means algorithm, as an example, is the earliest and most famous method and has been proved efficient in remote sensing image classification fields (Pelleg et al., 2000, MacQueen et al., 1967, Jain, 2010, Kanungo et al., 2002.
Although the k-means model has been widely used in remote sensing classification, it still has shortcomings. First, the kmeans method requires pre-setting the size of K value according to previous experience, and the size of K value largely determines the quality of subsequent classification results. Second and more importantly, this model relies on the hypothesis that all training samples satisfy the distribution on classic Euclidean space. However, when the sample is in the manifold space, the Euclidean spatial distance cannot accurately reflect the real distance of the sample, as shown in Figure 1.
In order to solve the above problems, a novel remote sensing scene image classification method, which is based on the intrinsic mean of Lie Groups, is proposed in this study. In a Lie Group, general means will be further treated as two types: intrinsic and extrinsic ones. Since the intrinsic mean is more able to reflect the commonness between one category and another, then an unknown sample will be closer to the intrinsic mean of a category than other categories. Thus it is considered that the unknown sample is most likely to belong to that category. To this end, the sample-set will be mapped to the Lie Group manifold space according to the category, and the intrinsic mean of each category in the Lie Group manifold space is calculated. For the unknown sample, only the geodesic distance of the intrinsic mean from the sample to each category is calculated, and the category to which the intrinsic mean with the shortest distance belongs is thus determined.
The major contributions of this paper are as follows: ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume V-3-2020, 2020 XXIV ISPRS Congress (2020 edition) 1. We propose a strategy to classify the samples of Lie Group in the space of Lie Group manifold, and implement the intrinsic mean classification algorithm within the Lie Group according to this strategy. 2. We demonstrate that the intrinsic mean within a Lie Group is more likely to reflect the commonality between one category and another. When the sample is projected onto the manifold space of the Lie Group, it has obvious advantages such as few parameters and strong understanding. 3. Due to Lie Group samples do not belong to vector space, Euclidean distance cannot be used as a classification metrics. Therefore, we propose a novel method of calculating the distance from geodesics based on the manifold space, which can better reflect the spatial distance between samples and has a better classification effect.
The rest of this paper is organized as follows. Section 2 presents the proposed scene classification method based on the intrinsic mean of Lie Group. In Section 3, the experimental results are provided. Finally, the conclusion is given in Section 4.

Definition of Matrix Lie Group
Let Mmn(K) be a matrix of m × n, in which each element belongs to K, and K represents an (exchange) field. In most cases, K is defined as R(real number) or K = C (complex number). In this study, K is restricted as R (real number). Aij or aij is used to represent the i th row and j th column of m × n matrix A, so, In matrix theory, the determinant of a matrix is a mapping det : Mmn(K) → K, which has the following properties: 1. for any A, B ∈ Mn(K), there is det(AB) = detAdetB. 2. det(In) = 1.

If and only if
There exists two very important matrix groups, namely general linear group The set GLn(K) and SLn(K) constitute the group under matrix multiplication. Further, there is In this study, the matrix groups applied later are all subgroups of GLn(K), and are all subgroups of GLn(R).

Intrinsic mean of Lie Group for remote sensing image scene classification
The flowchart of the intrinsic mean of Lie Group for remote sensing image scene classification is shown in Figure 2, where this study divides the scene classification process into three portions, i.e. 1)mapping the sample data set to a Lie Group manifold space, 2)calculating the intrinsic mean of each category in the Lie Group manifold space, 3)calculate the geodesic distance between the unknown sample and the intrinsic mean of each category, and finally, the category of the unknown sample is given.

Mapping operations
The training sample set is mapped to the Lie Group manifold space according to different categories, and the Lie Group sample data set on the Lie Group manifold space is obtained after the mapping. Accordingly, we get {Mij | i = 1, 2, · · · , c; j = 1, 2, · · · , ni}, where Mij represents the j th sample of the i th category in the sample training set, and ni represents the number of training samples in the j th category, with a total of c categories.
where xij represents a sample mapped to a Lie Group manifold space.

Calculating intrinsic mean in the sample of Lie Group
As we can see, the intrinsic mean µ on manifold S 1 is shown in Figure 1-(a). The distance from each data point to µ is the geodesic distance (curve length) on S 1 , and the intrinsic mean obtained is also on S 1 . Figure 1-(b) is the average value obtained by direct calculation of Euclidean distance, where µ obtained by such calculation is called the extrinsic mean, and obviously, the extrinsic mean is not in the manifold space, i.e. µ shown in the Figure 1- , · · · , xn} ⊆ R d set of n data points can be calculated by the following equation, where the sample xi can be a vector or matrix, and the obtained µ can maintain the minimum sum of the Euclidean distance squared of each point in {x1, x2, · · · , xn}, that is, µ can be expressed as: In general, an d-dimensional manifold M d is not a vector space, so the Euclidean distance cannot be used to represent the distance between two points on the manifold. Therefore the above equation (4) At the same time, we need to define another projection mapping π : R d → M d , µ = π(µΦ), for mapping µΦ back to M d and eliminating µΦ. The calculation equation can be obtained: In practical application, it is difficult to find the right two maps at the same time, and M d doesn't have to be one Φ to be all embedded in R d . A more reasonable solution is to use Riemann distance on M d instead of φ function in equation (5)    (2) the geodesic line of each point in x and {xi} n 1 ∈ M d is the minimum sum of squares. In a tight flow, an intrinsic mean consisting of a collection of {xi} n 1 data points n can be defined as: where d(·, ·) represents the geodesic distance between two points in parentheses.
From the above contents, it can be obtained that the distance between any two points in the Lie Group is given by equation, substitute it into equation (6), and obtain: where G represents the manifold of Lie Groups.
According to the first order BCH equation approximate equation (8) on the right side of the · content, can obtained: where x and xi represent Lie algebras of points in Lie Groups, respectively. Relatively accurate results can be obtained by approximating the first n-th derivative, which can be obtained through experimental calculation. The latter approximate calculation is very large, but the contribution to the results is very small and can be ignored. Considering the substitution of firstorder approximation, the intrinsic mean equation can be ob-tained:

Algorithm
Find the x ∈ G that meets the criteria according to equation (9) and (10). In (Fletcher et al., 2003), the process of solving µ by gradient descent method is given, and the optimization function is f (x) = 1 2n n i=1 d(x, xi) 2 . The gradient of function f is f (x) = − 1 n n i log(x −1 xi). When estimate of the intrinsic mean for a given k iteration is µ k , the k + 1 iteration equation is µ k+1 = exp( τ n n i=1 log2(µ −1 k xi), where τ is the step length. The specific solution algorithm of the intrinsic mean in Lie Group is given by Algorithm 1.
Algorithm 1 The intrinsic mean algorithm of n category Input: {xij} j=1,2···n i i=1,2···c ∈ G, xij represents the i th sample in the j th class distributed on the Lie Group G, ni represents the number of training samples in the i th category, with a total of c classes. Output: µi, i = 1, 2 · · · c, the intrinsic mean for each category.
While µ > ξ and k < M ax Iters 9 µi = µ 10 i = i + 1 11 While i c The gradient descent method is only local convergence, the finding is not necessarily the globally optimal µ. In practice, we can get a better effect by modifying the initial estimate µ0 ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume V-3-2020, 2020XXIV ISPRS Congress (2020 and step τ . The choice of τ is related to the manifold structure of Lie Group G. In (Buss, Fillmore, 2001) demonstrates that τ = 1 is appropriate for spherical manifolds. When the manifold structure of G is a vector space, the gradient descent of τ = 1 is equivalent to the linear average, and if it is R + , R d , SO(3), τ = 1 is equivalent to the geometric average, and the algorithm l can converge in a one-step iteration. For Lie Groups under general circumstances, if the algorithm l cannot converge when τ = 1, a smaller positive number can be set for τ appropriately. According to the characteristics of gradient descent method, when the value of τ is too large, it may cross the extreme point; while if the value is too small, the convergence speed will be too slow.

Classification
Owing the intrinsic mean is more able to reflect the commonness between one category and another, if an unknown sample is closer to the intrinsic mean of one of the categories than other categories, it can be determined that the unknown sample is most likely to belong to the closest category. Therefore, a Lie-mean algorithm is designed based on the intrinsic mean, by comparing the space manifold distance between the unknown sample and the trained sample Lie Group intrinsic mean. The unknown sample is determined to the category with the shortest distance. The specific discriminant equation is:

Experiment setup
To evaluate the performance of the proposed method, we selected two datasets, i.e.UC Merced dataset and the SIRI-WHU dataset. In this study, each of the datasets was randomly split into 75% for training and 25% for testing. And the experiment on each data set was repeated five times, the average classification accuracy was recorded finally.

Experiment on UC Merced dataset
The 21-class UCM land-use dataset (Yang, Newsam, 2012) was designed from large optical images by the U.S. Geological Survey. The UCM dataset covers all typical regions of the United States and includes 21 scene categories, each with 100 scene images. Each scene image consists of 256 × 256 pixels with a spatial resolution of 1 foot per pixel, as shown in Figure 3.
The scene classification results are reported in Table 1. The results of previous scene classification methods, such as BoVW, pLSA, LDA, SPM+SIFT, SIFT+SC (Cheriyadat, 2013), SPCK ++ (Yang, Newsam, 2011), S-UTF (Zhang et al., 2014), GBRCN , SAL-PTM , CCNN, SRSCNNNV (Liu et al., 2016), and SRSCNN were compared. To keep things simple, CCNN denotes CNN without random-scale stretching and voting, and SRSCCNNNV denotes CNN with random-scale stretching but not voting. Table 1 shows that our proposed method performs best among all compared methods, with an overall classification accuracy of 96.71%, on average 24.66% and 1.61% higher than the BOVW and SR-SCNN, respectively. Figure 4 shows the confusion matrix established on the UC Merced data set, indicating that the scene mis-classification between categories has relative small percentage.
From Figure 4, we can see that our proposed method identifies most of the scene categories, except dense residential, freeway and medium residential. From this, we find that there is a major confusion between the two categories of medium residential and dense residential because the intrinsic means of the Lie Groups of medium residential and dense residential are very close.

Experiment on SIRI-WHU dataset
The SIRI-WHU dataset (Zhao et al., 2015) consists of 2,400 scene sample images mainly collected from China (see Figure  5). The dataset includes 12-classes, and each scene class contains 200 images with a spatial resolution of 2 meters and a size of 200 × 200 pixels. Twelve categories cover agriculture, commerce, ports, idleland, industry, grassland, overpasses, parks, ponds, residential buildings, rivers, and water.  Table 2 compares the scene classification results by some common methods on SIRI-WHU datasets. Compared with the stateof-art methods, the proposed method has the highest accuracy with the overall classification accuracy as 97.16%, on average 0.28%, 2.06%, 3.74%, and 3.76% higher than the VGG-VD16 (Hu et al., 2015), VGG-M(IFK) (Hu et al., 2015), Caffe features (Penatti et al., 2015), and SRSCNN, respectively. The proposed method is still competitive in terms of low requirement for computational cost. Figure 6 shows the confusion matrix established on the SIRI-WHU data set. Likewise, the results based on the SIRI-WHU dataset can precisely recognize the correct scenes with very low mis-classification percentage, compared with the confusion matrix derived from the UC Merced dataset classification results.

Methods
From Figure 6, the classification accuracy of our proposed method reaches 97%. In the category of misclassified scenes, such as,   ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume V-3-2020, 2020 XXIV ISPRS Congress (2020 edition) industrial scenes are not well classified, and some river scene images are misclassified as water, which is likely because the river scene images contain more water. Therefore, these scenes may be misclassified as the water scene.

Comparison between two results
The UC Merced dataset contains 21 categories, and the SIRI-WHU dataset contains 12 categories. For the UC Merced dataset, the average accuracy of our method can reach more than 96%, while for the SIRI-WHU dataset, the average accuracy can reach more than 97%, and the classification accuracy of individual categories can reach 98%, which has obvious advantages. From the experiment, we find that the classification accuracy is higher when there is a large difference between categories in the data set, on the contrary, the classification accuracy is lower. For example, in the UC merged dataset, the classification accuracy of dense residential and medium residential is lower than other categories, and it is easy to misclassification. Further analysis, calculation, and analysis of the intrinsic mean difference within the two categories of Lie Groups, we found that the difference between the two categories is very small, but it is quite different from other categories, which is the root cause of their confusion. In addition, the method we proposed is based on the design and development of Lie Group manifold space. Therefore, only when the data samples meet the requirements of manifold space can the method have more advantages and achieve higher classification accuracy.

CONCLUSION
In this study, we proposed a high-resolution remote sensing image scene classification method based on the intrinsic mean of the Lie Group. Compared with the existing methods, which have a complex network structure, multiple parameters, and complex calculations, this method has the advantages of high accuracy, good computing performance, and low characteristic dimensions. In addition, the method solves the problem that non-euclidean space samples cannot be calculated by Euclidean distance. Finally the experiments on UCM and Google Dataset of SIRI-WHU demonstrate the effectiveness of the proposed method.
In the future, we will continue to study the scene classification method based on Lie Group machine learning to further improve the accuracy of the classification and maintain high computational performance.