ON-BOARD GCPS MATCHING WITH IMPROVED TRIPLET LOSS FUNCTION

: Intelligent remote sensing satellite system is an important direction to solve the problem of intelligent processing on-board. It can realize the real-time on-board intelligent processing of important targets. The accuracy of geometric positioning information is the basis for subsequent intelligent processing. Therefore, this paper corrects the positioning information by GCPs (Ground Control Points) matching on-board. Considering the limited storage and computing performance of satellites, this paper designs a lightweight GCPs deep feature extraction convolutional neural network based on MobileNetV2 as feature extraction model, and trains this network with an improved triplet loss function. The Songshan calibration field images constructed by Wuhan University was used as the GCPs image, and 30,399 image patches were extracted and embedded as GCPs feature library. The size of the GCPs library is a size of 15.3M, and size of the lightweight depth feature extraction model is 9.83M, which can be pre-stored on the satellite for positioning with GCPs matching on-board. In addition, this paper tested feature extraction performance on an embedded device Nvidia Jeston Xavier which simulates the performance of the device on the satellite. At Xavier 30W max power consumption model, a single frame takes 0.005 seconds, and under Xavier 15W power consumption model, a single frame takes 0.009 seconds. At 10W power consumption model, a single frame takes 0.018 seconds, which can meet the performance requirements on the satellite. In addition, the experiments in this paper show that the positioning accuracy is within 30 meters. The work done


INTRODUCTION
Dramatic increase in satellite data not only provides a rich source for subsequent processing and services, but also puts pressure on satellite-ground data transmission links and ground processing and storage systems. Especially in the field of high-efficiency applications, the images taken on-board cannot be provided to users in real-time. Intelligent satellites can extract and distribute effective information on-board in real-time. Therefore, there is an urgent need to research the processing technology on-board.  Table 1. Application of processing on-board of remote sensing satellites Since the 1990s, intelligent remote sensing satellite on-board processing technology has been researched by researchers. Table  1 shows the processing on-board in recent years (Hayden et al., 2004;Straight et al., 2010). DSP (Digital Signal Processing) and FPGA (Field Programmable Gate Array) were the main processor in these satellites. However, with the rapid development of software and hardware in recent years, the ARM + GPU structure has gradually been tried for processing on-board. The upcoming launch of Luojia-3-01 satellite (Wang Mi, 2019), jointly designed by DFH Satellite Co. and Wuhan University, will support this mode. Compared with FPGA and DPS, ARM + GPU mode has better portability and developability. It can easily transplant ground processing algorithms to satellite processing. * Corresponding author However, limitations of storage space and performance are always bottleneck of on-board processing. Therefore, remote sensing image processing algorithms need to be developed for this new architecture. The control point matching, as the basis of the subsequent high-processing, needs more in-depth research.

RELATED WORKS
Since AlexNet (Krizhevsky et al., 2012) has achieved great success in the field of image processing using deep convolutional networks, deep learning methods have been widely used in the field of image processing. It has also gained better application in remote sensing image processing (Ma et al., 2019), such as the classification (Cai et al., 2018;Gong et al., 2017;Hamida et al., 2018), object detection (Dong et al., 2019;Vetrivel et al., 2018;Yu et al., 2016) and segmentation (Kemker et al., 2018;. Siamese structure was used for two sets of patches matching (Zagoruyko and Komodakis, 2015) by train the distance between matched and no-matched pairs respectively, which provided a new idea for image matching using deep learning. The triplet loss function (Schroff et al., 2015) trains the matched and no-matched pairs at once for face recognition. The Triplet loss function was improved by studies Cheng et al., 2016) for better performance, which shows that the triplet structure has good scalability. Compare with triplet structure, Siamese structure works to a secondary objective: drive the distance between matched pair as close to 0 as possible (Vo and Hays, 2016). For remote sensing images, images from different locations may cover the same area. Such images are likely to be considered to come from the same location. The second objective will be useful for GCPs matching. Therefore, this paper proposes a new improved triplet loss function for remote sensing image matching on-board combining the advantages of Siamese structure and triplet structure. In addition, with the deepening of the layers of the deep neural network, the space and performance requirements of the model have gradually increased, which caused great difficulties for embedded devices and mobile devices. For example, devices on intelligent satellite cannot meet the needs of most deep networks. Therefore, many lightweight networks have been proposed for embedded devices with limited performance (Howard et al., 2019;Howard et al., 2017;Ma et al., 2018;Sandler et al., 2018;. MobileNet is one of the better performers in lightweight deep convolutional networks. In summary, it is difficult to store the traditional GCPs library and match with complex deep convolutional networks on-board due to the limited performance and storage space of equipment on-board. Therefore, this paper used a lightweight feature extraction model with improved triplet loss function to embed GCPs image patches into D-dimensional space as GCPs library to store on-board, and extracted feature of image taken on-board by this model to match with GCPs library for positioning. The work will be experimented on the Luojia-3-01 intelligent remote sensing satellite.

METHODS
In this section, the method of this article is introduced. The first part is the overall framework, the second part is the lightweight feature extraction network, the third part is the improved triplet loss function, and the fourth part is the on-board positioning for satellite based on GCPs matching.

The Overall Framework
As shown in the Figure 1, in general, the image patches were transformed into the feature space after being extracted by the shared feature extraction model. The feature extraction model optimized by the improved triplet loss function. Namely, this paper strives for an embedding ( ), from an image x into a feature space ℝ , such that the squared distance between all GCPs picture and target picture, independent of imaging conditions, of same position is small, whereas the squared distance between a pair of GCPs images from different position is large. The images were embedded into feature space as the GCPs library. Ref (reference, GCPs images) and Pos (positive, images from different sources at the same location as the GCPs) in image space are images of different sources at corresponding positions, and Neg (negative, images in different locations) is an image of different positions. Ref and Pos form positive pairs. Ref and Neg form negative pairs. This paper also defines second negative pairs which formed by Pos and Neg. The optimization of the traditional triplet loss function is to make the feature distance of positive pairs smaller than the feature distance of negative pairs. The purpose of the improved triplet loss function optimization is to make the feature distance of positive pairs approach 0, and make the feature distance of negative pairs, second negative pairs larger than the feature distance of positive pairs. Because positive pairs are images from different sources at the same location, their two features should be as similar as possible, and negative pairs belong to different locations, their features should be different. In conclusion, as shown in Figure 2, after learning with the improved triplet loss function, the image features extracted by the shared network at the same location will tend to be similar, and the image features at different locations will tend to be different.

Lightweight Feature Extraction Network
Traditional FPGA and DSP processing cores are difficult to meet the needs of on-board intelligent processing. In recent years, embedded devices based on ARM + GPU structures have been tried for on-board intelligent processing, such as the upcoming launch of Luojia-3-01, which has on-board processing capabilities with ARM + GPU structures. Deep convolutional networks have proven to be powerful in the field of image processing, but they consume too much computing resources.
There are also rich requirements in the embedded mobile terminal, so many lightweight frameworks (Howard et al., 2019;Howard et al., 2017;Ma et al., 2018;Sandler et al., 2018; have been proposed, and MobileNets (Howard et al., 2019;Howard et al., 2017;Sandler et al., 2018) is one of the best networks among them. In this paper, MobileNetV2 (Sandler et al., 2018) was selected as the basic feature extraction network. The key to MobileNetV2 lightweighting is the depth separable convolution. It is a form of factorized convolutions which factorize a standard convolution into a depthwise convolution and a 1 * 1 convolution called a pointwise convolution (Howard et al., 2017) (feature extraction model in Figure 1). In addition, it also adds the idea of Resnet residuals (He et al., 2016;Xie et al., 2016), which is called inverted residuals (Sandler et al., 2018), to improve accuracy. Therefore, it can also ensure that the accuracy can meet the requirements in the process of lightweighting.
In order to prevent overfitting, weaken the unimportant feature variables, and extract important feature variables, this paper adds the ReLU6 and L2 regularization layers before the output of MobileNetV2.

Improved Triplet Loss Function and Training on Ground
Triplet loss function was proposed in FaceNet (Schroff et al., 2015) and achieved good results in face recognition. It can also be used in image matching (Vo and Hays, 2016). The embedding is represented by ( ) ∈ ℝ . It embeds an image x into a ddimensional hypersphere Euclidean space. Its expression is as follows: is the image from different sources at the same location as the GCPs, (Neg) is image in different locations, is a margin that is enforced between positive and negative pairs. This will make, Here the image is closer to the image from same location than any image from different location. However, for remote sensing images, images from different locations may cover the same area. Such images are likely to be considered to come from the same location. Therefore, it is necessary to judge the most matching among these images. This paper added an item to the Triplet loss function to minimize the distance between and , and make it approach 0. Its expression will be as follows: Where is a margin that to adjust the minimum similarity rate. At the same time, in order to make the father away from the and , this paper also added an item in the formula to make the distance between the and lager than the distance between the and . This is visualized in Figure 2. Finally, the improved triplet loss function will look like the follows: In this paper, = 0.5, and β = 0.4.

On-board Positioning Based on GCPs Matching
The GCPs image patches can be embedded into a d-dimensional hypersphere Euclidean space as GCPs library, which can save a lot of hard disk space on the satellite. The d-dimensional depth GCPs library and lightweight feature extraction model will be stored on the satellite. For the image to be matched, the depth features of the region need to be extracted and compared with the GCPs to match. The specific calculation flowchart is shown in Figure 3. Where ( , ) are all GCPs in the image area and is the step for different epochs.
Input: image to be positioning   Figure 3. The calculation flowchart of on-board positioning based on GCPs matching. We also describe the specific algorithm with the pseudo code in Table 2. Where ℝ search is the group of searched range for different epochs and ( , ) is the image patch with in ℝ search . The forth step in the Table 2 is to normalize the direction and scale to the GCPs image patch. In order to balance search range and search accuracy, multiple epochs can be performed at different steps and search ranges. And, = max(ℝ n+1 ) * +1 . For example, a total of three epochs are calculated in this paper. First epoch, ℝ 1 ∈ {−4: 4} , 1 = 50 ; second epoch, ℝ 2 ∈ {−5: 5}, 2 = 10 ; third epoch, ℝ 3 ∈ {−10: 10}, 3 = 1 .

Algorithm1
On-board positioning based on GCPs matching Input: Image to be positioned Output: Image was positioned 1.
Find GCPs ( , ) in the image area 2.
Initialize ℝ ℎ as the search range and as the step 3.
Embedded image patch to d-dimensional space; 6.
End for 12.
Image positioned by CorrectPointPairs Table 2 On-board positioning based on GCPs matching

Experiments Data
The positioning accuracy of GCPs image has an important influence on the results of GCPs matching. This paper selects China (Songshan) satellite remote sensing calibration field data as the GCPs image. This test field was jointly constructed by China Resources Satellite Application Center, Wuhan University and The PLA Information Engineering University. The area of test field is 9000 square kilometres, located between Zhengzhou and Luoyang, with an east-west length 105 KM and a north-south 80 KM. Its high-precision DOM (Digital Orthophoto Map) and DEM (Digital Elevation Model) were produced using aerial photogrammetry through more than 400 uniformly distributed high-precision GCPs. The ground resolution of the DOM is 0.4 meters, and the ratio of the DEM is 1:5000. The DOM data of the Songshan test site is shown in the Figure 4. In this paper, Google (https://www.google.com/) and Arcgis (https://www.arcgis.com/) images in the same area are used as training, testing, and validation data, respectively. In order to facilitate on-satellite matching, all images in this paper are panchromatic images.

Feature Extraction Experiment with Improved Triplet Loss Function
Depth feature extraction model is the basis for subsequent localization through matching, and its accuracy will directly affect the positioning accuracy. The depth feature extraction will embed GCPs image patches into the d-dimensional space as GCPs library, and the library will be stored on-board to meet the storage limit. As shown in the Figure 4, this paper extracts evenly distributed SIFT feature points on the DOM as GCPs, and takes each GCPs as the center, and crops a 255 * 255 image as the GCPs image patch. Google images were selected as training images, and Arcgis images were used as test and verification images. After removing the image patch with obvious changes, 30,399 image patches can finally be extracted. The Google image and Arcgis image were also extracted as 30,399 image patches in the same position, and the orientation and sale of these image patches were normalized to GCPs image based on projection information. Each GCPs image patch and Google or Arcgis image patch at same position was set as a positive pair which their depth feature needs to be similar. In contrast, a negative pair consists of a GCPs image patch and the least similarity image patch in other position, and a second negative pair formed by the image patches from positive and negative pair except the GCPs image patch. Among them, the pairs consist of GCPs and Google image patch were constituted training data set, GCPs and Arcgis image patches are divided into test and verification dataset by 113.08E. West 113.08E is the test data set, and east 113.08E is the verification data set. The light blue box in the Figure 4 is the matching experiment area, which is located in the verification area. In this paper, a GCPs image patch was set as 255 * 255, and d-dimensional was set as 128-dimensional because of its best accuracy and efficiency (Schroff et al., 2015). Therefore, it can compress almost 500 times for each GCPs image. Dataset was trained on Nvidia RTX 2080TI, and the size of feature extraction model is 9.83M. This model was used to embedded 30,399 GCPs image patches into 128-dimensional feature space as GCPs library with a size of 15.3M. The model and GCPs feature space were used in the subsequent experiments of on-board positioning by GCPs matching.

On-board Positioning Experiment Based on GCPs Matching
The pre-trained feature extraction model and the GCPs feature library extracted by the it can be pre-loaded on the satellite before launching or transmitted via the satellite-to-ground transmission link. The image captured on the satellite will embed into ddimensional space through the feature extraction model. The ddimensional feature will match in the GCPs feature library. The matched point will be used for further positioning. The light blue box in Figure 4 is the matching experiment area, which is located in the verification area. The Arcgis image in this area has neither participated in training nor testing, and can not only be used to evaluate the positioning accuracy, but also to evaluate the generalization ability of the model. The GCPs, Google and Arcgis image have projection information because they are both orthoimages. They have accurate positions without positioning. Therefore, this paper shifts the Google and Arcgis image with a (dx, dy) offset respectively, and then uses Algorithm 1 to experimentally match and positioning. The experiments on the on-board simulation devices Nvidia Xavier, Nvidia RTX 2060 and Nvidia RTX 2080TI. The offsets were set as (163m, 152m), (175m,152m) and (175m,175m). The images to be tested were Google and Arcgis experiment area images which was shown as the light blue box in Figure 4.

Accuracy of Feature Extraction with Improved Triplet Loss Function
This section discusses and analyzes the accuracy of training with improved triplet loss function. It can be seen from the data in Table 3 that the results of test and validation are consistent with training results, which shows that the trained model has good generalization ability. The image patches for each GCPs are trained, although the newly captured image patches are not trained, such as the test and validation dataset which have the same accuracy. This just like face recognition. If there is no face data in the training dataset, then it can never be recognized.  Table 3. Accuracy of feature extraction The threshold between positive pairs and negative pairs is 0.7, which can be determined through the density distribution map. The correct rate of positive relative less than negative image pair is 0.973, the probability of positive relative less than 0.7 is 0.938, the probability of negative image pair is greater than 0.7 is 0.900, and the probability of satisfying positive relative less than threshold 0.7 and negative image pair greater than 0.7 is 0.844. These can show that the model has a good ability to distinguish positive and negative image pairs. In addition, the average value of positive relative is 0.373. It can be obtained from the improved triplet loss function that the positive value will be closer to 0, indicating that the depth feature values of different image patches in the same area tend to be consistent.

Accuracy of Positioning
This section discusses and analyzes the results of on-board positioning experiment. It is can be seen in the Figure 5 and Figure 6, this paper shifts (163m, 152m) for the Google and Arcgis images of the verification area which is shown in Figure  4, and uses algorithm 1 (Table 2) for GCPs matching, respectively. Because the Arcgis image here is not in the train and test dataset, it can reflect the generalization ability and robustness of the model. Figure 5 and Figure 6 show the results of the first epoch of automatic matching when ℝ 1 ∈ {−4: 4}, 1 = 50 . The algorithm matches the correct points from the GCPs library on the images to be matched as basis for subsequent positioning. It can be seen from the Figure 5 and Figure 6 that the points matched can be evenly distributed on the image, and the matching accuracy is good visually. For Arcgis images, it does not participate in training and testing, but can also accurately match GCPs. Therefore, in the feature, new images taken in the satellite can also matched on-board with GCPs library for positioning. What can be seen in Table 5, this paper calculates and analysis the positioning accuracy of the three epochs in Algorithm 1. This paper also adds experiments with different offsets and adds a comparison with the traditional SIFT features. The SIFT 128dimensional features extracted from the GCP image were preloaded on-board. The SIFT features extracted from the target image achieved on-board were matched with the pre-loaded features in the same area, and ransac algorithm were used to eliminate the mismatched points. The best results were shown in bold in the Table 5. The results show that the positioning accuracy of this algorithm is within 30 meters. The accuracy of matching in first epoch has a greater impact on the accuracy of subsequent matching. However, its ability to match small offsets is insufficient. Because it is difficult to distinguish the deep features of the micro-offset image from the GCPs, the matching ability of the micro-offset image need to be further improved with more training dataset. The matching accuracy of conventional SIFT algorithm was generally lower than the algorithm in this paper. In addition, the SIFT algorithm is difficult to match the images with large difference in the control point images, such as infrared images. However, the algorithm in this paper provides good robustness by increasing the range of the training set.

Efficiency Analysis
This paper evaluates the efficiency of the Nvidia Xavier that simulates on-board devices. This paper also evaluates the efficiency of other devices, such as Nvidia RTX2060 and RTX2080ti.  Table 4 efficiency analysis As can be seen in the Table 4, the number of frames to be processed was different for different search ranges, but the processing time of each frame was the same. And for Xavier, the different power modes were also different. However, it can meet the efficiency needs of on-board processing within a reasonable search range.  Table 5. Positioning accuracy ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume V-2-2020, 2020 XXIV ISPRS Congress (2020 edition)

CONCLUSION
In this paper, image positioning on-board is achieved through GCPs matching on-board. This paper proposes an improve triplet loss function and trains a lightweight feature extraction model with it. The model embeds GCPs image patches into 128dimensional feature space as GCPs library. Size of the extraction model and GCPs library are 9.83M and 15.3M respectively, then the model and library can be stored on-board for satellite. This paper also designed an algorithm for positioning on-board through GCPs matching. The images taken on-board will be positioned by GCPs matching real-time and the positioning accuracy is within 30 meters. In subsequent studies, the positioning accuracy will be further improved under this framework. And the work done in this paper will be experimented on the upcoming Luojia-3-01 intelligent remote sensing satellite.