GEOMETRIC AND NON-LINEAR RADIOMETRIC DISTORTION ROBUST MULTIMODAL IMAGE MATCHING VIA EXPLOITING DEEP FEATURE MAPS

: Image matching is a fundamental issue of multimodal images fusion. Most of recent researches only focus on the non-linear radiometric distortion on coarsely registered multimodal images. The global geometric distortion between images should be eliminated based on prior information (e.g. direct geo-referencing information and ground sample distance) before using these methods to find correspondences. However, the prior information is not always available or accurate enough. In this case, users have to select some ground control points manually to do image registration and make the methods work. Otherwise, these methods will fail. To overcome this problem, we propose a robust deep learning-based multimodal image matching method that can deal with geometric and non-linear radiometric distortion simultaneously by exploiting deep feature maps. It is observed in our study that some of the deep feature maps have similar grayscale distribution and correspondences can be found from these maps using traditional geometric distortion robust matching methods even significant non-linear radiometric difference exists between the original images. Therefore, we can only focus on the geometric distortion when we deal with deep feature maps, and then only focus on non-linear radiometric distortion in patches similarity measurement. The experimental results demonstrate that the proposed method performs better than the state-of-the-art matching methods on multimodal images with both geometric and non-linear radiometric distortion.


INTRODUCTION
Multimodal images reflect different characteristics and information of the observed objects because of the difference of sensor imaging mechanism. Making full use of the complementary advantages of multimodal images can help image interpretation. Image matching, seeking correspondences from overlap image regions, is a fundamental task in the processing and application of multimodal images (Kong et al., 2019;Sedagha and Mohammadi, 2019;Zhang et al., 2019). Because multimodal images are often captured from different platforms, sensors, viewpoints and times, there are significant geometric and non-linear radiometric distortion between multimodal images (see Figure 1), which brings great difficulties to reliable image matching. To achieve reliable image matching, many remarkable methods have been proposed. According to the difference of matching strategy, image matching methods can be mainly classified into two categories: area-based method and feature-based method (Gruen, 2012;Chen et al., 2017).
For a pair of multimodal images (one reference image and one target image), the area-based method often sets a window on the reference image and a corresponding search area on the target image, and finds the most similar window in the search area as the matched window. The central points of the two windows are regarded as a pair of matches (Gruen, 2012). The two key points of area-based method are to determine a search area of appropriate size and construct a reliable and robust similarity measurement method. The search area should contain the corresponding window region but not too large. Existing methods usually use direct geo-referencing information or perform a step of coarse registration to roughly eliminate the global geometric distortion between images and then determine a search area of appropriate size. Among similarity measurement methods, the normalized cross correlation (NCC) and mutual information (MI) are robust to radiometric changes to some extent (Chen et al., 2003;Hel-Or et al., 2014). However, they are still difficult to adapt to the non-linear radiometric difference between multimodal images. To improve the matching performance, some methods (e.g. HOPC and CFOG (Ye et al., 2016(Ye et al., , 2019) based on phase congruency model (Kovesi, 1999) have been proposed. However, these methods still have the common problem of other area-based methods: difficult to adapt to image geometric distortion. For example, in HOPC and CFOG methods, image rotation and translation should be coarsely eliminated by direct geo-referencing, and image scale change should be removed based on ground sample distance (GSD). Therefore, they will fail when the prior information of physical sensor models, navigation devices and GSD is unknown or the accuracy is not enough. Another way of eliminating geometric distortion is to find some ground control points (GCPs) to estimate the geometric transformation between images. However, it is difficult to extract reliable GCPs automatically under geometric and non-linear radiometric distortion by using traditional matching methods. Thus, users have to select GCPs manually in many cases. It limits the widespread application of this kind of methods.
Feature-based methods are more robust to image geometric distortion than area-based methods by considering geometric distortion in the designing of feature description algorithm. Feature-based matching methods usually include three steps: feature detection, description and matching. First, feature detectors are adopted to extract features from the reference image and the target image. Interest points, lines and regions are the three most commonly used features. In this paper, feature means interest point if there is no special explanation. Then, a robust feature descriptor is constructed to describe the features. Finally, features are matched based on descriptor similarity measurement. According to the difference of feature detection, description and similarity measurement, feature-based methods can be further subdivided into handcraft methods and deep learning-based methods. Among the handcraft methods, the scale invariant feature transform (SIFT) method (Lowe, 2004) is a milestone work and is widely used in many fields including photogrammetry and remote sensing. With the successful effect of SIFT method, many handcraft methods, such as the speeded up robust features (SURF) (Bay et al., 2008), oriented FAST and rotated BRIEF (ORB) (Rublee et al., 2011), and Affine-SIFT (ASIFT) (Morel and Yu, 2009), have been proposed to improve the performance in time efficiency and robustness to image geometric distortion. To overcome the problems caused by non-linear radiometric difference, some methods on the basis of local self-similarity and phase congruency have been proposed (Huang et al., 2011). However, similar to area-based methods, these methods are not robust to image geometric distortion while improving the robustness to non-linear radiometric distortion.
Recently, with the rapid improvement of computer hardware and software, deep learning technology has attracted more and more attention and has been introduced into the field of image matching (Zagoruyko and Komodakis, 2015;Altwaijry et al., 2016;Melekhov et al., 2017;He et al., 2018). A common idea of deep learning-based matching method is as follows: a deep convolutional neural network with two weights shared branches is constructed and trained on the basis of positive and negative samples by minimizing the distance of deep features between positive samples and maximizing the distance of deep features between negative samples. Studies have shown that deep learning-based matching method is robust to non-linear radiometric distortion between images (He et al., 2019, Quan et al., 2019. However, most of existing deep learning-based matching methods only focus on the non-linear radiometric difference between images. The geometric distortion between images should be coarsely corrected before matching similar to area-based methods. According to aforementioned introduction, existing matching methods are not robust enough to image geometric and nonlinear radiometric distortion simultaneously, which leads to the limitation of practical application. To overcome this problem, we propose a geometric and non-linear radiometric distortion robust multimodal image matching method in the framework of convolutional neural network by exploiting deep feature maps. Firstly, a Siamese-type neural network containing convolutional layers and fully connected layers is designed. For presentation purpose, this network is marked as FSNet (fully connected Siamese-type neural network). FSNet could extract deep features and perform feature similarity measurement. A training dataset is collected by considering negative sample distance to train FSNet. Secondly, the convolutional layers of FSNet are extracted to form another network, marked as CSNet (Siamesetype neural network only has convolutional layers). CSNet is used to produce deep feature maps for multimodal images of any size without known geo-referencing information and GSD. Then, a geometric transformation is estimated by exploiting the deep feature maps to eliminate the geometric distortion between the input multimodal images. After that, interest points are detected from the reference image and the geometric distortion eliminated target image, respectively. Image patches corresponding to interest points are generated and input into the FSNet to find matches. Finally, inliers are recognized from the matching result based on RANSAC method and inversely computed into the original images. The main contributions of this paper are as follows.
1) This paper points out that the main reason that deep learningbased methods can match multimodal images with non-linear radiometric difference is that the non-linear radiometric difference between some deep feature map pairs generated from the last convolutional layer has been alleviated or eliminated. Based on this observation, a multimodal image matching method robust to geometric and non-linear radiometric distortion is proposed. The proposed matching framework is very flexible and can be combined with other advanced image matching neural network in addition to the Siamese-type neural network used in this paper.
2) We find that some non-corresponding image patches with small spatial distances to the corresponding image patch have high similarity because they have large overlapping areas, which leads to mismatches around corresponding points. To overcome this problem, a negative sample generation strategy that takes the distance between non-corresponding and corresponding patches into account is proposed in this paper.
The remainder of this paper is organized as follows. Section 2 presents a Siamese-type neural network and analyses the feasibility of exploiting deep feature maps to deal with the geometric and non-linear radiometric differences between multimodal images. Section 3 describes the proposed multimodal image matching method in detail. The experimental results, along with the method of training dataset generation, matching performance analysis and discussion, are presented in Section 4. The final section concludes this paper and points out possible further improvements that can be made.

Network architecture
Siamese-type neural network has been proved to be an outstanding architecture in computer vision tasks in recent years. It has been widely utilized in fields of target tracking, similarity discrimination of images and texts. Because Siamese-type neural networks perform well for multimodal images without significant geometric distortion, a Siamese-type neural network is designed as the base model of the proposed matching method. This network is marked as FSNet in this paper, as shown in Figure 2. FSNet contains convolutional and fully connected layers but no pooling layer because pooling layers may make the network hard to locate the correct match accurately and finally affect the matching performance. The convolutional layers extracted from trained FSNet form a new network, marked as CSNet. To achieve optimal performance, we tested architectures with different network layers and different convolution kernel sizes. It is found that the network with five convolutional layers (convolution kernel sizes are 33, 55, 55, 55 and 55) and two fully connected layers performed well. Figure 2. Architecture of the networks used in this paper.

Loss function
From the perspective of metric learning, image patches matching can be transformed into a binary classification task. A commonly used objective function in classification is cross entropy loss function (Miller et al., 1993). Specifically, the Sigmoid cross entropy loss function (Han and Moraga, 1995) is used in this paper. Given a triplet input 12 ( , , ) The Sigmoid cross entropy function is used for both clustering the positive samples and separating the negative samples.

Deep feature maps analysis
Analyzing the architecture of the network shown in Figure 2, if FSNet can recognize positive sample correctly, we can infer that some of the feature maps output from the two branches of CSNet are similar, that is, there is no significant non-linear radiometric distortion between the corresponding feature maps. This is because the fully connected layers in FSNet mainly play the role of dimensionality reduction and similarity measurement. Figure 3 is an example to demonstrate this conjecture. In Figure  3, the input images are the blue band image and near-infrared band image of a Landsat8 image, respectively. Images shown in the grids are the deep feature maps generated by CSNet. Images in the grids are arranged in the order of the neurons in the last layer of CSNet. It can be seen that although there is significant non-linear radiometric difference between the input images, some deep feature maps that are marked with dotted boxes of the same color have similar appearance. The non-linear radiometric difference has been alleviated or eliminated. The abovementioned conjecture can be extended to images with both geometric and non-liner radiometric distortion: if a pair of multimodal images of any size with geometric and non-linear radiometric distortion are input into CSNet, the non-linear radiometric difference between some deep feature maps generated from corresponding neurons in the last convolutional layer will be alleviated or eliminated. Under such circumstances, some traditional geometric distortion robust matching methods (e.g. SIFT, SURF or ASIFT) can be adopted to find some correct matches from these pairs of feature maps. And the global geometric distortion between the original images can be eliminated by performing a step of coarse registration. On the basis of this conjecture, a multimodal image matching method that is robust to geometric and non-linear radiometric distortion is proposed in this paper (see Section 3).

ROBUST MULTIMODAL IMAGE MATCHING VIA EXPLOITING DEEP FEATURE MAPS
Our goal is to match multimodal images with geometric and non-linear radiometric distortion. In this study, we deal with the geometric distortion and the non-linear radiometric distortion in turn, but the influence of the latter is considered when dealing with the former and vice versa. The flowchart of the proposed robust multimodal image matching method is shown in Figure 4.  to make the matching method robust to image geometric distortion.

Deep feature map generation based on CSNet:
According to the analysis in Section 2, the reference image and target image should be input to CSNet to generate deep feature maps. Due to the large size of remote sensing images, especially satellite images, it is very inefficient to input such images directly into CSNet. In order to overcome this problem, the input images are down-sampled at the beginning, and then input into the convolutional layers for processing. In the downsampling, the down-sampling rate of the reference image and the target image should be kept the same, so as to maintain the fixed scale relationship between the original images. Therefore, the proposed down-sampling method is as Equation (3).   is a scale factor.  is a size factor to determine the size of the down-sampled images, which is empirically set as 600 in this paper.

BoF-based deep feature map retrieval:
After downsampling, the multi-layer convolutional operation is performed to obtain deep feature maps, and the most similar feature map pair can be found. The most straightforward way to find the most similar deep feature map pair and do coarse registration is to perform feature matching on every pair of deep feature maps and the matching result of the deep feature map pair with the most matches is adopted to do coarse registration. However, such a straightforward strategy is inefficient and unreliable. In order to overcome this problem, we use a Bag-of-Feature-based (BoF-based) image retrieval method (Philbin et al., 2007) to measure the similarity of each pair of deep feature maps and select the three pairs with the highest similarity to do feature matching. Because the non-linear radiometric distortion has been significantly relieved between the similar deep feature maps, the SIFT method is adopted in the feature matching.

Image coarse registration:
After feature matching, the generated three sets of matches are merged into one group. The RANSAC algorithm is performed on the group to eliminate outliers. Then, a projective transformation is fitted based on the matches. And the original target image is transformed into the coordinate system of the original reference image.

Interest point matching based on FSNet
Through the process described in subsection 3.1, the geometric distortion between multimodal images has been roughly eliminated. There is only non-linear radiometric difference between corresponding image patches on the reference image and the registered target image. Thus, we use the FSNet to match this kind of patches.
In order to produce image patches for similarity measurement, the Harris detector (Harris and Stephens, 1988 When all interest points on the reference image have been processed, the RANSAC algorithm is adopted to eliminate outlies and the inliers are computed back to the original image to be the final matches.

EXPERIMENTAL RESULTS AND ANALYSIS
In order to demonstrate the effectiveness of the proposed method, we compare it with both handcraft and deep learningbased methods. Among the handcraft methods, the most popular SIFT method (Lowe, 2004) and the multimodal image matching method CFOG (Ye et al., 2019) are selected. Among the deep learning-based methods, the FSNet described in Section 2 is selected as a method of comparison. If the proposed method performs better than FSNet, it will be proved that the proposed matching strategy is effective because FSNet is the base network of the proposed method. All the three compared methods are followed by a step of RANSAC to eliminate outliers as the proposed method.

Datasets
Three types of multimodal image pairs, including visible-toinfrared, optical-to-SAR, and optical-to-LiDAR are used in our experiments. There is significant non-linear radiometric distortion between images of each pair. To evaluate the robustness of the proposed method to geometric distortion, scale and rotation changes are added to the images manually. Finally six image pairs are formed. The datasets are shown in Figure 5. (e) and (f) are optical-to-LiDAR image pairs. There is significant non-linear radiometric distortion in all image pairs. Besides, there are scale change between images in pairs 1, 3 and 5. Rotation and scale changes exist between images in pairs 2, 4 and 6.

Evaluation criteria
In our experiments, two widely used indicators, number of correct matches (NCM) and matching precision (MP), are adopted to evaluate the performance of the proposed matching method. MP is computed as Equation (4).

MP NCM NTM =
where NTM is the number of total matches. To count the value of NCM, we manually selected some evenly distributed GCPs to fit a projective transformation for each image pair. Then, a localization error is computed for each pair of match on the basis of the transformation. If the localization error is smaller than a threshold (2 pixels in this paper), the corresponding match is regarded as correct match.

Training dataset:
A training dataset consists of 150000 pairs of matching patches (positive samples) and 150000 pairs of non-matching patches (negative samples) are generated from multimodal images including Google Earth images, ZY3 satellite images, Landsat-8 satellite images, TerraSAR-X satellite images and elevation rendering image of LiDAR point cloud. Multiple land cover types like buildings, rivers, roads and farmlands in urban and rural areas are contained in the training dataset.
In the generation of negative samples, a strategy considering the distance between the centre points of non-matching patches is proposed to overcome the false matching problem caused by neighbour points. As shown in Figure 6, the red patches form a pair of positive sample. Around the corresponding patch (red box) on the target image, eight patches (blue boxes) that are r pixels from the corresponding patch are extracted. One of the eight patches is selected randomly to form a negative pair with the reference patch. We call this kind of negative sample as DNS (distance-based negative sample).
Figure 6. Negative sample generation by considering the distance between the centre points of matching and nonmatching patches.
In the training sample generation, if n pairs of positive samples are collected, there will be n pairs of DNS. However, we only select n/2 pairs from all DNS randomly. We produce another n/2 pairs of negative samples randomly from the whole images without considering sample distance. This kind of negative samples is marked as RNS (random negative sample). Therefore, our training dataset contains n pairs of positive samples, n/2 pairs of DNS, and n/2 pairs of RNS (n=150000 in this study). pair of DNS generated from visible-to-infrared images, respectively. (b) and (c) show samples generated from opticalto-SAR and optical-to-LiDAR images, respectively.

Network training:
We trained FSNet on NVIDIA GTX 1080Ti GPU with Tensorflow. A batch size of 32 is used in each iteration and all the patches were resized to 97×9 7 pixels. The training is optimized by the Momentum Optimizer in Tensorflow. The initial learning rate and the momentum are set as 0.001 and 0.9 respectively. The training is terminated when the average loss value is less than 0.001. Figure 8 shows the convergence process of the training.

Multimodal image matching results
The statistical results, including NCM and NTM, are displayed in Table 1. It can be seen directly from Table 1  There is a normalization step in the descriptor computation in SIFT method, which makes it robust to illumination variation. However, the normalization is not robust to non-linear illumination difference. Therefore, SIFT method does not perform well for multimodal images with non-linear radiometric distortion although it is scale and rotation invariant.
The CFOG method is designed for multimodal images matching especially. Some prior information, including direct georeferencing information and GSD, must be available to correct the global geometric distortion between images to make the method work. If the prior information is unavailable, we have to select some GCPs manually to achieve coarse registration. However, in our experiments, we do not have any prior information of the images and enough GCPs to overcome the problem caused by geometric distortion. Therefore, the CFOG method failed in all image pairs with scale and rotation changes.
The FSNet method is a kind of deep learning-based method. Because geometric distortion has not been considered in the patch generation before similarity measurement, the patches of corresponding points are inconsistent. Thus, the features extracted from the patches are dissimilar and finally cannot be matched.
Compared with the aforementioned methods, both non-linear radiometric difference and geometric distortion are considered in the proposed method. On one hand, we exploit the deep feature maps to estimate the global geometric transformation between multimodal images and register images automatically. The coarse registration works without any extra information other than the two input images. After that, patches are produced from the coarsely registered images. Therefore, the proposed method is robust to image geometric distortion. On the other hand, we use a deep neural network trained by a dataset containing multimodal image patches with non-linear radiometric difference to extract features and measure similarity. The feature extraction and similarity measurement are robust to non-linear radiometric distortion. Therefore, the proposed method performs best on all image pairs. Figure 9 presents the matching results of the proposed method on all image pairs. Matches are linked with red lines. In addition, we can see from Table 1 that the proposed method performs better on image pairs 1 and 2 than on image pairs 3-6. There are two main reasons. First, the production of visible-toinfrared training samples is easier than that of optical-to-SAR and optical-to-LiDAR training samples. Therefore, the number of visible-to-infrared samples is larger than that of other two kinds of samples in our training datasets. It makes the trained neural network perform better on visible-to-infrared image pairs. Second, in the evaluation, it is easier to select accurate GCPs from visible-to-infrared image pairs than optical-to-SAR and optical-to-LiDAR image pairs. Therefore, the estimated image transformation to recognize correct and false matches is more accurate. In this case, some correct matches on optical-to-SAR and optical-to-LiDAR image pairs may be wrongly counted as false matches.

CONCLUSION
In this study, a multimodal image matching method that is robust to both geometric and non-linear radiometric distortion is proposed. We observed that the non-linear radiometric distortion between some deep feature maps generated from the last convolutional layer has been eliminated or relieved. On the basis of this observation, we analyzed the deep feature maps and designed a framework to overcome the geometric distortion between multimodal images. In this process, the geometric transformation can be estimated by using traditional feature matching method. We do not have to try to construct a feature descriptor that is robust to both geometric and non-linear radiometric distortion under the help of deep feature maps. The experimental results demonstrate that the proposed method performs better than other state-of-the-art multimodal image matching methods. The proposed method can be used to match multimodal images without any prior information and manually selected GCPs. A possible future work is to increase training samples and make the proposed method work well on multimodal images in different areas.