ROBUST REGISTRATION FOR OPTICAL AND SAR IMAGES BASED ON SPATIAL GEOMETRIC CONSTRAINT AND STRUCTURE FEATURES

Automatic registration of optical and synthetic aperture radar (SAR) images is a challenging task due to significant geometric deformation and radiometric differences between two images. To address this issue, this paper proposes an automatic registration method for optical and SAR images based on spatial geometric constraint and structure features. Firstly, the Harris detector with a block strategy is used to extract evenly distributed feature points in the images. Subsequently, a local geometric correction is performed by using the Rational Function Model, which eliminates the rotation and scale differences between optical and SAR images. Secondly, orientated gradient information of images is used to construct a geometric structural feature descriptor. Then, the feature descriptor is transformed into the frequency domain, and the three-dimensional (3-D) phase correlation is used as the similarity metric to achieve correspondences by employing a template matching scheme. Finally, mismatches are eliminated based on spatial geometric constraint relationship between images, followed by a process of geometric correction to achieve the image registration. Experimental results with multiple high-resolution optical and SAR images show that the proposed method can achieve reliable registration accuracy, and outperforms the state of the art methods.


INTROCTION
During the last decades, remote sensing systems and imaging techniques have undergone a rapid development, various multimodal remote sensing images (such as synthetic aperture radar (SAR), infrared, visible) can be acquired. These remote sensing images can be widely used in many applications, including change detection (Bruzzone and Bovolo, 2013), image fusion (Schmitt et al., 2017), agriculture and forestry monitoring (Rhee et al., 2010), and land-use analysis (Tuia et al., 2016). Combined utilization of multimodal remote sensing images can overcome some limitations of single-modal images. SAR and optical images are acquired by various sensors under different imaging mechanisms, i.e., the active SAR sensor and the passive optical sensor. SAR images are capable of delivering information in both day and night and under all-weather conditions because of its high penetration, but they have serious inherent speckle noise. Optical images have good interpretability, but they are easily affected by atmosphere condition such as clouds and fogs (Sui et al., 2015). Since the two types of images can provide complementary information, it is necessary to integrate SAR and optical images for following remote sensing applications. Therefore, automatic registration of optical and SAR images has still been a hot research spot.
In general, automatic image registration method can be generally classified into two categories: intensity-based and feature-based methods (Zitova and Flusser, 2003). Intensity-based methods estimate the optimal transformation parameter between images by control points (CPs), which are detected by using some similarity metrics. Commonly adopted similarity metrics include the mutual information (MI) (Suri and Reinartz, 2010) and the normalized cross correlation (NCC), NCC is sensitive to nonlinear radiometric difference between optical and SAR * Corresponding author: Yuanxin Ye, yeyuanxin@swjtu.edu.cn images, which limits its application in optical and SAR image registration. Compare with NCC, MI can adapt to nonlinear radiometric difference between images, but MI has the disadvantage of large computation. Feature-based method estimate the geometric transformation by matching distinct features between images. These features include points, lines and contours (Ma et al., 2017;Li et al., 1995). Recently, local invariant features, such as scale-invariant feature transform (SIFT) and speeded up robust features (SURF), have been widely used for remote sensing image registration due to its invariance to image scale and rotation changes (Sedaghat et al., 2011;Patel et al., 2016). Although these features are robust to illumination changes and geometric distortions between optical images, they are vulnerable from the nonlinear radiometric differences between optical images and SAR images (Dare and Dowman, 2001).
Recently, Ye et al. proposed a robust feature descriptor based on the structural properties of images, which is named the histogram of orientated phase congruency (HOPC) (Ye et al., 2017). Next, Ye et al. proposed a novel pixel feature representation by using orientated gradients of images, which is named channel feature of orientated gradients (CFOG) (Ye et al., 2019). These feature descriptors showed a robust ability to handle the significant nonlinear radiometric differences between optical and SAR images. Nevertheless, these feature descriptors are highdimensional features with large volume data, and not invariant for scale and rotation differences. In fact, scale and rotation differences between optical and SAR images also are the factors that affect the registration performance.
Currently, deep learning-based methods have been used to multimodal remote sensing image registration. Merkle et al., used a Siamese network to learn the spatial shift between optical and SAR image patches (Merkle et al., 2017). Girard et al., proposed a deep learning method to improve the registration performance by training a multi-task learning (Girard et al., 2018). Subsequently, Bürgmann designed a descriptor learned by a deep neural network named HardNet, which can automatically detect CPs between SAR and optical images (Bürgmann et al., 2019). In general, optical and SAR image registration by deep learningbased has become an active research direction in recent years. However, one drawback of the current methods is the restriction of large training datasets, and making the training datasets is a tedious task.
In order to address the above-mentioned issues, this paper proposes a robust and automatic optical-to-SAR remote sensing image registration method based on spatial geometry constraint and structural features. First, local scale and rotation differences between optical and SAR images are eliminated by using Rational Function Model (RFM) of remote sensing images. Second, the matching region is predicted based on the space geometric constraint, and a structure feature descriptor is built by using the orientated gradient information of images. Then, a fast matching similarity metric based on the structure feature and three-dimensional (3-D) phase correlation is established in the frequency domain, and a template matching manner is employed to detect CPs between optical and SAR images. Finally, mismatches are removed based on the spatial geometric constraint relationship, and the transform parameters for image registration are obtained by using the correct CPs.

METHODOLOY
The flowchart of the proposed method is shown in Figure 1, which consists of: 1) feature point extraction; 2) local geometric correction; 3) structural feature extraction; 4) Similarity metric based on 3-D phase correlation; 5) mismatch elimination and geometric transformation.

Feature Point Extraction
In order to extract the evenly distributed feature points in the optical image, we use the block-based Harris corner detector (Ye and Shan, 2014) to extract feature points. Firstly, the optical image is divided into M×M non-overlapping blocks. Then, the Harris value of each pixel is computed and ranked from largest to smallest in every block, and the point with the largest value is selected as the feature point. Figure 2 shows the feature points extracted by the traditional Harris corner detector and the blockbased Harris corner detector. It can be clearly observed that the block-based Harris corner detector can make the feature points evenly distribute over the image.

Local Geometric Correction
In order to eliminate the rotation and scale differences between the optical and SAR images, a local geometric correction is performed for an image patch centered on a feature point by using RFM (Grodecki and Dial, 2003;Wang et al., 2017). The RFM relates 3D object-space ( , , ℎ ) ground coordinates to 2D image-space (Line, Sample) through a ratio of cubic polynomials. In order to improve the numerical stability, the object coordinates and image coordinates are normalized to ⟨-1, +1⟩. The equation of RFM is expressed as follows: Where (r, c) is the normalized image coordinates, and ( , , ) is the normalized ground coordinates. 1 , 2 , 3 and 4 are the terms of third-order polynomials of ( , , ).
In the processing of local geometric correction, an image patch centered on a feature point is first determined in the optical image, and a digital elevation model (DEM) is used as the elevation datum. Then, the RFM is used to perform the geometric rectification for the image patch. Figure 3 shows the above process. We can see that the local correction can effectively eliminate the rotation and scale differences between optical and SAR images.

Structural Feature Extraction
As mentioned above, RFM can remove obvious geometric distortions (such as rotation and scale differences) between optical and SAR images. Accordingly, the main difficulty of image registration is significant nonlinear radiometric differences between optical and SAR image. Considering the similar geometric structural properties between them, we extract the structural properties of images by using a robust structural feature descriptor named CFOG. The CFOG feature descriptor is an extension of histogram of orientated (HOG) (Dalal and Triggs, 2005), which can capture geometric structure features of images, and CFOG is robust to nonlinear radiometric differences between images. Figure 4 shows the construction process of CFOG, which mainly includes two parts: orientated gradient channels and 3-D Gaussian convolution.

Orientated Gradient Channel:
Given an inputted template image, we first compute m orientated gradient channels of the image, which are called as , 1 ≤ ≤ . 0 ( , ) is the gradient magnitude of quantized orientation o at (x, y) of each orientated gradient channel, which is larger than zero, otherwise its value is zero. Then, the orientated gradient channel is defined as: Where I is the image, o is the orientation of the derivative, and ⌊ ⌋ denotes that the enclosed quantity is equal to itself when its value is positive or zero otherwise. In the actual process, we are not necessary to separately compute for each orientation, the horizontal and vertical gradients (named and , respectively) can be used to compute the magnitude of each orientated gradient channel. as follows: Where is the quantized gradient orientation, and are separately computed by the filters [−1, 0, 1]and [−1, 0, 1] T with the two 1-D convolutions. abs represents the absolute value, which can handle the intensity inverse between multimodal remote sensing images by modifying the gradient orientation to the range of [0 0 , 180 0 ) (Yi et al., 2008).

3-D Gaussian Convolution:
Once the orientated gradient channel is generated, the convolved feature channel can be obtained by a convolution with a 3-D Gaussian-like kernel , as shown below: Where represents the standard deviation of the Gaussian convolution kernel. Strictly speaking, the Gaussian convolution kernel is not a 3-D Gaussian function in the 3-D space, but a twodimension Gaussian kernel in X-and Y-directions and a kernel of [1, 2, 1] in the gradient orientation direction (hereinafter referred to as Z-direction). The gradient in the orientation direction is smoothed by convolution in the Z-direction, which can reduce the influence of orientation distortions caused by local geometric deformation and radiometric variation between optical and SAR images. Finally, the is normalized to form the orientated gradient channel feature. From figure 5, we can see that the two visualized 3-D CFOG representation of optical and SAR images, which have the high similarity despite significant radiometric differences between the optical and SAR images.

Similarity Metric Based on 3-D Phase Correlation
Since CFOG is a 3-D structural feature descriptor with a largevolume data. It is time consuming for image matching by using traditional similarity metrics (such as NCC) in the space domain. Accordingly, we build a fast similarity metric based on CFOG in  According to the shift property of Fourier transform, the shift in the spatial domain of two functions corresponds to the linear phase difference in the frequency domain. 3-D phase correlation is used as the similarity metric because the CFOG descriptor is a 3-D feature volume. Let 1 ( , , ) and 2 ( , , ) be two functions of the optical and SAR image feature volumes respectively. Since the scale and rotation differences has been eliminated by the local geometric correction, there is only a shift relationship between two feature volumes, which is 1 ( , , ) = 2 ( − 0 , − 0 , ). According to the shift property of Fourier transform, the function can be written as: Where 1 and 2 are the 3-D Fourier transform of 1 and 2 . Then, by computing the normalized cross-power spectrum, we have: ( , , ) = 1 ( , , ) 2 ( , , ) * | 1 ( , , ) 2 ( , , ) * | = (−2 ( 0 + 0 )) ⃗ Where * denote the complex conjugate, and ⃗ is a 3-D unit vector. Noting that the inverse Fourier transform of (−2 ( 0 + 0 )) is a Dirichlet function at ( 0 , 0 ), we can determine the ( 0 , 0 ) by finding the pulse peak in the inverse Fourier transform ( Figure 6). Therefore, we can obtain the translation shift between the two feature representation volumes.

EXPERIMENTS
In this section, we validate the performance of the proposed method on three sets of optical and SAR images (see Figure 7) by comparing the other two state-of-the-art intensity-based methods (i.e., NCC and MI). In order to make a fair comparison, both NCC and MI are applied to image registration with a process of local geometric correction descripted in Section 2.2. The details of test sets are listed in Table 1. In our experiments, the SAR images are used as the reference image because they have been ortho-rectified, and the optical images are used as the sensed image.

Evaluation Criteria and Parameter Setting
We analyse the performance of the proposed method by four indices: the number of the correct matches (NCMs), the correct matching ratio (CMR), RMSE and the running time (Fan et al., 2018). CMR denotes the matching precision, which can be written as CMR = NCM/NM. NCM and NM denote the number of correct matches points and the number of matches points, respectively. RMSE is calculated from the manually selected check points between the images. In the template matching, the sensed image was first divided into 25 × 20 non-overlapping blocks. Then one Harris feature point was extracted in each block, for a total of 500 feature points. The template window sizes and search region sizes are set to 121×121 and 200×200 pixels respectively. The experiment is performed on a computer with an Intel(R) Xeon(R) E5-1603 2.8-GHz processor and 8.0 GB of physical memory, using Microsoft Visual C++ with Geospatial Data Abstraction Library (GDAL).

Result and Analysis
The registration results of the three datasets by the proposed method are shown in Figure 8, 9 and 10, and Table 2 lists the detail performance of different methods for the three datasets. It can be clearly observed that NCC fails in the matching for the three image datasets, and the most higher CMR is only 23.22% among all three datasets. The main reason is that NCC is vulnerable to nonlinear radiometric differences between SAR and optical images (Ye and Shan, 2014). Compared with NCC, the performance of MI improves to some extent. Although its RMSEs and CMRs are acceptable, its running time are more than 4000 seconds that is quite time consuming. From Table 2, it is obvious that the proposed method outperforms the two method on the three datasets in terms of NCM, CMR, RMSE and the running time. This is because that the proposed method builds the descriptor by using the structural features of SAR and optical images, which makes it more robust against nonlinear radiometric difference. Moreover, the proposed method speeds up image matching by using the 3-D phase correlation in the frequency domain. On the other hand, due to the different characteristic of different datasets, the CMRs of dataset 2 and dataset 3 decline compared with dataset 1. The reason may be that the temporal difference of dataset 2 is fourteen months, which has a significant impact on the NM and RMSE due to the change of ground objects. Since the images of dataset3 have much mountain areas, significant local distortions caused by large terrain relief not only make it difficult to image matching, and the RFM is also difficult to accurately fit the local distortions, resulting in the registration performance degrades. In summary, the proposed method performs better than MI and NCC for the registration of SAR and optical images.

CONCLUSIONS
In order to address the difficulties caused by significant nonlinear radiometric differences and geometric deformation between optical and SAR images, this paper proposes an automatic registration method based on spatial geometry constraint and structure features. In the proposed method, a local geometric correction is carried out by using the RFM, and a geometric structural feature representation is constructed by using the CFOG descriptor. Then, the feature descriptor is transformed into the frequency domain, and 3-D phase correlation is used as the similarity metric for image matching. We have evaluated the proposed method by using multiple high resolution optical and SAR images in various scenarios. Compared with the state-ofthe-art intensity-based method(i.e., MI and NCC), the proposed method presents more satisfactory performance in the NCM, the CMR and the computational efficiency. It indicates that the proposed method can effectively handle the radiometric differences and geometric deformation between optical and SAR images.
Since the proposed method is based on geometric spatial constraint relationship and structural features of images, there are some limitations that should be considered. The precondition of the proposed method is that images requires a local geometric correction based on RFM, which can eliminate rotation and scale difference between optical and SAR images. In addition, the registration performance of the proposed method may degrade when geometric features of images are lacking.