AUTOMATIC MRF-BASED REGISTRATION OF HIGH RESOLUTION SATELLITE VIDEO DATA

In this paper we propose a deformable registration framework for high resolution satellite video data able to automatically and accurately co-register satellite video frames and/or register them to a reference map/image. The proposed approach performs non-rigid registration, formulates a Markov Random Fields (MRF) model, while efficient linear programming is employed for reaching the lowest potential of the cost function. The developed approach has been applied and validated on satellite video sequences from Skybox Imaging and compared with a rigid, descriptor-based registration method. Regarding the computational performance, both the MRF-based and the descriptor-based methods were quite efficient, with the first one converging in some minutes and the second in some seconds. Regarding the registration accuracy the proposed MRF-based method significantly outperformed the descriptor-based one in all the performing experiments.


INTRODUCTION
Currently the remote sensing community is expecting during the following years a paradigm swift from spare multi-temporal to every-day monitoring of the entire planet through mainly microsatellites at a spatial resolution of a few meters or centimeters (in the raster world), but also from other cutting-edge technology including hyperspectral sensors and UAVs.Moreover, apart from the standard imaging products video streaming from earth observation satellites significantly expands the variety of applications that can be addressed.
In particular, high resolution satellite video sequences [Murthy et al., 2014,d'Angelo et al., 2014,Kopsiaftis and Karantzalos, 2015] have become available and enrich the existing geospatial data and products.Skybox Imaging 1 and Urthecast 2 are already providing high resolution video datasets with a spatial/temporal resolution of approximately 1 meter and 30 frames per second.However, due to the continues movement of the satellite platform the acquired frames are not registered between each other.Moreover, in order to combine and fuse information from other geospatial data and imagery for any application or analysis their registration to a local/national geo-reference system is required.Therefore, the automated co-registration of video frames and/or their registration to a reference image/map is still an open matter.
The problem of image registration has been heavily studied and numerous approaches have been proposed [Zitova andFlusser, 2003, Sotiras et al., 2013].The methods fall into two main categories depending on the employed model i.e., rigid-based and non-rigid (deformable-) based ones.The first category consists of descriptor-based methods, which automatically detect and match points in the pair of images and then define a global transformation to register them.A variety of descriptors, such as SIFT [Lowe, 2004], ASIFT [Morel and Yu, 2009], SURF [Bay et al., 2008], DAISY [Tola et al., 2010], FREAK [Alahi et al., 2012], etc have Figure 1: The developed methodology manages to co-register the acquired video frames.Unregistered frames (left), registered frames after the application of the developed method (right).Data are from Skybox Imaging (Terra Bella).
been employed for a plethora of applications like face recognition, object identification, motion tracking and satellite imagery.Under such a framework one million of satellite RGB images have been registered by Planet Labs3 in just one day [Price, 2015].The second category contains non-linear registration methods.A similarity function is used to calculate the similarity of each pixel (from the first image) to a neighbourhood of pixels in the other image and find the best displacement which recovers the geometry.This kind of methods have been widely used in computer vision and medical imaging [Sotiras et al., 2013], while recently validated for very high resolution satellite data [Karantzalos et al., 2014] delivering high accuracy rates for both optical and multimodal data.
In this paper, a MRF-based registration framework is proposed for the co-registration of satellite video frames and/or their registration to a reference map/image (Figure 1).In particular, the developed method calculates a deformation map, while certain similarity functions (e.g., normalised cross correlation, mutual infor-mation, sum of absolute difference, etc.) were employed for calculating the displacement of every pixel.An energy formulation through an MRF model was defined and its minimization was performed using linear programming.The methodology was applied and validated based on Skybox Imaging data and certain corresponding reference images (Table 1).Experimental results were compared with the ones obtained from a descriptor-based technique [Price, 2015] which is based on a rigid registration framework using the STAR [Agrawal et al., 2008] and FREAK [Alahi et al., 2012] algorithms for establishing and matching correspondences.These correspondences were used for defining the homography transformation parameters and register the pair of images.Both methods have been quantitative and qualitative evaluated based on manually collected ground control points (GCPs).

Image Registration
Lets denote in a pair of images It: Ω → R 2 as the reference/target image and Is: Ω → R 2 as the source image that should be registered.The goal of registration is to define a transformation T : Ω → R 2 which will project the source to the target in the image pair.
For the rigid registration, the displacement of each pixel in the image is calculated using the same transformation parameters.
On the other hand, for the non-rigid registration the displacement of every pixel is calculated independently using only certain constraints for local smoothness defined by the model.Regarding the co-registration of satellite video frames, in our experiments the reference image corresponds to the first frame of the video sequence.

Rigid, descriptor-based registration
The most commonly used approach is based on a rigid registration [Le Moigne et al., 2011, Vakalopoulou and Karantzalos, 2014, Price, 2015] and calculates a global transformation for image pairs.The framework has four main components: i) the keypoint detector, which detects and holds the information about the position of every keypoint in each image, ii) the keypoint descriptor, which contains the characteristics of the keypoints, in order to be able to compare them, iii) the matcher, which matches the different keypoints in the source and target images and finally, iv) the image transformation method, which calculates the parameters of the transformation, based on the calculated correspondences.
For the evaluation of the proposed MRF-based approach the rigid registration method employed, here, is based on the recently proposed approach in [Price, 2015] including: a keypoint detector, the Star Detector (STAR), based on Center Surround Extremas (Censure) [Agrawal et al., 2008], a keypoint descriptor, the Fast Retina Keypoint algorithm (FREAK) [Alahi et al., 2012] and as matcher the brute force matcher (BFMatcher).Last but not least, the transformation used to register the source image to the target/reference was the homography one.
Generally speaking, the STAR algorithm detects numerous keypoints in each frame.Since the consecutive frames do not change a lot, many correspondences between the two frames were created.In order to reduce the outliers, the RANSAC [Fischler and Bolles, 1981] algorithm was used with a reprojection threshold of one pixel.Additionally, the false correspondences were removed, using a filter that allowed only matches below a specified threshold to participate to the transformation.The threshold was set to a fraction of the maximum distance between the matches.In all our experiments only those matches with a distance less than or equal to 65 percent of the maximum distance participated in the formulation of the transformation.
The homography parameters are defined after the minimization of the following error (Equation 2). (2) where h11, h12, h13, h21, h22, h23, h31, h32, h33 are the homography parameters, xi, yi are the coordinates of the keypoint i in the reference image and x i , y i the coordinates of the keypoint i in the source image.

The proposed MRF-based satellite video registration framework
The proposed approach is based on a deformable registration using different similarity metrics.A MRF model was defined and the solution is minimizing the following energy function (Equation 3) [Glocker et al., 2011].The label space for the model contains all the possible displacements (d 1 , . . ., d n ), such as: lp = [d 1 , . . ., d n ].A graph was superimposed on the target frame, and each node was connected to a neighbourhood of pixels using an interpolation function η(.).The total energy was formulated as below: where p, q are nodes in the graph G and N the neighbourhood of p in the other image, Vp is the unary term, Vpq is the pairwise term and λ is the weight which defines the use of the pairwise term in the energy minimization.
The unary and the pairwise terms are formulated as follows: where ρ() is the similarity function used (normalised cross correlation, mutual information, etc).The interpolation function η which connects with a weight propositional to the distance the pixels with the nodes of the grid and reverse.A typical example of a projection function would be cubic B-splines which is the one employed here.
where Vpq penalises neighbour nodes with different displacement labels depending on the difference of their displacement.The satellite video datasets that were employed for the validation of the developed registration framework.

IMPLEMENTATION
The formulation follows a multiscale approach concerning both the image and the graph, meaning that the energy was calculated at different levels of the grid and the image.Concerning the grid levels a sparse grid was implemented and as the levels of the grid augmented, the grid became more and more dense.At each level a number of iterations was performed in order to calculate the minimum energy.In different grid levels the source image was transformed and updated, so in the next level it was closer to the target one.This way the label space for the displacements was also changing in each grid level, being closer to the optimal.Finally, for different image levels a subsampling of the image was performed for less computational complexity.
For the Burj Khalifa4 Skybox video dataset the set of parameters was defined as follows.The node distance was set to 10 pixels, the grid levels to 3 and the image levels to 2 with 5 iterations at each level.The label space at each grid level changed to 0.8 times of the previous one.Normalized Cross Correlation (NCC) was used as the similarity function, which, according to the literature, performed better than other functions [Karantzalos et al., 2014] for the registration of remote sensing data.Finally, the lambda parameter was set to 40, the sampling steps to 25 and cubic bsplines was used as the interpolation function.All the parameters were tuned after grid search.
Using the above set of parameters, a co-registration between smaller groups was initially performed and then all groups were registered to the first frame.In particular, three groups with a lower number of frames and thus smaller displacements were formed i.e., every 300 frames.The registration of each group was performed using as target image the 1 st , 300 th and 600 th frame, respectively.Then all were registered to the first one.
For the two Las Vegas Skybox video datasets, the configuration consisted as in the previous case of: a node distance of 10 pixels, 3 grid levels and 2 image levels.Moreover, the number of iterations was set to 15, the sampling steps to 65, lambda was set to 15 and the label space to 0.67 times the previous one for each grid level.The similarity function and the interpolation method was the same as for the Burj Khalifa sequence.Again the registration was performed firstly in groups and in particular, for the Las Vegas5 dataset the grouping was every 300 frames and for the Las Vegas-night6 video dataset every 150 frames.

EXPERIMENTAL RESULTS AND EVALUATION
The proposed MRF-based methodology was evaluated both qualitatively and quantitatively.For the quantitative evaluation a number of manually collected GCPs were selected.It is important to note, that for the descriptor-based approach a set of fixed parameters did not perform well for all the video frames, since even the smallest shift between the frames affected the keypoint detection and respectively the registration accuracy.For this reason, the tuning of the parameters was performed for each pair of frames using grid search.This was the main drawback of the descriptor-based framework since even though the multithreaded implementation in OpenCV [Culjak et al., 2012] requires two to three seconds per image pair, the manual tuning of the parameters required significantly more.
The experimental results included satellite video sequences of Burj Khalifa, Las Vegas and Las Vegas Night (Table 1) from Skybox Imaging.The main challenges for the registration of the video datasets were mainly the relative tall buildings, their shadows and any other moving object (e.g., airplanes).In particular, the different angles of the sun and the satellite acquisition affect the geometry of terrain objects and their corresponding shadows.
For the quantitative evaluation the results after the implementation of both registration methods are presented in Table 2.In all cases the proposed MRF-based approach outperformed the descriptor-based one and managed to register all the different frames with a mean displacement error of less than 1.5 pixels.These errors correspond to the overall registration error from all frames since they were calculated between the first and last frame of the video dataset.The resutled higher registration errors from the descriptor-based approach along with the fact that these errors were not equally distributed in image plane indicated a significant lower performance than the proposed MRF-based approach.
Moreover, the registration of the Burj Khalifa dataset to a Google Earth's image mosaic was performed using the proposed MRFbased approach.Quantitative results are quite promising with mean displacement errors less than 1.6 pixels (Table 3).
For the qualitative evaluation different checkerboard visualisations are presented in Figures 2, 4, 3, 5, along with certain zoomin at selected sub-regions.Each checkerboard visualisation is a blend of the first and last frame of the unregistered and registered datasets.After a closer look on the marked with a red color areas one can observed that the unregistered data possessed large initial displacements.In particular, in Figure 2 one can observe quite large displacements between the different frames, with significant spatial discontinuities in roads, bridges and buildings (e.g., inside the red circles).The MRF-based registration recovered the geometry and managed to register accurately the video frames.

Mean Displacement Errors
(in pixels) In all cases the developed approach managed to registered the satellite video frames with a mean displacement error of less than 1.5 pixels.As expected the image regions with the most mis-registration errors were those with significant relief displacements with very tall man-made objects, buildings and skyscrapers.
With a chessboard visualisation, results for two other datasets are presented in Figure 3 and 4. Once again after a closer look one can observe the robustness of the proposed approach towards recovering scene's (frame's) geometry.Moreover, Figures 2 and 4 depict the same region in different acquisition times.Even though in Figure 4, the satellite video dataset was acquired during the night, the proposed MRF-based method performed significantly well, resulting into an overall mean displacement error of less than one pixel in both axis (Table 2).
In order to qualitatively compare the results of the proposed MRFbased approach with the descriptor-based one, results on the same  datasets are presented in Figure 5 after the application of the descriptor-based method.Although a large number of correspondences have been established the rigid nature of the transformation could not recover scene's geometry adequately.

CONCLUSION
In this paper an MRF-based registration approach was developed for the accurate co-registration of satellite video frames as well as the registration of the video dataset to reference map/image.The method was applied and validated based on satellite video data from Skybox Imaging and compared with a standard descriptorbased registration framework.Experimental results indicate the great potentials of the proposed approach which managed to recover the geometry in all cases with registration errors of less than 1.5 pixels at both x and y axis.

( a )
Figure 2: Chessboard visualizations from the Las Vegas Skybox dataset.Frames from the unregistered dataset (a) and frames after the registration process (b) are shown in the first two rows.Zoom-in areas are shown in the third row for the unregistered (c) and registered (d) frames.

( a )Figure 3 :
Figure 3: Chessboard visualization from the Burj Khalifa video dataset.Unregistered (left) and registered (right) data before and after the application of the proposed methodology.

( a )
Figure 4: Chessboard visualization from the Las Vegas-night video dataset.Unregistered (left) and registered (right) data before and after the application of the proposed methodology.

Table 2 :
Quantitative evaluation results after the application of the proposed MRF-based registration method.

Table 3 :
Quantitative evaluation results after the registration of the Burj Khalifa satellite video dataset to an image mosaic acquired from Google Earth.