GUIDED FEATURE MATCHING FOR MULTI-EPOCH HISTORICAL IMAGE BLOCKS POSE ESTIMATION

Historical aerial imagery plays an important role in providing unique information about evolution of our landscapes. It possesses many positive qualities such as high spatial resolution, stereoscopic configuration and short time interval. Self-calibration reamains a main bottleneck for achieving the intrinsic value of historical imagery, as it involves certain underdeveloped research points such as detecting inter-epoch tie-points. In this research, we present a novel algorithm to detecting inter-epoch tie-points in historical images which do not rely on any auxiliary data. Using SIFT-detected keypoints we perform matching across epochs by interchangeably estimating and imposing that points follow two mathematical models: at first a 2D spatial similarity, then a 3D spatial similarity. We import GCPs to quantitatively evaluate our results with Digital Elevation Models (DEM) of differences (abbreviated as DoD) in absolute reference frame, and compare the results of our method with other 2 methods that use either the traditional SIFT or few virtual GCPs. The experiments show that far more correct inter-epoch tie-points can be extracted with our guided technique. Qualitative and quantitative results are reported.


INTRODUCTION
Historical (i.e. analogue) aerial imagery plays an important role in providing unique information about evolution of our landscapes. The imagery had been acquired by many countries all over the world and can be traced back to the beginning of 20 th century (Cowley, Stichelbaut, 2012). In many cases high spatial resolution stereoscopic acquisitions were performed (Giordano, Mallet, 2019), allowing for 3D restitution of territories. Coupled with high temporal sampling, the imagery becomes also suitable for long-term environmental studies. As of now, many historical images collections have been digitized (Giordano, Mallet, 2019;USGS, 2019;IGN, 2019) and a number of attempts have been made to process the images with the state-of-the-art automated methods developed in photogrammetry and computer vision community over the last twenty years (Ayoub et al., 2009;Bakker, Lane, 2017;Chen, Tseng, 2016;Ellis et al., 2006;Feurer, Vinatier, 2018).
Apart from imagery itself, sometimes corresponding metadata is archived too. For instance, camera focal length and physical size of the images are commonly found, whereas flight plans, camera calibration certificates or orientations are rarely available. Because the images were scanned, and prior to scanning they may have been kept in unsuitable conditions, a selfcalibration is always necessary.
Historical images are self-calibrated in a bundle block adjustment (BBA) routine, using tie-points and Ground Control Points (GCPs) as input observations. When dealing with singleepoch datasets, tie-points are efficiently recovered with modern extraction algorithms, such as SIFT (Lowe, 2004), while GCPs originate from (i) field surveys (Micheletti et al., 2015;Walstra et al., 2004;Cardenal et al., 2006), (ii) existing orthophotos and * Corresponding author DEM (Nurminen et al., 2015;Ellis et al., 2006;Fox, Cziferszky, 2008), (iii) recent satellite images (Ellis et al., 2006;Ford, 2013). In many scenarios none of the auxiliary data is at disposal. When processing multi-epoch image blocks, ideally a joint self-calibrating bundle adjustment is carried out, including inter-epoch image observations. However, extracting interepoch tie-points remains challenging because the scene changes over time, and images are often of low radiometric quality violating the brightness constancy rule that all modern extraction algorithms rely upon.
In this work we propose a new approach to detecting interepoch tie-points. Using SIFT-detected key-points we perform matching across epochs by interchangeably estimating and imposing that points follow two mathematical models: at first a 2D spatial similarity, then a 3D spatial similarity. We do not rely on any auxiliary data, making it a generic approach. We adopt GCPs only to quantitatively evaluate our results. In the following we briefly describe existing feature matching methods, as well as recent approaches to georeferencing of historical images. In Section 3 we introduce our methodology, and in Section 4 the experiments as well as results are given. The terms features and tie-points are considered synonymous and used interchangeably throughout this publication.

Feature Extraction
Feature extraction refers to finding discriminative structure in images such as corner, blob, edge and so on, followed by a matching step. According to whether machine learning techniques are applied, the image features can be categorized as hand-crafted or learned.

Hand-crafted Features
In the early stage, Moravec detects corner feature by measuring the sum-of-squareddifferences (SSD) by applying a small shift in a number of directions to the patch around a candidate feature (Moravec, 1980). Based on this, Harris computes an approximation to the second derivative of the SSD with respect to the shift (Harris, Stephens, 1988). Since both Moravec and Harris are sensitive to changes in image scale, algorithms invariant to scale and affine transformations based on Harris are presented (Mikolajczyk, Schmid, 2004). Other than corner feature, SIFT (Scaleinvariant feature transform) (Lowe, 2004) detects blob feature in scale-space, which is an entire pipeline including detection and description. It uses a difference-of-Gaussian function to identify potential feature points that are invariant to scale and orientation. SIFT is a milestone among hand-crafted features, and comparable with machine learning alternatives. RootSIFT (Arandjelović, Zisserman, 2012) uses a square root (Hellinger) kernel instead of the standard Euclidean distance to measure the similarity between SIFT descriptors, which leads to a dramatic performance boost. Similar to SIFT, SURF (Bay et al., 2006) resorts to integral images and Haar filters to extract blob feature in a computationally efficient way. DAISY (Tola et al., 2009) is a local image descriptor, which uses convolutions of gradients in specific directions with several Gaussian filters to make it very efficient to extract dense descriptors. KAZE (Alcantarilla et al., 2012) is an algorithm that detects and describes multi-scale 2D feature in nonlinear scale spaces. AKAZE (Alcantarilla et al., 2013) is an accelerated version based on KAZE.

Learned Features
With the rise of machine learning, learned features have shown their feasibility in the image matching problem when enough ground truth data is available. FAST (Rosten, Drummond, 2006) uses decision tree to speed up the process of finding corner feature. LIFT (Learned Invariant Feature Transform)  is a deep network architecture that implements a full pipeline including detection, orientation estimation and feature description. It is based on the previous work TILDE (Verdie et al., 2015), the method of (Moo Yi et al., 2016) and DeepDesc (Simo-Serra et al., 2015). SuperPoint (DeTone et al., 2018) is a self-supervised, fully-convolutional model that operates on full-sized images and jointly computes pixel-level feature point locations and associated descriptors in one forward pass. LF-Net (Ono et al., 2018) is a deep architecture that embeds the entire feature extraction pipeline, and can be trained end-to-end with just a collection of images. Finally, D2-Net (Dusmanu et al., 2019) is a single convolutional neural network that works as a feature detector and a dense feature descriptor simultaneously.
Even though the learned features demonstrate better performance when compared to hand-crafted features on certain benchmark, it does not necessarily imply a better performance in terms of subsequent processing steps. For example, in the context of Structure from Motion (SfM), finding additional correspondences for image pairs where SIFT already provides enough matches does not necessarily results in more accurate or complete reconstructions (Schonberger et al., 2017).

Geo-referencing of Historical Images
Pose estimation describes the intrinsic and extrinsic parameters of an image and is classically solved with the SfM algorithms (Snavely et al., 2006;Pierrot-Deseilligny, Cléry, 2012;Schonberger, Frahm, 2016). If the image pose should be known in a reference coordinate system, i.e. be georeferenced, a 7-parameter transformation followed by a posteriori bundle block adjustment must be carried out.
Unlike in modern images where the image coordinate system overlaps with the planimetric camera coordinate system, in historical images the overlap is not maintained due to the scanning procedure. To account for this, an additional 2D transformation is estimated in the course of the processing (McGlone, 2013). (Giordano et al., 2018) demonstrates the importance of the selfcalibration of historical images and its impact on 3D accuracy. Poorly modelled intrinsic parameters result in the known domelike deformations that the authors eliminate with the help of automatic GCPs from exiting orthophotos.
Many cases reported in the literature calculate the image poses with SfM using features matched exclusively within the same epoch (Nurminen et al., 2015;Cardenal et al., 2006;Fox, Cziferszky, 2008;Walstra et al., 2004). In multi-epoch scenarios, the individual epochs should be defined in a common frame, be it the frame of a reference epoch or an absolute reference frame (i.e., a projection coordinate frame). Control points derived from recent orthophotos and DEM (Nurminen et al., 2015;Ellis et al., 2006;Fox, Cziferszky, 2008) or GPS survey (Micheletti et al., 2015;Walstra et al., 2004;Cardenal et al., 2006) serve to transform the individual epochs to the common frame. Alternatively, a coarse flight plan may provide an approximate co-registration (Giordano et al., 2018).
To increase relative accuracy between several epochs, (Cardenal et al., 2006;Korpela, 2006;Micheletti et al., 2015) manually insert inter-epoch tie-points. (Giordano et al., 2018) extracts tie-points between analogue images and recent images based on the method of (Aubry et al., 2014). (Feurer, Vinatier, 2018) joins multi-epoch images in a single SfM block based on SIFT-like algorithm (Lowe, 2004;Semyonov, 2011) by making the assumption that a sufficient number of feature points remain invariant across each time period. Identifying permanent points over a long time-span has not reached a maturity and research in this domain remains insignificant. Most of the mentioned approaches rely on manual effort, auxiliary input or limiting hypothesis. Figure 1 exhibits the workflow of our algorithm proposed in this paper. It can be divided as two main parts: processing intraepoch, and processing inter-epoch. The images are resampled to the geometry of the fiducial marks prior to processing. For the sake of simplicity we only exhibit the processing flow of two epochs, however, it can be easily extended to more epochs. In the remaining of this section, both intra-epoch and inter-epoch processing will be explained in detail.

Intra-epoch Processing
We process each single-epoch data individually as follow:

Inter-epoch Processing
Using images of all epochs, we now extract inter-epoch tiepoints. We propose the following 3-step guided matching strategy to increase the robustness of the automated tie-points: Get tentative inter-epoch tie-points: extracted with SIFT and matched based on mutual nearest neighbor (i.e., Euclidean distance) between respective descriptors. To increase computational efficiency and discard noise, features are extracted on images down-sampled with a factor of 1/6. No ratio test (Lowe, 2004) was applied since very few tie-points were otherwise retained. This obviously results in a higher number of false matches but as these are only our tentative matches, we accept that. When choosing which feature extraction algorithm to employ, we hesitated between SIFT as the state-of-the-art algorithm for hand-crafted features, and D2-Net -the state-ofthe-art in learned features. We chose SIFT based on a conducted experiment (see Section 4.3) where it outperformed D2-Net.
Get enhanced inter-epoch tie-points: out of the set of all tentative matches we now want to identify and eliminate the false matches. We introduce a 2D similarity transformation (cf., Equation 1) to describe the relation between feature points in a pair of inter-epoch images. This assumption is satisfactory for nadir images, which is our primary focus. The calculation is embedded in a RANSAC framework, and at each iteration the 2D similarity parameters (λ, θ, ∆x, ∆y) are computed from minimal points in inter-epoch images. All points falling within N pixels from the prediction are considered inliers. In our experiment, N was set to 5% of the image diagonal, and RANSAC iterations to 1000.
Get final inter-epoch tie-points: in the preceding step we got rid of a substantial number of outliers, however, we believe not all outliers could be identified. To obtain the final tie-points we introduce another filter, this time based on 3D similarity transformation: Where the P are the 3D coordinates of a feature point in the reference frame of the intra-epoch reconstruction (e.g., epoch1) and P are its respective coordinates in epoch2. The 3D coordinates are retrieved from the individual depth maps computed within the intra-epoch processing step. We now carry out the estimation of the scale λ, the translation vector T, and the rotation matrix R in a RANSAC routine. We set the number of RANSAC iterations to 1000, and consider tie-points within 10m of its predicted position as inliers.

Combined Processing
With the help of the 3D similarity transformation we can move our multi-epoch acquisitions to a common reference frame. Finally, we perform a BBA to refine all the camera poses and camera calibrations. The intra-and inter-epoch features are the observations in our BBA. To analyze the results in a metric scale, the BBA is followed by another spatial similarity transformation that brings the arbitrary reference frame tied to a selected acquisition to a global reference frame. In our experiments, we derived this transformation thanks to several manually identified points in a recent orthophoto and a digital surface model over the area. If precise poses for one of the epochs were known, we fixed their parameters during the BBA and skipped the final spatial similarity transformation.

Test Site and Data
The test site is a 420 km 2 rectangular area located in Pezenas in the Occitanie region in southern France. The area is mainly covered with vegetation, and several sparsely populated urban zones. We have at our disposal three sets of images acquired in June 1971, in June 1981 and in 2015 (cf., Table 1). The 2015 images were acquired with the IGN's digital metric camera (Souchon et al., 2010). The area exhibited changes in scene appearance due to land-use changes in the 44-year period. Our ground truth (GT)

Evaluation Method
As presented in Figure 1, we perform the BBA in a relative coordinate system embedded in a selected epoch. Subsequently, for evaluation purposes we transform the relative results to a reference frame of our GT with the help of several manually measured GCPs. We evaluate the results by computing DSM differences (DoD), between the GT DSM and a DSM resulting from our computations in respective years (cf. Figure 5). We compare Our method to two other approaches: -SIF Tintra−inter: intra-and inter-epoch tie-points are extracted and matched with SIFT; -SIF TintraV GCPinter: intra-epoch tie-points are extracted with SIFT but no inter-epoch tie-points are present; the individual epochs are co-registered with the help of virtual GCPs (i.e., points triangulated in one epoch and used as GCP in another).

Result
Comparison between D2-Net and SIFT. To decide which feature extracting algorithm to use in our guided matching, we performed an experiment based on D2-Net and SIFT. D2-Net is employed with the off-the-shelf trained model provided by the authors, and we use their multiscale detection version to achieve better performance with scale changes. We always apply the ratio test for SIFT, and for D2-Net we test scenarios with and without the ratio test. We used an inter-epoch image pair from 1971 and 1981. The images have a scale difference of factor 1.8 and are rotated by 180 • with respect to each other. D2-Net failed at finding corresponding features. This is probably due to its sensitivity to severe rotation. Meanwhile, SIFT correctly identified several matching points as shown in Figure 2. To be able to compare the two feature extractors we removed the scale difference and rotated one image to the orientation of the second image. The matching results are presented in Figure 3. D2-Net detects and matches more points, however, their precision is inferior to that of SIFT (see Figure 3 bottom).
As mentioned in (Dusmanu et al., 2019), D2-Net is a detection based on higher-level information which inherently leads to more robust but less accurate keypoints. At the same time, D2-Net features are also detected at lower resolution of the CNN features (feature maps have a resolution of one fourth of the input resolution). SIFT was the preferred algorithm because even though it finds few points, they are much more reliable than D2-Net, and they are robust to image rotation.
Inter epoch matching and BBA. In the BBA, all poses corresponding to 1971 and 1981 were considered as free parameters. Since the total number of intra-epoch tie-points is by far larger than the inter-epoch tie-points (cf. Table 2), we set the relative weight to balance the effect. We adopted the Fraser model (Fraser, 1997) to calibrate the cameras, and allowed imagedependent affine parameters. The remaining parameters were shared among all images. The poses of the 2015 acquisition were accurately known a-priori thereby treated as fixed during the combined BBA.
Comparison between three methods. Figure 4 illustrates the inter-epoch tie-points extracted with Ours and SIF Tintra−inter between 1971and 1981, 1971 and 2015 as well as 1981 and 2015. Our guided matching strategy detected far more points than the classical SIFT pipeline in SIF Tintra−inter (cf. Table 2), especially for the most challenging scenario (i.e. the one with the longest time gap between 1971 and 2015). Figure 5 shows the computed DoDs and Table 3 reports the average and standard deviations computed on Z-coordinate difference. A dome artifact is present in DoDs corresponding to methods SIF TintraV GCPinter and SIF Tintra−inter. This kind of systematic error is known to originate from poorly modelled camera internal parameters (Giordano et al., 2018). In Ours, given the reasonable dense inter-epoch features and the calibrated 2015 acquisition, we are able to effectively remove this discrepancy. Even though our proposed method performs best of all, we still observe some systematic errors, e.g. the stripe-like pattern that coincides with the acquisition geometry, and the increasing differences towards the edge of the image block in 1971, where few interepoch tie-points were found.   Table 3. Average µ, standard deviation σ, and absolute average |µ| of the Z-coordinate in the 7 DoDs in Figure 5.

CONCLUSION
A new approach to detecting inter-epoch tie-points in historical images has been proposed. Our validation datasets consisted of three epochs: 1971, 1981 and 2015. The proposed method outperformed the classical SIFT and D2-Net extraction algorithms in detecting many more and reliable points. This resulted in better pose estimation outcomes. The approach is generic as no auxiliary data is required, and it is independent of limiting initial assumptions. We also performed an experiment in which we show that (i) SIFT often provides less tie-points than D2-Net but they are more precise; (ii) D2-Net is sensitive to image rotation. Future work might concentrate on further improving robustness of the proposed method, in order to handle more challenging conditions, such as larger temporal gap. This could for example be done by automatically searching for a more suitable scale factor to downsample the input image to perform the tentative matching, or resorting to line features when necessary.