DEEP LEARNING BASED FEATURE MATCHING AND ITS APPLICATION IN IMAGE ORIENTATION

Matching images containing large viewpoint and viewing direction changes, resulting in large perspective differences, still is a very challenging problem. Affine shape estimation, orientation assignment and feature description algorithms based on detected hand crafted features have shown to be error prone. In this paper, affine shape estimation, orientation assignment and description of local features is achieved through deep learning. Those three modules are trained based on loss functions optimizing the matching performance of input patch pairs. The trained descriptors are first evaluated on the Brown dataset (Brown et al., 2011), a standard descriptor performance benchmark. The whole pipeline is then tested on images of small blocks acquired with an aerial penta camera, to compute image orientation. The results show that learned features perform significantly better than alternatives based on hand crafted features.


INTRODUCTION
Feature based image matching aims at finding correspondences among images and is a fundamental research issue in photogrammetry and computer vision. The related pipeline is composed of four steps, namely feature detection, feature orientation, feature description and high dimensional descriptor matching. Distinctive features are obtained during feature detection and localization across scale. After the assignment of a principal direction to each detection, all features are corrected accordingly to remove the rotation difference. Afterwards, a support window surrounding each detected feature with a size proportional to the detected scale is chosen and is used to extract a high dimensional vector, i.e. the feature descriptor, to represent the detected features. Optionally, the affine shape of detected features can also be estimated to compensate potential affine distortions, in particular when facing large changes in viewpoint and viewing direction.
The key challenge of feature based image matching frameworks is to ensure invariance against complex geometric and radiometric changes between images. For instance, when two images are taken from distinctively different viewpoints, the local appearance of image patches surrounding detected features can differ significantly due to the geometric setup. Radiometric changes can result from varying illumination, differences in imaging bands and non-Lambertian reflection properties. As indicated in (AanÃęs et al., 2012) , the invariance of hand crafted detectors and descriptors decreases sharply for images containing 3D scenes when viewpoint and viewing direction changes increase.
However, feature based image matching algorithms can be designed to be invariant against certain geometric and radiometric changes. For instance, the well known SIFT operator (Lowe, 2004) is rotation and scale invariant to a certain degree. This invariance can be extended to a reasonable level of affine transformation between images, see e.g., the Hessian-Affine detector (Mikolajczyk et al., 2005). The feature support window * Corresponding author is then mapped to a high dimensional vector to represent the underlying feature. The feature support window is normally corrected according to the estimated orientation and optional affine shape parameters.
Many successful detectors and descriptors based on handcrafted features were developed in the past, but achieved only limited success for large view point changes. Therefore, acquiring local feature descriptors using deep neural models has been attracting more attention recently, because the latter have been shown to have advantages with respect to discriminability, e.g. (Lenc, Vedaldi, 2016, Tian et al., 2017, Mishchuk et al., 2017. In this paper, a new feature based image matching framework based on deep neural networks, including affine shape estimation, feature orientation and description is presented and results for the image orientation of small blocks of oblique aerial images are reported and analysed.

RELATED WORK
Feature based image matching has been studied for many decades. The central idea of detecting features is to find points or blobs that are distinctively different from neighbouring pixels. Following this idea, the determinant or trace of the Hessian matrix computed from second order Gaussian smoothed image derivatives is often used to measure how distinctive the underlying feature is. This analysis can be extended to scale space by incorporating images convolved by derivatives of Gaussian kernels of different width. In this way, local extrema in both image x, y and in the scale dimension are detected as features. The trace of the Hessian is approximated by the Difference of Gaussians in SIFT (Lowe, 2004) and the determinant of the Hessian is used in the scale invariant Hessian detector which is a part of the SURF algorithm (Bay et al., 2008).
After features are detected, the rotation of a feature can be estimated by calculating a principal direction using the gradient orientations calculated in a local window surrounding the detected feature. Gradient orientation bins are used to find the principal direction in SIFT (Lowe, 2004), while the Haar wavelet responses in horizontal and vertical direction are used to assign a principal direction to a detected feature in SURF (Bay et al., 2008). In (Moo Yi et al., 2016), the orientation is estimated by a deep convolutional neural network (CNN), showing significantly better performance than the aforementioned methods based on hand crafted features.
Detecting local features in scale space and assigning them an orientation is basically equivalent to normalizing translation, rotation and scaling of local features before description. However, this transformation is not sufficient to model the geometric transformations between local image patches in case of large changes in viewpoint and viewing direction between images. Perspective changes, which for small windows can be compensated by an affine transformation, should also be estimated and taken into account before feature description. Compared to feature detection, fewer works have been published in this direction.
In (Mikolajczyk et al., 2005), the second moment matrix is used to measure the level of isotropy of a feature. The patch surrounding features is normalized by multiplying the patch with −1/2 (restricting the largest eigenvalue of −1/2 to 1) and then, the second moment matrix for the normalized patch is iteratively calculated and normalized with −1/2 , until the two eigenvalues of for the normalized patch are close enough to each other. After each iteration, the spatial localization of the maximum value of the feature response function is re-detected. As a result, the affine transformation between two image patches is removed and only a rotation remains. However, this algorithm is not stable when the tilt between patches is large. Instead, affine shape estimation based on a deep neural network is proposed in (Mishkin et al., 2018), where it is estimated by minimizing the distance between the matched descriptors. In addition, ASIFT (Morel, Yu, 2009) simulates the input image with different versions of affine transformations and then the DoG features and SIFT descriptors detected in each transformed image are combined for descriptor matching; this means ASIFT is computationally expensive.
Once orientation and affine shape are assigned to a detected feature, a small patch surrounding the feature is corrected to obtain the feature support window, which is then fed into a description module to obtain the descriptor. Although the pixel grey values in the feature support window can be used as a simple form of descriptor, this is often too sensitive to remaining local deformations in the patch and not discriminative enough. As discussed in (Brown et al., 2011), a feature descriptor is a composition of transformation, aggregation, normalization and optional dimension reduction. The transformation step magnifies some signal, e.g., gradients in SIFT (Lowe, 2004) and Haar wavelet response in SURF (Bay et al., 2008). Afterwards, the transformation response is aggregated by means of computing the mean value over a grid or the histogram of local response. After that, features are normalized to improve their invariance against radiometric transformations. Also, dimension reduction algorithms, e.g. Principal Component Analysis, can be used to further decrease the dimension of the output descriptor. Based on these steps, a descriptor learning method is proposed in (Winder, Brown, 2007, Brown et al., 2011 to optimize the configuration of different steps in building feature descriptors, resulting in a notable performance improvement compared to hand-crafted descriptors. Descriptor learning methods use image patch pairs as input, derive descriptions using some initial model, and then obtain a similarity score for the descriptors of the patch pairs. The model is then optimized using a loss function to increase the similarity of matched feature pairs and decrease the similarity of unmatched pairs. Following this approach, boosting is used to learn weak features that can best discriminate matched and unmatched pairs in (Trzcinski et al., 2012,Trzcinski et al., 2015, Chen et al., 2014. More recently, many researchers extract the descriptor by CNN, e.g., (Zagoruyko, Komodakis, 2015, Han et al., 2015, Simo-Serra et al., 2015, Chen et al., 2016, Kumar et al., 2016, Balntas et al., 2016, Tian et al., 2017, Mishchuk et al., 2017. Also, additional constraints for regularization (Zhang et al., 2017, Luo et al., 2018 were added for descriptor learning. Not surprisingly, the deliberate choice to enlarge the amount of training data has also increased the matching performance of learned descriptors, as suggested in (Mitra et al., 2018).
In this paper we suggest a feature matching pipeline based on CNNs in order to derive image orientation parameters for blocks of aerial penta cameras. These images typically have viewing directions with differences amounting to 45 degrees (nadir vs. oblique view) and 90 degrees (one oblique view vs. the other). The work most related to ours is (Mishkin et al., 2018). In that work the affine shape of local features is estimated by deep learning through optimizing the similarity of pairs of input patches simulated by affine transformations. We use the same approach and the same network architecture, but train the different modules from scratch. (Mishkin et al., 2018) differs from our work in mainly two further aspects: 1) the affine shape estimation part is trained based on a different form of loss; and 2) our algorithm is applied and tested on real image orientation tasks, instead of only on image matching benchmarks. The second aspect forms the main contribution of this paper.

THE FEATURE MATCHING PIPELINE
In our method the three steps of affine shape estimation, orientation assignment and description of local image patches are all learned based on a CNN architecture with detected feature pairs as input, which serve as training data (for details on the detection see section 3.4). Subsequently, features of individual images are detected using classical methods, followed by applying the trained CNN modules to obtain descriptors. Finally matching is carried out. An overview of the training phase is illustrated in Fig 1. The training data is a series of image patches with known matching relationship ("matched" or "unmatched"), thus matched and unmatched pairs can be sampled from the training data. These training patches only have small orientation and affine shape differences. First, the descriptor part is trained based on the sampled dataset. Then, the affine shape and orientation modules are trained separately (and independently of each other) based on the descriptor learned in the previous step and on sampled pairs of patch data which are now rotated and distorted based on simulation (note that in contrast to training, during inference, the sequence of first applying the affine correction and then carrying out orientation assignment does matter, see section 3.4 and figure 6 for details). The affine shape and orientation are not solved simultaneously, because preliminary experiments of solving the two parts in one step did not converge.

Descriptor Module
To train descriptors, matched and unmatched patch pairs are used. The patches are support windows of features. After ap- plying a CNN to those patches, related descriptors are obtained. As the pair of patches is either matched or unmatched, the corresponding descriptor distances are used to build the loss function. In this section, the framework for descriptor learning is discussed first, followed by the generation of training pairs and the design of the loss function. Additionally, the employed hardest mining strategy (Mishchuk et al., 2017) and data augmentation are discussed.

Descriptor Training Architecture:
The descriptor is trained based on a Siamese CNN, which is illustrated in figure 2. The two branches of the CNN share the same weights and each branch of the descriptor network is used to extract descriptors from an input image patch. Details of the descriptor network used to generate descriptors are provided in table 1. This network is identical to the one used in (Mishchuk et al., 2017) and (Mishkin et al., 2018), and was originally proposed by (Tian et al., 2017). Through a series of convolution layers, a 32×32 pixel single channel image patch is transformed and compressed into a 128 dimensional descriptor, which is then scaled to unit length.

Generation of training pairs:
Following (Mishkin et al., 2018), the training data is composed of image patches. For each patch it is known which other patches are correct matches (this is realised via a 3D point index for each patch, thus matched patches are characterized by the same 3D point index). First for each mini-batch, 3D points are sampled without replacement from the training data and then, two different patches associated with the same 3D point index are randomly selected to form a pair of positive patches for each selected 3D point.
During training we also need counter examples, i.e. nonmatching pairs: for a patch 1 , patches associated with a different 3D point index belong to the unmatched patches (see Figure  3). The use of the 3D index thus ensures that these pairs, when sampled randomly from the training data, do not by chance contain correct matches. Obviously, the number of possible unmatched pairs is much higher than that of the matched pairs. For properly training our network, we need an equal number of matched and unmatched pairs. We thus have to select the unmatched pairs to be used in training from the larger set. For this selection step we employ the hardest mining strategy (Mishchuk et al., 2017).
Hardest mining of unmatched pairs: According to this strategy, the unmatched pairs are required to be the most difficult ones for each pair of patches. In this way the network best learns how to differentiate between matched and unmatched samples. The corresponding Euclidian distance ℎ is defined as: where 1 = the descriptor for the ith patch in the patch set 1, i.e. the set passed through branch 1 ( 1 , 1 ) and ( 1 , 2 ) compute the hardest negative samples by finding the unmatched pair with the smallest distance. The selection of the hardest unmatched pair is illustrated in figure 3. Through seeking the hardest samples anew in each training epoch, the network "sees" many more unmatched training pairs than matched ones, which corresponds to the fact that matching is a problem where a much larger number of negative pairs than positive ones is typically compared for real matching applications. After mining, a triplet containing a pair of matched patches and a "most difficult" negative patch is obtained for each training patch passed to the first branch of the network.

Loss function:
The triplet margin based loss function (Hoffer, Ailon, 2015) is used as the loss function for descriptor training. For sampled triplets, the loss is defined as: where ( 1 , 2 ) is the Euclidian distance between the ℎ matched pair and ℎ is the hardest distance computed Figure 3. Hardest mining. The patch 1 is compared with all the patches with a different 3D index from the patch sets 1 and 2. The pair with the smallest distance is picked for calculating the loss for unmatched pairs. Arrows: comparison between patches.
for the ℎ triplet as described above. is a pre-defined margin between the distance of matched and unmatched pairs. Considering all the descriptors are normalized, the maximum distance between any two descriptors is 2. In this paper, is set to be 1, as suggested in (Mishchuk et al., 2017).

Data augmentation:
In order to increase the number of matched pairs during training, the available ones are augmented by flipping or rotating them by a value randomly chosen from the set [90 • , 180 • , 270 • ].

Affine Shape Estimation Module
In this section, the learning architecture for affine shape estimation, affine shape parametrization and the corresponding loss function are discussed.

Training Architecture:
Similar to the descriptor network, the affine estimation network has two branches and shared weights to handle patch pairs, which are sampled according to the method described in section 3.1.2. However, here the input patch pairs are first distorted by simulation using an affine transformation and are then fed into the affine estimation network. Note that both patches of a pair are distorted, and the shape correction is then computed for both of them. Through the affine estimation network the underlying affine transformation parameters are estimated and the patches are then re-sampled using the inverse transformation. Then, the resampled patches are fed into the descriptor network to obtain descriptors and to calculate the descriptor distance based loss. The whole architecture is illustrated in figure 4.
Affine Shape Parametrization: Similar to (Mishkin et al., 2018) and (Perd'och et al., 2009), the affine transformation applied to each patch individually is decomposed into the following form where = detected feature scale, kept constant during estimation of affine shape parameters 11 , 21 , 22 = residual form of affine shape parameters, computed during affine shape estimation = feature rotation angle, also kept constant during estimation of affine shape parameters The rotation matrix with angle will be discussed in section 3.3; the other matrix contains the affine shape parameters. Setting 12 = 0 enables the affine shape estimation to preserve the direction orthogonal to x-axis for a image patch, because the affine shape matrix always has one eigenvector equal to (0, 1) .
The affine shape estimation network (we again use the same network as (Mishkin et al., 2018)) is used to estimate the affine matrix elements for each input patch. This network has a similar structure as the descriptor network, delivering residual affine shape parameters 11 , 21 , 22 . To fix the overall scale of features we subsequently divide them by ( ) = ( 11 + 1) * ( 22 + 1). During training, all affine shape parameters are randomly sampled according to a uniform distribution, and for a matching pair is set to the same angle for both patches.

Loss function:
The hardest loss is used for the estimation of the affine shape matrix elements.
The hardest unmatched pairs are used for each matched pair (here > 1; for descriptor learning = 1).
is the margin defined in equation (2). This larger value of reduces the risk that the hardest unmatched sample lies between the matched features in descriptor space, while still relying on difficult samples. Based on experimental evaluation we set to 3 in this paper. This mining procedure is different from (Mishkin et al., 2018), who set = 1 in the loss function and eliminate the effect of the hardest unmatched samples during back-propagation by setting the respective gradients to zero.

Orientation Assignment Module
As feature pairs can not only exhibit affine distortion but also be rotated in an arbitrary way, there still exists the (unknown) rotation angle for each patch. In this section we describe how we estimate .

Training Architecture:
The feature orientation network is again similar to the descriptor network and has two branches with shared weights to handle patch pairs, which are sampled according to the method described in section 3.1.2. Here, both input patches are first rotated by simulation and then fed into the network. Both patches of a pair are rotated by an angle sampled independently of each other , and the orientation angle is then computed for both of them. The architecture for orientation learning is shown in figure 5. Figure 5. Orientation estimation network architecture. After rotation simulation, the rotation angle of simulated patches are estimated and then the patches are resampled using the estimated rotation angle. The resampled patches are subsequently used as input to the pre-trained descriptor network to obtain descriptors 1 , 2 and the loss.

Data Augmentation:
A uniformly distributed random rotation in a range of [0, 2 ) is applied to each patch of the input patch pair. Also, a random translation in a range of [−2, 2] pixels and random scaling in a range of [0.9, 1.1] is applied to simulate the localization and scale determination noise in the feature detection stage, as suggested in (Mishkin et al., 2018).

Training Loss:
In this case, the loss is only based on the distance of matched feature pairs. It is defined as where ( 1 , 2 ) is distance of the ℎ pair. When patches are rotated, the contents of the feature support window remain the same, and assuming there exists no rotated repetitive texture, incorporating hardest negative samples is not necessary.

Feature Description using Trained Models
Once the three aforementioned modules are learned, they are integrated into a feature and descriptor extraction pipeline which outputs detected features and their descriptors for an input image.
The whole process is shown in Fig. 6. First, the Hessian matrix determinants of each pixel in the input image are calculated for each sample scale of the input image in scale-space. Then, the local extrema of the Hessian determinant are detected in scale-space, followed by the refinement of image coordinates and characteristic scale. Subsequently, a patch is re-sampled to 32 × 32 pixels around the detected feature position with a range proportional to the characteristic scale. This patch is regarded as input for the affine shape network to predict its estimated affine shape, which is used in the next step to compensate affine distortion of local patches. In the following step, the orientation network is applied to the patch corrected for affine distortion in order to estimate rotation. The patch is then further corrected by the estimated rotation angle, and the related patch forms the feature support window for the local feature. The trained descriptor network is then applied to this support window and a descriptor for the underlying local feature is derived. Note that whenever resampling is necessary, all related parameters are combined first and only a single resampling step is carried out.

EXPERIMENTS AND RESULTS
First, the learned descriptor is evaluated using the Brown dataset (Brown et al., 2011) to reflect its performance for classification of patch pairs. Second, the whole pipeline is used to extract features and descriptors for small image blocks taken from an aerial penta-camera system including nadir and oblique images with significant changes in viewing direction. The quality of image orientation parameters and the computed 3D points after bundle adjustment using the matched features as input are used to assess the results.

Experimental Datasets
Brown Dataset: The Brown dataset (Brown et al., 2011) is used to train the descriptor network and to test its performance. This dataset was generated from a multi-view image collection containing a large number of images of community photo collections (Goesele et al., 2007). Through structure from motion and dense multi-view stereo matching (Snavely et al., 2008), matches were retrieved, which are considered as ground truth in the following . The dataset is composed of three different For a feature in image , a small grid surrounding in is extracted and transferred to image through the depth map estimated between stereo image pair , . The transferred grid is then used subsets: Notre Dame, Liberty and Yosemite. In each subset, a test set containing equal numbers of matched and unmatched pairs is also provided.

Dataset for Image Orientation
Five different image blocks, each containing a mix of nadir and oblique images, are used in this experiment. Blocks 1 to 3 were acquired using a penta system with Cannon EOS-1DS Mark II cameras, while blocks 4 and 5 are part of the ISPRS/EuroSDR benchmark for multiplatform photogrammetry (Nex et al., 2015) , in which the IGI penta camera system was used. The image dimension is 4994 x 3328 pixels for blocks 1 to 3 and 8176 x 6132 pixels for blocks 4 and 5. Details of the blocks are given in

Training of Network
All three modules were trained using a mini-batch size of = 1024 with 20 epochs, where one epoch represents the number of iterations in which all training samples are used once. For the descriptor, the pairs were regenerated after each training epoch. The employed optimizer was standard gradient descent with momentum, the learning rate was set to 0.001 and decayed linearly with a step of /# , where # is the number iteration steps. The training of the descriptor for the Brown dataset used 10 million pairs. The networks of descriptor, affine shape and orientation estimation for image orientation were all trained based on the complete Brown dataset. For descriptor training, 30 million pairs were used, while for the other two modules, 10 million pairs were employed. Note that there is a domain gap between the training data and the image orientation task data, because the Brown dataset is composed of close-range terrestrial images and the orientation task uses aerial images.

Brown Dataset:
The descriptor network is trained using one subset and then tested on the other two (in all permutations), therefore six different combinations of training-test subsets are obtained. The distances of descriptors for pre-defined patch pairs, containing an equal amount of matched and unmatched pairs, are computed and a threshold is applied to obtain the False Positive Rate (FPR) -True Positive Rate (TPR) curve. FPR (in %) at 95% TPR is reported as our evaluation criterion.
to estimated the scale and pixel localization for the transferred feature point in . If the difference of estimated scale and pixel localization for the transferred features point is close to the scale and localization of a feature point in , then and are judged as a ground truth match. http://www2.isprs.org/commissions/comm1/icwg15b/benchmark_main.html 4.3.2 Image Orientation: After feature detection and matching a bundle adjustment is carried out for the five blocks to obtain image orientation parameters. A number of quality measures are recorded as evaluation criteria (see below for details). In this experiment, three different combinations of affine shape estimation, orientation and feature description algorithms are compared in order to evaluate the contribution of the three different modules in comparison to published work. Note that for all three combinations the same set of detected features is provided as input. The three variants formed by different combinations are: • hbss: the variant uses Baumberg iteration (Baumberg, 2000) for affine shape, gradient statistical orientation assignment as in SIFT (Lowe, 2004) and the SIFT descriptor (Lowe, 2004). • hbsl: same affine shape and orientation solution as for hbss, and the descriptor learnt as explained in this paper. • hlll: the method explained in this paper, i.e. all affine shape, orientation and descriptor are learned.

Determination of image orientation
For this task the following steps are conducted.
• Detection of features for all images, see section 3.4. In this step, a fixed number of Hessian features with potentially high repeatability against viewpoint changes is detected in scale space. We use 5000 features per image in blocks 1 to 3 and 12000 features per image for blocks 4 and 5 . • Estimation of the affine shape and orientation of local features and computation the descriptors using the trained network, as explained in section 3.4. • Nearest neighbour threshold matching (Lowe, 2004) to obtain initial matches for each pair of images in a block. The maximum ratio allowed between nearest and second nearest matching features is set to 0.67. • Structure from motion (SfM) software COLMAP (Schönberger, Frahm, 2016) to obtain initial orientation parameters. Note that we ignore matching points that only appear on two images in this step. • Transformation of the initial orientation results obtained from the different pipelines into a common coordinate system for further comparison . To achieve this goal, one image is selected as origin of the common coordinate system, setting both its projection centre and rotation angles to zero; a second image is selected to define the scale between the two projection centres, then the length of the baseline is set equal to 1. point, depending on which images these rays come from, four cases are distinguished: -A) from only one viewing direction (nadir or oblique) (only_nad_or_obl); -B) from the nadir and one oblique camera (nad_obl); -C) from the nadir and at least two different oblique cameras (obl_nad_obl); -D) from different oblique cameras (obl_obl).
In general, the difficulty of matching increases from level A to level in D, as the change of viewpoint and viewing direction becomes larger accordingly.
• Precision of the 3D point coordinates obtained from error propagation. A higher precision indicates a better quality of the whole image orientation pipeline.

Brown Dataset:
As mentioend before, we tested six different combinations of the three subsets: in all combinations one subset was used for training, the other two subsets as test data. The results are shown in Table 3. The comparison to the work of (Mishchuk et al., 2017), called HardNet, and to SIFT (Lowe, 2004) is also reported.
For the descriptor evaluation on the Brown dataset our model achieves a quality comparable to HardNet, which is the basis of our work, and which can be considered as state-of-the-art in the field: for three cases our result is better, for three cases is is worse. This is a good pre-requisite for the determination of image orientation parameters reported in the next section.

Results for image orientation:
The result for the image orientation experiment are reported in table 4 for the three different variants hbss, hbsl and hlll. The table contains the details of the bundle adjustment results. The distribution of the number of rays intersecting at the 3D points are illustrated in figure 7. The image orientation results indicate that the use of a learned descriptor improves the performance compared to hand crafted features (with SIFT as the example), see table 4. The numbers for hbsl are significantly better than those for hbss. In particular, the learned descriptor leads to a more complete image block, more matches and thus more 3D points, more observations per image, and a better 3D coordinate precision, while the mean reprojection error stays approximately constant.
The incorporation of learned affine and orientation parameters further improves the results. From the distribution of the number of views per 3D point in figure 7 it can be observed that our completely learned variant hlll results in a larger number of multiple ray 3D points. Also, the number of 3D points in the cases B(nad_obl), C(obl_nad_obl) and D(obl_obl) is higher for our pipeline than for the other two. Partly due to the fact that 3D points in our pipeline are observed in a larger number of views, the precision of reconstructed 3D points is also higher, which is more obvious in block4 and block5.
Based on the above observations, the learned affine shape, orientation and descriptor modules provide more complete and more accurate results for image orientation than the other tested pipelines for the challenging task of dealing with images containing large viewpoint and large viewing direction changes.

CONCLUSION AND FUTURE WORK
In this paper, a feature based image matching framework making use of deep learning is proposed. The affine shape estimation, orientation assignment and description of local features are all learned using CNN. The method provides state-of-the-art feature descriptors. Tests for image orientation of small blocks of penta cameras reveal that the proposed method achieves a better matching performance than more traditional methods.
Currently, we do not use canonical descriptions for the image patches, such as referring the orientation to the main gradient direction as is the case in SIFT. Consequently, multiple solutions exist for image matching in terms of orientation and affine transformation, and our solution contains an over-parametrisation of the feature correspondence problem. While we still use very good results, we plan to introduce constraints for the predicted transformation parameters to make the estimation more stable in the future. Also, we plan to use larger photogrammetric blocks in our evaluation. Furthermore, we plan to explore the possibility to integrate the three steps into one single network and to evaluate the limitations of learning our network on one dataset and then transferring the results to different sets of images.  Figure 7. Resuls for the three pipeleins (see text for explanation of the abbreviations). Figure (a) to (e) indicate the distribution of the number of multiple rays per object point for blocks 1 to 5. Figure (f) to (j) indicates the distribution of involved cameras for the five blocks, where A="only_nad_or_obl", B="nad_obl", C="obl_nad_obl" and D="obl_obl".