ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Download
Publications Copernicus
Download
Citation
Articles | Volume V-2-2020
ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci., V-2-2020, 357–364, 2020
https://doi.org/10.5194/isprs-annals-V-2-2020-357-2020
ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci., V-2-2020, 357–364, 2020
https://doi.org/10.5194/isprs-annals-V-2-2020-357-2020

  03 Aug 2020

03 Aug 2020

SELF-SUPERVISED LEARNING FOR MONOCULAR DEPTH ESTIMATION FROM AERIAL IMAGERY

M. Hermann1,2,3, B. Ruf1,2,3, M. Weinmann2, and S. Hinz2 M. Hermann et al.
  • 1Fraunhofer IOSB, Karlsruhe, Germany
  • 2Institute of Photogrammetry and Remote Sensing, KIT, Karlsruhe, Germany
  • 3Fraunhofer Center for Machine Learning

Keywords: Monocular Depth Estimation, Self-Supervised Learning, Deep Learning, Convolutional Neural Networks, Self-Improving, Online Processing, Oblique Aerial Imagery

Abstract. Supervised learning based methods for monocular depth estimation usually require large amounts of extensively annotated training data. In the case of aerial imagery, this ground truth is particularly difficult to acquire. Therefore, in this paper, we present a method for self-supervised learning for monocular depth estimation from aerial imagery that does not require annotated training data. For this, we only use an image sequence from a single moving camera and learn to simultaneously estimate depth and pose information. By sharing the weights between pose and depth estimation, we achieve a relatively small model, which favors real-time application. We evaluate our approach on three diverse datasets and compare the results to conventional methods that estimate depth maps based on multi-view geometry. We achieve an accuracy δ1:25 of up to 93.5 %. In addition, we have paid particular attention to the generalization of a trained model to unknown data and the self-improving capabilities of our approach. We conclude that, even though the results of monocular depth estimation are inferior to those achieved by conventional methods, they are well suited to provide a good initialization for methods that rely on image matching or to provide estimates in regions where image matching fails, e.g. occluded or texture-less regions.