End-to-end depth from motion with stabilized monocular videos

We propose a depth map inference system from monocular videos based on a novel dataset for navigation that mimics aerial footage from gimbal stabilized monocular camera in rigid scenes. Unlike most navigation datasets, the lack of rotation implies an easier structure from motion problem which can be leveraged for different kinds of tasks such as depth inference and obstacle avoidance. We also propose an architecture for end-to-end depth inference with a fully convolutional network. Results show that although tied to camera inner parameters, the problem is locally solvable and leads to good quality depth prediction.


INTRODUCTION
Scene understanding from vision is a core problem for autonomous vehicles and for UAVs in particular. In this paper we are specifically interested in computing the depth of each pixel from a pair of consecutives images captured by a camera. We assume our camera's velocity (and thus movement between two frames) is known, as most UAV flight systems include a speed estimator, allowing to settle the scale invariance ambiguity.
Solving this problem could be beneficial for applying depth-based sense and avoid algorithms for lightweight embedded systems that only have a monocular camera and cannot directly provide an RGB-D image. This could allow such devices to go without heavy or power expensive dedicated devices such as ToF camera, LiDar or Infra Red emitter/receiver (Hitomi et al., 2015) that would greatly lower autonomy. In addition, along with some being unable to operate under sunlight (e.g. IR and ToF), most RGBD sensor suffer from range limitations and can be inefficient in case we need long-range trajectory planning (Hadsell et al., 2009). The faster an UAV is, the longer range we will need to efficiently avoid obstacles. Unlike RGB-D sensors, depth from motion is robust to high speeds since it will be normalized by the displacement between two frames. Given the difficulty of the task, several learning approaches have been proposed to solve it.
A large number of datasets has been developed in order to propose supervised learning and validation for fundamental vision tasks, such as optical flow (Geiger et al., 2012, Dosovitskiy et al., 2015, Weinzaepfel et al., 2013 stereo disparity and even 3D scene flow (Menze andGeiger, 2015, N.Mayer et al., 2016). These different measures can help figure up scene structure and camera motion, but they remain low-level in terms of abstraction. End-to-end learning of a certain high semantic value such as three dimensional geometry may be hard to compute on a totally unrestricted monocular camera movement.
We focus on RGB-D datasets that would allow supervised learning of depth. RGB pairs (preferably with the corresponding displacement) being the input, and D the desired output. Our choice today to learn depth from motion in existing RGB-D datasets is either unrestricted w.r.t. ego-motion (Firman, 2016, Sturm et al., Figure 1. Camera stabilization can be done via a) mechanic gimbal or b) dynamic cropping from fish-eye camera, for drones or c) hand-held cameras 2012), or a simple stereo vision, equivalent to lateral movement (Geiger et al., 2012, Scharstein andSzeliski, 2002).
We thus propose a new dataset, described Part 3, which aims at proposing a bridge between the two by assuming that rotation is canceled on the footage that contains only random translations.
This assumption about videos without rotation appears realistic for two reasons : 1. Hardware rotation compensation is mainly a solved problem, even for consumer products, with IMU-stabilized cameras on consumer drones or hand-held steady-cam (Fig 1). 2. this movement is somewhat related to human vision and vestibulo-ocular reflex (VOR) (De Nó, 1933). Our eyes orientation is not induced by head rotation, our inner ear among other biological sensors allows us to compensate parasite rotation when looking at a particular direction.
This assumption allows to dramatically simplify links between optical flow and depth and leverage much simpler computation.
The main benefit being the camera movement's dimensionality, reduced from 6 (translation and rotation) to 3 (only translation). However, as discussed in Part 4, depth is not computed as simply as with stereo vision and requires being able to compute higher abstractions to avoid a possible indeterminate form, especially for forward movements.
Using the proposed dataset, we then show that depth can be learned as an end-to-end problem just like other usual Deep Learning problems. With a trained artificial neural network, we perform much better depth accuracy than flow based methods and are confident this will be efficiently leveraged for sense and avoid algorithms.

Monocular vision based sense and avoid
Sense and avoid problems are mostly approached using a dedicated sensor for 3D analysis. However, some work has been done trying to leverage Optical flow from Monocular camera (Souhila andKarim, 2007, Zingg et al., 2010). These works enlighten the difficulty in estimating depth solely with flow, especially when the camera is pointed toward movement. One can note that rotation compensation was already used with fish-eye camera in order to have a more direct link between flow and depth. Another work (Coombs et al., 1998) also demonstrated that basic obstacle avoidance could be achieved in cluttered environments such as a closed room.
Some interesting work concerning obstacle avoidance from Monocular camera (LeCun et al., 2005, Hadsell et al., 2009, Michels et al., 2005 showed that single frame analysis can be more efficient than depth from stereo for path planning. However, these works were not applied on UAV, on which depth cannot be directly deduced from distance to horizon, because obstacles and paths are now three-dimensional More recently, Giusti et al. (Giusti et al., 2016) showed that a monocular system can be trained to follow a hiking path. But once again, only 2D movement is approached, asking a UAV going forward to change its yaw based on likeliness to be following a traced path.

Depth inference
Deep Learning and Convolutional Neural Networks has recently been widely used for numerous kinds of vision problem such as classification (Krizhevsky et al., 2012) and hand-written digits recognition (LeCun et al., 1998).
Depth from vision is one the problems studied with neural network, and has been addressed not only with image pairs, but also single images (Eigen et al., 2014. Depth inference from stereo has also been widely studied (Luo et al., 2016, Zbontar andLeCun, 2015), and not necessarily in a supervised way (Konda andMemisevic, 2013, Garg et al., 2016).
Current state of the art methods for depth from monocular view tend to use motion, and especially structure from motion, and most algorithm do not rely on deep learning (Cadena et al., 2016, Mur-Artal and Tardos, 2016, Klein and Murray, 2007. Prior knowledge w.r.t. scene is used to infer a sparse depth map with its density usually growing over time. These techniques also called SLAM are typically used with unstructured movement, produce  Table 2. datasets parameters very sparse point-cloud based 3D maps and require heavy calculation to keep track of the scene structure and align newly detected 3D points to the existing ones. SLAM is not widely used for obstacle avoidance, but more for off-line 3D scan. Our goal is to compute a dense (where every point has a valid depth) quality depth map using only two images, and without prior knowledge on the scene and movement, apart from the lack of rotation and the scale factor.

Navigation datasets
As discussed earlier, numerous datasets exist with depth groundtruth, but to our knowledge, no dataset propose only translational movement. Some provide IMU data along with frames (Smith et al., 2009), that could be used to compensate rotation but their small size only allows us to use it as a validation set.

STILL BOX DATASET
For our dataset we used the rendering software Blender to generate an arbitrary number of random rigid scenes, composed of basic 3d primitives (cubes, spheres, cones and tores) randomly textured from an image set scrapped from Flickr (see Fig 2).
These objects are randomly placed and sized in the scene, so that they are mostly in front of the camera, with possible variations including objects behind camera, or even camera inside an object. Scenes in which camera goes through objects are discarded. To add difficulty we also applied uniform textures on a proportion or of the primitives. Each primitive thus has a uniform probability (corresponding to texture ratio) of being textured from a colorramp and not from a photograph.
Walls are added at large distances as if the camera was inside a box (hence the name). The camera is moving at a fixed speed value, but to a random direction (uniform distribution), which is constant for each scene. It can be anything from forward/backward movement to lateral movement (which is then equivalent to stereo vision). Tables 1 and 2 show a summary of our scenes parameters. They can be changed at will, and are stored in a metadata

Why not disparity ?
Flow Estimation and disparity (which is essentially magnitude of optical flow vectors) are problems to which exist a lot of very convincing methods (Ilg et al., 2016, Kendall et al., 2017. Knowing depth and displacement in our dataset, we could be able to easily get disparity and train a network for it using existing methods. We consider a picture with (u, v) coordinates, and optical center Definition 1 Disparity is defined by the norm of a flow vector Definition 2 Focus of Expansion is defined by the point FOE where each flow vector flow(P) = du dv of a point P = u v is headed from. Note that this property is true only when considering no rotation and a rigid scene. One can note than for a pure translation, FOE is the projection of the displacement vector Theorem 1 For a random rotation-less displacement of norm V of a pinhole camera, with a focal length of f , depth is an explicit function of disparity ,focus of expansion FOE and optical center P0 This result is in a useful form for limit values. Lateral movement corresponds to FOE → +∞ and then When approaching FOE, knowing depth is a bounded positive value, we can deduce : disparity(P) ∝ P→FOE P − FOE limit of disparity is this case is 0 and we use its inverse. As a consequence, small errors on disparity estimation will result in diverging values of depth near focus of expansion while it corresponds to the direction the camera is moving to, which is clearly problematic for depth-based obstacle avoidance.
Given the random direction of our camera's displacement, computing depth from disparity is therefore much harder than for a classic stereo rig. To tackle this problem, we decided to set up an end-to-end learning workflow, by training a neural network to explicitly predict the depth of every pixel in the scene, from an image pair with constant displacement value V .

Dataset set augmentation
The way we store data in 10 images long videos, with each frame paired with its ground truth depth allows us to set a posteriori distances distribution with a variable temporal shift between two frames. If we use a baseline shift of 3 frames, we can e.g. assume a depth three times as great for two consecutive frames (shift of 1). In addition, we can also consider negative shift, which will only change displacement direction without changing speed value compared to opposite shift. This allows us, given a fixed dataset size, to get more evenly distributed depth values to learn, and also to de-correlate images from depth, preventing any overfitting during training, that would result in a scene recognition algorithm and would perform poorly on a validation set.

Depth Inference training
Our network, which is broadly inspired from FlowNetS (Dosovitskiy et al., 2015) and called DepthNet is described Fig 3. This network was initially used for flow inference. The main idea behind this network is that upsampled feature maps are concatenated with corresponding earlier convolution outputs. Higher semantic information is then associated with information more closely linked to pixels (since it went through less strided convolutions) which is then used for reconstruction.   (1) where • γs is the weight of the scale, arbitrarily chosen as Ws in our experiments. • (Hs, Ws) = ( 1 /2 n H, 1 /2 n W ) are the height and width of the output. • depths is the scaled depth groundtruth, using average pooling.
As said earlier, we apply data augmentation to the dataset using different shifts, along with classic methods such a flips and rotations. We also clamp depth to a maximum of 100m, and provide sample pair without shift, assuming its depth is 100m everywhere. One can notice that although the network is still fully convolutional, feature map sizes go down to 1x1 and then behave exactly like a Fully Connected Layer, which can serve to figure out implicitly motion direction and spread this information across the outputs. The second noticeable fact is that near FOE, (see  Table 3. quantitative results for depth inference networks. FlowNetS is modified with 1 channel outputs (instead of 2 for flow), trained from scratch for depth with Still Box for centered FOE, i.e. perfect forward movement) the network has no problem inferring depth, which means that it uses neighbor disparity and interpolates when no other information is available.
This can be interpreted as 3d shapes identification, along with their magnification : pixels belonging to the same shape are deemed to have close and continuous depth values, resulting in a FOEindependent depth inference.

From 64px to 512px Depth inference
One could think that a fully convolutional network such as ours can not solve depth extraction for pictures greater than 64x64. The main idea is that for a fully convolutional network, each pixel is applied the same operation. For disparity, this makes sense because the problem is essentially similarity from different picture shifts. Wherever we are on the picture, the operation is the same. For depth inference when FOE is not diverging (forward movement is non negligible), result from Theorem 1 apparently shows that once you know the FOE, you then get different operations to do depending on your distance from it and from the optical center P0. The only possible strategy for a fully convolutional network would be to compute the position in the frame as well and to apply the compensating scaling to the output.
This problem then seems very difficult, if not impossible for a network as simple as ours, and if we run the training directly on Figure 6. some results on 512x512 images, same color code as for 64x64 input . some results on real images input. Up is from a Bebop drone footage, down is from a gimbal stabilized smartphone video 512x512 images, the network fails to converge to better results than with 64x64 images (while better resolution would help getting more precision). However, if we take the converged network and apply a fine-tuning on it with 512x512 images, we get much better results. Fig 6 shows training results for mean L1 reconstruction error, and shows that our deemed-impossible problem seems to be easily solved with multi-scale fine-tuning. As Table 3 shows, best results are obtained with multiple fine-tuning, with intermediate scales 64, 128, 256, and finally 512 pixels. Subscript values indicate finetuning processes. FlowNetS is performing better than DepthNet but by a fairly light margin while being 5 times heavier and most of the time much slower, as shown Table 4.
Fig 7 shows qualitative results from our validation set, and from real condition drone footage, on which we were careful to avoid camera rotation. These results did not benefit from any finetuning from real footage, indicating that our Still Box Dataset, although not realistic in its scenes structures and rendering, appears to be sufficient for learning to produce decent depthmaps in real conditions.

Quality measurement
As our network is leveraging the reduced dimensionality of our dataset due to its lack of rotation, it is hard to compare our method to anything else. Disparity estimation is equivalent to a lateral translation that our network has been trained on, and could be used to compare to other algorithms but this reduced context seems unfair compared to methods designed especially for it.
Other datasets provide ego motion with 6-DOF on which our network has not been trained and is certain to give poor results. On the other hand, we could test some SLAM methods but they work better when applied to long image sequences and not only image pairs. In short, our method is setting state of the art, but for a very particular problem that we hope will gain interest with time.

UAV NAVIGATION USE-CASE
We assumed in learning depth inference from a moving camera, assuming its velocity is always the same. When running during flight, such a system can easily deduce the real depth map from the drone speed Vt, knowing that the training speed was V0 (here 9m.s −1 ) One of the drawbacks of this learning method is that the f value (which is focal length divided by sensor size per pixel) of our camera must be the same as the one used in training. Our dataset creation framework however allows us to change this value very easily for training. One must also be sure to have pinhole equivalent frames like during training.

Multiple shifts inference
Depending of the depth distribution of the groundtruth depth map, it may be useful to adjust frame shift. For example, when flying high above the ground, big structure detection and avoidance requires knowing precise distance values that are outside the typical range of any RGB-D sensor. The logical strategy would then be to increase the temporal shift between the frame pairs provided to DepthNet as inputs.
More generally, one must ensure a well distributed depth map from 0 to 100m to get high quality depth inference. This problem can be solved with two (among other) solutions: • Deduce optimal shift ∆t from precedent inference distribution, e.g: where E0 is 50m (because our network outputs from 0 to 100m) and E depth is the mean of precedent output, i.e. : DepthN et(f ramet, f ramet−∆ t )i,j • Use batch inference to compute depth with multiple shifts ∆t,i. As shown in Table 4, batch size greater than 1 can be used to some extent (especially for low resolution) to efficiently compute multiple depth maps.
These multiple depth maps can then be either combined to construct a high quality depth map, or used separately to run two different obstacle avoidance algorithm, e.g. one dedicated for long range path planning (and then a high value ∆i,t) and the other for reactive and short range collision avoidance with low ∆i,t. While one depth map will display closer areas at zero distance but further regions with precision, the other will set far regions to infinity (or 100m for DepthNet) but closer region with high resolution as flow is lowered compared to a high shift, and potentially within the range the network has been trained on.

CONCLUSION AND FUTURE WORK
We propose a novel way of computing dense depth maps from motion, along with a very comprehensive dataset for stabilized footage analysis. This algorithm can then be used for depth-based sense and avoid algorithm in a very flexible way, in order to cover all kinds of path planning, from collision avoidance to long range obstacle bypassing.
Future works include implementation of such a path planning algorithm, and construction of a real condition fine tuning dataset, using UAVs footages and a preliminary thorough 3D offline scan. This would allow us to measure quantitative quality of our network for real footages and not only subjective as for now.
We also believe that our network can be extended to reinforcement learning applications that will potentially result in a complete end-to-end sense and avoid neural network for monocular cameras.