JOINT 3D ESTIMATION OF VEHICLES AND SCENE FLOW

: Three-dimensional reconstruction of dynamic scenes is an important prerequisite for applications like mobile robotics or autonomous driving. While much progress has been made in recent years, imaging conditions in natural outdoor environments are still very challenging for current reconstruction and recognition methods. In this paper, we propose a novel uniﬁed approach which reasons jointly about 3D scene ﬂow as well as the pose, shape and motion of vehicles in the scene. Towards this goal, we incorporate a deformable CAD model into a slanted-plane conditional random ﬁeld for scene ﬂow estimation and enforce shape consistency between the rendered 3D models and the parameters of all superpixels in the image. The association of superpixels to objects is established by an index variable which implicitly enables model selection. We evaluate our approach on the challenging KITTI scene ﬂow dataset in terms of object and scene ﬂow estimation. Our results provide a prove of concept and demonstrate the usefulness of our method.


INTRODUCTION
3D reconstruction of dynamic scenes is an important building block of many applications in mobile robotics and autonomous driving.In the context of highly dynamic environments, the robust identification and reconstruction of individually moving objects are fundamental tasks as they enable save autonomous navigation of mobile platforms and precise interaction with surrounding objects.In image sequences, motion cues are amongst the most powerful features for separating foreground objects from the background.While approaches for monocular optical flow estimation have matured since the seminal work of (Horn and Schunck, 1980) 35 years ago, they still struggle with real world conditions such as non-lambertian surfaces, variable illumination conditions, untextured surfaces and large displacements.Apart from more sophisticated regularizers, stereo information provides a valuable source of information as it can be used to further constrain the problem.Furthermore, depth information allows for a more meaningful parametrization of the problem in 3D object space.Recent algorithms for scene flow estimation leverage this fact (Vogel et al., 2013, Vogel et al., 2014) and provide promising segmentations of the images into individually moving objects (Menze and Geiger, 2015).
In this paper, we build upon the method of (Menze and Geiger, 2015) but go one step further: Instead of simply decomposing the scene into a set of individually moving regions which share a common rigid motion, we decompose the scene into 3D objects and in addition to the rigid motion also model their pose and shape in 3D.Towards this goal, we incorporate a deformable 3D model of vehicles into the scene flow estimation process.More specifically, we exploit the Eigenspace-based representation of (Zia et al., 2011) which has previously been used in the context of pose estimation from a single image.Given two stereo pairs as input, our model jointly infers the number of vehicles, their shape and pose parameters, as well as a dense 3D scene flow field.The problem is formalized as energy minimization on a conditional random field encouraging projected object hypotheses to agree with the estimated motion and depth.A representative result is shown in Fig. 1 which depicts scene flow estimates projected to disparity and optical flow as well as the result of model-based reconstruction.The remainder of this paper is structured as follows.We first provide a brief summary of related work in Section 2. and a detailed formal description of the proposed method in Section 3. In Section 4. we present results for dynamic scenes on the novel KITTI scene flow dataset proposed by (Menze and Geiger, 2015).We conclude the paper in Section 5.

RELATED WORK
In this section, we provide a brief overview over the state-of-theart in scene flow estimation as well as related work on integrating 3D models into reconstruction.
Scene flow estimation has first been addressed by (Vedula et al., 1999, Vedula et al., 2005) who define scene flow as a flow field describing the 3D motion at every point in the scene.Like in classical optical flow estimation (Horn and Schunck, 1980), the problem is often formulated in a coarse-to-fine variational setting (Basha et al., 2013, Huguet and Devernay, 2007, Pons et al., 2007, Valgaerts et al., 2010, Wedel et al., 2011, Vogel et al., 2011) and local regularizers are leveraged to encourage smoothness in depth and motion.As in optical flow estimation, this approach eventually fails to recover large displacements of small objects.Following recent developments in optical flow (Yamaguchi et al., 2013, Nir et al., 2008, Wulff and Black, 2014, Sun et al., 2013) and stereo (Yamaguchi et al., 2014, Bleyer et al., 2011, Bleyer et al., 2012), Vogel et al. (Vogel et al., 2013, Vogel et al., 2014) proposed a slanted-plane model which assigns each pixel to an image segment and each segment to one of several rigidly moving 3D plane proposals, thus casting the task as a discrete optimization problem.Fusion moves are leveraged for solving binary subproblems with quadratic pseudo-boolean optimization (QPBO) (Rother et al., 2007).Their approach yields promising results on challenging outdoor scenes as provided by the KITTI stereo and optical flow benchmarks (Geiger et al., 2012).More recently, (Menze and Geiger, 2015) noticed that many structures in the visual world move rigidly and thus decompose the scene into a small number of rigidly moving objects and the background.They jointly estimate the segmentation as well as the motion of the objects and the 3D geometry of the scene.In addition to segmenting the objects according to their motion (which doesn't guarantee instances to be separated), in this paper, we propose to also estimate their shape and pose parameters.Thus, we infer a parametrized reconstruction of all moving vehicles in the 3D scene jointly with the 3D scene flow itself.
3D Models have a long history in supporting 3D reconstruction from images (Szeliski, 2011).Pioneering work, e.g. by (Debevec et al., 1996) made use of shape primitives to support photogrammetric modelling of buildings.While modelling generic objects, like buildings, is a very challenging task, there are tractable approaches to formalizing the geometry of objects with moderate intra-class variability, like faces and cars.A notable example is the active shape model (ASM) proposed by (Cootes et al., 1995) where principal component analysis of a set of annotated training examples yields the most important deformations between similar shapes.(Bao et al., 2013) compute a mean shape of the observed object class along with a set of discrete anchor points.Using HOG features, they adapt the mean shape to a newly observed instance of the object by registering the anchor points.(Güney and Geiger, 2015) leverage semantic information to sample CAD shapes with an application to binocular stereo matching.(Dame et al., 2013) use an object detector to infer the initial pose and shape parameters for an object model which they then optimize in a variational SLAM framework.Recently, (Prisacariu et al., 2013) proposed an efficient way to compress prior information from CAD models with complex shape variations using Gaussian Process Latent Variable Models.(Zia et al., 2013a, Zia et al., 2013b, Zia et al., 2015) revisited the idea of the ASM and applied it to a set of manually annotated CAD models to derive detailed 3D geometric object class representations.While they tackle the problem of object recognition and pose estimation from single images, in this paper, we make use of such models in the context of 3D scene flow estimation.

METHOD
Our aim is to jointly estimate optimal scene flow parameters for each pixel in a reference image and a parametrized reconstruction of individually moving vehicles as shown in Fig. 2. The proposed algorithm works on the classical scene flow input consisting of two consecutive stereo image pairs of calibrated cameras.We define the first image from the left camera as the reference view.Following the state-of-the-art, we approximate 3D scene geometry with a set of planar segments which are derived from superpixels in the reference view (Yamaguchi et al., 2013).Like Figure 2. Data and Shape Terms.Each superpixel si in the reference view is matched to corresponding image patches in the three remaining views.Its shape and motion are encouraged to agree with the jointly estimated 3D object model.(Menze and Geiger, 2015), we assume a finite number of rigidly moving objects in the scene.It is important to note that using this formulation, the background can be considered as yet another object.The only difference is that we do not estimate a 3D model for the background component.
In this section, we first give a formal definition of our model and the constituting energy terms for data, shape and smoothness.Then, the employed active shape model and the inference algorithm are explained in detail.

Problem statement
Let S and O denote the set of superpixels and objects, respectively.Each superpixel si ∈ S is associated with a region Ri in the image and a random variable (ni, li) T where ni ∈ R 3 describes a plane in 3D (n T i x = 1 for points x ∈ R 3 on the plane) and li ∈ {1, . . ., |O|} is a label assigning the superpixel to an object.Each object o k ∈ O is associated with a random variable (ξ k , γ k , R k , t k ) T comprising its state.ξ k ∈ R 3 determines the pose, i.e. the position (2D coordinates in the ground plane) and the orientation of the object in terms of its heading angle.γ k ∈ R 2 contains the parameters determining the shape of the 3D model.R k ∈ SO(3) and t k ∈ R 3 describe the rigid body motion of object o k in 3D, i.e. the rotation and translation relating the poses of the object at subsequent time steps.Each superpixel si is associated with an object via li.Thus, the superpixel inherits the rigid motion parameters of the respective object (R l i , t l i ) ∈ SE(3).In combination with the plane parameters ni, this fully determines the 3D scene flow at each pixel inside the superpixel.
Given the left and right input images of two consecutive stereo frames at t0 and t1, our goal is to infer the 3D geometry, i.e. the plane parameters ni of each superpixel and its object label li together with the rigid body motion, the pose and the shape parameters of each object.We specify our model as a conditional random field (CRF) in terms of the following energy function where s = {si|i ∈ S}, o = {o k |k ∈ O}, and i ∼ j denotes the set of adjacent superpixels in S. We use the same data term ϕ(•) and the same smoothness term ψ(•) as proposed in (Menze and Geiger, 2015), and add an additional shape term κ(•) to model the pose and shape of the objects in 3D.To make the paper selfcontained, we will briefly review the data term before we provide the formal description of the novel shape term.

Data Term
Data fidelity of corresponding image points is enforced with respect to all four input images in a combined data term depending on shape and motion.Since both entities are encoded in different random variables, the data term is defined as a pairwise potential between superpixels and objects where li assigns superpixel i to a specific object and [•] denotes the Iverson bracket, which returns 1 if the condition in square brackets is satisfied and 0 otherwise.Thus, the actual data term Di(n, o) is only evaluated with respect to the selected object.
It comprises three components: A stereo, an optical flow and a cross term which relate the reference view (left image at t0) to the three remaining images, as depicted in Fig. 2: Note that this term depends on the plane parameters n of the superpixel and the rigid motion parameters of the object o.Each sub-term sums matching costs C of all pixels p inside the region R of superpixel i.As we assume that the geometry within a superpixel can be approximated by a local plane, we are able to warp pixels from the reference view to the other images using homographies computed from n and o: The superscript of D indicates which image is compared to the reference view, with x ∈ {stereo, flow, cross}.Without loss of generality, the camera calibration matrix K ∈ R 3×3 is assumed to be the same for both cameras.The matching cost Cx(p, q) is a dissimilarity measure between a pixel at location p ∈ R 2 in the reference image and a pixel at location q ∈ R 2 in the target image.
In this work, we evaluate two types of features and define Cx(p, q) as the weighted sum of matching costs based on dense Census features (Zabih and Woodfill, 1994) and sparse disparity and optical flow observations: The dense matching cost is computed as the truncated Hamming distance between Census features.Pixels leaving the target image are penalized with a truncation value.As precomputed disparity estimates (Hirschmüller, 2008) and optical flow features (Geiger et al., 2011) are not available for every pixel, we calculate C sparse x only at locations for which observations exist.More specifically, we define C sparse x as the robust l2 distance between the warped pixel πx(p) and the expected pixel q where ρτ i (x) denotes the robust truncated penalty function ρτ i (x) = min(|x|, τi) with threshold τi and πx(p) denotes the pixel p, warped according to the set of sparse feature correspondences.Πx is the set of pixels in the reference image for which correspondences have been established.For more details, we refer the reader to (Menze and Geiger, 2015).

Shape and Pose Consistency Term
Our novel shape consistency term enforces consistency between the 3D plane of superpixel si and the pose and shape of the referenced object.Similarly to the data term, we can take advantage of the fact that this term decomposes into computationally tractable pairwise potentials between superpixels and objects: Here, Si(ni, o k ) enforces consistency between the shape of object o k and the 3D plane described by ni.In analogy with the data term, shape consistency is evaluated with respect to the object associated with the superpixel via li.We define the penalty function Si as where C bg denotes a constant penalty for superpixels associated with the background, and C obj i (n, o) denotes the sum of the truncated absolute differences between the 3D model of object o k projected to a disparity map (see Section 3.5) and the disparities induced by the 3D plane ni.Differences are computed for all pixels inside Ri which coincide with the projection of o k .Remaining, uncovered pixels are penalized with a multiple of C bg .Note that in contrast to the data term Di this term evaluates the consistency between the deformed shape model and the reconstructed superpixels.
The second part of Eq. 3 is the occlusion penalty O ik .It penalizes a possible overlap between parts of a foreground model and superpixels that are assigned to a different object via the arguments of the leading Iverson bracket.The overlap penalty itself is chosen to be proportional to the overlap of the projected model of object o k with the superpixel si.This term is crucial to avoid object models from exceeding the true object boundaries.

Smoothness Term
To encourage smooth surface shape and orientation as well as compact objects, the following smoothness potential is defined on the CRF: The weights θ control the influence of the three constituting terms.First, regularization of depth is achieved by penalizing different disparity values d at shared boundary pixels Bij: Second, the orientation of neighboring planes is encouraged to be similar by evaluating the difference of plane normals n Finally, coherence of the assigned object indices is enforced by an orientation-sensitive Potts model: The weight w(•, •) in the coherence term is defined as and prefers motion boundaries that coincide with folds in 3D.
Here, λ is the shape parameter of the penalty function which is normalized by the number of shared boundary pixels |Bij|.

3D Object Model
For encoding prior knowledge about the objects {o k |k ∈ O} and in order to restrict the high-dimensional space of possible shapes, we follow (Zia et al., 2013b) and use their 3D active shape model.
In particular, we apply principal component analysis to a set of characteristic keypoints on manually annotated 3D CAD models.This results in a mean model over vertices as well as the directions of the most dominant deformations between the samples in the training set.In our CRF, the shape parameters γ k of object o k are optimized for consistency with the jointly estimated superpixels.The deformed vertex positions v are specified by a linear sub-space model where m is the vertex mean and ei denotes the i'th eigenvector weighted by the standard deviation of the corresponding eigenvalue.We define a triangular mesh for the vertices v(γ k ), transform it according to the object pose ξ k and render a virtual disparity map 1 for the reference image in order to calculate the shape consistency term in Section 3.3.
Fig. 3 depicts the mean shape in the center and deformed versions of the model on the left and right, illustrating the range of different layouts covered by the first two principal components.While the first principal component accounts mostly for the size of the object, the second component determines its general shape.We limit our model to the first two principal components as we found this to be an appropriate tradeoff between model complexity and the quality of the approximation.

Inference
Due to the inherent combinatorial complexity and the mixed discrete-continuous variables, optimizing the CRF specified in Eq. 1 with respect to all superpixels and objects is an NP-hard problem.To minimize the energy, we iteratively and adaptively discretize the domains of the continuous variables in the outer 1 http://www.cvlibs.net/software/librender/loop of a max-product particle belief propagation (MP-PBP) framework (Trinh andMcAllester, 2009, Pacheco et al., 2014).
In the inner loop, we employ sequential tree-reweighted message passing (TRW-S) (Kolmogorov, 2006) to infer an approximate solution given the current set of particles.
To keep the computational burden tractable, we perform informed sampling of pose and shape parameters.In each iteration of the outer loop, we draw 50 particles, jointly sampling pose and shape from normal distributions centered at the preceding MAP solution.The respective standard deviations are iteratively reduced.
To prune the proposals, the shape consistency term, Eq. 3, is evaluated for each particle with respect to the superpixels' MAP solution of the previous iteration.Only the best particle is kept and introduced into the optimization of Eq. 1.
In our implementation, we further use 10 shape particles for each superpixel, 5 particles for object motion, and 10 iterations of MP-PBP.All motion particles and half of the superpixel plane particles are drawn from a normal distribution centered at the MAP solution of the last iteration.The remaining plane particles are proposed using the plane parameters from spatially neighboring superpixels.

EXPERIMENTAL RESULTS
To demonstrate the value of our approach, we process challenging scenes from the scene flow dataset proposed by (Menze and Geiger, 2015).As we evaluate additional metrics regarding the quality of the estimated objects we use a set of representative training images for which ground truth information is publicly available.The observations evaluated in the data term comprise densely computed differences of Census features and additional sparse features.We use optical flow from feature point correspondences (Geiger et al., 2011) and precomputed disparity maps using semiglobal matching (SGM) (Hirschmüller, 2008).Sparse cross features, connecting the reference view with the right image at t1, are computed by combining the optical flow matches with valid disparities from the SGM maps.We initialize all superpixel boundaries and their shape parameters using the StereoSLIC algorithm (Yamaguchi et al., 2013) with a parameter setting that yields approximately 1000 superpixels for the used input images.
One typical oversegmentation of a car is depicted in Fig. 4.While most of the outline is faithfully recovered, shadows can lead to bleeding artifacts.This table shows the benefits of integrating the proposed object model, evaluated for all results shown in the paper.We specify the percentage of outliers with respect to disparity estimates in the subsequent stereo pairs (D1,D2), optical flow in the reference frame (Fl) and the complete scene flow vectors (SF).See text for details.
Rigid body motions are initialized by greedily extracting motion estimates from sparse scene flow vectors (Geiger et al., 2011) as follows: We iteratively estimate rigid body motions using the 3point RANSAC algorithm on clusters of similar motion vectors and chose promising subsets with a large number of inliers using non-maxima suppression.The mean positions and the moving direction of the best hypotheses are used as initial values for the object pose parameters ξ.This leads to spurious object hypotheses, as evidenced by Fig. 5, which are pruned during inference because no superpixels are assigned to them.In our experiments, γ comprises two shape parameters controlling the two most significant principal components of the ASM.We initialize each object with the mean shape of the model by setting its shape parameters γ to zero.To compute the shape consistency term in Eq. 3, we use OpenGL to render all object proposals and compare the resulting disparity maps to those induced by the shape particles of each superpixel.In our non-optimized implementation, inference takes more than one minute on a single core, thus the method is not yet applicable to scenarios with real-time constraints.
Qualitative Results: Fig. 5 and Fig. 6 illustrate resulting disparity, optical flow and wire-frame renderings of the object models superimposed to the respective reference views of eight representative scenes.The top part of each sub-figure depicts the layout after initialization as described above.In most cases, the shapes do not match the observed cars and there are some significant positional offsets.In addition, there are spurious objects initialized due to wrong object hypotheses.The lower part shows our reconstruction results after optimizing Eq. 1. Objects which are not referred to by any of the superpixels are considered absent and thus not drawn.For all examples shown in Fig. 5, the model position is successfully aligned with the observed object and the shape of the model is faithfully adapted to the depicted cars.Spurious hypotheses are removed, demonstrating the intrinsic model selection capabilities of our approach.Sub-figures (b,c) of Fig. 6 contain successfully reconstructed cars in the foreground.Some of the spurious objects are removed while others remain in the final result.This is due to strong erroneous motion cues in the respective image patches contradicting the estimated background motion.Note that for visualization we only render fully visible faces of the CAD models.The last sub-figure (d) of Fig. 6 shows a failure case of the approach: Here, object hypotheses with many inliers occur in the very challenging regions next to the road.The numbers in the sub-captions specify the intersection-over-union (IOU) of the estimated object shape with respect to ground truth at initialization and after optimization as explained in the next paragraph.
Shape Adaption: To quantify the improvement gained by optimizing the model parameters, we evaluate the intersection-overunion criterion which is frequently used for evaluating segmentation and object detection in the literature.In particular, we compare the ground truth mask of the annotated objects to the mask of the projected 3D model as inferred by our method.We discard objects without successful initialization and report the intersection-over-union averaged over all detected cars.  1 which specifies the mean percentage of outliers for all eight examples shown in Fig. 5 and Fig. 6 using the evaluation metrics proposed in (Menze and Geiger, 2015), i.e., a pixel is considered as outlier if the estimated disparity (D1,D2) or optical flow (Fl) exceeds 3 pixels as well as 5% of its true value.As a baseline, we optimize Eq. 1 without the shape consistency term κ and sample only motion particles for the objects instead, corresponding to the method of (Menze and Geiger, 2015).In contrast, our full model ("Ours") also optimizes shape and pose parameters of the 3D model as described in Section 3. Table 1 shows that the performance for background regions (bg) slightly decreases in all categories while there is a significant improvement of 5 percentage points for the foreground objects (fg) and moderately improved results for the combined scene flow metric (bg&fg).

CONCLUSIONS
We extended the scene flow algorithm of (Menze and Geiger, 2015) by a deformable 3D object model to jointly recover the 3D scene flow as well as the 3D geometry of all vehicles in the scene.Our results show that the estimation of only 5 model parameters yields accurate parametric reconstructions for a range of different cars.In the future, we plan to incorporate additional observations of a class-specific object detector as well as to estimate motion over multiple frames in order to improve completeness of the retained objects and to further increase robustness against spurious outliers.

Figure 1 .
Figure 1.Result of Model-based Reconstruction.The green wire-frame representation is superimposed to the inferred disparity (top) and optical flow map (bottom).
Figure 5. Qualitative Results.Each sub-figure shows our results at initialization (top) and after optimization (bottom).The reference view is superimposed with the color-coded disparity (left) and optical flow map (right).Object models are depicted as green and blue wire-frames.The numbers in the sub-captions specify the value of the intersection-over-union criterion at initialization and after optimization.

Table 2 .
Model Coverage.Intersection-over-union (IOU), averaged over all foreground objects before and after optimization.Scene Flow Error:The quantitative effect of incorporating the 3D object model is shown in Table