Jacob H.White and Randal W.Beard,
Abstract—This paper introduces a new algorithm for estimating the relative pose of a moving camera using consecutive frames of a video sequence.State-of-the-art algorithms for calculating the relative pose between two images use matching features to estimate the essential matrix.The essential matrix is then decomposed into the relative rotation and normalized translation between frames.To be robust to noise and feature match outliers, these methods generate a large number of essential matrix hypotheses from randomly selected minimal subsets of feature pairs,and then score these hypotheses on all feature pairs. Alternatively, the algorithm introduced in this paper calculates relative pose hypotheses by directly optimizing the rotation and normalized translation between frames, rather than calculating the essential matrix and then performing the decomposition. The resulting algorithm improves computation time by an order of magnitude. If an inertial measurement unit(IMU) is available, it is used to seed the optimizer,and in addition, we reuse the best hypothesis at each iteration to seed the optimizer thereby reducing the number of relative pose hypotheses that must be generated and scored.These advantages greatly speed up performance and enable the algorithm to run in real-time on low cost embedded hardware. We show application of our algorithm to visual multi-target tracking(MTT) in the presence of parallax and demonstrate its real-time performance on a 640× 480 video sequence captured on a UAV. Video results are available at https://youtu.be/HhK-p2hXNnU.
I.Introduction
ESTIMATING camera motion from a video sequence has many applications in robotics including target tracking,visual odometry,and 3D scene reconstruction.These applications often require on-board processing of the video sequence in real-time and thereby impose size,weight,and power (SWAP)constraints on the computing platform.
One method to estimate motion from a video sequence is to calculate the essential matrix between consecutive frames.The essential matrix relates the homogeneous image coordinates between frames using the epipolar constraint.After the essential matrix has been determined,it can be decomposed into a rotation and a normalized translation to determine the relative motion of the camera between frames.In order to be robust to noise and feature mismatches,the essential matrix is typically estimated by generating a large number of hypotheses from five-point minimum subsets of matching features,and selecting the best hypothesis using either random sample consensus(RANSAC)[1]or least median of squares(LMedS)[2].When using RANSAC,the hypotheses are scored by counting the number of inlier points from the entire set.When using LMedS,the hypotheses are scored by calculating the median error.
State of the art methods calculate essential matrix hypotheses directly from each five-point minimum subset.One of the best known methods is Nister’s algorithm[3].Nister showed that for five matching points,there are potentially ten essential matrices that satisfy the constraints,each corresponding to a real root of a tenth-order polynomial generated from the data.There are a many open-source implementation of Nister’s five-point algorithm including OpenCV’s findEssentialMat function[4].However,constructing,solving,and extracting the essential matrix from this tenth-order polynomial is complex and can be computationally expensive.Furthermore,since each minimum subset produces up to ten hypotheses,it can be time consuming to score them.
As an alternative to directly calculating essential matrix solutions,some authors[5]–[9]propose solving for the essential matrix using non-linear optimization algorithms such as Gauss-New ton(GN)and Levenberg-Marquardt(LM).Since the essential matrix has nine entries but only five degrees of freedom,the optimization is performed on the five dimensional essential matrix manifold.There are a number of ways to define the essential matrix manifold.Some authors define the manifold using a rotation and translation unit vector, which are elements ofSO(3) andS2,respectively[5],[6].Others define the manifold using two elements ofSO(3)[8],[9].
A third method of optimizing on a manifold is described in[7].This approach called LM Basis calculates the four essential matrix basis vectors in the nullspace of the essential matrix equation using SVD.The four coefficients to these matrices are solved for on theS3manifold.In contrast to the previously described methods which operate on fivedimensional manifolds,this method uses a three-dimensional manifold.
During each iteration of the optimization algorithm, the optimizer step is solved for in terms of the three or five degrees of freedom along the manifold.The computational requirements of the resulting scheme are significantly less than Nister’s five point algorithm.However,one weakness of optimization-based solvers is that they only find one of the ten possible essential matrices at a time.Finding all solutions requires additional optimization runs with different initialization points.The optimization method is also sensitive to initial conditions,which can cause the optimizer to fail to produce a valid solution.For example,GN may diverge if the initial guess is too far from the true solution.LM can be used to prevent increases in the cost function, but still may fail to converge.Because of the need to run the optimizer multiple times from different initial conditions,these existing optimization-based solvers might not necessarily be faster than the direct essential matrix solvers if the same level of accuracy is desired.However, not all of the ten possible solutions are needed in order achieve comparable accuracy to direct essential matrix solvers if the best solution can be found the first time.
After the essential matrix is found,it must then be decomposed into a rotation and normalized translation.Given an essential matrix, there are four possible rotation-translation pairs[10].The correct rotation-translation pair is typically determined using the Cheirality check that ensures that matching features are in front of both cameras.However, the Cheirality check is sensitive to noise in the image and frequently returns the wrong decomposition.
The main contribution in this paper is a novel optimizationbased algorithm that directly solves for the relative pose using the epipolar constraint in the cost function.If an inertial measurement unit(IMU)is available, then it is used to seed the optimization algorithm at the next time step.When an IMU is not available,since the rotation and translation between consecutive video frames is similar to nearby frames,we use the relative pose estimate from the previous time step to initialize the optimization at the current time step. At each iteration, we use the current best hypothesis to seed the LM algorithm.
We show that this approach significantly reduces the number of hypotheses that must be generated and scored to estimate the pose, thus allowing real-time execution of the algorithm on a Jetson TX2 processor.
The remainder of the paper is organized as follows.The problem is formally stated in Section II.The new pose estimation algorithm is developed in Section III.Application of the algorithm to target tracking in the presence of parallax is described in Section IV.Simulation and flight results on a quadrotor UAV are presented in Section V,and conclusions are given in section VI.
One application of relative pose estimation is motion detection and tracking in the presence of parallax.Motion detection is a valuable source of information in target tracking applications.It can be used to track objects without any prior knowledge about their appearance,in contrast to many trackers that are designed to track specific classes of objects.
There are many successful image-based background subtraction techniques in the literature that work on stationary cameras.In order for image differencing techniques to work on a moving camera,the images must first be aligned using a homography.While this works well for planar scenes,if there is parallax,artifacts will appear in the difference image.If the parallax is small enough in comparison to the movement of objects in the scene,the effects of parallax can be reduced using simple morphological operations and gradient suppression [17].
In the presence of strong parallax, however,a better motion model that accounts for depth variation must be used.There are several methods in the literature that use a geometric model to describe the motion of tracked points in the scene over time.For example,[18] uses orthographic projections to segment moving objects,which works well if the camera is far from the scene or has a narrow field of view.Another approach maintains an affinity matrix and uses principal component analysis to segment moving objects[19].Another approach uses multiple-frame geometric constraints[20],[21].However,all of these methods can be computationally prohibitive.In contrast,the technique proposed in this paper exploits the two-frame epipolar constraint and is therefore computationally simple,enabling real-time performance.
A. Motion Detection Algorithm
Given two consecutive frames,with point correspondences detected in each frame,the objective is to determine which points are from stationary objects and which are from moving objects.In this section we assume that the relative pose between cameras()has been calculated )using Algorithm 2.The goal is to design a detector ?which returns1 if the feature pairis sourced from a moving object and 0 if it is sourced from a stationary background object.The output of the motion detector is used as an input to a tracking algorithm that produces target tracks as described in Section IV-B.
The essential matrix relates points in one image to the other image with the epipolar constraint.In other words, the essential matrix maps a point in one image to a line in the other image.The location where the point in the other image appears along this line depends on the feature’s depth to the camera.As the camera translates,points that are closer to the camera will appear to move more than the points that are far away.This effect is known as parallax.
There are two degrees of freedom for the apparent motion of each point in the image plane.One of these degrees of freedom can be explained by the epipolar constraint if the real-world point is stationary.However,motion along this degree of freedom can also be explained by object motion in the world frame.Hence,the source of any movement along this degree of freedom is ambiguous without additional information.The second degree of freedom for apparent motion of points in the image plane is perpendicular to the epipolar constraint.Thus,the only possible source of motion along this degree of freedom is movement in the real-world frame.
To stream line the notation,we will drop theisubscript in the following discussion.The epipolar line in frame Fk2corresponding to the pointpk1is given by
The performance of the pose estimation algorithm will be demonstrated in simulation and with real flight tests.In the simulation study outlined in Section V-A we know the true pose and so we will be able to assess the accuracy of pose estimation.In the flight test outlined in Section V-B,we apply pose estimation to motion detection and tracking as described in Section IV.
Fig.3.Screenshot of the holodeck video sequence.
The error over time for the OpenCV Nister/LMedS polynomial solver[4],LM Basis[7],and Algorithm 2 are shown in Fig.4.A ll three algorithms give low error for the UAV trajectory. Notice how the rotation error seems to be proportional to the total rotation,while the translation error becomes very large as the true translation approaches zero.
Fig.4.Incremental rotation and translation error over entire video sequence.True incremental translation and rotation are also shown.
In order to provide a fair comparison of initialization schemes for Algorithm 2 with the OpenCV five-point polynomial solver,the IMU was not used in this section for initialization.Alternatively we compare random initialization,where bothandare selected randomly,random recursive initialization,where the first LM optimization at each time step is initialized randomly,but subsequent LM optimizations at that time step use the best pose hypothesis prior,whereis the best pose from the previous time step,and prior recursive,where the first LM optimization at each time step is the best pose from the previous time step, but subsequent LM optimizations at that time step use the best pose hypothesis.All results use LMedS for outlier rejection.The LM optimization is repeated 100 times with five matching features at each iteration.The mean error across the entire video sequence is plotted in Fig.5.
Fig.5.Comparison of LM seeding methods.
This result shows the importance of initializing the optimizer with a prior.The random initialization method performs the worst out of all four methods,while initializing the optimizer with a prior from the previous time step or the best LMedS hypothesis so far from the current time step significantly reduces the error.IMU prediction will further improve these results. After 100 iterations,the LMedS error for the initialization methods that use prior information is comparable to the OpenCV five-point polynomial solver,despite the fact that only one hypothesis is generated per subset instead of an average of about four hypotheses.LM Basis also generates hypotheses from random seeds, resulting in higher error than methods that initialize from a prior. Note that while LM Basis generates up to 10 hypotheses per subset,it removes duplicates and hypotheses that do not appear to converge,resulting in an average of only 1.82 hypotheses per subset.Dropping these hypotheses early may increase the error based on iteration number, but results in a lower error when compared against time.
Fig.6 shows the error of Algorithm 2 compared to the OpenCV 5-point algorithm, but with thex-axis changed to be time instead of number of iterations.When under a time constraint,Algorithm 2 significantly outperforms the OpenCV solver [4]and LM Basis[7].
Fig.6.Sampson error over time.
Fig.7.Average rotation and translation errors for RANSAC and LMedS.
For outlier rejection,RANSAC and LMedS were also compared.For RANSAC the algorithm was tested with 19 different thresholds.For LMedS,the algorithm was run once,because there is no threshold parameter to tune.For each run the average truth rotation and translation error over the entire video sequence were calculated.As shown in Fig.7,LMedS performs well without requiring a threshold.However,in order for RANSAC to perform as well as LMedS, the threshold must be tuned to within an order of magnitude of the optimal threshold.
Table I shows the results of the rotation and translation disambiguation algorithms.The first row within each group shows a baseline comparison,where no method was used for pose disambiguation.The baseline method gives poor results.However,it is worth nothing that Algorithm 2 returns the correct rotation,even without any form of pose disambiguation.This is likely because it is seeded at the first frame with the identity rotation,and in every frame thereafter the best hypothesis from the previous iteration is used as the seed to the optimizer.The second row in each group shows the results when the Cheirality check was used to determine the best of the four possible rotation/translation pairs.The translation direction is often correct but the rotation is correct only about half of the time.The third row in each group shows the results of using the matrix trace to determine which rotation is correct and the Cheirality check to determine the correct translation direction.This third pose disambiguation method consistently outperforms the other methods.
The best hypothesis was refined using LM to optimize the Sampson cost over all inliers(Section III-D).The refined relative pose is only kept if the new relative pose successfully reduces the LMedS error.Table II compares the average error with and without refinement.Refining the best relative pose hypothesis significantly reduces all three error metrics.LM Basis also refines the best hypothesis and successfully reduces the translational and rotational error.However,even with refinement,LM Basis has a higher error than Algorithm 2.
Table III compares the computation time for relative pose estimation between the OpenCV implementation[4],LMBasis[7],and Algorithm 2.The algorithms were both implemented on a laptop with a 2.1GHz Intel i7 CPU running Linux.The breakdown of the time required to generate each hypothesis set is shown in Fig.8.The most time-consuming part of the OpenCV solver is finding the zeros of the tenthorder polynomial. The most time-consuming part of Algorithm 2 is the Eigen matrix solver. Note that while LM Basis and Algorithm 2 take about the same amount of time to generate hypotheses per subset,LM Basis generates up to 10 hypotheses per subset,while Algorithm 2 only generates one hypothesis.The faster optimization used in the LM Basis algorithm is likely due to solving simpler equations which require inverting a 3 ×3matrix instead of a 5 ×5matrix.
TABLE I Pose Disambiguation Comparison
TABLE II Relative Pose Refinement
TABLE III Computation Time
Fig.8.Time required to generate each hypothesis set.
B. Motion Detection and Tracking-Flight Video
The motion detection algorithm was tested on a moving camera video sequence taken from a multi-rotor UAV.Fig.9 shows the results of the motion detection algorithm. Note that the stationary points have zero perpendicular velocity and a positive parallax velocity,while the moving points have a non-zero perpendicular velocity component.Fig.10 shows the results of tracking these moving points using R-RANSAC[22].
Fig.9.Video motion detection results.Each point position(left)and its corresponding net velocity (right)are plotted.Points with a net perpendicular velocity greater than one pixel are classified as moving points(red),while points with a velocity below this threshold are classified as stationary points(blue).
The computation times of the motion detection and tracking algorithm are shown in Table IV.For faster processing the video was scaled to 640×480.The motion detection and tracking algorithm is running on a Linux desktop computer with a 4 GHz Intel i7 CPU.On average about 800 points are detected, tracked,and fed to Algorithm 2 each frame. Notice that the OpenCV feature detection and tracking are the most time-consuming components of the tracking algorithm and consume 70% of the total CPU usage.The complete algorithm takes 29 milliseconds to run per frame,which means it is capable of running in real-time at 34 frames per second (FPS).
TABLE IV Motion Detection and Tracking Computation Times
Fig.10.R-RANSAC tracks.
In this paper we have presented a relative pose estimation algorithm for solving the rotation and translation between consecutive frames,that requires at least five matching feature points per frame,and is capable of running in real-time.We show the importance of seeding the LM optimizer with an initial pose estimate and demonstrate that this initial estimate significantly improves the performance of the algorithm.We have applied the algorithm to detecting motion and tracking multiple targets from a UAV and demonstrated real-time performance of this tracking algorithm on a 640×480 video sequence.Future work includes applications to 3D scene reconstruction and more complex tracking methods.
IEEE/CAA Journal of Automatica Sinica2020年4期