Luping Wang and Hui Wei
Abstract—Monocular vision-based navigation is a considerable ability for a home mobile robot. However,due to diverse disturbances, helping robots avoid obstacles, especially non-Manhattan obstacles,remains a big challenge. In indoor environments, there are many spatial right-corners that are projected into two dimensional projections with special geometric configurations.These projections, which consist of three lines,might enable us to estimate their position and orientation in 3D scenes. In this paper, we present a method for home robots to avoid non-Manhattan obstacles in indoor environments from a monocular camera.The approach first detects non-Manhattan obstacles.Through analyzing geometric features and constraints,it is possible to estimate posture differences between orientation of the robot and non-Manhattan obstacles. Finally according to the convergence of posture differences, the robot can adjust its orientation to keep pace with the pose of detected non-Manhattan obstacles, making it possible avoid these obstacles by itself. Based on geometric inferences, the proposed approach requires no prior training or any knowledge of the camera’s internal parameters,making it practical for robots navigation. Furthermore, the method is robust to errors in calibration and image noise. We compared the errors from corners of estimated non-Manhattan obstacles against the ground truth. Furthermore, we evaluate the validity of convergence of differences between the robot orientation and the posture of non-Manhattan obstacles.The experimental results showed that our method is capable of avoiding non-Manhattan obstacles, meeting the requirements for indoor robot navigation.
with the aging population and growing amount of disabled peoples,the development of home service robots is becoming an increasingly urgent issue. Visual navigation in an indoor environment has considerable value for monitoring and mission planning.However,there exist a multitude of disturbance of clutters and occlusion in an indoor environment,resulting in the predicament of avoiding obstacles,especially non-Manhattan obstacles(e.g.,shelf,sofa,chair),which remains a difficult challenge for visionbased robots.
Instead of current methods(e.g.,3D laser scanners),visual navigation that uses a single low-cost camera draws more attention, because is advantageous in consumption and efficiency.In the human visual system,Gibson described that perception of depth is inborn and does not require additional knowledge via experiment “visual cliff”[1].It used to be seen that humans can recover the three dimensional structures using a binocular parallax.However,it was indicated that the human ability to estimate the depth of isolated points is extremely weak,and that we are more likely to infer relative depths of different surfaces from their jointed points[2].This indicates binocular features are not that important,and that it is possible to understand scenes using only monocular images.Meanwhile,it was reported that humans are sensitive to surfaces of different orientations,allowing us to extract surface and orientation information for understanding a scene[3].Accordingly,it can be assumed that there are some simple rules that can be used to infer 3D structure over a short period of time.Methods were presented to understand indoor scenes based on projections of rectangles and right angles, but non-Manhattan obstacles remains an undiscussed issue[4],[5].
In this paper,we present a method which allows for understanding of non-Manhattan obstacles in an indoor environment from a single image,without prior training or internal calibration of a camera.First,straight lines were detected,and the spatial corners projections consisting of three lines can be extracted.Secondly,through geometric inferences,it is possible to understand the non-Manhattan obstacles.Finally,through convergence of differences in geometric features,it is possible to adjust robot orientation to keep pace with the posture of non-Manhattan obstacles,allowing for the avoidance of such objects.
Instead of data-driven methods,such as those using deep learning,the proposed approach requires no prior training.with the use of simple geometric inferences, the proposed algorithm is robust to changes in illumination and color.For disturbances,the method can understand non-Manhattan obstacles with neither knowledge of the camera’s intrinsic parameters nor the relation between the camera and the world,making it practical and efficient for a navigating robot.Besides,without other external devices,the method has the advantages of lower required investment.
For classic benchmarks,our algorithm is capable of describing details of non-Manhattan obstacles.We compared the corners estimated by the proposed approach against the corner ground truth, measuring the error through the percentage of pixels from summing up all Euclidean distances between estimated corners and the associated ground truth corners.Furthermore,the experimental results demonstrated that robots can understand the non-Manhattan obstacles and avoid them via the convergence of posture difference between the robot orientation and the non-Manhattan obstacle,meeting the requirements of indoor robot navigation.
There are previous works which have made impressive progress,including structure-from-motion[6]–[9]and visual SLAM[10]–[14].Through a series of visual observations,they propose a scene model in the form of a 3D point cloud.A method showed that three dimensional point clouds and image data were combined for semantic segmentation [15].Nevertheless, just a fraction of the information from original images can be provided via point clouds and geometric cues,thus some aspects such as edge textures are sometimes lost.
Also,3D structures can be reconstructed through inferring the relationship between connected super pixels.Saxenaet al.assigned each pixel of an image of grass,trees,sky,or something else,through heuristic knowledge[16].But these methods hardly work in indoor settings with different levels of clutter and incomplete surfaces and coverage.
Furthermore,there are approaches that model geometric scene structures from a single image,including approaches for geometric label classification[17]and for finding vertical/ground fold-lines[18].As to others[19],local image properties were linked to a classification system of local surface orientation,and walls were extracted based on jointed points with the floor.However,due to a great dependance on precise floor segmentation, these methods may fail in an indoor environment with clutter and covers.There has been renewed interest in 3D structures in restricted domains such as the Manhattan world[20],[21].Based on vanishing points,method detected rectangular surfaces aligned with major orientations[5].But dominant directions alone were discussed and object surface information were not extracted.
Additionally,a top down approach for understanding indoor scenes was presented by Peroet al.[22].However,it was difficult to explain room box edges when there were no additional objects.Although Pero’s algorithm[23]can understand the 3D geometry of indoor environments, it required objects and prior knowledge such as relative dimensions,size,and locations.Also a comprehensive Bayesian generative model was proposed to understand indoor scenes[24], but it relied on more specific and detailed geometric models,and suffered greatly from hallucinating objects.Conversely,parameterized models of indoor environments were developed[25].However,this method sampled possible spatial layout hypotheses without clutter,was prone to errors because of occlusions,and tended to fit rooms where walls coincided with object surfaces.Meanwhile,the relative depth-order of rectangular surfaces were inferred by considering their relationships [26], [27], but it just provided depth cues of partial rectangular regions in the image and not the entire scene.
Approaches that can estimate what part of the 3D space is free and what part is occupied by objects are modeled either in terms of clutter[28],[29]or bounding boxes[30],[22].A significant work was found to combine 3D geometry and semantics in the scope of outdoor scenes.Hedau proposed a method that identified beds by combining image appearances and 3D reasoning made possible by estimating the room layout [31].
As to Dasgupta’s work[32],indoor layout can be estimated by using a fully convolutional neural network in conjunction with an optimization algorithm.It evenly sampled a grid of a feasible region to generate candidates for vanishing points.Nevertheless, the vanish point may not lie in the feasible region when the robot faces certain layout scenarios,such as,a two-wall layout scenario.Additionally, because of the iterative refinement process,optimization took approximately 30 seconds per frame,with a step size of 4 pixels for sampling lines,and a grid of 200 vanishing points.Hence,the efficiency of this method cannot meet the requirements of robot navigation in an indoor environment. Also,a method was presented to predict room layout from a panoramic image[33].Meanwhile,other methods using convolutional neural network were proposed to infer indoor scenes from a single image[34]–[38].Since these methods have no regard for non-Manhattan structures,it is difficult for them to understand non-Manhattan obstacles.
Recently,a method was presented to detect horizontal vanishing points and the zenith vanishing point in man-made environments[39].Also,another method was proposed to estimate the camera orientation and vanishing points through nonlinear Bayesian filtering in a non-Manhattan world[40].However,it is difficult for these methods to understand non-Manhattan obstacles.In previous works, the proposed algorithm can estimate the layout of an indoor scene via projections of spatial rectangles, but there was difficulty in handling non-Manhattan structures[5].Also,a method can provide understanding of indoor scenes that satisfy the Manhattan assumption[4];however,it failed to understand non-Manhattan obstacles because the structures that do not satisfy the Manhattan assumption were not discussed.Therefore,it is necessary to develop an algorithm to understand the non-Manhattan obstacles for visual navigation that uses a single low-cost camera in robots.Furthermore the method of low consumption and high efficiency meet the requirement of robot navigation.
Fig.5.An example of avoiding an obstacle of non-Manhattan structure.
TABLE II Turn Motion Mode in the Camera Coordinate System
Based on its understanding of the indoor scene,the robot can turn its orientation in order to keep pace with the posture of different structures(Manhattan or non-Manhattan).The turning of its orientation can be modeled as a convergence of the functionDt,as shown in Fig.6.with a converging value for posture difference,the robot can adjust its orientation,step by step. AsDt→0,the robot’s orientation is in accordance with the posture of the obstacle,allowing it to avoid the obstacle by itself.
We design experiments to evaluate the performance of the robot in avoiding non-Manhattan obstacles through the proposed approach.The focus of the experiments is to evaluate algorithms underlying the execution of a real robot mounted with only one camera.The goals of the experiment are to evaluate not only their performance in detecting non-Manhattan obstacles in indoor settings, but also their ability to avoid such non-Manhattan obstacles by turning its orientation viaDt.
For an input image that contains many occlusions and clutter,our method copes with clutter without prior training.Based on geometric constraints of spatial corners,our approach not only detects the obstacles satisfying the Manhattan assumption,but also can estimate the pose of the obstacles,especially non-Manhattan obstacles.
Fig.6.The decreasing posture difference between the obstacle and robot orientation.
We compare the obstacles estimated by our algorithm against the ground truth,measuring the corner error by summing up all Euclidean distances between the estimated corners and the associated ground truth corners.The performance on the LSUN dataset [44]is compared in Table III.Although the errors of corners appear lower in methods[32],[34], they only measured the error of corners that belong to the layout of their indoor setting,without competence of understanding and estimating corners of non-Manhattan obstacles.However,our method estimates the error of corners that are non-Manhattan obstacles,which plays an important role in the navigation of the robot,allowing the robot to avoid non-Manhattan obstacles in the indoor setting.
TABLE III Performance on the LSUN Dataset
Experimental comparisons were conducted between our method and Wei’s method[4],as shown in Fig.7.In Fig.7,the image size(height and width)of the Wei’s scene understanding and for our non-Manhattan obstacle detection are the same.For example,the scene understanding(the fifth row,fourth column)and the non-Manhattan obstacle detection(the sixth row,fourth column)are from the same group of line segments,in which the line segments’numbers are the same.What is different is that some line segments which do not satisfy the constraint of angle projections in the scene understanding(the fifth row,fourth column)are eliminated,resulting in displaying less lines.Since Wei’s method only considers lines belonging to vanishing points of layout in indoor scenes,it is prone to failure in detecting non-Manhattan obstacles.However,our method can deal with clutter,and can efficiently detect details,especially non-Manhattan obstacles,without any prior training.
Experimental comparisons were conducted between our method and Wang’s method[5],as shown in Fig.8.Since Wang’s method only considers rectangular projections that belong to vanishing points of layout of indoor scenes,it is also difficult to detect non-Manhattan obstacles.
Fig.7.Experimental comparisons.(a)input frames from UCB dataset[26];(b)understanding of indoor scenes by Wei’s method[4];(c) non-Manhattan obstacles estimated by our method;(d)images from Hedau dataset[42];(e)results estimated by Wei’s method[4];(f)non-Manhattan structures estimated by our method.
Fig.8.Experimental comparisons.(a)input images from LSUN dataset[44];(b)understanding of indoor scenes by Wang’s method[5];(c) non-Manhattan obstacles estimated by our method.
Fig.9.An unmanned aerial vehicle with a two mega-pixel fixed camera that was used for capturing video.
Here,as shown in Fig.9,an unmanned aerial vehicle with a two mega-pixel fixed camera was used for capturing video.The vision information was transmitted to the computer with the CPU Intel Core i7-6500,2.50GHz.Then,our method can efficiently be applied to identify non-Manhattan obstacles in a scene, without any prior training.Take aFt1as an example in Fig.10(first column);it is obvious that there exists a difference between the orientation of the robot and the pose of non-Manhattan obstacle.Based on the equation above,theDt1can be approximately estimated(ηt1=76.43).According to the Table II,the robot understood the scene and identified the non-Manhattan obstacle,and turned right by a prelimited angle(the max turning angle prelimited in robot controlling).
Then,the robot caughtFt2(Fig.10,the second column)and entered a next loop of unerstanding-turn.For the frames as shown in Fig.10, non-Manhattan obstacles are detected and pose differences between the robot’s orientation and the non-Manhattan obstacles are estimated in Table IV.
Through successive understanding of the non-Manhattan obstacles and orientation turning,the converging of differences(Mxand η)can be seen as a convergence process,as shown in Fig.11.In the left image of Fig.11,the horizontal axis indicates the index of frame in Table IV,and the vertical axis represents the valueMx.Meanwhile,in the right image in Fig.11,the horizontal axis indicates the index of frame in Table IV,and the verticalaxis represents the value η.with the decreasing value of difference betweenMxand η,the orientation of the robot can be adjusted to keep pace with the pose of detected non-Manhattan obstacles,step by step,avoiding the non-Manhattan obstacles by itself.
Obviously,in facing the non-Manhattan obstacle,the pose difference between the robot orientation and the non-Manhattan obstacle can be approximately estimated so as to determine whether or not and how to change the robot orientation in order to avoid the obstacles.
Fig.10.Pose difference estimation;(a)(c)input frame;(b)(d) pose difference (Mx and η)between the robot orientation and the non-Manhattan obstacles.
The current work presents an approach for home mobile robots to avoid non-Manhattan obstacles in indoor environments using a monocular camera.The method first detects projections of spatial right-corners and estimates their position and orientation in three dimensional scenes.Accordingly,it is possible to model non-Manhattan obstacles via the projections of corners.Then, based on understanding such non-Manhattan obstacles, the difference between the robot orientation and the posture of the obstacles can be estimated via geometric features and constraints.Finally,according to the difference,it is possible for the robot to determine whether and how to turn its orientation so as to keep pace with the posture of the detected non-Manhattan obstacles,making it possible to avoid such obstacles.Instead of data driven approaches,the proposed method requires no prior training.with use of geometric inference,the presented method is robust against changes in illumination and color.Furthermore, without any knowledge of the camera’s internal parameters, the algorithm is more practical for robotic application in navigation.In addition, using features from a monocular camera,the approach is robust to the errors in calibration and image noise.without other external devices,this method has the advantages of lower investment and energy efficiency.The experiments measure the error of corners by comparing the corners of non-Manhattan obstacles estimated by our algorithm against the ground truth.Moreover,we demonstrated the validity of avoiding obstacles via the convergence of difference between the robot orientation and non-Manhattan obstacle posture.The experimental results showed that our method can understand and avoid the non-Manhattan obstacles,meeting the requirements of indoor robot navigation.
Fig.11.Convergence.(a)convergence curve of Mx;(b)convergence curve of η.
TABLE IV Difference Between the Robot Orientation and the Non-Manhattan Obstacles
IEEE/CAA Journal of Automatica Sinica2020年4期