Daoyong FU, Songchen HAN, BinBin LIANG, Xinyang YUAN, Wei LI
School of Aeronautics and Astronautics, Sichuan University, Chengdu 610065, China
KEYWORDS 2D and 3D information;6D pose regression;aircraft 6D pose estimation;End-to-end network;RGB image
Abstract The 6D pose estimation is important for the safe take-off and landing of the aircraft using a single RGB image.Due to the large scene and large depth,the exiting pose estimation methods have unstratified performance on the accuracy.To achieve precise 6D pose estimation of the aircraft, an end-to-end method using an RGB image is proposed.In the proposed method, the 2D and 3D information of the keypoints of the aircraft is used as the intermediate supervision,and 6D pose information of the aircraft in this intermediate information will be explored.Specifically,an off-the-shelf object detector is utilized to detect the Region of the Interest(RoI)of the aircraft to eliminate background distractions.The 2D projection and 3D spatial information of the pre-designed keypoints of the aircraft is predicted by the keypoint coordinate estimator (KpNet).The proposed method is trained in an end-to-end fashion.In addition, to deal with the lack of the related datasets,this paper builds the Aircraft 6D Pose dataset to train and test,which captures the take-off and landing process of three types of aircraft from 11 views.Compared with the latest Wide-Depth-Range method on this dataset, our proposed method improves the average 3D distance of model points metric(ADD)and 5°and 5 m metric by 86.8%and 30.1%,respectively.Furthermore, the proposed method gets 9.30 ms, 61.0% faster than YOLO6D with 23.86 ms.
The correct pose is an important guarantee for the safe takeoff and landing of the aircraft.Excessive elevation angle,excessive speed, and deviation from the runway are all likely to cause irreparable accidents.The translation (x,y,z) (position)and rotation (φ,θ,γ)(attitude angle)are utilized to show the 6D pose information of the aircraft.Usually, the aircraft obtains its own 6D pose information through the Global Positioning System (GPS) and inertial navigation system installed on the body.However, when these sensors fail and cannot be used, the safe take-off and landing of the aircraft will face a major challenge.1Therefore, it is necessary to obtain the 6D pose information of the aircraft when taking off and landing by other means.
With the rapid development of deep learning, many deeplearning-based solutions have been proposed in various fields.The field of object pose estimation is no exception.According to different strategies, the 6D pose estimation methods based on deep learning are mainly divided into two categories: the one-stage methods and the two-stage methods.The one-stage methods analyze the image and directly estimate the 6D pose of the object.The two-stage methods first predict 2D-3D correspondences,and then use the Random Sampling Consensusbased(RANSAC-based)Perspective-n-Point(PnP)algorithm2to estimate the 6D pose of the object through these 2D-3D correspondences.Although the latter has higher accuracy than the former, this kind of methods rely on the detection of 2D-3D correspondences.In addition, due to the lack of depth information, the latter will lead to poor accuracy on the 3D translation,while some methods use RGB-D images as input for 6D pose estimation.3–4However, for large scenes and large depth faced by the aircraft 6D pose estimation problem, the acquisition of the depth map is not feasible.This is mainly because the distance from the camera to the aircraft can reach hundreds of meters and even kilometers, which is beyond the effective range of a depth camera.
Compared with the two-stage methods,the one-stage methods pay more attention to the direct estimation of the 6D pose and reduce the dependence on the object depth information.In addition, the end-to-end training strategy also makes such methods easier to train.However,this kind of method ignores the feature learning of the object,5which leads to the poor accuracy.Inspired by the two-stage method, some scholars are trying to replace the PnP algorithm with network learning method,6whereas this kind of iterative approaches heavily rely on good coordinate initialization.
To obtain high-precision aircraft 6D pose information,this paper proposes an end-to-end 2D + 3D information based aircraft 6D pose estimation method.When the keypoints of the aircraft are described in 3D space (such as camera coordinate system), they are called 3D keypoints whose position are called 3D coordinates.When the keypoints of the aircraft are described in 2D image (such as pixel coordinate system), they are called 2D keypoints whose position are called 2D coordinates.Inspired by the high-precision estimation of the rotation matrix in the two-stage methods,the proposed method utilizes the strategy of the keypoint acquisition and pose estimation method, which can be regarded as the intermediate supervision.7Compared with the two-stage methods, the main differences are that the proposed method not only detects the 2D projection position of each keypoint in the image,but also predicts the 3D spatial position of each keypoint.The 2D projection information can obtain high-precision rotation, and the 3D spatial position information will be helpful for the estimation accuracy of translation matrix due to the supplement of depth information.Different from previous one-stage methods,the proposed method predicts the 2D and 3D information of the keypoints,and the 6D pose information of the aircraft is explored.Due to the lack of relevant datasets about aircraft take-off and landing, this paper uses Unreal Engine 4 to restore the take-off and landing of the aircraft with high fidelity and produces the 6D pose dataset of the aircraft for training and testing.
The contributions of this paper are summarized as follows:(A)This paper proposes a 2D+3D information based end-toend 6D pose estimation method for the aircraft.We guide the network to learn the patterns in the image to restore the 2D projection coordinates of the keypoints in the image and 3D spatial coordinates of the keypoints in 3D space.And we fuse these 2D projection information and 3D spatial position information to estimate the 6D pose of the aircraft.(B)The Aircraft 6D Pose dataset is built,which can be used to estimate the 6D pose of the aircraft when taking off and landing.
The rest of this paper is organized as follows: Section 2 introduces the related work,and Section 3 clarifies the method proposed in this paper, detailing how to fuse the 2D and 3D information of aircraft keypoints.Section 4 introduces the dataset built in this paper, as well as the relevant details and results of the experiments.Section 5 draws the conclusion.
According to the different strategies used,the 6D pose estimation of the object can be divided into two categories, the onestage methods and the two-stage methods.The former directly returns to the 6D pose of the object by analyzing the input single RGB image.The latter first detects 2D-3D correspondences, and then uses the RANSAC-based PnP algorithm to estimate the 6D pose of the object.
(1) One-stage methods.Early-one-stage methods usually suffer from rotational nonlinearity,8which can be effectively solved by combining with PnP algorithm or RANSAC method.2The one-stage methods initially hope to input the pictures (RGB image, RGB-D image, or a combination of the two), and then directly return the pose information of the 3D object.However, pose data annotation of 3D objects is difficult, especially when depth images are not available.Compared with pose data, the annotation of 2D bounding boxes is relatively simple,so Yang et al.9used a weakly supervised method10to learn the segmentation map of the object,and utilized the segmentation mask as the prior knowledge of the pose estimation.When performing pose estimation,the proposed DSC-PoseNet predicts the pose of the object by comparing the similarity between the segmentation mask map and the rendered visible object mask map.The algorithm is even comparable to several supervised directions.In order to solve the accurate pose information of the 3D model in the pictures taken when the external parameters of multiple cameras are unknown, Labbé et al.11proposed the CosyPose method,which first matched a single object from different perspectives,and used this object to estimate the relative pose between cameras.The RANSAC method is utilized to optimize consistency across the scene.Finally, a global optimization process is utilized to eliminate object poses with noisy single views.Stevsˇicˇ and Hilliges6believed that most of the existing methods adopt a non-iterative method, which does not take into account the need to use high-frequency features for accurate alignment between the 3D model and the 2D image.At the same time,the iterative optimization method has great large room for improvement, such as DeepIM.12Therefore, to improve the estimation accuracy of pose,Stevsˇicˇand Hilliges6used an iterative method to identify and utilize spatial information to improve the process of pose refinement.
(2) Two-stage methods.Two-stage pose estimation methods based on deep learning usually use point features (or key points) as intermediate passes for pose estimation.This kind of method believes that the deep learning network can accurately detect the exact position of the key points (markers) of the three-dimensional object in the two-dimensional image.And the keypoints in these two-dimensional images can be used to return the pose information of the three-dimensional object.13The eight vertices of the 3D bounding box(3D bbox)of the object(or adding the centroid point of the 3D bounding box) are used by many researchers to build bridges between 2D-3D correspondences,such as YOLO6D14and Bb8.15There are also some methods that use points on the 3D model to construct 2D-3D correspondences,such as PVNet.5However,this type of pose estimation method usually uses the corresponding regression surrogate object for training (such as key points),which does not necessarily reflect the actual pose of the 3D object solved by the PnP algorithm.The reasons are that the average error may cause some outlier 2D-3D correspondences to be judged correct.At the same time,due to the lack of depth information of the object,for large scenes and large depth aircraft, the accuracy of this method will drop sharply.
Based on the above analysis, the existing methods do not make full use of the 3D information of the object key points.However, generating 3D information without the depth map as input is very helpful for the estimation of the 6D pose of the aircraft.
The 3D coordinates of the keypoints on the aircraft body contain the position information and the attitude information of the aircraft.To make full use of these information to estimate the 6D pose of the aircraft,this paper proposes an aircraft 6D pose estimation method based on the 2D+3D coordinates of the keypoints,as shown in Fig.1.The object detector(Detect-Net)is used to obtain the position of the aircraft in the image,and then send the Region of Interest (RoI) to the keypoint coordinate estimator (KpNet) to estimate the 2D + 3D coordinates of each keypoint.Finally, these 2D and 3D information are sent to Perspective-n-Point solver (PnPNet) for estimating the 6D pose of the aircraft,i.e.,the rotation matrix(R) and translation matrix (T).
The liveries on the fuselage of the aircraft are different.This makes it impossible to select keypoints using aircraft texture features.Compared with texture features, the geometric features of aircraft are more suitable for defining key-point.Based on the above reasons, this paper selects the corner points and connection points on the aircraft body as the keypoints of the aircraft, namely, the nose of the aircraft, the left-wing tip, the right-wing tip, the right horizontal tail tip, the left horizontal tail tip, the vertical tail tip, the tail of the aircraft, the connection point between the wing and the fuselage, and the connection point between the horizontal stabilizer and the fuselage,as shown with the blue points in Fig.2.In addition, the body coordinate system origin O has been selected to show the center of the aircraft, as shown with the red points in Fig.2.
Fig.2 Selected keypoints of aircraft.
Since the aircraft body is a rigid body structure,the relative positional relationship between keypoints is fixed,that is,when the origin of the body coordinate system and the type of the aircraft are known, the position of any point on the aircraft body will be known.Thus, the 3D coordinates of each keypoint in the body coordinate system can be determined by
The rotation matrix R shows the attitude angle of the aircraft.The common methods,such as unit quaternions,16log quaternions,17and Lie algebra-based vectors,18are utilized to show the rotation matrix.However, they will cause an error close to the discontinuities.Thus, this paper adopts continuous 6D representation method proposed by Zhou et al.19:
The rotation matrix R can be obtained easily by the following expression:
where vn(?) represents the normalization operation.
The translation matrix T is used to show the position of the aircraft.To return to a high accurate aircraft position, this paper adopts the method Scale-Invariant Translation Estimation (SITE),20as shown in
Fig.1 End-to-end aircraft 6D pose estimation method based on 2D + 3D information of keypoints.
This paper uses the intermediate supervision to guide the neural network to predict the 2D + 3D coordinates of keypoints on the aircraft body and achieves the 6D pose estimation of the aircraft by analyzing the 2D + 3D information of these keypoints.Inspired by the literature,21the aircraft pose estimation network based on 2D + 3D information of keypoints mainly includes three modules, i.e., DetectNet, KpNet and PnPNet,which are used for the object detection, 2D + 3D coordinate estimation of the keypoints and 6D pose of aircraft estimation,respectively.Fig.3 shows the Network architecture of the proposed method.
(1)Object detector(DetectNet).This paper uses an off-theshelf object detector(Faster-RCNN)22for aircraft detection to obtain the position of the aircraft in the image,namely,Region of Interest (RoI).The corresponding RoI is enlarged and a square RoI is obtained.And then this square RoI will be scaled to a size of 256×256×3 and sent to KpNet to estimate the 2D + 3D coordinate of the keypoints.To improve the robustness, during training, this paper uniformly translates the position of the aircraft in the RoI and scales the size of the RoI by 25%.During testing, the RoI with the size of 256 × 256 is directly sent to KpNet for 2D + 3D coordinate estimation of keypoints.For the groundtruth of the 2D bounding box(2D bbox)of the aircraft in the image,we can obtain it by manual labeling.
Keypoint coordinate estimator (KpNet).ResNet-34 is utilized as the backbone of KpNet.The RoI with the size of 256×256×3 sent by the DetectNet are preliminarily analyzed through the backbone and obtaining the feature maps with size of 512×8×8.And after three up-sample opeartions, the feature maps are mapped into the size of 41×64×64 as the output of the KpNet.The outputs of the KpNet include three parts.The first part is the predicted mask M with the size of 1×64×64 (For the groundtruth of the mask, we can obtain it by manual labeling.).It is utilized to penalize the wrong detected points during calculating the loss, as shown in Eq.(7).The second one is the predicetd 2D position of all keypoints of the aircraft in the image,corresponding to the output maps Kp_2D with the size of 10×64×64.Kp_2D involves 10 maps, each one for one keypoint of the aircraft.Each element of one map shows the probililty that the keypoint occurs at this position, as shown in Fig.4(a).Kp_2D adopts the Gaussian distribution which is widely utilized to represnent the position of the keypoint in the image.23The last one is the predicted 3D position of all keypoints in the camera coordinate system,corresponding to the output maps Kp_3D with the size of 30×64×64.Kp_3D involves 30 maps.The first 10 maps show the predicted × coordinate of the 3D coordinates of 10 keypoints, as shown in Fig.4(b).The next 20 maps represent the predicted y and z coordinates of 10 keypoints,respectively.Inspired by the depth and contour lines, Kp_3D adopts the binary map while Gaussian distribution is not suitable for representing discontinuous 3D coordinate.Eq.(5) and Eq.(6)show how to use the map to show 2D and 3D position of the keypoints.
Fig.3 Network architecture.
Fig.4 Output map of KpNet.
Kp_2D involves the accurate 2D projection information of the keypoints.Kp_3D combines the rough 2D projection information with accurate 3D spatial information.These 2D and 3D information in Kp_2D and Kp_3D will be concatenated,and then sent into PnPNet to estimate 6D pose of the aircraft.
Perspective-n-Point solver(PnPNet).Kp_2D and Kp_3D as the inputs are sent into PnPNet which is combined with three convolutional layers and four fully connected layers.The last two fully connected layers are utilized to regress the rotation matrix and the translation matrix,respectively.PnPNet parses the 2D projection information and 3D spatial information in Kp_2D and Kp_3D to obtain the 6D pose of the aircraft, i.e.,R and T.The 6D pose of the aircraft illustrates the conversion relationships between body coordinate system and camera coordinate system, i.e., the product of the 6D pose of the aircraft in the world coordinate system and the extrinsic parameters of the camera.
(2) Loss function.The network predicts the 2D and 3D position of the keypoints and estimates the 6D pose of the aircraft.Thus, the overall loss includes two parts: Loss2D-3Dand Losspose, as shown in Eq.(7) and Eq.(8).In addition, Mean Squared Error (MSE) Loss is utilized to train KpNet.
To overcome the lack of a dataset for estimating the 6D pose of the aircraft during take-off and landing,this paper creates a dataset called Aircraft 6D Dataset.The Unreal Engine 4 is utilized to build a 1:1 model of Chengdu Shuangliu International Airport (ICAO: ZUUU) and build the three high-precision models of the Airbus A320 and Airbus A350, as shown in Fig.5.Eleven cameras are installed at Chengdu Shuangliu International Airport to capture images of the aircraft when taking off or landing on the runway.The camera distribution is shown in Fig.6.The data of the position and attitude angle of the aircraft when taking off and landing comes from the state records of the aircraft when the captain uses the flight simulator for recurrent training.From 11 viewpoints (or cameras), the images of the three types of aircraft when taking off or landing are recorded,and the mapping relationship between the body coordinate system and the camera coordinate system is recorded at each moment.The Aircraft 6D Dataset contains 17460 images with the size of 1920×1000.All images are split at a ratio of 3:1 for training and testing.
The Pytorch24is utilized to train and test the proposed method and run on the Intel CoreTMi7-9700 CPU, 16 GB RAM and a single RTX2080Ti with 11 GB of memory.The initial learning rate is 10-4.A cosine schedule25is adopted to adjust the learning rate.During the training, the Ranger optimizer26–28executes calculation.The proposed method is trained for 120 epochs.The proposed method includes three modules.We utilized the Faster-RCNN as DetectNet.The backbone of the KpNet is ResNet-34, followed by three convolutional layers and three up-sampling layers.PnPNet consists of three convolutional layers and four fully connected layers.The input and output size of each module are described in Section 3.3.
In this paper,three common metrics are utilized to evaluate the proposed methods, i.e., average 3D distance of model points metric (ADD),3,295° and 5 cm metric,30–31and 2D projection metric (2D Proj).32Especially, eADDcalculates the average 3D error between the 3D coordinate of the keypoints transformed by the estimated pose (R + T) and the groundtruth of 3D coordinate of the keypoints.Eq.(9) illustrates ADD metric:
where R*and T*denote the groundtruth.ADD judges whether the error is less than 5% of the object’s diameter.
Since the size of the aircraft is 100–400 times the size of the objects that are suitable for the 5 cm metric and 5 cm metric is too small for the aircraft,5 cm metric is adjusted.In this paper,the symbol 5° and 5 m are utilized for 5° and 5 m metric,respectively.And 5°, 5 m metric is utilized to represent the rotation error and translation error which are less than 5°and 5 m, respectively.
Fig.5 Three aircraft models including Airbus A320 and Airbus A350.
The metric 2D projection (e2Dproj) calculates the distance between the 2D projection coordinates of the keypoints by the estimated pose (R + T) and the groundtruth of the 2D projection coordinates of the keypoints.Eq.(10) illustrates 2D projection metric:
where K is the camera intrinsic parameters.
To demonstrate the effectiveness of the proposed method,performance comparison between the proposed method and the state-of-the-art methods is carried out, such as YOLO6D,14PVNet,5Single-Stage,33segmentation-driven,7and Wide-Depth-Range.34As shown in Table 1, bold indicates the best performance, compared with the Wide-Depth-Range,34the 2D + 3D information-based method proposed in this paper is respectively 4.1% and 6.5% lower in terms of 5° and 2D Proj, but for ADD, 5°&5m, and 5 m metrics, the proposed method outperforms by a large margin of 87.0%, 30.1% and 35.6%,respectively.In addition,the proposed method is much faster than other methods.The proposed method gets 9.30 ms,61.0% faster than YOLO6D with 23.86 ms.The two-stage methods(YOLO6D and PVNet)based on the PnP/RANSCAS algorithm has high accuracy in the estimation of the rotation matrix but has unsatisfied performance in estimation of the translation matrix.This is because the locations of the aircraft that are taking off or landing within the viewing angle of a single camera vary greatly.The two-stage methods only learn the2D projection information of the aircraft.Due to the lack of depth information, this kind of methods will have a bad performance on the translation matrix of the object in such a large scene.Single-Stage33has poor performance in pose estimation based on the metrics, because Single-Stage only uses multiple 1D convolutional layers to replace the PnP algorithm, which is essentially not much different from the one-stage method.Wide-Depth-Range aims to solve the object 6D pose estimation problem in a large scene and large depth,but this method does not make full use of 3D information.Compared with these methods, the proposed method makes full use of the 2D projection information and 3D coordinate information of the aircraft, which can greatly improve the accuracy of 6D pose estimation.In addition, this simple end-to-end network greatly improves the efficiency of the method.Fig.7 shows the visualization results of the methods in Table 1 for 2D Proj metric.It includes 3 types of the aircraft in 11 scenarios.The blue bbox illustrates the project of 3D bbox using the estimated 6D pose.The red one represents the groundtruth projection of 3D bbox.The circles in Fig.7 emphasize some errors.As shown from Fig.7, YOLO6D and Single-Stage make the clear wrong estimation.For the proposed method,Wide-Depth-Range and PVNet,the gap among them is slight.
Table 1 Performance comparison on aircraft 6d datasets.
In this section,a series of ablation experiments are carried out to explore the impact of 2D and 3D information on aircraft 6D pose estimation.In addition,this section also explores the role of hyperparameters, such as the representation of 2D and 3D information, the radius and number of the keypoints.
Table 2 shows the influence of 2D information (Kp_2D)and 3D information (Kp_3D) on aircraft 6D pose estimation.In this set of experiments, the number of keypoints is 10,and the radius of the keypoints is r = 3 pixels.
Like most methods, the proposed method also predicts the mask of the aircraft, which can eliminate negative detections caused by the background.Firstly,directly fusing the predicted mask,2D and 3D information will get very bad results.Therefore, the translation matrix is normalized to [-1,1], which is called T Norm.And then the impacts of mask,2D information and 3D information on performance are explored, respectively.The 2nd to 5th sets of experimental results show that 2D information improves performance more than 3D information, and 3D information has a side effect after merging 2D and 3D information.The reason is the inaccurate estimation of the 3D coordinate of the keypoints.Therefore,all 3D coordinates of the keypoints are also normalized to [-1,1],which is called Coord Norm.The results of the last four sets of experiments in Table 2 show that mask,2D and 3D information all contribute to the performance improvement of the method after translation matrix and coordinate normalization.Finally,the configuration of the 8th set of experiments whose settings have the highest accuracy is adopted.
Binary map is utilized to represent 3D coordinate of the keypoints.If the pixel is close to the 2D projection coordinate of the 3D keypoints, the value of this pixel is the value of one coordinate of a 3D keypoint.Otherwise, it will be set to ‘0′.Does the arrangement of the binary graph affect the performance? Table 3 shows the answers.Based on the 8th set of the experiment in Table 2, three arrangements are explored.The first one is that the vthcoordinate of the ithkeypoint is stored in (i+N×(v-1))thmap, which is called as Reprexxyyzz.The second one is that the vthcoordinate of the ithkeypoint is stored in (v+(i-1)×3)thmap,which is called as Reprexyzxyz.The last one is that the vthcoordinate of the ithkeypoint is stored in vthmap, which is called as Reprexyz.As shown in Table 3, the difference in the representation has less impact on performance.The three sets of experiments in Table 3 also show that the number of input layers does not have much influence on the 6D pose analysis of PnPNet.Finally, Reprexxyyzzis adopted.
Table 4 illustrates the effect of the radius of the keypoint on performance,that is,the effect of the parameter σ in Eq.(5).In theory, the smaller the radius is, the fewer features the keypoint contains.The larger the radius is, the more features are included.The richness of features affects the 2D and 3D information prediction of key points.It also affects the representation of G3D(?).Therefore, this paper conducts the experiments in Table 4, gradually increasing the radius of the 2D projection point of the 3D keypoint to explore the optimal radius.As shown in Table 4, the effect of radius is slight.The reason is that PnPNet analyzes the input 2D and 3D information and has a certain robustness to the errors of these information.Finally, the selected radius of the keypoint is r = 4 pixels.
The proposed method utilizes the 2D and 3D information of keypoints to estimate the 6D pose of the aircraft.Thus,does the amount of 2D and 3D information have an impact on performance?To explore whether the number of keypoints has an impact on performance,the experiments in Table 5 are carried out.According to the symmetry of the aircraft, 14 points are pre-designed in this paper (see Fig.8), of which point 7 is the origin of the body coordinate system,points 11 and 12 represent the center of the engine, points 13 and 14 represent the position of the landing gear, and the remaining points are described in Section 3.1.‘‘N”in Table 5 means to take the first N points in Fig.8.Some endpoints,connection points,and the origin of the body coordinate system are considered necessary(i.e., the first 9 keypoints).It is because these points make up the body of the aircraft.These 9 keypoints are regarded as the original keypoints.As shown in Table 5, as the keypoints gradually increase,the accuracy of the aircraft’s 6D pose does not change significantly.It illustrates that the increase in the number of keypoints has little effect on the estimation of the aircraft’s 6D pose.Finally, N = 10 is adopted.
Fig.7 Qualitative results of different methods under 2D projection metric.
Table 2 Performance of mask, 2D and 3D information w/ or w/o normalization.
Table 3 Effect of different representation of 3D information.
Table 4 Effect of the radius of keypoint on performance.
Table 5 Effect of the number of keypoint on performance.
Fig.8 Candidate keypoints.
The accuracy of the intermediate supervision plays an important role in the final 6D pose estimation of the aircraft.Thus, we explore the accuracy of each step of the proposed method.For the first step (DetecNet), we use Average Precision (AP)22,35as the evaluation metric which is widely used in object detection task to evaluate the result of object detection.For the second step (KpNet), we use mask Intersectionover Union(mask IoU),36–37which is used to show the overlap between the predicted mask and real mask,to evaluate the predicted mask.The Object Keypoint Similarity (OKS) metric,38which is widely used in keypoint detection task, is adopted to evaluate the predictedKp_2D.For predictedKp_3D, we adopt the same principle as 5 m metric to calculate the regression error of each coordinate of each keypoint.Specially, for each map inKp_3D,the clustering method will be used to find the pixels that are close to the predicted keypoint.Perform the mean operation on these pixels to obtain the final predicted coordinate.Further, perform L2norm on the difference between predicted coordinate and real coordinate to obtain regression error.If regression error is less than 5 m.We think it is acceptable.For the third step (PnPNet), as shown in Table 1, ADD, 5°&5m, 5°, 5 m, and 2D Proj are utilized as the evaluation metrics.As shown in Table 6,the result of intermediate supervision(2D bbox,Kp_2D,Kp_3Dand mask)and the final result (6D pose) are satisfactory.The reason why the accuracy ofKp_3Dis 81.51%is that the position of the aircraft changes greatly, which can reach hundreds or thousands of meters.Thus, it is very difficult to keep the precision within 5 m.The accuracy of the mask is 80.24%, which is mainly caused by the wrong prediction of pixels near the landing gear.
Table 6 Performance of each step of the proposed method.
Due to the large scene and large depth, the existing 6D pose estimation methods cannot obtain high-precision 6D pose information.This is because the two-stage methods lack depth information, and the existing one-stage methods focus too much on the regression of the final 6D pose.As a result, the 6D pose estimation of the aircraft is inaccurate.In this paper,a 3D position based end-to-end aircraft 6D pose estimation method is proposed.The proposed method utilizes an offthe-shelf object detector to obtain the position of the aircraft in the image.And the region of interest will be sent to the 3DKpNet to predict the 2D projection position and 3D spatial position of the aircraft keypoints.These 2D and 3D information will be fused to estimate the 6D pose of the aircraft during take-off and landing.The experimental results show that the proposed method is more accurate and efficient than other methods.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This study was co-supported by the Key research and development plan project of Sichuan Province, China(No.2022YFG0153).
CHINESE JOURNAL OF AERONAUTICS2023年8期