Lei Fu ,Wen-in Gu ,Wei Li ,Ling Chen ,Yong-o Ai ,Hu-lei Wng
a College of Field Engineering,Army Engineering University of PLA,Nanjing,210007,China
b 93182 troops,People's Liberation Army,Shenyang,110000,China
c 93033 troops,People's Liberation Army,Shenyang,110000,China
d Automobile NCO Academy,Army Military Transportation University,Bengbu,233011,China
Keywords: Aerial images Object detection Feature pyramid networks Multi-scale feature fusion Swarm UAVs
ABSTRACT In this paper,based on a bidirectional parallel multi-branch feature pyramid network(BPMFPN),a novel one-stage object detector called BPMFPN Det is proposed for real-time detection of ground multi-scale targets by swarm unmanned aerial vehicles(UAVs).First,the bidirectional parallel multi-branch convolution modules are used to construct the feature pyramid to enhance the feature expression abilities of different scale feature layers.Next,the feature pyramid is integrated into the single-stage object detection framework to ensure real-time performance.In order to validate the effectiveness of the proposed algorithm,experiments are conducted on four datasets.For the PASCAL VOC dataset,the proposed algorithm achieves the mean average precision(mAP)of 85.4 on the VOC 2007 test set.With regard to the detection in optical remote sensing(DIOR)dataset,the proposed algorithm achieves 73.9 mAP.For vehicle detection in aerial imagery(VEDAI)dataset,the detection accuracy of small land vehicle(slv)targets reaches 97.4 mAP.For unmanned aerial vehicle detection and tracking(UAVDT)dataset,the proposed BPMFPN Det achieves the mAP of 48.75.Compared with the previous state-of-the-art methods,the results obtained by the proposed algorithm are more competitive.The experimental results demonstrate that the proposed algorithm can effectively solve the problem of real-time detection of ground multi-scale targets in aerial images of swarm UAVs.
In recent years,unmanned aerial vehicles(UAVs)or drone swarms technology has been widely concerned.The drone swarms have been applied to intelligence reconnaissance[1],surveillance[2],communication[3],and military tasks[4,5].In 2005,the United States Department of Defense(DOD)released the UAVs system roadmap 2005-2030,which set the autonomous swarms of drones as the ultimate development goal.In addition,many experiments in this field,including Perdix,Elves,and Coyotes drones,have been carried out in recent years.
The swarm UAVs need to ef ficiently and accurately detect targets of different scales on the ground as the precondition of their next action.In recent years,the general object detection algorithm has been constantly updated on PASCAL VOC and MS COCO detection datasets.Typical single-stage object detector include:YOLO[6],SSD[7],YOLOv3[8],Re fineDet[9],RFBNet[10],RetinaNet[11],M2Det[12],CenterNet[13],FSAF[14],CornerNet[15],FCOS[16],EffcientDet[17],and YOLOV4[18];and some of typical twostage object detector are Faster-RCNN [19],R-FCN [20],MaskRCNN[21],TridentDet[22],Libra-RCNN[23],and CBNet[24].The two-stage object detectors have higher detection accuracy than the single-stage object detectors,but their speed cannot meet the requirements of real-time applications.
Although signi ficant progress has been made in general target detection algorithms,these algorithms are not usually able to deal with drone-captured sequences or images[25].The images taken by UAVs are different from those of common natural scenes,especially in terms of scale and complex background.Moreover,recent studies have shown that single-stage detectors can achieve high accuracy and maintain real-time speed characteristics.Therefore,numerous studies have been conducted to improve the existing single-stage object detectors for UAVs image target detection.
Rohan et al.[26]studied the use of a single-short detector(SSD)algorithm[7]for human detection,and high detection accuracy was achieved.However,they considered only the close range of people without analyzing the multi-scale situation.In Ref.[27],YOLOv2[28]network was used to detect outdoor high-voltage line facilities photographed by UAVs.The detection accuracy of the YOLOv2 method is better than that of the non-deep learning methods.However,this method is not necessarily suitable for multi-scale and multi-category target detection.Based on the YOLOv3[8],in Ref.[29],the authors enhanced the spatial information by concatenating two ResNet units with the same width and height and adding convolution operation in the early layer.Their algorithm achieved excellent results for small scale change,but its performance decreased for large scale change.SlimYOLOV3[30]reduced the number of parameters and model volume by using the convolution layer channel pruning technology to deal with the airborne embedded devices with limited memory and computing resources.
Although the above-mentioned algorithms have achieved good detection accuracy on their respective datasets,they are not universally applicable to the problems of multi-scale and constantly changing target features in UAV aerial images.Therefore,this paper investigates the construction method of a bidirectional parallel multi-branch feature pyramid network(BPMFPN)to solve the realtime detection problem of multi-scale targets for swarm UAVs aerial images.
The main contributions of this paper can be summarized as follows:
?A novel feature pyramid construction method that uses bottomup and top-down paths to extract image features is proposed.The features are extracted in parallel in each path in contrast to the series feature extraction methods proposed in the previous studies.
?A single-stage real-time object detection algorithm named the BPMFPN Det,which is suitable for aerial images of swarm UAVs,is proposed.
?Experiments are conducted on PASCAL VOC 2007[31],DIOR[32],VEDAI[33],and UAVDT[34]datasets to validate the effectiveness of the proposed algorithm and compare its performance with the performances of the existing state-of-the-art algorithms.
The remainder of the paper is organized as follows.Related work is presented in Section 2.The proposed method is described in Section 3.In Section 4,the experimental results on four datasets are provided and discussed.Finally,the conclusion is presented in Section 5.
In recent years,the single-stage object detectors have been widely studied due to their high detection ef ficiency.YOLO[6]uses a forward network to obtain the detection layer and performs classi fication and regression in the last detection layer.SSD[7]employs the feature pyramid structure to lay anchors on the feature layers of different scales,where anchors of different scales are responsible for the object detection of different sizes.RON[35]improves the SSD by introducing the reverse connection block and also connects the deep and shallow feature maps to improve detection accuracy.RFBNet[10]uses the Receptive Field Block to replace the convolution module in the original SSD network and extracts more discriminative and robust features,thus improving the detection accuracy.Re fineDet[9]uses the Anchor Re finement Module(ARM)and Object Detection Module(ODM)modules to simulate the detection process of a two-stage detector in order to obtain the accuracy of a two-stage target detector.YOLOv3[8]constructs a multi-scale feature layer using the deep feature layer and shallow feature layer fusion and obtains prior anchors by clustering.The Ef ficientDet[17]employs a weighted bidirectional feature pyramid network(BiFPN)and compound scaling method to adjust the resolution,depth,and width of the backbone network.
The construction of a feature pyramid plays an important role in the field of object detection.Several commonly used feature pyramid construction methods are presented in Fig.1.The feature pyramid network(FPN)[36]uses a top-down architecture with lateral connections to build the high-level semantic feature maps at all scales.PANet[37]introduces the bottom-up structure based on the FPN and shortened paths in the top and bottom layers.In this method,the location information of the bottom layer can be transmitted to the topmost layer to enhance the feature representation ability.NAS-FPN[38]uses the Neural Architecture Search(NAS)to search an irregular combination of top-down and bottomup connections to enhance the features of all scales.Ef ficientDet[17]uses multiple serial BiFPN structures to achieve multi-scale feature fusion.
Fig.1.Several common feature pyramid construction methods.
The overall architecture of the proposed BPMFPN Det is illustrated in Fig.2,where it can be seen that it consists of three parts:Backbone,BPMFPN,Head.The Backbone uses CSPDarknet53[18]as the backbone network to extract the features.The BPMFPN is composed of two parts,top-down and bottom-up parts.The Head part performs classi fication and regression tasks.
The biggest difference between the proposed methods and the existing methods is that the features of different scales in a single path in parallel in contrast to the existing methods where multiple serial convolutions or multiple BiFPN structures are stacked to extract the features of different scales.In Fig.2,the double arrows represent the parallel multi-branch convolution block(PMCB)module.The head part performs the class score prediction and bounding box regression on different scale feature layers,followed by non-maximum suppression(NMS)operation to obtain the final results.The group convolution is commonly used in the PMCB module to reduce model volume by reducing the number of model parameters.Since the proposed approach aims to explore the effectiveness of BPMFPN structure in multi-scale features extraction,the loss function of the original paper is used.
Fig.2.The architecture of the proposed BPMFPN Det.
This section first describes the multi-scale feature pyramid mathematically,then introduces the construction process of the BPMFPN,and finally describes the PMCB module needed to build the feature pyramid.
3.2.1.Mathematical description and construction of BPMFPN
The construction of the feature pyramid is to integrate more discriminative features.The structure of the BPMFPN is presented in Fig.3.From the mathematical point of view,Eqs.(1)-(3)are used to build the feature pyramid.
Fig.3.The architecture of BPMFPN.
where·represents the channel concatenation and⊕represents the element-wise sum;(i∈3,4,5;j∈1,2)represents the feature layer at level i,and through j times PC convolution relative toPi,andrepresents the final output layer of feature i layer.The double arrow in Fig.3 represents the PMCB module,referred to as PC for short.The PMCB module will be discussed in the following section.
Eq.(1)represents the construction of the top-down structure.The PMCB module is first used to extract multi-scale features ofP5to obtainand then to further extract the features.Next the feature combined with the multi-scale feature is obtained byP4through concatenation.Eq.(2)represents the construction of the bottom-up structure.Eq.(3)represents the transfer connection structure.The first feature of each feature layer is added to the last feature by element-wise sum to obtain the most discriminative feature layers,e.g.,.For the concatenation operation,takeas an example.Suppose the feature map ofisFh,w,c1,the feature map ofP4isHh,w,c2,anddenotes a three-dimensional tensorMh,w,c1+c2.The height and width of the feature map are denoted as h and w respectively,and c1 and c2 represent the channel numbers of the two feature maps.The output feature map represents the stack of two feature maps in the channel dimension.For the sum operation,takeas an example.Suppose the feature map ofisMh,w,cc;then the calculation formula is as follows:
whereKh,w,c3represents the feature map of,c3 is the channel numbers of,c2 and c3 are the same as cc,the element level addition at the corresponding position is used.
It is believed that the feature extraction with different scales from multiple branches is conducive to the propagation of top-level semantic information to the lower level.At the same time,the multi-scale extraction of the bottom features will increase the discrimination of the top-level features.By using the transfer module,the original scale features can be preserved to a greater extent.
3.2.2.Parallel multi-branch convolution block
The earliest parallel feature extraction method was proposed by the Inception network[39].Since then,several improved versions have been developed[40,41].This method can effectively extract multi-scale features of objects.The architecture of the PMCB module constructed in this paper is presented in Fig.4.
The PMCB module fuses feature under five different pyramid scales.From left to right(refer to Fig.4),the first branch uses global average pooling(GAP)to extract global features.In the second to fifth branches,the multi-scale feature extraction is performed using the convolution with a kernel size of 3,and the dilation rate increases gradually.The features of different receptive fields are first concatenated and then passed through BC-R,and finally added with a shortcut layer.In order to improve the expression ability,3×1 and 1×3 convolutions are commonly used.The number of channels in the convolution kernel is delicately designed in order to ensure that it can use group convolution to reduce the parameters.Take the third branch as an example.Suppose the input feature map is I,and it has n input channels.First,the number of channels is reduced to n/8 by 1×1 convolution,and then the features are further extracted by 3×1 and 1×3 convolutions respectively.The number of input channels and output channels is n/8,a special group convolution called depthwise separable convolution[42,43]is used to reduce the number of parameters.The group parameter of convolution is set to n/8.The settings of other branches are similar.For more details,please refer to Ref.[44].
In Fig.4,BCG indicates the BasicConv plus the group convolution,BC indicates the BasicConv,and BC-R indicates BasicConv without ReLU activation.It should be pointed out that when the large-scale feature map is converted to the small-scale feature map,the BC-R module in the PMCB uses stride as two for downsampling.When the feature map with the same scale is transformed,the stride is one,which does not affect the size of the feature map.
Fig.4.The architecture of the PMCB module.
In order to comprehensively evaluate the performance of the proposed algorithm,experiments were conducted on the following four datasets:PASCAL VOC[31],DIOR[32],VEDAI[33],and UAVDT[34].Considering the limited computing resources of the UAVs platform,only images with sizes of 320,416,and 512 were selected to be used in the evaluation process.Although larger scale images can provide more accurate results,our original intention is to save computing resources as much as possible while achieving real-time detection.
The hardware included a computer with 8700k CPU,1080Ti GPU,and 32G memory.The software included PyTorch 1.5,CUDA 10.1,and cuDNN 7.6.5.
Each of the four datasets had their own unique settings plus a basic setting common for all four datasets.The training parameter settings were the same in all four experiments.In terms of the training parameters,a multi-scale training method was used,and the cosine learning rate strategy[45]with the initial learning rate of 0.01 was adopted.The batch size was set to four.The SGD optimizer with a momentum of 0.937 and a weight decay of 0.0005 was utilized.The mosaic data enhancement method[18]was used to increase the number of positive training samples.Also,random flipping and color jittering were also used.The speci fic training parameters of different datasets are given in the corresponding experiment part.
4.1.1.Dataset introduction
The PASCAL VOC is a general-purpose object detection dataset that contains 20 types of common objects with the conventional view.It has been widely used in the evaluation of general object detection algorithms.
4.1.2.Experimental settings and evaluation metrics
The train and validate datasets of VOC2007 and VOC2012 were combined to form a train set,which contains 16,551 images.This combined set was used to train the proposed BPMFPN Det.The test was carried out on the VOC 2007 test set with 4592 images.The total number of training epochs was 120.The size of multiscale training was 320-512.The average precision(AP)of each category and the mAP of all categories were used as the evaluation criteria.
4.1.3.Experimental results
The comparison of the proposed BPMFPN Det and other detectors is given in Table 1.When the input size was320×320,the BPMFPN Det achieved the mAP of 83.6,which was better than the result of YOLOv3 by 3.4 mAP.The proposed BPMFPN Det even exceeded some of the detectors that had the input size of 512×512.For 512×512 input,the BPMFPN Det achieved the mAP of 85.4,which surpassed all the other single-stage object detection algorithms.The results in Table 1 demonstrate that more accurate results can be obtained with a smaller size input,which is important for saving computing resources.
tv 76.8 79.4 79.4 77.7 80.0 81.7 80.0 77.6 82.3 81.2 83.7 81.2 train 87.6 87.2 88.2 88.0 90.3 86.6 88.1 83.9 88.1 89.7 87.5 91.1 73.7 80.9 sofa 80.4 79.0 77.4 78.7 76.9 78.5 79.5 78.8 75.9 sheep 82.6 78.0 80.2 86.773.6 82.7 81.7 84.0 83.2 84 77.9 85.1 89.6 57.6 plant 61.6 52.3 55.3 55.6 51.7 56.7 56.2 60.2 61.7 62.5 79.4 person 59.7 86 79.7 82.0 89.3 82.6 82.5 83.2 84.7 88.2 84.3 90.0 mbike 86.7 87.5 89.5 86.7 88.6 83.9 83.9 88.3 87.8 88.9 86.1 91.4 horse 92.7 88.7 87.1 89.4 86.7 89.0 87.5 89.7 89.2 90.9 89.3 94.4 85.4 dog 86.1 86.7 88.3 86.2 84.7 86.8 86.5 88.4 86.5 85.4 89.7 table 78.7 77.0 71.8 77.9 75.2 73.2 75.1 77.0 75.0 73.2 78.5 81.2 81.5 83.5 cow 87.3 81.3 83.9 85.4 85.1 87.9 82 87.7 87.8 62.5 chair 61.1 64.0 66.486.2 60.3 63.5 62.4 66.5 66.6 62.7 67.5 68.8 89.1 cat 87.8 90.5 90 88.1 88.9 87.4 87.4 89.2 90.8 87.1 90.2 car 89.6 88.1 91.9 86.2 88.3 89.3 88.7 90.9 87.5 89.0 92.2 88.0 85.7 89.9 86.4 bus 87.8 85.6 87.4 88.2 87.0 88.5 89.7 89.2 93 77.6 50.5 bottle 60.2 57.8 58.3 60.5 68.0 76 62.8 80.1 68.4 boat 53.9 69.6 TheAPandmAPresultsonthePASCALVOC2007testsetofdifferentdetectors.70.1 68.4 75.5 74.3 76 73.0 73.7 76.5 72.7 77.0 75.8 bird 76.0 81.4 80 79.6 85.2 81.5 83.2 83.2 80.9 85.1 80.5 85.8 bike 83.9 85.4 87.4 87.5 91.9 85.1 88.7 84.9 87.0 89.6 89.0 92.9 aero 83.9 82.4 81.9 79.5 82.8 88.3 84.8 88.8 88.7 88.5 89.3 93.9 map 78.6 80.0 77.5 80.2 80.4 83.6 79.5 80.3 81.8 82.5 82.3 85.4 network ResNet-101 VGG Darknet53 VGG VGG CSPDarknet53 VGG VGG VGG Darknet53 VGG CSPDarknet53 DSSD321[46]Det416 Re fi neDet320[9]Table1 Method SSD300*[7]YOLOV3-320[8]PMCBNet300[44]Det320 BPMFPN SSD512*[7]MDSSD512[47]Re fi neDet512[9]YOLOV3-416[8]PMCBNet512[44]PBMFPN
In terms of speed,the total time required by the proposed algorithm to process 4952 images was tested.The speed of the BPMFPN Det320 was 40 frames per second(FPS),and that of the BPMFPN Det416 was 39 FPS,so,both of them meet the real-time application requirement.
4.1.4.Ablation studies
In order to explore the effectiveness of each component on the detection performance in the process of building the pyramid,four different experiments were conducted on the PASCAL VOC 2007 dataset,and the obtained results are shown in Table 2.For a fair comparison,the same training strategy and the same size(320×320)of input images were used.
Baseline:As a baseline,we trained the original YOLOv3 for 120 epochs on the PASAL VOC dataset.The mAP of 80.2 was achieved.In the following analysis,different components were gradually replaced,and finally,the BPMFPN Det was constructed to analyze the effectiveness of each component.
Extra bottom-up path:An additional bottom-up path was added on the basis of the original YOLOv3,as shown in Fig.3,whereto(i∈3,4,5)andto(k∈3,4)indicate the paths.As shown in Table 2,compared with the baseline,the mAP was improved by 1.7 by adding the additional bottom-up path constructed by the PMCB.
Table 2 Effectiveness of various components on the PASCAL VOC 2007 test set.
Change prior top-down structure:Next,the original top-down structure of the feature pyramid in YOLOv3 was replaced by the new top-down structure to form a bidirectional parallel multibranch network structure.By comparing columns 3 and 4 in Table 2,it can be seen that the changed structure can improve AP by 0.5,namely from 81.9 to 82.4.
Transfer connection:As shown in Fig.3,the transfer connection fromPitoPoi(i∈3,4,5)was added.The detection accuracy of the model is shown in the fifth column in Table 2.The results in the fourth column show that the transfer connection structure can obtain the AP improvement of 0.5,from 82.4 to 82.9.The main reason for this improvement is that the fusion of the original and output features makes the feature layer more distinguishable.
CSPDarknet53:A more discriminative network was used to verify the impact of the network performance on the detection results.The backbone network was changed from Darknet53 to CSPDarknet53,and the result showed a mAP improvement of 0.7,from 82.9 to 83.6.
4.1.5.Qualitative results
Some PASCAL VOC 2007 test set visualizations are shown in Fig.5.
Fig.5.Qualitative results on the PASCAL VOC test set.
4.2.1.Dataset introduction
The benchmark of the DIOR dataset included 23,463 images,covering 20 object classes.The DIOR dataset contains a large number of object instances and has a large range of object sizes.It includes large object size variations because the images were obtained under different conditions,such as different weather conditions,seasons,and image quality.It also has a high inter-class similarity and intra-class diversity.Since the original large-scale images were cropped in small pieces to make it similar to the aerial images from the perspective of UAVs,this paper uses the DIOR dataset to evaluate the performance of the proposed algorithm.
4.2.2.Experiment settings and evaluation metrics
In this experiment,the proposed BPMFPNDet was trained using the training and validation datasets containing 11,725 images and tested on the remaining 11,738 images.Since the categories of DIOR dataset were equal to the PASCAL VOC dataset,the same training strategy was used.Table 3 lists 20 different target categories.
4.2.3.Experimental results
The comparison of the proposed BPMFPN Det and other detectors is presented in Table 4,where it can be seen that the BPMFPN Det achieved the best results for 14 of 20 object classes with the input size 512×512.For large range of object size variations,such as c7,c12,c14,and c15,the proposed algorithm obtained the best results.For high intra-class diversity(e.g.,Chimney)and high inter-class similarity(e.g.,Basketball court vs.Tennis court and Bridge vs.Dam),the proposed algorithm performed best for input size of 512×512 among all the detectors.For small targets,such as vehicles,the proposed detector obtained the best results.Moreover,the proposed algorithm could reach a speed of 37 FPS.
4.2.4.Qualitative results
Some DIOR test set visualizations are shown in Fig.6.
4.3.1.Dataset introduction
The VEDAI dataset is a small vehicle target detection benchmark in aerial images,and it has the following characteristics:small target,and large changes in target orientation and light shadows.It includes nine vehicle classes:boat,car,truck,tractor,boat,plane,van,camping car,and other that include all other vehicle types.Small land vehicles include the car,pickup,tractor,and van,while large land vehicles include truck and camping car.There is an average of 5.5 vehicles per image,and they occupy about 0.7%of the total pixels of the images.We use this dataset to verify the performance of the proposed algorithm for small target detection.
4.3.2.Experimental settings and evaluation metrics
Since the dataset uses 10-folds cross-validation,120 epochs were used in the training process for each folder,and then the model was tested separately on each test set.Other training strategies are consistent with the VOC dataset.Following the conventions of the VEDAI dataset,the evaluation metrics given by Eqs.(5)-(7)were used.For a given image I and a threshold t,detection with a score higher than t was regarded as a real detection.When the real detection matched the ground truth,it was regarded as a True Positive orTP(I,t);when it did not match the ground truth,it was regarded as a False Positive orFP(I,t);lastly,when a target was not detected,it was considered as False Negative orFN(I,t).
In Eqs.(6)and(7),ATdenotes the number of targets in the testing fold,andNTdenotes the numbers of images in the considered fold.
4.3.3.Experimental results
Due to the limited computing resources,the input size of 512×512 was used in the test.The results of this experiment were compared with the best results reported in the literature.As shown in Table 5,the proposed algorithm completely surpassed the best traditional algorithm.It should be noted that the proposed algorithm achieved the processing speed of up to 38 FPS.
Table 5 Detection results in terms of mAP on VEDAI test set for different vehicle types.
In order to analyze the detection performance of the proposed algorithm for different scale targets,the car in slv category and the truck in llv category were selected as examples.Also,the smallest AP folder was chosen for analysis.The Precision-Recall(PR)and Recall-FPPI curves are presented in Fig.7(a)and 7(b),respectively.As shown in Fig.7(a)the proposed algorithm achieved the accuracy of at least 90%for both small and large scale targets.In the case of a recall rate of 1,as shown in Fig.7(b),the worst performance of less than 0.3 FPPI was achieved for the car.In fact,when FPPI was 0.1,the recall was higher than 90%,indicating that the proposed detection algorithm could find more real targets with a less detection error.
4.3.4.Qualitative results
The detection results in some complex scenarios are presented in Fig.8,where it can be seen that the proposed algorithm is able to obtain satisfactory detection results.
4.4.1.Dataset introduction
The UAVDT dataset contains complex scenes captured by UAVs under different weather conditions,flying altitudes,camera views,vehicle types,and occlusions.The training and test sets includes 24,143 and 16,592 images,respectively.
4.4.2.Experiments settings and evaluation metrics
The training set was first clustered to obtain suitable anchors for the dataset.The anchors obtained by clustering were as follows:8,16,11,27,19,25,15,43,27,27,24,66,48,43,57,102,231,and 171.The training epochs was set to 100,and the multi-scale range was 320-512.The test was conducted at the input image size of512×512.The AP score was used to rank different methods.When the intersection over union(IoU)of the detected and the ground truth boxes was greater than the threshold of 0.7,the detection was regarded as successful.
4.4.3.Experimental results
The PR curves of different algorithms are presented in Fig.9,where it can be seen that the proposed algorithm achieved the AP of 48.75%,thus surpassing the evaluation results of all the other algorithms in this dataset.It achieved both higher AP(48.75%vs.34.35%)and speed(38 FPS vs.4.65 FPS)than the two-stage detector R-FCN[20],while it had higher AP(48.75%vs.33.62%)than the one-stage detector SSD[7]algorithm but a similar speed(38 FPS vs.41.55 FPS).It is worth noting that the proposed algorithm still had high precision when recall was higher than 0.5.
In order to measure the performance of the proposed algorithm more comprehensively,its performance was tested under different attribute settings and compared with the performances of other algorithms;the obtained results are presented in Fig.10.As shown in Fig.10,the proposed algorithm was superior to the other algorithms for each attribute.Speci fically,for the attribute of night,the BPMFPN Det surpassed the SSD[7]algorithm and achieved the AP of more than 25%.In the case of the fog,the AP of the proposed algorithm was increased by 14%compared with the previous best algorithm.In these two cases,the imaging quality was poor,but the proposed algorithm achieved good detection accuracy,which was better than those of the other algorithms.
mAP 58.6 57.1 65.1 65.2 66.1 66.1 64.9 73.9 c20 65.7 78.7 81 85.5 84.5 75.9 86.4 27.4 43.1 81.2 43 48.3 c19 44.4 47.2 43 48.9 55.1 29.4 c18 59.5 62.3 55.2 57 57.1 58.3 c17 81.1 87.3 76.3 81 81.3 80.9 84 89.8 46.6 53.8 c16 68.7 53.7 45.8 62 45.2 71.8 61 70.6 c15 68.3 68.3 68.4 70.4 70.7 76.7 c14 59.2 71.8 71.8 71.1 71.7 37.6 86.7 87.4 c13 49.7 57.2 57.4 48.1 59.6 56.9 60.6 59.9 c12 44.9 46.4 46.5 49.9 45.3 26.1 49.4 60.9 c11 75.9 76.8 68.6 61.1 76.6 73.4 79.5 77.7 DetectionresultsintermsoftheAPandmAPonDIORtestset.Thebestperformanceisrepresentedbyaboldblackfont.c10 31.1 65.3 76 76 78.6 72 79.5 79.4 c9 53.1 54.4 62.3 62.5 62.8 66.7 76.3 71.6 c8 63.5 48.6 75.6 76.3 78.6 72.1 81.6 87.3 c7 56.6 26.9 60 60.4 62.4 61.4 64.3 64.7 c6 65.8 69.7 72.5 72.5 73.2 72.3 75.3 81.2 c5 29.7 31.2 44.8 40.2 44.1 43.6 46.4 46.6 c4 75.7 78.6 80.7 80.9 85 80.5 80.8 89.2 c3 72.4 74 63.3 63.2 69.3 70.6 72 80.0 c2 72.7 29.2 74.5 76.6 77 72 84.2 83.9 c1 59.5 72.2 54 53.9 53.3 60.2 58.8 78.2 Backbone VGG16 Darknet-53 ResNet-101 ResNet-101 ResNet-101 ResNet-101 Hourglass-104 CSPDarknet53 Det Detector Table4 SSD[7]YOLOv3[8]Faster-RCNN[19][21]Mask-RCN RetinaNet[11]PANet[37]CornerNet[15]BPMFPN
Fig.6.Qualitative results on the DIOR test set.
By comparing the detection results of low,medium and high altitudes,the proposed algorithm can better deal with multi-scale situation.For different perspectives,even in the case of the birdview with the most insufficient features,the proposed algorithm could obtain the AP improvement of nearly 14%compared with R-FCN algorithm[20].The number of car is more than 90%of the whole dataset.In terms of car category detection,the proposed algorithm could obtain the AP improvement of 14%compared with R-FCN algorithm[20].The BPMFPN Det had strong discriminative ability due to its special feature extraction method.In addition,the running speed of the proposed algorithm was 38 FPS.
4.4.4.Qualitative results
Some UAVDT test set visualizations are shown in Fig.11.
The test results obtained by the proposed algorithm on four different datasets demonstrate that the proposed algorithm can achieve satisfactory results in solving the multi-scale problem of swarm UAVs aerial images.The main reason for this is the PMCB module can extract the multi-scale features of a target in parallel.The rich multi-scale features are crucial for solving the multi-scale problem of a target.In the BPMFPN,the top-down structure transfers the multi-scale features from top to bottom,which is also a way to return the semantic information.The bottom-up structure transfers the multi-scale information from bottom to top,which is more like a fusion of shallow location information and deep semantic information.Thus the joint action of the two modules enables the extraction of more discriminative and robust features.However,for some datasets,the proposed algorithm cannot detect individual categories with high accuracy.The receptive field of the PMCB module is arti ficially designed,which is not optimal for the objects of this category,resulting in the insufficient ability of representational features extraction for this category.Another reason may be that in the data enhancement process,the proportion of positive and negative samples in the training datasets of this category is unreasonable,which can also disable the proposednetwork to achieve the optimal discriminability for this category.
Fig.7.Performance curves of two categories.
Fig.8.Qualitative results on the VEDAI test set.
Fig.9.Precision-Recall curve on the UAVDT test set.The legend presents the AP score and the GPU speed of each detection algorithm respectively.
Fig.10.Comparison of results of different detection methods for each attribute.
Fig.11.Qualitative results on the UAVDT test set.
In this paper,an effective feature pyramid structure is constructed to solve the problem of real-time detection of multi-scale targets in swarm UAVs aerial images.A bidirectional parallel multibranch convolution module is used to construct the feature pyramid.Based on this module,the BPMFPN Det is proposed.The proposed algorithm is veri fied by the experiments on the PASCAL VOC 2007,DIOR,VEDAI,and UAVDT datasets.The experimental results have demonstrated that the proposed algorithm is suitable for multi-scale target detection in aerial images of swarm UAVs.In the future,we intend to reduce the number of model parameters and model volume further by using the model compression technology in order to make the proposed algorithm more applicable.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to in fluence the work reported in this paper.