Fn-jie Meng ,Xin-qing Wng ,,F-ming Sho ,Dong Wng ,Yo-wei Yu ,Yi Xio
a Department of Mechanical Engineering,College of Field Engineering,Army Engineering University of PLA,Nanjing 210007,China
b College of Civil Aviation,Nanjing University of Aeronautics and Astronautics,Nanjing 210007,China
Keywords: Multi-object tracking Deep learning Gabor filter Biological vision Military Application Video processing
ABSTRACT The multi-armored target tracking(MATT)plays a crucial role in coordinated tracking and strike.The occlusion and insertion among targets and target scale variation is the key problems in MATT.Most stateof-the-art multi-object tracking(MOT)works adopt the tracking-by-detection strategy,which rely on compute-intensive sliding window or anchoring scheme in detection module and neglect the target scale variation in tracking module.In this work,we proposed a more ef ficient and effective spatial-temporal attention scheme to track multi-armored target in the ground battle field.By simulating the structure of the retina,a novel visual-attention Gabor filter branch is proposed to enhance detection.By introducing temporal information,some online learned target-speci fic Convolutional Neural Networks(CNNs)are adopted to address occlusion.More importantly,we built a MOT dataset for armored targets,called Armored Target Tracking dataset(ATTD),based on which several comparable experiments with state-ofthe-art methods are conducted.Experimental results show that the proposed method achieves outstanding tracking performance and meets the actual application requirements.
Multi-armored target tracking(MATT)based on Unmanned Aerial Vehicles(UAV),Armored Scout Car(ARSV)and other platforms is vital to grasping the dynamics of ground battle field.The research of MATT is also the main topic of fully autonomous observing and attacking equipment[1].Among the many sensors,digital image sensor can provide rich target features and it became the hotspot in MATT.
Tracking requires some degree of reasoning about armored targets and establishes the target correspondences between frames.Different from other general MOT,The MATT in complex ground battle field faces the dif ficulties of more tiny armored targets and complicated ground battle field environment.Armored target typically attacks at a large distance,which demand the visual based tracking system to track at greater distances.Typical combat environment for armored targets includes jungle,desert,grassland and disturbances such as dust,muzzle fire and smoke.Besides,MATT faces occlusion and insertion in receptive field and lacking of ready-made MOT dataset for armored targets.Fig.1 presents typical frames including armored target in our ATTD dataset.Compared with conventional multi-object tracking scenarios,the battle field environment is more complex.
Most state-of-the-art MOT works adopt the tracking-bydetection strategy [2].In detection stage,these approaches generate target proposals using Region Proposal Network(RPN)[3].In order to increase the recall rate for target proposals,computeintensive sliding window or anchoring scheme and very deep convolutional neural network are adopted in such methods.Obviously,if this exhaustive strategy is adopted in the two-step multiobject tracking,it would be extremely computation-expensive to estimate large number of proposals.Besides,features from shallow layers are used to detect small objects and shallow layers only contain low-level features and often cause false detections.In tracking stage,it is dif ficult to address the occlusion only using such offline trained method.
Considering all the dif ficulties in multi-armored target tracking,we proposed a visual-attention Gabor filter based online multiarmored target tracking method.In detection stage,an offline candidate recommendation module with two working modes is used to generate proposals,which is computation-ef ficient when no new armored target appear.Different from sliding window or dense anchoring scheme,we present a visual-attention Gabor filter based spatial-attention branch in the offline module,which leverages semantic features to determine suspect areas for insertion.In order to estimate candidates and deal with occlusion,an online candidate estimation module is proposed.In the online module,target-speci fic CNNs are used to estimate the classi fication and occlusion for each candidate.A novel motion model using the temporal information is proposed to predict the precise position of armored target in the online module.Aiming the lack of readymade MOT dataset available for armored targets,we built an armored target tracking dataset(ATTD)through actual collection and network download.We conducted several comparative experiments to verify the proposed method based on ATTD.
Our main contributions are summarized as follows.Firstly,a computation-ef ficient detector is proposed for multi-object tracking in detection stage.Secondly,a visual-attention Gabor filter based spatial-attention mechanism is proposed instead of sliding window or dense anchors.Thirdly,a novel online candidate estimation module is proposed to deal with occlusion and predict the precise position.More importantly,an armored target tracking dataset(ATTD)are built to address the lacking of ready-made MOT dataset for armored targets.
Fig.1.Typical frames including armored target in our ATTD dataset.
Armored target detection is the cornerstone of multi-armored target tracking algorithm,which can be divided into infrared image-based methods[4-6]and visible image-based methods[1,7-9].The key to armored target detection is to extract effective features.In the early stage,handcrafted features were used to extract armored target.Such methods include Histogram of Oriented Gradient(HOG)[10],Scale-Invariant Feature Transform(SIFT)[11],and so on.Human-engineering-based features are usually used in conjunction with Support Vector Machine(SVM)[12],which is shallow-learning-based discriminative classi fiers.However,methods based on handcrafted features are often limited in particular problem.Besides,human-engineering-based features may fail when the appearance of armored target vary greatly or in small scale.With the successful of region based convolutional neural network(R-CNN)[3],a sequence of later armored target detection method is developed based on R-CNN and its variants[1,13].A top-down aggregation(TPA)network is proposed for armored target detection in Sun,H.et al.[1].Quandong Wang et al.[13]detect armored target with Faster R-CNN and a coarse image pyramid.R-CNNbased methods remarkably improve the detection accuracy of armored target.However,in such methods,the computational consumption caused by sliding windows or dense anchors and the lack of detection accuracy of small armored target still need to be solved.
Tracking is a fundamental task in any video processing requiring some degree of reasoning about objects of interest[14-16].Inspired by the success of convolutional neural networks,numerous methods have been proposed to exploit hierarchical CNN features for visual tracking[17-30].The occlusion and scale handling of multiple objects and the insertion of new objects are the core research of multi-object tracking[17-23].In Qi Yet al.[17],authors propose techniques adaptively utilizing the level set segmentation and bounding box regression to get a tight enclosing box,and design a CNN to recognize whether the target is occluded.In Qi Y et al.[18],authors propose a CNN-based tracking algorithm which hedges deep features from different CNN layers to better distinguish target objects and background clutters.In Qi Yet al.[19],authors show that attribute information in the context facilitates training an effective classi fier for visual tracking.At present,a tracking-by-detection strategy is used in most state-of-the-art MOT methods[24-26],which composed of detection module and tracking module.In Zhang S[27],a novel tracking by matching framework for robust tracking based on basis matching rather than point matching.In Chu Q et al.[28],authors use a spatial-temporal attention mechanism to track multi-object.They use traditional RPNoperated on deep convolutional features to generate proposals and multiple target-speci fic CNN branches to estimate these proposals.In Henschel R et al.[29],authors demonstrate how to fuse two detectors into a tracking system and propose to formulate tracking as a weighted graph labeling problem,resulting in a binary quadratic program.In Zhang Y et al.[30]authors present a hybrid deformable convolution neural networks for multi-object tracking.
Focused on multi-armored target tracking in the ground battle field,we proposed a tracking method based on Single Shot Multi-Box Detector(SSD)[31]framework and visual-attention Gabor correlation filter,which is shown in Fig.2.The MATT method mainly consists of two parts:(a)Of fline candidates recommendation module;(b)Online candidates estimation module.Firstly,current frame is sent into the offline module to generate ROI features.In offline module,target proposals and ROI features are generated by two branches,including a multiple-layers candidates recommendation branch and a novel visual-attention Gabor filter branch.The offline module works in two modes which are used in fine detection and routine detection respectively.Meanwhile,retina-simulation Gabor filter bank is used to enhance features of shallow layers for small armored target detection.Then,the ROI features are sent into online candidate estimation module.The online module includes a motion model and target speci fic estimation CNNs which are used to estimate occlusion and predict precise position of the armored target.Finally,the multi-armored target position and classi fication are evaluated and linked across different frames.
Fig.2.Overview of our method.(a)Of fline Candidates Recommendation Module;(b)Online Candidates Estimated Module.
Detection are the cornerstone of tracking-by-detection strategy based multi-object tracking methods.Deep convolutional neural network based offline module has been widely applied in detection stage of multi-object tracking.Such strategy produces candidates based on dense anchors with fixed scales and aspect ratios[32-36].In order to increase detection accuracy,a large number of anchors are used on “hypercolumn” features[37].Obviously,this exhaustive strategy would be extremely computation-expensive to be used in tracking multi-armored targets in wide ground battle field scene with high real-time requirement.Besides,most detection methods adopt shallow layers to generate small object proposals.However,shallow layers only contain low-level features and cause false detections in small objects.To address above problems,we use two branches to generate object proposals including a multiple-layers candidates recommendation branch and a novel visual-attention Gabor filter branch to enhance small target proposals on shallow layers.As shown in Fig.3,our offline candidates recommendation module works in two modes.The two branches work together when the offline module works in Mode 1 used for fine detection of the initial input of videos and the insertion of new targets.Mode 2 is used in case where there are no new target inserts and non-initial video inputs,using visual-attention Gabor filter branch and shallow layers of multiple-layers candidates recommendation branch.In addition,Mode 2 is part of Mode 1,used to enhance the detection of small target on shallow layers.The insertion of new targets is determined by the visual-attention Gabor Filter branch,which are described in Section 3.3.
The retina is the core of the eye’s visual function,which encodes the optical signals of the visual world into digital pulse signal series and transmits them to the brain.The retina includes five major types of neural cells:rod cell,cone cell,horizontal cell,bipolar cell and ganglion cell.The cell of retina forms a central-peripheral rivalry receptive field,which are shown in Fig.4.The types of receptive field can be divided into center-on type(Fig.4(a))and center-off type(Fig.4(c)).The center-on type receptive field has a positive response to excitation in the center.On the contrary,the center-off type receptive field has a negative response to excitation in the center.Fig.4(b)and(d)show the variation of the response frequency with the distance from the center of the retina receptive field.The response is the strongest when the excitation boundary coincides with the boundary between the center and the edge of the receptive field.
Fig.3.Of fline candidates recommendation module.
Fig.4.Receptive field of the retina and the response to the excitation.
The central-peripheral rivalry receptive field enhance the ability of retina cell to detect boundary.Riaz et al.[38,39]proposed Gabor filters to simulate the characteristics and responses of retina receptive field,which is widely used in extracting spatial local spectral features and multiple pattern recognition[40-45].A traditional two-dimensional Gabor filter can be expressed as:
Fig.5(a)-(c)shows the traditional two-dimensional Gabor filters.The traditional Gabor filter only simulates the center-on receptive field by using Gaussian function.In our work,a large number of experiments have shown that the center-on type receptive field has an obvious effect on the enhancement of internal features of the target,while the center-off type receptive field has an obvious effect on the enhancement of edge features of the target.In order to simulate the receptive field of the retina,we improve the traditional Gaussian envelope,which can be expressed as:
wheregOnandgOffare used to simulate center-on and center-off type of Gaussian function.Fig.5(d)-(f)shows the center-off type of Gaussian function,oriented complex sinusoidal grating and center-on type of Gabor filter.
In deep convolution neural network,shallow layers respond to corner,texture and other edge features.In traditional detectors,false detections may be caused in small objects detection by lowlevel features.In this work,we use a novel visual-attention Gabor filter branch to enhance small target proposals on shallow layers.As shown in Fig.3,the low-level features are compressed and fed into visual-attention Gabor filter branch including improved feature enhanced Gabor filter bank,target position prediction branch and target shape prediction branch.Fig.6 shows the compressed features and enhanced Gabor features of armored targets.The first column shows the input video frames and the second column is the compressed VGG features of the frames.The third and fourth columns show the enhanced features by improved center-on and center-off Gabor filters respectively.Compared with the input features,the edge and texture features enhanced by the Gabor filter simulating the retina is more obvious.
A smaller number of anchors containing all targets means a decrease in calculations of regression and classi fication.In order to avoid the strategy of sliding windows and dense anchors and save the computation of candidate evaluation,we leverage spatial and temporal semantic features to predict the target position and shape at end of Gabor banks.As shown in Fig.3,at end of the Gabor filter bank,two branchesNPandNSare used to predict target position and shape respectively.In branchNP,the position prediction mapMPcan be formulated as:
whereFErepresents the enhanced feature by Gabor filter bank,ωPis the set of parameters in branchNP.In our work,thefPis modeled as a 1×1 convolution with an element-wise sigmoid function.EachMP(i,j)corresponds to the location with coordinate,where s is stride of feature map.Similarly,in branchNS,two target shape prediction mapsandare used to predict target area and aspect ratio,which can be expressed as:
where,fSis modeled as a 1×1 convolutional layer that yields a two-channel map.In order to obtain stable target widthwandh,the following transformations are adopted:
wheresis the stride anddwanddhrepresent the value of twochannel map.In branchNP,a global thresholdεPare used to filter out regions where armored targets cannot exist.Fig.7(a)shows the procedure of target bounding box prediction.After determining the center of the target,the shape of the bounding box is predicted by Eq.(6).Fig.7(b)shows examples of position prediction mapMPgenerated by branchNPand 3-D features ofMP.In position prediction mapMP,the 3-D features of armored targets are more prominent than background.
Fig.5.Center-on(a),(c)and center-off(d),(f)type of Gaussian function and Gabor filter,oriented complex sinusoidal grating(b),(e).
Fig.6.Compressed features and enhanced Gabor features of armored targets in ground battle field.
Fig.7.(a)Procedure of target bounding box prediction in offline candidate recommendation module;(b)3-D features of armored targets in position prediction map.
In detection stage,the insertion of new targets around the boundary and the disappearance of targets in the scene can be solved with a stable detection result.However,the appearance features of armored targets are polluted and the position detected by offline module may be inaccurate when they are occluded by other target,building,fire and smoke.Besides,the task of multiarmored target tracking also includes linking multiple armored targets across different frames.In this work,we proposed an online candidate estimation module including a motion model and target speci fic estimation CNNs to solve above problems.
3.4.1.Target specific estimation CNNs
Of fline trained detector often misses or detects part of occluded target.Such result cannot be taken as the actual position of the target obviously.The candidate deep features from shared CNN are recommended by offline module,which ignores the occlusion.In this work,we proposed online target-speci fic CNNs to estimate each candidate.Denote the ROI-Pooled feature representation of thekthcandidateCkasattframe.is the state of candidateCk,which can be expressed as:
3.4.2.Motion model
The detection result of offline module is unreliable when the appearance features of armored targets are polluted.A motion model analyzes the motion curve of the target in the history frames and predicts the position of the target in the current frame[46,47].In most MOT application,a simple linear motion model is adopted,which are not fit to a fast turn,a sudden stop or a reverse drive and ignore the target shape variation.In order to predict the accurate position and realize the correspondence of multi-armored target labels in different frames,we proposed a novel motion using the temporal information.The motion model includes target position and shape prediction.Given the velocityofk-1thcandidate andofkthcandidate,the predicted center location can be expressed as:
whereOt=[xt,yt].αis the occlusion score predicted by targetspeci fic CNNs.the velocity of candidate can be expressed as:
Fig.8.The prediction results of the offline candidate recommendation module and the motion model with occlusion scores.
whereMtdenotes the time gap for computing velocity.Accordingly,the predicted candidate shape can be expressed as:
whereSt=[wt,ht].Fig.8 shows the prediction results of the offline candidate recommendation module and the motion model with occlusion scores.The blue bounding boxes are the results predicted by the motion model and the yellow bounding boxes predicted by offline candidate recommendation module.The occlusion scores are also labeled in the figure.with the gradual occlusion of the armored targets,the occlusion scores predicted by online estimation module decreased gradually.The position detected by the offline module is unreliable when the appearance features of armored targets are polluted.Our motion model can analyze the motion curve of the armored target in the history frames and predicts the precise position of the target in the current frame.
3.5.1.Intelligent screening process for Gabor Filter bank
In the experiment,we use ATTD dataset to train the SSD512[31]model which is used as the multiple-layers candidates recommendation branch.In online armored target tracking,a novel visual-attention Gabor filter branch is used to enhance small target proposals on shallow layers.Through the experiment,results show that center-on and center-off type of Gabor filter bank with size 5×5 can obtain better object feature sensitivity.The ratio of two type Gabor filter is set 1:1.In order to construct reasonable Gabor parameters,a multi-population genetic algorithm(MPGA)[48,49]based screening process is proposed to,as shown in Fig.9.
Fig.9.Screening process of Gabor parameters.
First,random initial parameters are set within appropriate interval and two type pf Gabor filter banks are constructed by parameters from Eqs.(1)-(3);and then a small-scale test frame set(100 frames in total)is constructed.The test frame set is fed into the shallow layer of the pre-trained SSD network and resulting features are compressed;Then,convolution operation is operated between combined Gabor filter bank and compressed features to enhancement.The obtained feature maps are converted into feature vectors through pooling and input into traditional SSD pre-trained with few samples to obtain the detection result of the test frames and Gabor filter banks.Then,we select the best Gabor filter bank parameters with highest detection recall rate as reproduction and generate new parameters of Gabor filter bank by reproduction,crossover,and mutation.By iterating through the above process,we can get the optimal Gabor filter bank when the termination condition is met.
3.5.2.Training of target prediction branch
In our visual-attention Gabor filter branch,two branchesNPandNSare used to predict armored target position and shape respectively,after the enhancement of Gabor filter bank.In order to train the two branches,a multi-task loss including target position lossLpos,The conventional classi fication lossLclsand regression lossLreg.The two branches are jointly optimized with the following loss:
In order to train the armored target position prediction branch,binary labeled maps are used as samples.In the binary labeled maps,1 represents a valid location of target center and 0 represents invalid location.In this work,we use ground-truth bounding boxes to generate samples.As shown in Fig.10,let(xg,yg,wg,hg)nrepresents thenthmapped ground-truth bounding box.The center region of the binary labeled map in this work can be expressed as:
Fig.10.Ground-truth bounding boxes(left)and binary labeled maps(right).
The negative region is feature map excluding mapped groundtruth bounding box,which can be expressed as:
whereFrepresents the enhanced feature map.In armored target shape prediction branch,we adopt the conventional regression loss.
3.5.3.Training of online candidate estimation module
In this work,online candidate estimation module is used to estimate the occlusion and exact position of armored target.At the initial stage of tracking,the parameters of the target-speci fic CNNs are random,and the networks have no estimation ability.In order to achieve a robust performance,the online target-speci fic CNNs need sufficient samples to be trained.The motion model also needs enough frames to analyze the motion curve of the armored target.Through the experiment,we useNinit=0.1Nvto train the initial online module,whereNvis the number of all frames in video.For the video including frames less than 100,we use the first 10 frames to complete the training of the initial online module.The online module is updated with Error Back Propagation algorithm.
In order to get sufficient training samples,we take the detection results of offline candidate recommendation module as positive samples at initial stage.LetSamplenP=(xd,yd,wd,hd)nrepresents the one detection result and positive sample.Negative sample is constructed by corresponding positive sample and position offset,which can be expressed as:
whereσ1andσ2are randomly selected within the interval[0.7,0.9]and[-0.9,-0.7].
After the initial stage,the online candidate estimation module is updated with estimation result.The polluted features of corrupted samples in bounding boxes would reduce the ability of estimation model to classify the target and background,until the candidates cannot be evaluated.To address this con flict,we introduce a temporal attention parameter to balance the current and history samples,which can be expressed as:
For thekthcandidateCk,the target-speci f ic loss function intframe can be expressed as:
At present,general targets and vehicle targets detection datasets have been published,such VOC[50]and KITTI[51].However,there are no ready-made multi-object tracking dataset for armored target.In order to address the problem of lacking dataset,we captured 80 video sequences by actual shooting and downloading from the internet for armored target and built a MOT dataset for armored targets,called Armored Target Tracking dataset(ATTD).The ATTD contains a variety of complex ground battle field scenes(such as jungle,desert,grassland and city)and complicated factors(such as armored cluster,muzzle fire and smoke,dust and so on).All videos contain 11,536 ground battle field scenes frames and 30,132 armored targets.All video frames are normalized to size of 1920×770 pixels and labeled with the graphical image annotation tool LabelImg[52]in PASCAL VOC format[50].Armored target scales in ATTD have wide range from 10×10 pixels to more than 700×700 pixels and large number of 30,132,with an emphasis on remote small armored targets.We take 50 video sequences as training set and 30 video sequences as testing set.Fig.11 shows the distribution of target instance sizes in VOC2007,KITTI and ARTD.Compared with general targets and vehicle targets dataset,the average target size of ATTDis signi ficantly smaller,which is more in line with actual needs of ground battle field.
Fig.11.Distribution of target instance sizes in VOC2007,KITTI and ATTD.
In this work,pre-trained SSD512 model are used as the backbone of multiple-layers candidates recommendation branch.All framework is implemented on an NVIDIA GeForce GTX 2080ti 2080ti GPU with 11 GB of memory.
In order to evaluate our MATT method,we adopt Recall rate to quantitatively evaluate the offline candidate recommendation module,which can be formulated as:
wheretprefers to a true positive andfnrefers to a false negative.Meanwhile,to evaluate the whole tracking method,CLEAR MOT metrics[53]are adopted including multiple objects tracking precision(MOTP)and multiple objects tracking accuracy(MOTA).The MOTP represents the total error in estimated position for matched object-hypothesis pairs over all frames,averaged by the total number of matches made,which can be expressed as:
wherefp,fnandidsare the number of false negatives(FN),of false positives(FP)and of identity switches(IDS).gtrepresents the number of objects present at framet.The MOTA re flects all object con figuration errors,including false positives,misses,mismatches,made by multi-object tracker over all frames.Additionally,the percentage of mostly tracked targets(MT),mostly lost targets(ML)are used as metrics in this work.
Fig.12.The MOTA with different global threshold and classi fication threshold.
In our multi-armored target tracking method,a global threshold εPis used to filter out regions where armored targets cannot exist in offline module.Meanwhile,a classi fication thresholdεois used to estimate occlusion in each candidate.In order to select appropriate parameters,we use small training set and variable parameters to train our tracking framework.In particular,we randomly select 10 video sequences from ATTD dataset and half of the frames are used as training samples and the other half belonging to the same video are taken as test samples.We used MOTA as the joint performance indicator ofεPandεo.Fig.12 shows the MOTA of our MATT method with different global thresholdεPand classi fication thresholdεo.The result demonstrates that the algorithm achieves the highest MOTA 84.2%on the selected samples,whenεP=0.7,εo= 0.6.Hence,the next experiments under values of above.
To address computation-expensive dense anchors and false detections caused by low-level features,we proposed a novel offline candidate recommendation module.The offline module includes two branches: a multiple-layers candidates recommendation branch and a novel visual-attention Gabor filter branch.The former is general object proposals based on SSD framework.The latter is used to enhance small target proposals on shallow layers.The offline module works in two modes can be expressed as:
Mode 1: Multiple-layers candidates recommendation branch+visual-attention Gabor filter branch.
Mode 2:shallow layers of SSD+visual-attention Gabor filter branch.
In order to demonstrate the function of each part in our offline module,we compare the detection recall rates of above modes and the following modes.
Mode 3:The SSD based multiple-layers candidates recommendation branch.
The experimental results are shown in Table 1.The result is divided into three different subsets:small armored targets(size<32×32),medium armored targets(32×32
Table 1 The detection recall rate of different modes of the offline candidate recommendation.module.
In order to prove the effectiveness of our MATT method,we compare our framework with several state-of-the-art approaches on ATTD dataset and KITTI dataset.The compared methods include offline tracking methods like SiameseCNN[54],CNNTCM[55],DCO-X[56]and LP-SSVM[57],Online tracking methods like NOMTHM[58],STAM[28],SSP[59],MOTBeyondPixels[60].
Results on ATTD Dataset.The comparison results on ATTD tracking testing set are summarized in Table 2.Our MATT method achieves the best MOTA,MOTP and MT of 85.65%,87.55%and 80.52%.Our method achieves the second-best ML,which up 0.5%from the best.The higher MOTA and MT indicate that our offline module has a higher detection ability.The best MOTP indicates that the Motion Model can accurately predict the trajectory of motion.The decrease in ML indicates that our approach has fewer false negatives.The shortest tracking time indicates that two modes of our offline candidate recommendation module not only meets the recall rate,but also improves the tracking ef ficiency.The overall running time can meet the actual requirements.
Table 2 Comparison with state-of-the-art methods on ATTD tracking testing set.
Table 3 Comparison with state-of-the-art methods on KITTI tracking testing set.
Results on KITTI Dataset.To verify our method on conventional vehicle targets,we run our MATT method on KITTI dataset.The comparison results are summarized in Table 3.Our method achieves the best MOTA,MT,and ML.Compared with the second best,our method obtains 5.41%,7.35%increase in MOTA and MT,and 0.52%decrease in ML.Our method achieves the second best MOTP by 85.55%.The encouraging results show that our method is still effective in conventional target tracking.Fig.13 shows some qualitative results of our MATT method on ATTD dataset.In order to make the tracking process more apparent,two sequential frames are merged into one image.The object trajectories are also shown in the figure.
Fig.13.Experiment results on ATTD dataset.Left column shows the detection results.The previous position is denoted with blue rectangle and the current with red.Right column shows the tracking results with our MATT method and target trajectories.
Multi-Armored Target Tracking(MATT)faces the problems of complicated environment,smaller target size,occlusion,insertion and the problem of no ready-made Multi-Object Tracking dataset for armored target.In this work,we proposed a visual-attention Gabor filter based online multi-armored target tracking method and a special MOT dataset,named Armored Target Tracking Dataset(ATTD).Aiming at the problems of computation-expensive anchors and false detections caused by shallow layers,a novel offline candidate recommendation module works in two modes and uses a visual-attention Gabor Filter branch to enhance features of shallow layers.Aiming at the problem of inaccurate position detected by offline module when the armored target is occluded,we proposed an online candidate estimation module including a motion model and target speci fic estimation CNNs.The target-speci fic CNNs estimate each candidate with an occlusion score and the motion use temporal information to predict the precise position.Experimental results show that our method achieves an outstanding MOTA,MOTP,MT increase and ML decrease.The overall running time indicates that our approach can meet the actual requirements.
Funding
This work was supported by the National Key Research and Development Program of China(No.2016YFC0802904),National Natural Science Foundation of China(No.61671470),Natural Science Foundation of Jiangsu Province(BK20161470),62nd batch of funded projects of China Postdoctoral Science Foundation(No.2017M623423).Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to in fluence the work reported in this paper.