Ji-qing Luo,Hu-sheng Fang,Fa-ming Shao,Yue Zhong,Xia Hua
Department of Mechanical Engineering,College of Field Engineering,Army Engineering University of PLA,Nanjing,210007,China
Keywords: Neural architecture search Feature enrichment Faster R-CNN Retinex-based image adaptive correction algorithm K-means UN-DETRAC
ABSTRACT It well known that vehicle detection is an important component of the field of object detection.However,the environment of vehicle detection is particularly sophisticated in practical processes.It is comparatively dif ficult to detect vehicles of various scales in traf fic scene images,because the vehicles partially obscured by green belts,roadblocks or other vehicles,as well as in fluence of some low illumination weather.In this paper,we present a model based on Faster R-CNN with NAS optimization and feature enrichment to realize the effective detection of multi-scale vehicle targets in traf fic scenes.First,we proposed a Retinex-based image adaptive correction algorithm(RIAC)to enhance the traf fic images in the dataset to reduce the in fluence of shadow and illumination,and improve the image quality.Second,in order to improve the feature expression of the backbone network,we conducted Neural Architecture Search(NAS)on the backbone network used for feature extraction of Faster R-CNN to generate the optimal cross-layer connection to extract multi-layer features more effectively.Third,we used the object Feature Enrichment that combines the multi-layer feature information and the context information of the last layer after cross-layer connection to enrich the information of vehicle targets,and improve the robustness of the model for challenging targets such as small scale and severe occlusion.In the implementation of the model,K-means clustering algorithm was used to select the suitable anchor size for our dataset to improve the convergence speed of the model.Our model has been trained and tested on the UN-DETRAC dataset,and the obtained results indicate that our method has art-of-state detection performance.
Vehicle detection has always been a hot subject in the field of object detection.It has long been the focus of the military and civilian domains,such as intelligent city[1],automatic driving[2],intelligent transportation[3]and military investigation[4].In the modern life where vehicles are becoming more and more popular,the monitoring and tracking of traf fic vehicles is receiving increasing attention from city managers.In traf fic scenarios,the location and type of vehicles greatly affect the whole traf fic environment,so accurate detection of traf fic vehicles is conducive for urban traf fic management.
With the deep convolutional neural networks(CNN)with strong feature learning abilities widely used in the field of computer vision,vehicle detection technology has greatly developed.However,there are still some problems in practical applications,especially in actual traf fic scenarios.Most of the traditional methods only consider the identi fication of single vehicle.However,in complex traf fic or battle field environments,there are many vehicle targets.In the actual traf fic image obtained,the vehicle has a variety of scales from small to large with the distance changing of vision.In addition,due to the existence of vehicle shielding,battle field camou flage or bad weather,the information about shape,contour and color of the vehicle obtained by us is often incomplete,which is still a great challenge for vehicle identi fication.The dif ficulties as Fig.1 faced by traf fic vehicle detection include the following aspects:
Fig.1.Examples of challenging objects for vehicle detection in traf fic scenarios:(a)multi-scale and multi-target vehicle in real-time traf fic,(b)partially obscured vehicles,(c)vehicles shaded by trees,(d)Motion blurred vehicles.
?Vehicle detection in complex traf fic scenes is actually a multiobjective and multi-scale object detection task,which is extraordinary challenging for depth models themself.
?During the process of image collection,some vehicles are always obscured by other objects on the road,such as green belts,roadblocks,power poles or other vehicles even disguises.Therefore,it is extremely important to be able to detect vehicles from part of the vehicle information in the image.
?At present,the vehicle detection system,especially the monitoring vehicle detection,has some restrictions on the light intensity in the process of collecting vehicle images,but the images obtained in the actual recognition often do not meet these lighting conditions.
?Traf fic scene images usually capture the moving vehicle,so the obtained vehicle targets in motion may be blurred.Moreover,there is also some blurring in the vehicle images in foggy and rainy weather,which increases the dif ficulty of vehicle detection.
In this paper,we propose a model based on Faster R-CNN with Neural Architecture Search (NAS)optimization and feature enrichment for vehicle detection in real traf fic scenarios.The detection and classi fication accuracy of our method exceeds many advanced methods.The main contributions of our approach are as follows:
?To reduce the impact of complex illumination on detection,we use Retinex-based image adaptive correction algorithm(RIAC)to preprocess images in dataset for image enhancement to obtain high-quality images.
?We conduct Neural Architecture Search(NAS)on Faster R-CNN backbone network to generate the optimal cross-layer connection for multi-scale features extraction to improve the detection performance of multi-scale objects.
?For objects with small scale or severe occlusion,we use the feature enrichment,which integrates the multi-layer feature information and the context information of the last layer to enrich the information of vehicle for further detection.
The rest of our paper is organized as follows:In Section 2,we introduce the related work.In Section 3,we give an overview of our methods.The details of our traf fic vehicle detection method are presented in Section 4.Section 5 discusses the experimental evaluation.Finally,we present some conclusions and identify future work.
As an important branch of object detection,vehicle detection has been proposed in the past decades,and some methods have been successfully applied in civil and military fields.The traditional vehicle detection methods exact feature based on Haar,LBP,SIFT,and others,and the corresponding descriptors are used to represent these features.Finally,shallow classi fier is used for classi fication[5-8].In Refs.[5],Haar feature with AdaBoost classi fier was used for vehicle detection and recognition,which improves the recognition performance and reduces the training time.The work in Refs.[6]had taken into account the shape and texture features applying HOG-LBP feature to detect vehicles,which signi ficantly reduces the number of false positive.SIFT algorithm has good robustness to the change of target scale and direction,and tolerant with the changes of illumination and noise.Therefore,reference[7]applied the characteristics of SIFT to the classi fication of moving vehicles and achieved good results.In Refs.[8],the object was divided into vehicle and non-vehicle by using SIFT-HOGfeature and SVM classi fier,and the detection results are very good.Generally speaking,these methods have made some achievements in vehicle detection,but for handcraft feature extraction and use,a lot of prior knowledge is needed,which is still a challenge.
Deep learning can extract feature automatically by machine,so it has powerful image abstraction ability and automatic high-level feature representation.In recent years,with strong learning ability and portability,deep learning is favored by the majority of scholars,and has been widely applied and developed.The detection performance of deep learning,especially convolution neural network(CNN)model,has been greatly improved.CNN is a datadriven approach,and training a new network from scratch requires magnanimous data.At present,the publicly available multiscale traf fic vehicle dataset is much smaller than ImageNet[9],which cannot meet the requirements for adequate training of new deep networks.However,ImageNet-based pre-training convolution neural networks,such as VggNet[10],GoogleNet[11],ResNet[12],have strong generalization ability.Compared with traditional feature acquisition methods,pre-training CNN can extract much more representative feature.With its continuous development,some excellent object detection networks such as Faster R-CNN[13],YOLO[14],SSD[15]have emerged,and been well applied in the field of vehicle detection through appropriate parameter modi fication and algorithm modi fication[16-18].For example,Quanfu Fan.et al.[16]tuned and modi fied network for speci fic applications and gained signi ficant improvement on default performance of Faster R-CNN for vehicle detection on the KITTI dataset.Jun Sang.et al.[17]improved the feature extraction ability of YOLO,which not only improves the detection performance of speci fic vehicle dataset,but also has excellent generalization ability on other vehicle dataset.Zhang,F.et al.[18]improved the SSD model to enable accurate vehicle detection in complex scenes.
However,most of the above networks use single scale deep feature(convolution layer,etc.),that is,each proposal based on the high-level feature of the target,and so it is dif ficult to improve the performance of classi fication in sophisticated scenes.At present,multi-scale features are widely used in high-level object recognition.Details such as edge and texture of the target exist in the lowlevel features.In the identi fication task,multi-level features fusion helps to identify small targets,which is just needed for vehicle recognition in the actual traf fic image.Feature Pyramid Network(FPN)[19]is a commonly used multilayer feature fusion method,which derives many networks with high detection accuracy using its multi-scale expression ability.Shortly after the birth of FPN,scholars combine it with Faster R-CNN to produce a new network named Mask R-CNN[20]with extremely high accuracy for target detection and recognition as well as instance segmentation.Similarly,RetinaNet[21]combine ResNet,which is its backbone network,with FPN to greatly improve the accuracy of one stage network while ensuring fast detection.However,using FPN as a top-down cross-layer connection can cause the network to pay too much attention to the optimization of low-level features,and sometimes result in a decrease in the detection accuracy of largescale targets.Thus,PANet[22]adds a bottom-up path to the topdown structure of the FPN,which retains edges and texture and other details of object from the lower layers to the higher layers,further increasing the low-level information in the higher layers.Nowadays,more and more attention has been paid to the crosslayer connection of network model.In addition to FPN,many excellent cross-layer connection networks have emerged.Libra R-CNN[23]based on FPNadjusts all feature from different layers to the middle layer by interpolation or max pooling for fusion and re finement,and finally adds the original features to enhance the features.ZigZagNet[24]iteratively fuses feature mappings between top-down and bottom-up networks to aggregate multi-scale context information.From PANet to ZigZagNet,from strategically connecting certain layers to densely feature layers connecting,cross-layer connections require enormous prior knowledge,which makes it extremely challenging to select appropriate cross-layer connections to produce optimal detection results for a certain class of targets.With the emergence and development of Neural Architecture Search(NAS)[25],it is used to automatically adjust various parameters.NMS adjusts not the training hyperparameters,but the network architecture hyperparameters:how many layers of the network,what operators for each layer,the size of the kernel in the convolution layer,and so on.NAS-FPN[26]based on frame of RetinaNet,using accurate as an update reward for recursive neural network(RNN)controllers,balance the accuracy and speed of repeated searches until the best cross-layer connected network is found that can be applied.Based on the idea of NAS-FPN,we apply NAS to optimize the backbone network of classical Faster R-CNN for feature extraction to generate the optimal cross-layer connection network,and effectively utilize the multi-layer features and through feature enrichment to achieve higher accuracy detection result for traf fic vehicles.
Our optimized Faster R-CNN traf fic vehicle detection method is illustrated in Fig.2.First,we use the Retinex-based image adaptive correction(RIAC)algorithm to preprocess the images of dataset for image enhancement to obtain a higher quality image.Next,the enhanced image is extracted feature through our optimized Faster R-CNN backbone network.Finally,we use the object feature enrichment to synthesize multi-layer information and context information for detection.Our network structure is divided into two parts.The first part is the feature extraction part of the network proposed by us.We use Resnet101 as the backbone network to extract multi-level features,and use the optimal cross-layer strategy generated by NAS to obtain the final multi-scale features expression.The second part is the improved R-CNN subnetwork.The feature enrichment strategy is conducted to comprehensively utilize the multi-level features and the context information of the last layer,that is,the region proposals of the multi-scale feature layer and the context feature layer get the RoI features with the same scale through RoI Align.Then,we get the final RoI feature through cascade,scale adjustment,dimension reduction from each RoI.The third part is the network head.The final classi fication score and boundary box are obtained by using the generated RoI feature through two full connection layers to realize the high-precision detection for vehicles in traf fic sences.
The detection task of computer vision is usually implemented on high-quality images,but the images captured in real scenes may not meet this requirement.In real traf fic scenarios,uneven outdoor illumination or shadows caused by tall buildings or trees can result in poor image quality,as shown in Fig.3(a),(b).If the illumination is too strong or too weak,the detected vehicle information cannot be well highlight,and even the shadow of the environment will cause certain misjudgment,which will affect the accuracy of traf fic vehicle detection.Therefore,we use Retinex-based image adaptive correction(RIAC)algorithm to enhance the images from dataset,thereby improving the image quality.
Retinex theory holds that an image can be made up of the product of ambient brightness function and object reflection function[27],as follows:
whereS(x,y)is the image to be enhanced,L(x,y)is the ambient brightness function,R(x,y)is the object reflection function,and(x,y)is the coordinates of the image pixel coordinates.When image enhancement is performed,ifL(x,y)can be estimated from the imageS(x,y)to be enhanced,then the model can obtain the enhanced imageR(x,y).
Fig.2.The overall framework of our model.
Fig.3.Comparison of the raw and enhanced images.By comparing(a),(c)and(b),(d),the image quality is improved and the details of vehicle target are enhanced.
The basic idea of Retinex image enhancement is that if the amount of illumination changes slowly when the image is obtained,we can assume that the ambient brightness function is the lowfrequency part of the image.Because the shape and re flectivity of the object surface are different,the object reflection function is the high-frequency part of the image,that is,the part where the image details exist,such as texture,edge,etc.[28].Therefore,we can take the environment luminance function as a multiplicative noise,and convert it into additive noise through log domain transformation.Then filter is used to filter the image,so as to separate the illumination change information of the environment brightness function,reduce or even remove its in fluence,retain the details of the image,and get the enhanced image.Unlike traditional linear and nonlinear methods,which can only enhance a certain kind of image features,Retinex balances dynamic range compression,edge enhancement,and color constancy.So it can adaptively enhance different types of images.Images in dataset or images captured by camera are generally RGB images.However,it is dif ficult to ensure that each channel is enhanced in the same proportion by direct processing of three channels of RGB image,which will cause halation and other phenomena,resulting in image color distortion.Moreover,the amount of calculation for enhancing the three channels is large[29].Therefore,we transform the RGB image into HSV color space,which can better re flect people’s color perception,and then process the image[30].Retinex theory also applies to luminosity channel.We transform V channel from HSV space to Ref.[31]in logarithmic domain and conduct multi-scale Retinex(MSR)algorithm to extractR(x,y).Generally,we select three scales:high,medium and low.Their environmental brightness function is as follows:
Then the output image(or object reflection function),is defined as:
whereNis the number of Gaussian functions with the value of 3.*denotes the convolution operation.Gi(x,y)is the Gaussian function in the i-th scale,and its mathematical expression is defined as follows:
wherecrepresents the Gaussian surround scale,which is selected according to the image size in the actual processing.λis the normalized scale,and its value satis fies the condition as follow
whereωiis the weight corresponding to the scale.In fact,because a large-scale Gauss filter can maintain good hue but blur the details of the image,while a small-scale Gauss filter can obtain excellent edge information,the enhancement of the image by MSR depends on the selection of its weight.In order to achieve a balance between brightness and detail preservation during image enhancement,our algorithm takes 1/3 of the weightωifor three scales in implementation.
In most cases,the multi-scale Retinex algorithm will make the HSV image brighter during the enhancement process,which is bene ficial for low illumination or shadow images,but sometimes causes over-enhancement for normal illumination images.In order to make our algorithm universally applicable to the entire dataset,we use two-dimensional gamma function[32]to adjust the illumination adaptively to reduce the brightness of over-illuminated areas and improve the brightness of over-illuminated areas.The correction process is as follows:
whereγis the adjustment parameter of luminance,whereαis taken as 0.5.mrepresents the median of the incident component.O(x,y)represents the luminance value of the corrected image.Finally,we fuse the corrected brightness components onto the HSV image and remap them to the RGB color space to get the enhanced image.
As shown in Fig.3(b),(d),the detail of the vehicle target in the upper left corner of the image in low light are signi ficantly improved compared with the original image,while the image in good light does not show any enhancement.In addition,because the two-dimensional gamma function can effectively balance the brightness of the image,the clarity of the shaded vehicle target is also improved,as shown in Fig.3(a),(c).
Traditional Faster R-CNNobtains the region of interest from the final feature layer generated by the backbone network,and then obtains the detection result[9].Although the final feature layer represents the advanced features of the image that is more global,it does not have an excellent representation capability for small targets or fine-grained feature.In real-time traf fic vehicle detection,with the field of vision from far to near,the vehicles are various scales from small to large.Feature extraction directly from Faster R-CNN backbone network will obviously ignore or underperform the feature of small-scale vehicles on the image.In the process of feature extraction,high-level features with large receptive fields are used to detect large objects and low-level features with more details are used for small objects detection.High-level feature undergo much downsampling,which leads to the loss of some details.Low-level feature guarantee the resolution of features but lack global semantic information,which ultimately leads to inaccurate classi fication or positioning.Cross-layer connection between different layers retains both high-level semantic information and high-resolution feature,which have a signi ficant impact on detection performance.However,cross-layer connection requires extensive prior experience and large design space.The number of possible connections that combined with different scales increases exponentially with the number of network layers,which requires a lot of time,and it is arti ficially dif ficult to determine the optimal way[26].Therefore,we draw lessons from the idea of NAS automatic search in classi fication network architecture,and determine its search space[33].Fast RCNN uses the optimal cross-layer connection network to achieve the purpose of complementary advantages of different feature layers for obtaining more comprehensive image information.The optimal cross-layer connection method generated improves the feature expression ability of backbone network,and ultimately improves the performance of traf fic vehicle detection.
The optimized cross-layer connection structure of backbone network is searched by an RNN controller,and the search space is shown in Fig.4.Our search space is consist of a series of cross-layer connections,such as the fusion connection of C1 and C2,the fusion connection of C1 and C3,and so on.The RNN controller uses the accuracy of the sub-models in the search space,that is,the accuracy of different cross-layer connection structures,as reward signals to update the superparameters,which are the cross-layer connection of which two layers.Therefore,through repeated experiments,RNN controller gradually learns how to generate better feature extraction architecture.
Network models are constructed in blocks through a circular manner,with the output of each block as the next input and the number of blocks in actual training is given.We use the candidate set composed of five different scales of feature layer{C2,C3,C4,C5,C6}as input.Because layer C1 has a huge memory occupation and insufficient feature representation ability,we do not put it into the candidate set,and layer C6 is obtained by the max pooling operation from layer C5.Pi represents the fused feature layer.In the search space,each block is connected across layers by a fusion unit,which is structured as shown in Fig.5.The feature layers of the two inputs change the channel number of each layer to 256 by 1×1 convolution for subsequent additive operations.One of the feature layers Pm determines the feature resolution after fusion and in turn determines the upsampling or Max Pooling operation of the another feature layer Pn.Inspired by DBL block of[34],3×3 convolution layer,Batch Normalization layer and the ReLu layer are used to make the fused features more robust,while improving the convergence of the model.Finally,the feature layer with the desired resolution is obtained.
Fig.4.Search space diagrammatic sketch.
Fig.5.Structure diagram of single fusion unit.
In our search space,the RNN controller needs three different Softmax classi fiers to do three prediction steps for each cross-layer connected block respectively.The RNN controller structure is shown in Fig.6.The speci fic steps are as follows:(1)select one feature layer from the candidate set formed by the input feature layers as an input for feature fusion;(2)select one feature layer from the candidate set as another input for feature fusion(which cannot be duplicated with the previous one and not replaced);(3)determine the output resolution.Then the new feature expression layer obtained from the fusion unit is used to replace the feature layer with the same resolution in the candidate set and serve as the input for subsequent block.AfterNtimes of fusion,the feature layer of each resolution of the candidate set is replaced,the whole candidate set is filled with the new layers,or when a certain degree of accuracy is reached,the search stops and five feature layers{P2,P3,P4,P5,P6}of the final output are obtained.
Fig.6.RNN controller structural for cyclically constructing a cross-layer connection unit.
Here,our RNN controller is trained by reinforcement learning algorithm.Our cross-layer connection only considers the backbone network part for feature extraction.Therefore,in the training process,we add network head that composed of classi fier and bounding box after all intermediate blocks.The model does not need to complete the forward propagation of all block subnetworks during recursive inference,and it can stop at the output of any subnetworks and generate detection accuracy.The parameters of the controller RNN are optimized to maximize the expected validation accuracy of the inferred network structure.Such a new network,which can effectively extract multi-scale features of images,has been established and used for the next training.
Human can detect and identify targets by capturing the interaction information between different objects and the interaction information between objects and scenes.This enables the human visual system to quickly recognize a large number of targets even when the target and background are complex,and it is sensational adaptable to the texture,distortion and occlusion of the target[35].In machine learning,the performance of network can be greatly improved by simulating human perception of external things and applying context information.At present,many advanced methods only use the texture,outline and other information of the object for object detection.However,this speci fic information is not enough for accurately detecting challenging targets,such as those with low resolution,small scale or severely obscured.For multi-scale traf fic vehicle detection,the environment of target is complex.There are many small-scale vehicle targets,and sometimes the surrounding environment(green belts,road barriers,power poles,etc.)and other vehicles will obscure the vehicles that we want to detect.The traditional region detection based on a single target cannot achieve a good result.The contextual information generated by the interaction between the vehicle and the road environment can effectively improve the detection performance.
Because the vehicle target has no exact structural relationship with the surrounding environment in the traf fic scenario,when considering the context information,the commonly used context information[36]on the top,bottom,left and right of the region of interest appears to be complex and does not play a key role.Here,based on the original region,we use the context information of its internal and external scales to enhance the feature representation,through this simple operation to represent the context information of the vehicle target.The region of interest in an image is represented as(x1,y1,x2,y2)by its four corners coordinates,then the inner and outer scale regions are represented byRIandRO,respectively.These two context RoIs defined as follows:
wherewis the width of RoI andhis the height of RoI.εdenotes the scale factor,ε1=0.8,ε2=1.2.According to human visual habits,for the target which only shows local information,the attribute of the target can be inferred by the signi ficant part of its neighborhood and the external environment connection.As shown in Fig.7,for partially occluded vehicles,in addition to the information of the original bounding box,we add the internal and external context information to improve the inference ability of the model.
Fig.7.Feature combination of original RoI and internal and external context RoI.The red bounding box represents the ground truth,and the green and blue bounding boxes represent RI,RO,respectively.It can be clearly seen that the landmark part of the vehicle can be found through RI,while the relationship between vehicle and environment can be found through RO.
In the process of context feature extraction,we need to carry out RoI Pooling on three-scale RoIs to obtain the fixed-scale.Next,after L2 normalization,cascade,1×1 convolution dimension reduction for these three RoIs,we finally obtain the feature map of comprehensive object context information,named context RoI feature.The procedure of forming context RoI features is shown in Fig.8.
After we optimize the cross-layer connection through the neural architecture search,we have obtained reliable and more effective multi-layer feature for classi fication and recognition.For the multilayer feature information obtained,the traditional method[20]uses an empirical formula to determine the feature layer to which the RoI belongs based on its size for the next step.Large-scale RoI corresponds to the low-resolution feature layer,while small-scale RoI corresponds to the high-resolution feature layer.However,when RoI is selected based on a single feature layer,the other overlooked layers also contain rich semantic information,which can in fluence subsequent classi fication and bounding box regression to some extent[22].If the features extracted from different convolution layers are fused together to achieve complementary advantages,more comprehensive image information can be obtained,and better results can be achieved.
There are many vehicle targets in the traf fic scene.Most of the vehicle targets have less information because of the small scale or occlusion,so the information used by the model is very weak and the accuracy of object detection is low.Therefore,in order to improve the detection accuracy of vehicle targets in traf fic scenes,we must enhance the use of the features of the extracted targets.In this paper,we propose object feature enrichment,which combines multi-layer features and context information to overcome the influence of complex environment,and realize multi-scale vehicle recognition.The proposed feature enrichment structure is shown in Fig.9.The scale of feature layer{P2,P3,P4,P5}extracted from Faster R-CNN backbone network optimized by NAS is 56×56,28×28,14×14,and 7×7.We get the context RoI features from{P5}layer,and obtain the multi-scale features from{P2,P3,P4}.We synthesize the multi-scale features and context feature to obtain the fixed scale RoI feature descriptor through a series of operations.Finally,two full connection layers are used for classi fication and bounding box regression.
In order to effective utilize the information in our multi-scale feature,we need to perform RoI Align operation on RoIs from the cross-layer connected feature layer{P2,P3,P4}and the context feature,respectively.RoI Align reduces the impact of pixel misalignment compared with the traditional RoI Pooling and provides more accurate spatial positioning.RoI from different layers has different scales.Although we can combine different scales of RoI through simple operations,this is morbid.RoIs from deep feature layers are easily dominated or submerged by RoIs from shallow feature layer,which will cause a poor effect in practice.We then fused the RoI obtained from the different feature layers through L2 normalization,cascade,1×1 convolution operations and restored the size to 512×7×7 for input to the subsequent full connection layers.In this way,our structure combines multi-level features with contextual information to achieve the purpose of object feature enrichment.
Fig.8.The procedure of forming context features.
Fig.9.Feature enrichment structure.
In this section,we first introduce the traf fic vehicle dataset used in the experiment,and we determine anchors’sizes.Then,we de fine the evaluation criteria and a series of ablation experiments are carried out to verify the effectiveness and contribution of our proposed strategies.Finally yet importantly,we compare the performance of the proposed model with other advanced networks for multi-scale traf fic vehicle detection.Finally,we show some examples of the detection results of our method.
To verify the availability of our method,we chose the UADETRAC dataset[37]of the vehicle dataset in this experiment.UA-DETRAC dataset is a challenging multi-target detection and tracking vehicle dataset,which is divided into training set and test set.The dataset is obtained in real traf fic scenes with a resolution of 960×540 pixels.It contains a series of challenging scenes,such as various scales of target,pose,night and day,occlusion of other objects and background clutter.Therefore,it provides a rich test field scene,not limited to speci fic observation conditions,as shown in Fig.10.The UA-DETRAC dataset manually labels 8250 vehicles,1.21 million marked object bounding boxes in total,including 100 video sequences.The vehicle types are divided into four types:van,car,bus and others.
From the statistical rule of a single video sequence of UADETRAC dataset shown in Fig.11,we can see that small-scale vehicle targets account for a high proportion,and there are targets with various scales,which conform to the actual traf fic scene and suitable for training of our model.
The original UA-DETRAC dataset had no context RoI annotation.To consider the context information of object during training,we need to add two context annotations based on the original label of the dataset.These two labels are located inside and outside the original label and have a proportional relationship with the original label.The scale and location of the label are determined by the de finition(8)(9).
Our traf fic vehicle detection method was experimented in the software of Python 3.7 under the tensor flow framework.The running platform is Windows 10 64-bit operating system,Intel?HD Graphics 520 Graphics,8 GB DDR3 memory,Intel(R)Core(TM)i5-6300U,CPU@2.40 GHz,GPU GTX 2080Ti.Our algorithm runs on GPU.
In traditional model like Faster R-CNN,anchors are set manually.Anchor sizes are usually set according to the aspect ratio of{1:1 2:1 1:2},and the scale of{128 256 512}[9].In practical application,considering the small-scale target in the dataset,the two scales{32,64}will be added selectively,or the aspect ratio will be changed[38].In experimentation,anchors do not completely cover the target.This requires bounding box to continuously regress and adjust,which is a time-consuming process.However,the disadvantage of arti ficially designing anchors is that they are not guaranteed to be well suited to a speci fic dataset.If the sizes of anchors greatly differ from the target size,the detection effect and detection time of the model will be affected.Therefore,the design of anchor size is important for regression of different size targets.As the output feature layer and target scale changing,anchor sizes also need reset.Here,we use K-means clustering algorithm to cluster the bounding box of vehicle targets in the UA-DETRAC dataset to try to find the anchor width and height values that are more suitable.K-means clustering algorithm[39]is the most common algorithm to determine the most extensive value for a given dataset through iterative partition according to a certain distance function.Through clustering,the potential classi fication label of each sample can be found.Traditional clustering methods use Euclidean distance to measure the difference between sample points and cluster centers.If we use the traditional K-means clustering algorithm directly,K combinations of anchors of width and height can be generated by the width and height of the cluster bounding box.Nevertheless,as the size of the bounding box of the target increases,the error will increase.To reduce the impact of size on clustering,inspired by anchor generation method inYOLOv2[40],we use the Intersection over Union(IoU)of anchor box and bounding box to measure the distance between each sample point and cluster center,which is defined as follows:
In the above formula,bbrepresents the bounding box of each vehicle target,that is,the ground truth labeled by the dataset,andabrepresents the anchor box.
The purpose ofK-means clustering is essentially to obtain large IoU value and the shortest distance from the clustering center,so we use a new distance function which is defined as follows:
Experience shows that the more anchors you have and the greater the IoU average.However,this makes the model more complex.To achieve trade-off between IoU averages and model complexity and to correspond to the feature layers of the four scales we used,we choose the value ofKas 12,that is,the number of cluster centers(Wi,Hi)is 12.The cutoff condition for clustering is that the change in distance functiondis less than the set threshold or the number of iterations has been reached.In this paper,the iteration deadline is set when the number of iterations reaches 50 times in the speci fic experimental process.We randomly select 10 out of 100 video sequences for clustering.Visualize the data that originally needed to be clustered as shown in Fig.12(a).The horizontal coordinate represents the width of the bounding box of the object,and the vertical coordinate represents the height of the bounding box.Fig.12(b)shows the final clustering result.
Thexin the graph represents the final cluster centroid.Because the result ofK-means clustering is related to the initial cluster center selected,we have conduct six experiments and synthesize the results of six experiments to determine the anchor sizes.Finally,the width and height values of 12 anchor boxes are obtained,which is corresponding to feature layers with different scales.The result of anchor box sizes is shown in Table 1.
Table 1 Anchor box sizes for different feature layers.
Fig.12.Data distribution diagram of object.(a)raw data,(b)clustering results.
In the field of object detection,there are many evaluation metrics,and here we use some simple evaluation metrics.We use precision and recall to evaluate the performance of our method.They are defined as follows:
whereTPis the number of true positive samples,that is,the number of positive samples successfully predicted to be positive;FNis the number of false negative samples,namely,the number of false positive samples is predicted to be negative;FPis the number of false positive,that is,negative samples are mispredicted to be positive.In the detection task,the de finition of true position and false position is achieved by the intersection ratio(IoU)criterion of the prediction box and ground truth.In this model,the IoU threshold is set to 0.7,that is,IoU greater than or equal to 0.7 is considered true positive,IoU less than 0.7 is considered false positive.Precision and recall evaluate performance from two different perspectives.Higher precision means that the positive vehicle samples predicted by the model are as positive as possible and higher recall means that the model picks out the true positive vehicle samples as possible.Generally,the precision and recall are negative correlation.The model with good performance should keep the precision value at a higher level while the recall increases,while the model with lower performance often needs to sacri fice a lot of precision value in order to improve the recall value.In order to balance the recall and precision,Average Precision(AP)is used to evaluate the performance of the model for a class target.AP calculates the average accuracy when recall changes from 0 to 1(i.e.the area of P-R curve).Mathematically,it can be calculated with integrals as follows:
whereP(R)is a P-R curve equation,the higher theAPmeans and the better the model performance.
To further illustrate the deeper implications of the proposed method improvement,a series of additional experiments were performed as ablation experiments.These additional experiments are designed to demonstrate the advantages of our approach and the effectiveness and contribution of various strategies.Next,we will gradually validate these strategies.
In our proposed model,we useK-means clustering algorithm guided by the IoU of bounding box and anchor box to generate anchor box suitable for dataset.In order to verify the effectiveness of this strategy for the whole model,we compare it with other anchor frames,that is,the default aspect ratio is{1:1,2:1,1:2},and the scale is{64,128,256,512}.The results are shown in Table 2.It is obvious that the performance of the anchors generated by k-means is better than that of the default anchors.Our anchor generation strategy not only improves the detection accuracy for small targets,but also effectively reduces the detection time.Therefore,we useKmeans clustering algorithm to generate anchors in our model.
Table 2 Effect of anchor generation strategy by K-means clustering algorithm.
Our model uses RIAC algorithm to preprocess the image of UADETRAC dataset to improve the image quality.There is no doubt that detection on high-quality images will yield better results.In some challenging images of the dataset,most of the details of the vehicle are hidden in the dark.We compare the results of our detection on the original and enhanced datasets in Table 3.It can be observed that if the preprocessing algorithm is removed,the performance will be reduced by 0.63%.The main reason is that RIAC algorithm reduces the impact of shadows and illumination changing,even reduces misjudged pixels between shadows and real vehicle targets.The results show that the image quality enhanced by RIAC algorithm is helpful for vehicle detection in complex environment.
Our strategy utilizes RNN controllers,which regard accuracy as reward,to combine features extracted from multiple feature layers to achieve the purpose of leveraging both high-level semantic information of deep features and fine-grained detail information of shallow features for better detection accuracy.Here,we quantitatively compare the cross-layer connection strategies we have obtained with the widely used cross-layer connection methods shown in Table 4.Obviously,our cross-layer connection method has a good detection result compared with other strategies.
Table 3 In fluence of RIAC image enhancement algorithm.
Table 4 Comparison of different cross-layer connection methods.
Our proposed object feature enrichment contains two important components,namely,the multi-scale semantic layer fusion part and the multi-scale context semantics part,which effectively synthesize the global and local context information.The target feature expression in the obtained RoI includes not only the basic target feature representation,but also the direct context feature representation of the target and the environment and the pure appearance feature representation of the central part of the target,which makes our detection model more robust to challenging targets.We compare the impact of these two parts on the test results in the UA-DETRAC dataset,as shown in Table 5.Ours-Mrepresents the model of only using multi-scale information for vehicle detection.Ours-C represents the model of only using context information for vehicle detection.
Table 5 Effect of different parts of feature enrichment.
At the same time,we compare our method with other advanced methods on UA-DETRAC dataset,and the results are shown in Table 6.Compared with other methods,our method has achieved good detection performance.It can be seen that our method greatly improves the problem that Faster R-CNN has insufficient ability to detect small-scale vehicles.The value of AP for Fast R-CNN for small targets in the dataset is only 14.16%,but our method improves it to 43.64%.At the same time,our model has good robustness for vehicles covered,and the AP of large-scale and medium-scale targets are also improved.
Table 6 Comparison of our method with other representative methods on UA-DETRAC dataset.
Fig.13.P-R curves of four categories of dataset:(a)car,(b)van,(c)bus,(d)others.
The P-R curve of four categories of vehicle detection results is shown in Fig.13.Obviously,our method has achieved good results for car class.For the other three categories,due to their small proportion in the dataset,there is an imbalance between classes.However,our model still has good robustness compared to other methods.For other methods without RPN for reducing background proposals,there will be some misjudgments in the detection process,and the results are poor.
We show some detection results for the dataset in Fig.14.The red bounding box represents the original label in the dataset,and the green bounding box represents the detection area.We can see that our model can accurately detect multi-scale and multi-target vehicles in real traf fic scenes.Some vehicles covered by green belts can be detected well,and for some small-scale vehicle detection,our model also has good robustness.
Fig.14.Examples of detection results.
In this paper,a new multi-scale and multi-target vehicle detection method is proposed for complex vehicle detection in natural environment.Using the Retinex-based image adaptive correction algorithm(RIAC),we enhance the image of our dataset to improve image quality by reducing the impact of shadows and Illumination.We propose a multi-layer feature extraction method that produces optimal cross-layer connections through Neural Architecture Search,which enhances the feature representation of the backbone of the Faster R-CNN model and improves the detection performance of multi-scale vehicle targets.Our proposed object feature enrichment combines the multilayer feature information and the last layer context information after cross-layer connection to enrich the information of the target and improve the robustness of the model for small target and severe occlusion target detection.In the experimental process,we use K-means clustering algorithm to generate anchors suitable for dataset to improve the model convergence speed and detection ef ficiency to some extent.The results show that our model has excellent detection results for four categories of vehicles in UA-DETRAC dataset.Compared with similar methods,this method has exceeded the previous state-ofart in accuracy and proved its validity.However,in the classi fication process of our model,there are only four types of vehicles.For future research,we will label vehicles for finer-grained vehicle classi fication.At the same time,the speed of detection should be further improved to facilitate the application of real scenes.
Funding
This research was funded by the National Natural Science Foundation of China(grant number:61671470),the Key Research and Development Program of China (grant number:2016YFC0802900).
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to in fluence the work reported in this paper.