Hining WANG, Yng LI,, Yuqing FANG, Yurong LIAO, Bito JIANG,Xito ZHANG, Shuyn NI,*
a Space Engineering University, Beijing 101416, China
b Beijing Institute of Remote Sensing Information, Beijing 100192, China
KEYWORDS Dense connection;Object detection;Pretraining;Remote sensing image;Train from scratch
Abstract Most of the current object detection algorithms use pretrained models that are trained on ImageNet and then fine-tuned in the network,which can achieve good performance in terms of general object detectors.However, in the field of remote sensing image object detection, as pretrained models are significantly different from remote sensing data,it is meaningful to explore a train-fromscratch technique for remote sensing images.This paper proposes an object detection framework trained from scratch,SRS-Net, and describes the design of a densely connected backbone network to provide integrated hidden layer supervision for the convolution module.Then, two necessary improvement principles are proposed: studying the role of normalization in the network structure,and improving data augmentation methods for remote sensing images.To evaluate the proposed framework, we performed many ablation experiments on the DIOR, DOTA, and AS datasets.The results show that whether using the improved backbone network, the normalization method or training data enhancement strategy, the performance of the object detection network trained from scratch increased.These principles compensate for the lack of pretrained models.Furthermore,we found that SRS-Net could achieve similar to or slightly better performance than baseline methods, and surpassed most advanced general detectors.
Object detection is the core problem of optical remote sensing image interpretation.Remote sensing image object detection refers to the use of algorithms to accurately locate objects and classify categories in images.It is widely used in intelligent scheduling, urban planning, intelligence reconnaissance and other fields1.With the continuous advances of earth observation technology, the scale of high-resolution images obtained by remote sensing has increased,and the images contain more information, which brings great opportunities for progress in remote sensing image object detection,and makes the observation of various objects on the ground more valuable.
In recent years, Convolutional Neural Networks (CNNs)have developed rapidly, and deep-learning-based object detection methods for remote sensing images have been proposed.To improve the effect of the detection algorithms, researchers often use the pretraining and fine-tuning approach by applying network models pretrained through large classification databases such as ImageNet2.Then, adaptive fine-tuning is performed on downstream vision scenes, as shown in Fig.1.First,pretraining is performed on large databases such as ImageNet.Second, adaptive fine-tuning is performed on downstream tasks for instance image segmentation, object detection, and object tracking.These pretrained models are set as the initialization structure during network training and have been applied in various works,including image segmentation3,object detection4,and object tracking5.According to the past experience,the pretraining model contains rich data association information,which can provide preliminary knowledge representation and entity semantics for various downstream tasks so that the gradient of the network is stable during the learning process.These processes can be called transfer learning or can be understood as a new semantic understanding driven by pretrained knowledge.The general object detection framework can naturally learn the object feature information of the new dataset by using the pretrained model and can then complete the updating and fine-tuning of the detection network model.
However, the information structure contained in remote sensing images is significantly different from that of natural scene images.Therefore, whether the object detection task of remote sensing images has the same training paradigm as that in natural scenes is a question worth studying.Previous research6has found that using ImageNet-based pretraining can accelerate the convergence of the model, but cannot improve its final detection performance.Another study7posited that if the parameters are set accurately, the self-training effect of the network will even surpass that of the traditional pretraining paradigm.It is clear that in the remote sensing image object detection field, fine-tuning with models trained on large-scale classification task datasets such as ImageNet has certain limitations: (A) the pretrained model domain is considerably different from the object domain of remote sensing images.For example, most pretrained models are classification models trained on large-scale RGB datasets like ImageNet.It is difficult to transfer ImageNet-based models to hyperspectral or visible remote sensing images.(B)The process of model fine-tuning limits the optimization of the detection network structure.The pretrained model structure cannot be changed, so the obtained model cannot be structurally optimized.In sum,the above two limitations mean that researchers need to rethink the use of pretraining and finetuning in remote sensing object detection8.
The motivation of this paper can be summarized in two aspects.First,in the process of feature extraction,to eliminate the bias between remote sensing data and the data used by pretrained models,we can train the network from zero initial values.For a specialized detector dealing with remote sensing data, it is crucial to avoid data bias in the feature extraction process.However, considering that the data obtained by a mobile device are often extremely complex, if we use a pretrained model, the parameters of the backbone network will be unable to optimize and adjust with the dataset.Trainingfrom-scratch enables the network to achieve adaptive learning on difficult datasets.Therefore, we propose the training-fromscratch method for remote sensing(SRS-Net),an object detection framework in remote sensing that is trained from scratch,which can eliminate the limitations of pretrained models in the domain of remote sensing data applications.This paper also illumines how a network trained from scratch can efficiently achieve adaptive learning in remote sensing domain compared to the fine-tuning in a large-scale pretrained model.
Overall,the main contributions in this paper are as follows:
(1) The model of training-from-scratch is introduced into remote sensing object detection, which provides a new idea for algorithms and application research in this field.
(2) Dense-Connection-Cross-Stage-Partial (DCCSP) is proposed to make up for the impact of the lack of pretrained models.
(3) The effects of gradient normalization and data augmentation on training-from-scratch are researched.
Fig.1 Vision tasks under pretraining and fine-tuning paradigm.
The earliest use of the pretrained models in object detection tasks was proposed by the work of Ref.9.The classification model trained by ImageNet was applied to the initial position of the object detection network, and the network nodes were updated by fine-tuning.This process is called model pretraining.The pretraining process of the upstream model has two characteristics: the first is that the dataset is large and prelabeled.For example,ImageNet has about 14 M images.Such datasets are huge and complete, and naturally serve as pretraining datasets; the second is to lots of generic features into network parameters without considering the focus of downstream tasks.Due to these characteristics,various visual downstream tasks can obtain better performance after using largescale pretraining models.10The transfer learning process of downstream tasks also has two characteristics: the first is to fine-tune the upstream model and apply it to small datasets,such as remote sensing image datasets DOTA11or DIOR.12The fine-tuning methods include increasing resolution or mixing regularization.The second is that the computational overhead of the fine-tuning process is small, which is suitable for fast training in general visual task scenarios.According to these findings,most of the existing object detectors adopt pretraining and fine-tuning mechanisms.Recent studies have developed a variety of effective pretraining models for object detection, such as ResNet13, VGGNet,14and DenseNet,15which also include the latest transformer-based pretraining backbone network.16,17
Current deep-learning-based remote sensing object detection algorithms can be classified into two sorts: detectors based on candidate regions and detectors based on regression18.The candidate-region-based method first generates a series of regions that may contain the object, and then performs the object classification and the prediction box regression, such as in Refs.19,20 Then, researchers carry out the adaptive reconstruction of the detector for remote sensing from many aspects:enhancing the feature representation of remote sensing objects,21optimizing the generation of candidate regions22,improving the positioning accuracy of objects,23and so on.The regression-based method can directly regress and determine the prediction box and category of the object from multiple positions of the remote sensing image, as seen in Refs.24–29
Since the images of remote sensing scenes usually have a high quantity of pixels and high real-time requirements for object detection, it is therefore necessary to select a network with a faster detection speed.At present, YOLOv530reaches a desirable balance between precision and speed, so we chose it as the benchmark network for research and improvement.Specifically, first, YOLOv5 uses adaptive image scaling and adaptive anchor box calculation, which is conducive to accelerating the inference process; second, the focus structure is used in the backbone, which can expand the number of input channels to enhance the efficiency of information transmission;third, YOLOv5 uses multiple positive sample matching to speed up the training efficiency.
Before pretrained models became popular, some classic object detectors were directly trained from scratch,such as in Ref.31.Since then, based on the success of the pretraining and finetuning paradigm, many object detectors have been applied to remote sensing object detection scene.Since He et al.6demonstrated that the performance of the network model trained from scratch is no worse than the pretraining effect based on the ImageNet dataset, there have been some advanced studies exploring the design guidelines of network training from scratch.Ref.32 designed an SSD-based network trained from scratch that demonstrates a set of instructive design principles.Ref.33 realized the recalibration of the object detection supervision signals by using the threshold mechanism and the recurrent feature pyramid, which highly increases the speed and precision of the network trained from scratch.Ref.34 studies the effect of batch normalization on stabilizing network training and gradient descent, and proposes Root-Resnet to help train object detectors from scratch.Ref.35 designed a pruning framework for sparse neural networks trained from scratch and achieved good performance.Ref.36 reported that the Vision Transformer outperformed ResNet without pretraining or strong data augmentations.Furthermore, the authors in Ref.37 demonstrated that pretrained models obtained on large-scale datasets are not necessarily beneficial for downstream tasks.
This section describes the proposed SRS-Net framework without pretraining.This framework guides and supervises the network for more adequate training,reports the beneficial role of model normalization in training from scratch, and describes the effective data augmentation in the remote sensing scene.Our goal is to eliminate the limitation of remote sensing data training brought by the pretrained model,and find an effective way to start training from scratch.
For two reasons, the proposed SRS-Net framework is similar to YOLOv5.First, we aimed to design a single-stage detector that considers both detection accuracy and speed to study how to eliminate the lack of prior knowledge caused by the lack of pretraining model.Second,the complex network structure creates huge hardware resource overhead, so it is necessary to avoid the consumption of a large amount of computing resources caused by the actual application of the proposed detector.The SRS-Net framework includes a feature extraction backbone, feature fusion neck, and classification and regression detection head network, as shown in Fig.2.As the goal of SRS-Net is to train the network from scratch using small-scale remote sensing data, and there is a lack of prior feature information in the pretrained model, the densely connected network structure is theoretically effective in achieving this goal.In addition,studies have shown that dense structures can improve the representation ability of the network to a certain extent.15To enable the network to achieve sufficient training, we designed SRS-Net to improve the original backbone CSP-darknet53, increase the structure of dense connections, and concatenate multi-scale feature maps to retain limited multi-scale information to the greatest extent.The improved backbone is named DCCSP.The dense structures introduced in DCCSP are used to realize feature reuse on the channel.Compared with the ordinary connection method,DCCSP has two advantages: first, it can enhance the feature extraction and representation ability,second, it stabilizes the gradient information of the network to avoid the exploding gradients.SRS-Net uses the structure Path-Aggregation-Network (PANet) that has been built in the YOLOv5 framework as the feature fusion neck network.This network is in the middle part of the framework and integrates and processes features output by the backbone.In addition, SRS-Net uses a classification and regression detection head network and gridbased multi-scale anchor boxes to perform bounding box regression and object classification on feature maps of different scales.
Fig.2 Architecture comparison of SRS-Net and YOLOv5.
As shown in Fig.2, the training image first passes through the backbone network,DCCSP,with densely connected structures, in which multi-scale feature extraction is performed.There is no pretraining model and large-scale dataset involved in this process.At the same time,the figure compares the structural difference between DCCSP and CSP.CSP is a common connection method, and the extraction and processing of features are relatively simple.Then,SRS-Net inputs the extracted effective feature layers to the feature-fusion neck network so that the multi-layer features are fused in up-sampling and down-sampling.The fused feature layer can be regarded as a collection of feature points.These feature points are input to the detection head network, where the regression parameters are adjusted and the object category is judged.
Previous studies have demonstrated the role of densely connected structures in neural network training, such as DSN38and GoogLeNet39.The densely connected structures add additional branches for the hidden layer of the neural network to guide and supervise the network for more adequate training.In fact, densely connected networks can provide supervision for most convolution modules, providing integrated hidden layer supervision by introducing an adjoint objective function for each convolution module.In addition,for object detection tasks trained on small-scale remote sensing datasets, the densely connected structures can have a strong regularization effect15.The proposed DCCSP backbone is based on the theory of dense connection structure and adopts the dense connection method for convolution modules to realize layer-bylayer supervision of the network.In Fig.3, we can see that the dense connection is embedded in each convolution module as a parallel feature extraction branch.In addition,some modules that have been built in the original CSP have been proved to be effective, such as the Focus module.So, these modules are retained in DCCSP, and the necessary network structure is inherited.
Specifically, after the input image enters the DCCSP network, it first passes through the Focus module in the original CSP to obtain shallow features with a size of 320×320 pixels.Then,it is input to the first 3×3 ordinary convolution module to gain the second-layer feature map with 160×160 pixels.The ordinary convolution module contains convolution layers and normalization layers, where the convolution sliding stride is set to 2.Simultaneously,the shallow feature is concatenated in a channel with the second-layer feature map after operated by the dense convolution module.The concatenated feature map continues to be passed down.The dense convolution module operation comprises a 2×2 max-pooling layer, a 1×1 convolution layer, and a normalization layer (a discussion on normalization layers is presented in Section 3.3).
The DCCSP network implements the strategy of deep supervision by dense connection, which can enhance the feature representation capacity of the network.In a traditional CSP network, let the output of the Nth convolution module be FN, and then, FNcan be expressed as follows:
where fN(FN-1) represents the Nth layer convolution module nonlinear transformation for the (N - 1)th layer output features.FN-1represents the output of the(N-1)th convolution module.However,in the DCCSP network,the Nth layer dense convolution module output isFN, and the Nth layer generates an enhanced feature map by concatenating and merging the features for previous N - 1 layers, which is expressed as follows:
Fig.3 Densely connected DCCSP network.
Comparing Eq.(1) and Eq.(2), we can see that the dense connection method plays a crucial role in the backbone network, especially when the object detection network is trained from scratch without a pretrained model.The dense connections can guide the neural network through training under deep supervision.Furthermore, there are some tiny or weak objects in the remote sensing image, and the dense structures retain the object information in the shallow network, which enhances the feature extraction effect of the backbone and expresses a key role in promoting object detection precision in remote sensing images.
A previous study6experimentally showed that the role of normalization could not be ignored when training the detection network from scratch.Valid forms of normalization include normalized parameter initialization40and activation normalization layers.41In the parameters of the network updating process, a normalization layer is usually used to regulate the distribution of learning data.As shown in Fig.4, the normalization layer in SRS-Net is located after the convolution layer, reducing the internal covariate shift of the data.The normalization method can be expressed as follows:
where xidenotes the input feature to be normalized;yidenotes the output feature; γ and β denote the independent learnable parameters; and ε is a small offset constant to prevent the denominator from being equal to 0.μiand σidenote the mean and standard deviation of the data to be normalized, respectively, which are given by
Fig.4 Position of normalization layer in SRS-Net.
where Sirepresents the pixel set for calculating the mean and standard deviation, m is the size of this set, and i and k are the index numbers.Different normalization methods calculate the pixel set Sidifferently.
Batch Normalization (BN)42is widely used in network training and directly affects the gradient optimization process of the image classifier.In fact, when the network is trained from scratch,the distribution of the learning data is unknown,and the weights of the hidden layers need to be constructed from scratch;so BN,that is widely used in the object detection network has a great impact on training, especially when training an object detector dedicated to remote sensing.According to the definition of BN,the Sicalculation of the BN operation is as follows:
where C is the channel axis, and kCand iCrepresent the index of k and i on the channel axis,respectively.The BN operation needs to normalize the data under the same channel at the same time, that is, the mean and standard deviation of the batch learning data are calculated,respectively,in three dimensions(N, H, W).This means that a trained detection network cannot represent the test data well when the training and test data have different feature distributions.Furthermore, the BN operation is affected by the batch size setting.If the batch size is small, the calculation of the mean and standard deviation will not be accurate enough.If the batch size is large,the memory consumption will significantly increase.Therefore,training a high-resolution remote sensing object detection network from scratch is more difficult than training using natural scenes, and modifying the normalization settings of the network is a key factor to avoid being affected by the learning data distribution and batch size.
Group Normalization (GN)43is another common normalization training method.GN divides multiple channels into several groups and calculates the mean and standard deviation of the input features within each group.Thus, it can more powerfully describe the characteristics of data distribution.In addition, the GN operation has nothing to do with the batch size setting and is not constrained by it; thus, the normalization performance during training is stable.Ignoring the role of GN can give the misconception that detectors are hard to train from scratch.Given this,it is reasonable to introduce the GN operation into a remote sensing object detection network trained from scratch.According to the definition of GN, the Sicalculation of the GN operation is as follows:
where G is the group number setting, C/G indicates the channel number of each group, and.■■indicates the floor operation.N is the batch axis, and kNand iNdenote the indices of k and i on the batch axis, respectively.The GN operation divides the channels into several groups and then calculates the mean and standard deviation of the batch learning data in (N, W, C/G) dimensions.This means that the GN operation has nothing to do with the batch size setting and is also immune to the distribution state of the training data and test data,thus better optimizing the training of the object detection network from scratch.A comparison between the calculation processes in BN and GN is shown in Fig.5.The neural network operated by GN can guide the training in the optimal solution direction and is not affected by data distribution;yet, the network operated by BN may not reach the optimal convergence state.Based on the above analysis,the normalization layer in SRS-Net uses GN operations.Subsequent experiments (detailed in Section 4.3.2) demonstrated that this modification could better train the detector from scratch.
ImageNet-based pretrained models iterate over hundreds of epochs on more than one million images, and many low-level features and much shallow semantic information are extracted into models; so, the network does not need to re-learn this information in fine-tuning processes.However, when we train a remote sensing object detection network from scratch, the limited image datasets make it difficult to drive the network for efficient training.Therefore, it is necessary to increase the scale and generalization of the dataset as much as possible so that the training data can fully represent the characteristics of remote sensing objects.
To obtain enough training data, the SRS-Net network sequentially uses improved Mosaic,random affine transformation, and other methods for data enhancement.However, the original Mosaic operation is random when cropping the image;it is easy to crop positive samples,especially in remote sensing images.Moreover,the original image size is distinct,and some redundant areas are obtained at the edge of the image after the Mosaic operation, resulting in an increased useless computational overhead in the network.Therefore, this section improves the Mosaic operation in line with the characteristics of object distribution in remote sensing image, as shown in Fig.6.First, take out a batch size image and randomly select nine images from it as splicing samples, including various categories and instances.Then, cut according to the area of the minimum circumscribed rectangle, retain the object information,and cut out the useless information at the edge of image.Finally, perform image scaling, where the spliced image is scaled to a uniform size and input to the network as a new training sample.This improved Mosaic operation brings two advantages to the remote sensing object detection scene: first,it reduces the redundant area of the image edge, which can make the model converge faster; second, it is equivalent to increasing the amount of data for each training case, which improves the generalization of training data and effectively prevents training overfitting in remote sensing datasets.
Fig.5 Comparison of calculation process of BN and GN.
Fig.6 Improved Mosaic operation splices and fuses images in a batch on a minimum circumscribed rectangle.
Owing tothe lackoffeatureinformationlearnedbypretrained models,making full use of the learning data is particularly important in networks trained from scratch.The improved data augmentation method in this section effectively expands the training dataset,enabling the network to learn the object information of cross-modal semantics, enhancing the generalization and robustness ability of the object detection network.
This section defines the loss function for SRS-Net training.The loss constraint Ltotalof the SRS-Net training process includes three parts: the regression loss Lregfor adjusting the anchor box positioning, the classification loss Lclafor judging the object category of the anchor box, and the confidence loss Lconffor the network.One of the difficulties faced in object detection in remote sensing is that it is hard to obtain more accurate object boundaries through conventional Intersection-over-Union (IoU) loss, which is disadvantageous for training from scratch.Thus,it is crucial to use a more efficient IoU loss function during SRS-Net training to increase the regression accuracy of the object bounding box.
Given the influence of remote sensing object feature sampling on the regression loss function,SRS-Net adopts the unified power regression loss function Lα-IoUin Ref.44 as Lreg.The loss function uses the dynamic parameter α to modulateIoU, and has two advantages in training remote sensing object detectors from scratch: it can adaptively learn to revise the gradient weight in line with the morphological features of the object;it is more robust to anchor boxes containing noise.To facilitate matching the k-th anchor box of a real object at position (i,j ), the unified power regression loss Lα-IoUcan be calculated as follows:
where s is the scale of the anchor box; Boxgtand Boxpredrepresent the ground-truth and the predicted box, respectively;and IoUαrepresents the use of dynamic parameter α to modulate the IoU calculation process.In SRS-Net, three anchor boxes are generated near each possible object, and there are three scales.
Suppose the (x , y, w, h) represents the center point coordinates and side length of a bounding box, G is the object ground-truth, and P is the predicted bounding box.Then the regression equation for the position update of the anchor box is as follows:
where mx,my, mwand mhare the relative differences between the ground-truth G and the anchor box, respectively.m*x,m*y, m*wand m*h are the relative differences between the predicted box P and the anchor box, respectively.The calculated offset can guide the update of coordinates of the bounding box.
Therefore, for the k-th anchor box that matches the real object at position (i,j ),the total loss Ltotalof SRS-Net training can be summarized as follows:
where the hyper-parameters λcls, λboxand λconfare the weights that control the multi-task loss value, and the classification loss Lclaand confidence loss Lconfuse the binary crossentropy calculation.
To objectively test and evaluate the remote sensing object detection network SRS-Net trained from scratch, we selected two public datasets, DOTA11and DIOR12, and the combined dataset Airplanes and Ships (AS), for experiments.Specifically, we aimed to evaluate whether the distribution of the object aspect ratio affects the performance of the anchorbox-based network model.As there are many object categories in the public dataset,and the distribution of the aspect ratio is relatively balanced, the model’s robustness cannot be tested from the extreme aspect ratio.Therefore, we combined a remote sensing image dataset with only airplanes and ships and controlled the object aspect ratio in the range of extreme cases (the airplanes are approximate square objects, and the ships are the largest aspect ratio object), which is called the AS dataset.
(1) DIOR dataset.The DIOR dataset contains different imaging conditions, including weather, seasons and other environmental factors.The dataset has 23463 images of 800× 800 pixels and 192472 object instances,including 11725 training and validation set images and 11738 test set images, covering 20 object categories.The dataset has three characteristics:a variety of images and object categories, a large range of object size variations, and numerous environmental factors.Owing to these particularities, researchers usually choose it for algorithm performance testing.Thus, the DIOR dataset can be used as a large-scale object detection benchmark.
(2) DOTA dataset.DOTA1.0 is also a public, large-scale,remote sensing image dataset, including 2806 satellite or aerial images of about 800–4000 pixels and 188282 object instances, which are divided into 15 categories.The imaging area of the DOTA dataset has a wide field of view, complex background information, large image size,and various instance orientations,which are crucial for training a robust object detector from scratch.The ratios between the training set, validation set, and test set are 1/2, 1/6, and 1/3, respectively.In experiments,the DOTA1.0 training set was used for network training,and the DOTA1.0 validation set was used as the comparison standard.
(3) AS dataset.AS is a combined remote sensing dataset that only contains airplane and ship images.Owing to the variety of objects in the DIOR and DOTA datasets,the average object aspect ratio of the entire dataset is around 1.3, making the shape of most bounding boxes approximately a square.To test the detector performance in case of drastic changes in aspect ratio, we selected 4816 airplane images and 4871 ship images from Google earth, the TGRS-HRRSD dataset45, and the RSOD dataset46, which were combined into the AS dataset.In this dataset, the aspect ratio of the airplane is approximately 1,the aspect ratio of the ship is approximately 4, and the number of the two objects is roughly equal.So, it can simulate a situation where the aspect ratio changes drastically.This is necessary when comprehensively evaluating the performance of remote sensing object detectors.
The experiments mainly verified the effect of the proposed SRS-Net network trained from scratch.As this paper mainly focuses on training the network from scratch, some popular rotation detection methods are not within the scope of this paper.SRS-Net uses a horizontal bounding box to locate objects, which can simplify the network training process and reduce the calculation of regression parameters for the model.
The experiments used YOLOv5 as the baseline network,and the baseline used pretrained weights that have been trained on MSCOCO dataset.The dynamic parameter α of the modulation IoU calculation in Eq.(7) was set to 3.The hyper-parameters λcls, λboxand λconf, which control the multitask loss value in Eq.(9), were set to 0.5, 0.5, and 1, respectively.Furthermore, the training epoch number was set to 100 and the initial learning rate was set to 0.01.All experiments were accelerated on two workstations with NVIDIA Titan X GPU, using the SGD optimizer for gradient descent during training and using the same system environment for training and testing to guarantee a fair comparison of all detection algorithms.
To quantify the effect of detection and make a unified comparison of different networks,we used four general evaluation indicators:recall,precision,Average Precision(AP),and mean Average Precision (mAP).The relevant equations for the calculation of the indicators are as follows:
where recall and precision are defined by TP,TN,FP,and FN,which represent true positive samples, true negative samples,false positive samples,and false negative samples,respectively.AP represents the average precision of a certain object category,which is represented by the integral area under the recall and precision curve (PR, Precision–Recall curve).mAP denotes the mean average precision under all categories in dataset.These evaluation metrics are consistent with the standard PASCAL VOC47calculation scheme.In addition, unless otherwise specified, the precision or average precision indicators used were compared with a percentage system to avoid complicated results.
4.3.1.Study of backbone DCCSP
This section explores the impact and contribution of the improved DCCSP on SRS-Net trained from scratch without using any training and testing tricks.At this time,the YOLOv5 network without a pretraining model was used as the baseline,and the BN operation was used for data normalization (to avoid the impact of the GN operation on training).The densely connected structure was embedded in the DCCSP; so,the ablation study of backbone DCCSP can prove the influence of densely connected structures on training the network from scratch.Table 1 summarizes the experiment results.We compared mAP, precision and recall under three datasets.Among them,(n),(s),(m),and(l)represent network structures with different depths and widths from small to large.
The comparison shows that on the DIOR, DOTA and AS datasets, the improved backbone training (l) model corre-sponds with mAP values of 68.0%,73.8%,and 95.3%,respectively, which are higher than those of the detector without DCCSP, at 2.4%, 2.7% and 3.4%, respectively.In addition,using the DCCSP backbone to train the other three sizes of network models also enhances the detection accuracy.The improved backbone DCCSP can thus be said to be more effective compared to the original backbone network, and experiments verified the effectiveness of the dense connection structure in training.In the absence of a pretrained model,the knowledge representation ability of the network could be enhanced to a certain extent through the densely connected backbone and supervised training.We also observed that as the number of layers of the selected backbone network model increased, the detection performance improved, which means that the increase of the network layer number enhances the nonlinear fitting ability.
Table 1 Ablation studies on backbone DCCSP under three datasets.
4.3.2.Studies for normalization
Eq.(5) and Eq.(6) are two normalization methods for deep networks, which guide the process of gradient optimization and weight construction.However, the scope of the two normalization applications is different.Owing to the particularity of the remote sensing scene, it is even more difficult to train the network from scratch.To eliminate the impact of lacking pretrained models on remote sensing object detection, we performed an ablation analysis on the normalization methods using the DIOR dataset.In fact, BN and GN operations have different degrees of sensitivity to the distribution of data; so, it is necessary to explore an adaptive normalization method when training from scratch.According to different batch sizes, four sets of comparison experiments of normalization operations were set up for SRS-Net, as shown in Table 2.
Table 2 Comparison of experiment results of different normalization operations.
Fig.7 Qualitative inference results of the proposed SRS-Net on DIOR dataset.
Table 3 Quantitative evaluation results of proposed SRS-Net and 12 advanced methods on DIOR dataset.
When the batch size is 4, the mAP of SRS-Net on the DIOR test set using the BN operation is 66.3%, but the mAP is 68.2% for the GN operation.In contrast, when the batch size is 64, the BN operation has a mAP comparable to that of GN.Therefore, the BN operation is indeed influenced by batch size, and the mAP variation scope is 1.7%,while the GN operation is affected by the batch size by 0.2%.This variation phenomenon caused by normalization needs to be avoided in remote sensing object detection.In addition, when training an object detection network from scratch, the lack of prior feature information of the pretrained model makes it critical to adopt a larger batch size to stabilize the gradient descent process.Therefore,using the GN operation can better characterize the distribution of training and testing data, and obtain better object detection performance.
4.4.1.Results comparison on DIOR dataset
Qualitative inference results of the proposed SRS-Net trained from scratch on the DIOR dataset are shown in Fig.7.Various categories of objects at different scales are depicted, as well as scenes with dense or sparse object arrangement.It can be seen that the detection of the vast majority of object instances by SRS-Net is accurate and their categories are correct.Note that although the designed detector uses horizontal boxes, the boundaries of objects determined by it are clear.Additionally,this paper focuses on training detectors from scratch,so rotating the bounding box is not within the scope of the discussion.For some scenes with dense object distributions, such as ships and storage tanks, SRS-Net is essentially not affected by the object distribution and shape.When the object size is small,for instance small vehicles, the object location detected by SRS-Net is still accurate.This is competitive among other detection algorithms.Overall, the detection results inferred by SRS-Net reflect better visual effects.
Table 3 presents the quantitative evaluation results for the designed SRS-Net and 12 advanced algorithms on the DIOR dataset.‘*’indicates multi-scale training and testing.Bold numbers indicate the maximum value of the category.As these methods were not specifically designed for embedded devices,only the inference speed on a Titan X GPU was evaluated.In addition, we reimplemented some methods, so the results may be slightly different from those reported in the original papers.Among them, DSOD and ScratchDet detectors do not use pretrained models, and they are also two types of networks trained from scratch.The remaining 10 advanced detectors use ImageNet pretrained models for training and finetuning.The comparison shows that the mAP value of SRSNet is 68.4%, which is comparable to the performance of YOLOv5 based on a pretraining model.With the introduction of multi-scale training and testing, SRS-Net has a mAP of 70.6%, surpassing the performance of all pretrained-based detectors.In challenging categories such as DF, ETS, and TC, SRS-Net also has the best detection results.As SRS-Net is a one-stage method, which has obvious advantages on remote sensing datasets compared with two-stage methods and Transformer framework.Although the results do not show optimal performance by SRS-Net in all categories, the algorithm has better overall detection ability and comparable to or better performance than algorithms based on pretrained models.Thus, SRS-Net has specificity in the remote sensing object detection field,effectively eliminates the impact of lacking pretrained models, and achieves satisfactory results.
Fig.8 Qualitative inference results of SRS-Net on DOTA dataset.
Fig.9 Relationship between precision and recall curves of SRS-Net and other six popular algorithms on DOTA.
In addition, Table 3 shows the comparison of inference results of different methods.The size of images input to the network is 800 × 800 pixels.The inference speed of SRS-Net on the DIOR dataset is 35.6 fps,which is the best among these comparative methods.Furthermore, the size of the SRS-Net model is only 93.0 Mb, which is smaller than that of most detectors using pretrained models.We believe that such resultsare due to the dense backbone structure and stable gradient descent.On the contrary, some pretraining-based methods do not have an advantage in inference speed and model size.We speculate that this is because of the large number of parameters and floating-point operations of the model.Thus,the proposed network has a better model structure and can avoid loading huge parameters in memory,which also realizes the domain adaptation learning of data.
Table 4 Quantitative evaluation results of proposed SRS-Net and 12 advanced methods on DOTA dataset.
Table 5 Quantitative evaluation results of proposed SRS-Net and 7 advanced methods on AS dataset.
4.4.2.Results comparison on DOTA dataset
To discuss this further,we used the DOTA dataset to perform for a comparative evaluation of each algorithm.Fig.8 shows SRS-Net’s qualitative inference results on DOTA dataset.It is obvious that SRS-Net acquires satisfactory detection results on large-scale remote sensing datasets.In some challenging scenes, such as ships with diverse orientations, or densely distributed vehicles, SRS-Net is almost unaffected by complex scenes.In addition, SRS-Net also has good performance in multi-scale object detection.For example,when a roundabout and small vehicle exist simultaneously, SRS-Net can learn multi-scale features between objects and perform multi-scale prediction.Fig.9 shows the precision and recall curves of SRS-Net and six other popular algorithms.There is a constraint relationship between precision and recall.Moreover, a detector with good performance can maintain a high recall rate while maintaining a high precision.Overall,the proposed SRSNet achieves good performance in most categories.
Table 4 shows the results on the DOTA dataset, including AP and mAP values for 15 categories.The comparison shows that the mAP of the proposed method is 74.1%,which is comparable to the performance of YOLOv5 with a pretrained model.Interestingly, the AP values of some classes using SRS-Net are slightly lower than that of other detectors, while the average detection performance of SRS-Net is slightly higher or comparable to that of the other detectors.This indicates that the proposed detector has better overall detection performance on more data categories,and the gradient convergence in the training process is more stable.This phenomenon is due to the dense structures on the backbone and the adjustment of normalization.SRS-Net, DSOD and ScratchDet detectors are all trained from scratch, but SRS-Net has huge advantages on remote sensing datasets, indicating that the adaptive transformation of SRS-Net for remote sensing object detection is effective.As SRS-Net is trained from scratch and lacks a lot of primary features and prior information in the neural network layers,it is reasonable that it performs slightly lower than YOLOv5 without using multi-scale training and testing skills, and it can still outperform other representative detection algorithms.Even with this shortcoming, this work still provides inspiration for the field of remote sensing, that is, how to train a higher-precision object detection network from scratch and explore its design principles.This is crucial and novel in many complex image domains.
Fig.11 Visualization of loss value and mAP change curves during training.
4.4.3.Results comparison on AS dataset
To comprehensively evaluate the performance of SRS-Net,this section uses the AS dataset to simulate a situation where the aspect ratio changes drastically.Compared with the DOTA and DIOR datasets, the object category of the AS dataset is simple, but the aspect ratio changes greatly, which is a challenge for high-precision detection of remote sensing objects.Fig.10 shows the qualitative inference results of SRS-Net.It can be seen that the detection of both types of objects is accurate, indicating that SRS-Net has good robustness against aspect ratio changes.In addition, the proposed method also shows good detection results when the illumination is insufficient.We can see that the detection of the airplane by SRSNet is still accurate even when it is hidden in an illumination-insufficient image.
Table 5 shows the comparison of the quantitative evaluation results with 7 advanced detection methods on the AS dataset.Note that SRS-Net can achieve the same performance as YOLOv5 using the pretrained model, indicating that the dense connection strategy of the backbone and the gradient optimization are the design principles of training object detectors from scratch.In addition, SRS-Net surpasses the performance of most classical object detectors based on pretrained models and has higher detection accuracy.Although SRSNet is affected by training data and prior knowledge,it cannot subversively enhance the results of existing object detection algorithms.Despite this, SRS-Net contributes a good idea to the remote sensing object detection field,that is,training from scratch after some necessary modifications.
To compare the training efficiency of the two modes,as well as the relationship between training effect and data scale,we used the DOTA and DIOR datasets for experiments,and one-third of the images were randomly selected for training.Fig.11 shows the loss and mAP change curve for the training process.
Three conclusions can be drawn from the loss and mAP curve during training.First, methods based on pretraining can converge faster during training.In contrast, it takes the trained-from-scratch methods longer to reach convergence and a steady state.This is because using a pretrained model is equivalent to giving the network initial values so that network can quickly enter the fine-tuning stage.Second, when network reaches a steady state, the method of training from scratch is more accurate.This is because training from scratch can better realize the domain adaptation learning of the data and can learn the feature information of remote sensing data better than the pretrained model.Third, because of the lack of prior information in network parameters, the loss value of the training process is larger than that of the pretraining method, so the overfitting can be avoided when training from scratch.
In addition, it is undeniable that from the perspective of training efficiency, the method of training from scratch also has some disadvantages, which can be summarized from the following two aspects.First, when the quantity of data is less than a certain scale, the accuracy of training-from-scratch method decreases significantly.This is because training from scratch is more dependent on data.When the scale of training data is small, the feature representation effect learned by the network is also smaller.Second, it takes the networks trained from scratch longer to reach a state of convergence, which means that it may take longer than pretraining methods to train a similarly effective detector.However, from the test results in Sections 4.3 and 4.4,these disadvantages do not seem to affect the performance of applications.We believe that this is due to the combined effect of a dense network structure,stable gradients, and sufficient data.
Remote sensing object detection is of great significance for the development of the remote sensing field and has been popularly used in a large number of scenarios.This paper explored the network design principles of remote sensing object detectors so that they can be trained from scratch.We proposed a densely connected object detector, SRS-Net, a simple and effective network for training from scratch.In this work, a densely connected DCCSP backbone was also designed to provide supervision for most convolution modules.The role of normalization in the network structure was investigated, and data augmentation was improved for remote sensing scene customization.Considering the pretraining-free advantage, SRSNet can eliminate the limitations of pretrained models in the domain of remote sensing data applications.SRS-Net demonstrated a performance comparable to or better than popular detectors such as YOLOv5, CenterNet, Swin Transformer,Mask R-CNN, and SSD on the DIOR, DOTA, and AS datasets.As training from scratch is scalable and flexible,this paper provided a new perspective on object detection training in remote sensing, which is particularly important for exploring the role of learning data and training methods in deep neural networks.Our future work will consider lightweight improvements of training from scratch to support its deployment in embedded devices.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This research was supported by the Natural Science Foundation of China (No.61906213).
CHINESE JOURNAL OF AERONAUTICS2023年8期