Seung-Yeon Hwang and Jeong-Joon Kim
1Department of Computer Engineering,Anyang University,Anyang-si,14028,Korea
2Department of ICT Convergence Engineering,Anyang University,Anyang-si,14028,Korea
Abstract: Artificial intelligence,which has recently emerged with the rapid development of information technology,is drawing attention as a tool for solving various problems demanded by society and industry.In particular,convolutional neural networks (CNNs),a type of deep learning technology,are highlighted in computer vision fields,such as image classification and recognition and object tracking.Training these CNN models requires a large amount of data,and a lack of data can lead to performance degradation problems due to overfitting.As CNN architecture development and optimization studies become active,ensemble techniques have emerged to perform image classification by combining features extracted from multiple CNN models.In this study,data augmentation and contour image extraction were performed to overcome the data shortage problem.In addition,we propose a hierarchical ensemble technique to achieve high image classification accuracy,even if trained from a small amount of data.First,we trained the UCMerced land use dataset and the contour images for each image on pretrained VGGNet,GoogLeNet,ResNet,DenseNet,and EfficientNet.We then apply a hierarchical ensemble technique to the number of cases in which each model can be deployed.These experiments were performed in cases where the proportion of training datasets was 30%,50%,and 70%,resulting in a performance improvement of up to 4.68%compared to the average accuracy of the entire model.
Keywords: Image classification;deep learning;CNNs;hierarchical ensemble;UC-Merced land use dataset;contour image
Recently,demand and use cases of big data have been increasing in various industries,and owing to the rapid development of artificial intelligence,it has drawn attention as a tool to solve various problems demanded by society and industry.In particular,research on land-use classification in remote sensing fields is actively underway,and related high-resolution spatial image data can be obtained.In addition,innovative advances in computer vision field technology create a variety of opportunities to improve remote sensing image recognition and classification performance,facilitating the development of new approaches.The classification of land use in remote sensing images is critical for analyzing,managing,and monitoring the land use areas of people.In addition,deep learning-based classification techniques are needed to analyze environmental pollution,natural disasters,and land changes caused by human development in consideration of the complex geographical characteristics and diversity of images according to the weather.
Traditional image recognition techniques have used methods such as SIFT [1],HOG [2],and BoVW [3]to extract numerous features from images and search for the location of objects using extracted features.However,there is a disadvantage in that the classification accuracy for new images is low because of their sensitivity to shape transformation or brightness conditions.
CNNs,which have recently been studied in various ways,were first introduced in [4]to process images more effectively by applying filtering techniques to artificial neural networks,and later proposed CNNs,which are currently being used in the deep learning field in[5].AlexNet[6],which won the ImageNet Large Scale Visual Registration(ILSVRC)competition held in 2012,is still widely used in image recognition and classification,object detection,and tracking and shows excellent performance [7].VGGNet [8],GoogLeNet [9],ResNet [10],DenseNet [11],and EfficientNet [12]recently achieved superior performance in image recognition and classification.
Additionally,in recent years,when image classification is performed to complement the performance of a single CNN architecture,ensemble techniques have emerged that harmoniously utilize predictive results from multiple well-designed deep learning architectures.By training several models,a combination of the predicted results of these models were used to obtain more accurate predictions.Recently,research has been actively conducted on the application of deep learning ensemble techniques in various fields[13-17].
In this study,we used transfer learning strategies to overcome overfitting and improve the convergence rate of the models.Using a pretrained CNN architecture is the preferred method in several existing studies[18-21].The UC-Merced land use dataset[22]was used to evaluate the experimental performance,and six scenarios were set according to whether the contour image was trained and the ratio of the training data,and then individually trained on five pretrained CNN architectures with excellent performance.We then performed hierarchical ensembles on the number of cases in which the five learning models can be deployed.The method proposed in this study was named hierarchical ensembles because the output values of the previous layer were used as inputs for the next layer.The three contributions of this study are summarized as follows.
? We demonstrate that learning contour images to classify remote sensing images using a CNN architecture is an important factor in performance improvement.
? We propose a hierarchical ensemble technique to complement the poor performance of a single model and achieve higher image classification accuracy.
? We achieve excellent image classification performance even if we use CNN models with a small amount of training data through the learning of contour images and hierarchical ensembles.
The remainder of this paper is organized as follows.Section 2 introduces the related work.Section 3 describes the dataset used in this study and the experimental method in detail.Section 4 presents the experimental results and analyses.Section 5 concludes this paper and suggests directions for future research.
As shown in Fig.1,CNN uses convolutional layers to extract feature maps for input images and pooling layers to convert the local parts of each feature map into one representative scalar value.In addition,the derived feature maps were utilized to configure the classification operation on the image to be performed.
CNNs can generally be formulated as shown in Eq.(1).Mjrepresents the selection of the input map.If output mapsjandkboth sum over input mapi,then the kernels applied to mapiare different for output mapsjandk[23].The CNNs used in this study are described as follows.
VGGNet is a model developed by the University of Oxford Visual Geometry Group that won second place in the 2014 ILSVRC competition.The VGG research team proposed six models to compare the performance of neural networks according to depth.Among them,VGG16 with 13 conv layers,three FC layers,and VGG19 with 16 conv layers and three FC layers were used.In this work,we used VGG19,and the architectures of VGG16 and VGG19 are shown in Figs.2 and 3.
Figure 2:VGG16 architecture
Figure 3:VGG19 architecture
GoogLeNet is a neural network developed by Google that won the 2014 ILSVRC competition by beating VGGNet.Although it consists of 22 layers,deeper than VGG19,it uses 1 × 1 Conv layer to dramatically reduce the computation of the feature map.In addition,unlike conventional neural networks that use fully connected layers for image classification,global average pooling is used to completely remove the weights that occur during classification.Furthermore,unlike existing conventional neural networks that perform convolutional operations using the same size kernel in one layer,GoogLeNet introduces an inception module to transfer convolutional operational results using different sizes of kernel to the next layer to derive more different kinds of properties.Fig.4 shows a visualization of the structure of the inception module in GoogLeNet.
Figure 4:Inception module in GoogLeNet
ResNet is a neural network developed by Microsoft that won the ILSVRC competition in 2015.Because it is difficult to expect good performance even with the deep design of neural network models,ResNet introduced a residual block with added shortcut to the input value of neural networks to the output value.ResNet,which consists of 18,34,50,101,and 152 layers,has been proposed,and ResNet-152 with 152 layers has the best performance.In this work,we use a 101-layer ResNet-101 model,which visualizes the residual block and ResNet architecture as shown in Figs.5 and 6.
Figure 5:Residual block
Figure 6:ResNet architecture
DenseNet does not add input values to the output values,such as residual blocks.Instead,it uses dense blocks that concatenate the input and output values.Because DenseNet connects the feature maps of all layers,the information flow is improved,preventing information loss when the feature map passes through the layers of the neural network,and also reduces the vanishing gradient problem.DenseNet composed of 121,169,201,and 264 layers was proposed,and in this study,DenseNet-121 composed of 121 layers was used.Fig.7 shows a visualization of the dense block.
Figure 7:Dense block
Many of the CNN architectures that have previously emerged have evolved to scale models to achieve higher accuracy within limited resources.ResNet is a representative model of depth scaling that increases the number of layers,and representative models of width scaling that increase the number of kernels include MobileNet[24]and ShuffleNet[25].Before EfficientNet appeared,there were no guidelines for selecting an appropriate scaling technique,thus it was difficult to prove it through experiments.EfficientNet enables efficient and effective model scaling by using the compound scaling technique that scales the model by adjusting the balance between the depth,width,and resolution of the CNN.For EfficientNet,eight models were proposed according to the scaling method,in which EfficientNetB0,the baseline model of EfficientNet,is used.Fig.8 shows a schematic for comparing each model scaling technique.
In this section,we first describe the dataset obtained for this experiment,and then describe the scenarios and specific steps of each experiment.
The dataset used in this study,the UC-Merced Land Use dataset,is a remote sensing image dataset for land use.It consists of 21 classes:agricultural,airplane,baseball diamond,beach,buildings,chaparral,dense residential,forest,freeway,golf course,harbor,intersection,medium residential,mobile home park,overpass,parking lot,river,runway,sparse residential,storage tank,and tennis court.It contains a total of 2100 images,including 100 images per class,and most of the images have a resolution of 256×256 pixels.Therefore,in this study,44 images with different shapes were converted to 256×256 pixels.Fig.9 shows the sample images for 21 classes.
Figure 8:Model scaling methods
Figure 9:Sample images of UC-Merced land use dataset for 21 classes
In addition,to obtain more training datasets,each image was rotated,shifted,zoomed,and made symmetrical,and the contour image for each image was extracted to increase the size of the training dataset.Algorithms for edge extraction include Sobel[26],Prewitt[27],Roberts[28],and Canny edge detection[29].In this study,the Canny edge detection algorithm was used.Fig.10 shows the five steps for Canny edge detection,and the images extracted at each step.First,the result of edge detection on the input image is very sensitive to noise,so the noise is removed using a Gaussian kernel.In the second step,the intensity and direction of the edge are detected by calculating the gradient to find the point at which the pixel value changes rapidly.The image extracted at this stage contains non-edge parts.Therefore,in step 3,non-maximum suppression is performed to remove these non-edge parts.In step 4,we set the maximum and minimum thresholds to identify strong or weak edge pixels.Finally,in step 5,for pixels between the maximum and minimum thresholds,the connection structure of each pixel determines whether it is an edge.
Figure 10:Detailed steps for extraction of contours using the Canny edge detection
Fig.11 shows a sample extracted from the original image,and this edge information is efficiently used for GIS data analysis.
Figure 11:Sample contour images from UC-Merced land use dataset for 21 classes
The proposed method consists of five steps:First,for the training of VGGNet,GoogLeNet,ResNet,DenseNet,and EfficientNet,six scenarios were set according to whether the contour image was trained and the ratio of the training data;detailed information of each scenario is presented in Tab.1.When the training of each model was completed for each scenario,the hierarchical ensemble was performed by appropriately arranging the five models.The number of cases where five models could be deployed in one scenario was 120,and there were a total of 720 cases for six scenarios.The steps of the proposed study are presented in Tab.2.
Table 1:Six scenarios for the proposed method
Table 2:Five steps of the proposed method
In this study,to overcome overfitting and reduce the training time,a model pretrained with the ImageNet dataset was used using transfer learning techniques,and each model was trained using a GPU.In addition,to measure the performance improvement by hierarchical ensembles according to the ratio of training data,the learning epoch of each model was set to 50,and the learning rate was fixed at 0.0001.Additionally,the Adam [30]optimizer was used to enable each model to converge quickly.Fig.12 schematically shows the method proposed in this study.The performance process of the hierarchical ensembles is shown in Algorithm 1.
Algorithm 1:Hierarchical ensemble method 1.Input:images in test dataset I,labels in test dataset L,CNN model list M 2.cnt ←Initialize number of tests as 0 3.averageAcc ←Initialize test accuracy as 0(Continued)
Algorithm 1:Continued 4.For 0:n-1(n=the number of I 5.A ←Average of prediction results for I[i]of M[0]and M[1]6.B ←Average of prediction results for I[i]of M[1]and M[2]7.C ←Average of prediction results for I[i]of M[2]and M[3]8.D ←Average of prediction results for I[i]of M[3]and M[4]9.E ←Average of A and B 10.F ←Average of B and C 11.G ←Average of C and D 12.H ←Average of E and F 13.I ←Average of F and G 14.J ←Average of H and I 15.averageAcc ←averageAcc+ J 16.cnt ←cnt+1 17.End For 18.averageAcc ← averageAcc cnt 19.Output:Average accuracy of Hierarchical ensemble
Figure 12:Diagram of the proposed method
The experiments performed in this study were performed on a 64-bit Windows 10 Education Operation system and were implemented using the Python programming language and the Keras library.The PC used in the experiment was equipped with an NVIDIA GeForce RTX 2070 SUPER 8 GB GPU,AMD Ryzen 7 3700X 8-Core CPU,and 32 GB RAM.
Fig.13 shows the classification accuracy of each model that has learned the contour image among the six scenarios proposed in this study,and Fig.14 shows the classification accuracy of each model that does not learn the contour image.As for the classification accuracy of individual models,the model trained including the contour image had higher accuracy,and the higher the proportion of training data,the higher the accuracy.Therefore,we confirmed that learning the contour image is significant for improving the classification performance of the model.In addition,it was confirmed that the ratio of the training data affected the performance of each model.In the case of learning contour images,GoogLeNet performed best when the proportion of training data was 70%,whereas it was third when it was 30%.In the case of ResNet,when the proportion of training data was 70%,it was the third best,whereas when it was 30%,the performance was the best.In the case of VGGNet,when the ratio of training data was 70%and 50%,the performance was the worst,whereas when the ratio of training data was 30%,it was confirmed that the classification accuracy improved slightly compared to EfficientNet.The performance of the entire model is the best when the ratio of training data is high,and the performance is not always good,even when the ratio of training data is low.
Figure 13:Classification accuracy for each model trained by datasets containing contour images
The hierarchical ensemble architecture in Fig.12 shows that the central model has the greatest influence.Therefore,the best performance model is centered.We then deploy the second-best performing model in the second position and the third-best performing model in the fourth position.Finally,the lower two models with poor performance were placed in the first and fifth places.We judged that performing hierarchical ensembles in this way would have the best performance.The results of performing a hierarchical ensemble on the number of cases in which the trained individual models can be deployed demonstrate the validity of the above hypothesis.The experimental method for the number of cases is presented in Algorithm 2.
Figure 14:Classification accuracy for each model trained by original datasets
Algorithm 2:Hierarchical ensembles on the number of all cases 1.Input:Images in test dataset I,labels in test dataset L,CNN model list M 2.While:Permutation of M 3.Hierarchical ensemble(I, L, M)of Algorithm 1 4.End While 5.Sort descending by case-by-case 6.Get one case with the highest accuracy 7.Output:Best case of hierarchical ensemble
Tab.3 shows the results of the hierarchical ensemble experiment for each scenario,and Fig.15 shows a visual representation of the results of the hierarchical ensemble experiment.As a result of performing the hierarchical ensemble,it was confirmed that the performance was improved in all scenarios,and higher accuracy was achieved when the contour image was learned than when it was not learned.Comparing the average accuracy of the overall model by scenario with the accuracy of the hierarchical ensemble,the accuracy improvement was not significant when the ratio of training data was high;however,the lowest ratio of training data(scenario 6)achieved a performance improvement of up to 4.68%.Therefore,the data insufficient constraint problem can be solved if the hierarchical ensemble is used when the data required for deep learning model training is insufficient.
Tab.4 presents the performance comparison of the basic and hierarchical ensembles.Fig.16 shows the performance comparison of the basic ensemble and the hierarchical ensemble when the contour image is trained,and Fig.17 shows the visualization of the performance comparison between the basic ensemble and the hierarchical ensemble when the contour image is not trained.First,a basic ensemble trains five CNN models,such as a hierarchical ensemble.We derived the predicted values for each model to perform a basic ensemble.The average of each derived model prediction was used to calculate the accuracy of the image.Although there were some performance improvements for basic ensembles,we found that hierarchical ensembles had an average performance improvement of 1.84%over basic ensembles and achieved up to 2.41%when training data had the lowest ratio(Scenario 6).
Table 3:Hierarchical ensemble results
Figure 15:Visualization of hierarchical ensemble results
Table 4:Performance comparison of basic and hierarchical ensembles
Figure 16:Performance comparison of basic and hierarchical ensembles trained by datasets containing contour images
Figure 17:Performance comparison of basic and hierarchical ensembles-trained by original datasets
In this study,six scenarios were set according to the ratio of training data and whether the contour image was trained,and land use data were individually trained on well-known pretrained neural networks such as VGGNet,GoogLeNet,ResNet,DenseNet,and EfficientNet.A hierarchical ensemble was performed for all cases in which each learned model could be deployed.As a result,we not only confirmed that the contour image was a significant factor in the performance of deep learning models but also achieved high classification accuracy with small amounts of training data through hierarchical ensembles.Further studies related to optimization,such as hyperparameter tuning of each CNN model,are required to achieve the best accuracy,as the purpose of the proposed method is to demonstrate the importance of contour image learning and improve the classification performance of the model through hierarchical ensembles.In future studies,we expect to identify and extract significant features among features extracted from each model and further improve the model training speed and accuracy through a combination of multiple models.
Funding Statement:The authors received no specific funding for this study.
Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding this study.
Computers Materials&Continua2022年8期