Jiawei NIU, Zhunga LIU, Quan PAN, Yanbo YANG, Yang LI
Department of Automation, Northwestern Polytechnical University, Xi’an 710072, China
KEYWORDS Classification;Generative adversarial network;Imbalanced data;Optimization;Over-sampling
Abstract Imbalanced data classification is an important research topic in real-world applications,like fault diagnosis in an aircraft manufacturing system.The over-sampling method is often used to solve this problem.It generates samples according to the distance between minority data.However,the traditional over-sampling method may change the original data distribution,which is harmful to the classification performance.In this paper, we propose a new method called Conditional Self-Attention Generative Adversarial Network with Differential Evolution(CSAGAN-DE) for imbalanced data classification.The new method aims at improving the classification performance of minority data by enhancing the quality of the generation of minority data.In CSAGAN-DE, the minority data are fed into the self-attention generative adversarial network to approximate the data distribution and create new data for the minority class.Then,the differential evolution algorithm is employed to automatically determine the number of generated minority data for achieving a satisfactory classification performance.Several experiments are conducted to evaluate the performance of the new CSAGAN-DE method.The results show that the new method can efficiently improve the classification performance compared with other related methods.
In pattern classification1–4, different data types may have an imbalanced distribution in some applications.A class with fewer data in the dataset is known as a positive class or a minority class, whereas a class with more examples is called a negative class or a majority class.In addition, the minority class is more important than the majority class, and the cost of misclassification is higher in reality.Therefore,the minority data are paid more attention to in term of classification accuracy.The imbalanced data classification is often encountered in many applications, such as disease diagnosis5, face recognition6, risk management7, and fault diagnosis8.In such a way,improving the imbalanced data classification accuracy has attracted extensive interest from many researchers.
The imbalanced data classification framework includes three categories: data preprocessing level9–12, feature selection level13, and classification algorithm level14,15.
At the data preprocessing level, the imbalance ratio is reduced by changing the training data distribution to solve the problem of sparse data.The data preprocessing level consists of three basic techniques:the under-sampling method,the over-sampling method, and the hybrid-sampling method.In previous work16, each method has unique advantages in dealing with imbalanced data.To fully make use of the complementary information among different methods, we combine each of them appropriately at the decision level.In Ref.17, a pairwise learning-based BPA determination model is proposed.It divided the original complex problem into many pairs of simple binary subproblems to achieve better decision boundaries in complicated scenarios.A new Complex Evidential Quantum Dynamical (CEQD) model18is proposed to explain and predict interference effects on human decisionmaking behavior.At the feature selection level, features with good differentiating performance are chosen to improve the classification accuracy of the minority class.At a classification algorithm level,the general classification methods are modified to combine with the characteristics of imbalanced data to solve the imbalanced problem.
In this work,the classification problem is solved in the data preprocessing level.The over-sampling technique is used to redistribute the data to minimize the imbalance ratio of the dataset.In an over-sampling method, the examples are generated from the minority class to balance the sample ratio.A simple method known as the Random Oversampling (ROS)technique generates instances by copying minority data randomly to balance the data distribution.However, it may increase the possibility of overfitting, for it duplicates samples repeatedly.In Ref.19,new artificial minority instances are generated by the line segment, which joins minority data.In contrast,it may create noise data because boundaries between the majority and minority classes are not clear.
Various modifications of Synthetic Minority Over-sampling Technique (SMOTE) have been proposed to deal with the imbalanced problem.Self-Level SMOTE20applies a weight degree adjusted to determine the location of the generated data.In Borderline-SMOTE21,the dataset is divided into three sets: Noise, Danger and Safe.For each object in the minority class, the nearest neighbors are calculated from the dataset.The Noise set contains the objects whose nearest neighbors are the majority instances.The Danger set includes the objects whose nearest neighbors are more than half of the majority data.The Safe set represents the objects whose nearest neighbors are the minority data.Only the Danger set is used by the SMOTE to generate minority data.In Ref.22,the hard-to-learn informative minority data are recognized,and their weights are assigned according to the Euclidean distance from the nearest majority data.The minority examples are generated using a clustering approach from weighted informative minority data.
Kernel-ADASYN23method constructs an adaptive oversampling distribution estimated with kernel density estimation methods.It weights different minority data according to difficulty level to create the minority data.In Ref.24, a K-Nearest Neighbors-based (KNN-based) minority data generation method is proposed.The synthetic data are created randomly in the direction of its k-nearest neighbors.A weight is assigned to each neighbor direction to avoid overgeneralization, as the safer neighbor directions are likely to be chosen to generate data.A Deep Over-Sampling (DOS) method has been proposed by Ando and Huang25to improve the classification accuracy of the minority class by resampling of minority data in the feature space.In Ref.26, a conditional generative adversarial network is used to approximate the true data distribution and generate minority data, and it can effectively improve the imbalance characteristic of data.
However, several over-sampling methods may have some limitations.For instance,the ROS technique repeats data randomly, which may cause overfitting problems.The SMOTE method generates samples by the distance between minority data that will change the original data distribution.To deal with the above issues, we propose a method based on Self-Attention Generative Adversarial Network(SAGAN)to learn the data distribution.Unlike the standard GAN model that used local spatial points to generate data, the SAGAN model aims to create data using complementary information at all feature locations.The learned model makes fake data for the minority class (i.e., a generator is an over-sampling method).
Moreover, the traditional over-sampling method attempts to generate the same number of samples as the majority data,satisfying a 1:1 ratio between minority class and majority class.However, it may cause some restrictions.Some studies have pointed out that better classification results can be achieved with a 1:1 ratio between minority and majority classes, but it is not necessarily the best27,28.In order to deal with the above limitation of the random sampling ratio, the Differential Evolution (DE) algorithm is used to automatically determine the sampling ratio in the over-sampling technique to obtain a stable and satisfactory classification performance.
Our research concentrates on binary imbalanced data classification because the multi-classification study is usually seen as multiple binary classifications.Improving the classification accuracy of binary imbalanced data is the main focalization to solve the imbalanced data classification problems.
The remainder of this paper is organized as follows.Section 2 presents our proposed method in detail.In Section 3,the practical applications are shown to test the performance of our method, while conclusions are provided in Section 4.
Traditional over-sampling techniques generate data according to the distance between minority data, which may change the original data distribution and cover up the real data distribution.Since the self-attention generative adversarial network concentrates on the global feature of data, the distribution of data learned from the network can approximate the distribution of real data.Although it is not the real data distribution,the samples can be well generated from the learned data distribution.Furthermore, most of the over-sampling methods create data through random sampling ratios, which cause unstable and undesirable classification results.The differential evolution algorithm is a target optimization based on genetic evolution to deal with the optimal solution in the data space.Overall, the main problem we solved in this study is that: a new over-sampling method (CSAGAN) is proposed to improve the generation quality of the sampling methods by learning the data distribution.In addition,a differential evolution technique is utilized to determine the over-sampling ratio automatically, which can overcome the randomness limitation of the sampling method and improve the classification performance of imbalanced data.
The method consists of the following procedures: Initially,the minority data are sent to the self-attention generative adversarial network to learn the distribution.The optimization model is obtained after training the network.Additionally,the F1-score of the minority class is set as the objective function value of the differential evolution algorithm.This function value is calculated after the mutation,crossover, and selection operators.Finally, the optimal objective function value and corresponding sampling ratio are acquired until the loop termination.
In this section, a new method named Conditional Self-Attention Generative Adversarial Network(CSAGAN)is proposed to generate the minority data from the learned distribution.In the beginning, we take a brief introduction to the Conditional Generative Adversarial Network (CGAN) and Self-Attention Generative Adversarial Network (SAGAN).
The generative adversarial network29was proposed by Goodfellow et al.It is inspired by a two-player game.A generative model G captures the data distribution and produces a fake instance by noise Z.A discriminative model D estimates the probability that an instance came from the real dataset rather than generative model G.However, the limitation of GAN is that we cannot control the modes of the generated data.If some supervised information is added to the model,it can direct the data generation process.The Conditional Generative Adversarial Network (CGAN)30extends the GAN framework.An additional information c is fed into both the generator and the discriminator.c can be any kind of auxiliary information, so we set it as class label.
In CGAN framework, generator and discriminator are multilayer perceptrons, and the loss function is defined by
where x ?X represents the real data sampled from the data distribution Pdata,and z ?Z is a random noise vector sampled from the noise distribution Pz.The expectation E is used to find the best value.The additional information c is a class label coming from the training dataset in our research.G(z|c)represents the produced data from the generator, and D(x|c) is the probability of the input data.D(G(z|c))means the probability of the discriminator distinguishing whether the data created by the generator are real data.
Convolution neural networks31are used by generative models to deal with the image dataset.However, the size of each convolution kernel is minimal, basically no more than 5.A limitation of these methods is that the receptive field depends on the size of the convolution kernel.Thus, only small local regions in the output image can be computed when calculating the convolution output.For example,GAN can generate simple geometric images such as sea and mountain, but it may have some trouble producing fine specific geometry images such as horse or sheep.
While the receptive field can be increased by multi-layer convolution and pooling operations, it needs multiple layers of mapping to snatch the feature.Furthermore, it is not enough to use the receptive field of the convolution kernel at a particular layer.It is supposed that global information is obtained in some layers, and the complementary information will be brought to the subsequent layers.Thus, it may be beneficial to capture the feature in long distance.If the convolutional neural network makes the space larger capture images,it will reduce the computation efficiency of smaller filters.In conclusion, obtaining an adequate receptive field and maintaining computation efficiency become a significant problem.
SAGAN32adds a self-attention mechanism into the GAN framework to balance efficiency and expand receptive field.The self-attention model is complementary to convolution for feature maps.It uses complementary features at all feature locations rather than fixed areas at local sites.Global geometric features can be obtained directly by calculating the relationship between any two pixels in the image.The structure of the self-attention module is shown in Fig.1.
The convolution feature map x is obtained by a hidden layer of the convolution neural network.In addition, two feature spaces f=Wfx and g=Wgx are acquired from 1×1 convolution kernel.Attention is calculated as
where γj,idenotes the ith local attention at the jth location.
The attention map is obtained by softmax normalization.Moreover, we use another 1×1 convolution kernel to get a feature space h=Whx.Overall, the output of the attention layer is obtained by attention map γj,iand the feature space h as follows:
In the above equation, three weight matrices Wf,Wg,Whimplemented as 1×1 convolution kernels need to learn the parameter from the training process.Thus, the final output yican be obtained by the combination of the output of attention layer oiand input xiby
where δ is a learnable scalar, and we set it to zero initially,allowing the model to start from simple local features and gradually learn the global features.
The self-attention network is good at modeling multi-level dependencies across image regions.It has been applied to the generator and the discriminator in the SAGAN.The generator is able to produce images where the details of these locations are carefully coordinated with the details in distance parts of the image.In addition, the complicated geometric constraints will be performed by the discriminator on the global image.The loss function is given by
Fig.1 Structure of self-attention module.
Suppose the generator and the discriminator are conditioned on some auxiliary additional information c originating from the CGAN version.In that case, the SAGAN can be extended to Conditional SAGAN (CSAGAN).Additional information c (class label) and noise z are concatenated in the same representation to the generator, and the input data x and additional information c are linked in joint hidden representation to the discriminator.
Formally, the objective function of the CSAGAN framework can be modified by
Most over-sampling methods generate minority data with the same number as the majority data, which may cause an unstable classification accuracy.To obtain excellent and stable classification performance, we propose a method to determine the generated number in data generation automatically.
The main point of the over-sampling method is to change the data distribution by generating minority data, to obtain a ratio of 1:1 with the majority data.In the end,the new dataset with balanced distribution is fed into the classifier for learning,and it can efficiently improve the classification accuracy of the standard classification method.
Definition 1.(Over-sampling Ratio).An over-sampling ratio is the proportion of the generated minority data using the oversampling method in the original imbalanced dataset.
In order to acquire a satisfactory and stable classification performance in a standard classification algorithm,we propose an automatic determination method for an over-sampling ratio based on the differential evolution algorithm.
The Differential Evolution(DE)algorithm was proposed by Rainer Storn and Kenneth Price to deal with the Chebyshev polynomials problem, and it has been widely applied since then33.The differential evolution algorithm is an efficient global optimization algorithm used to solve the optimal solution in multi-dimensional space.The set of possible solutions to the problem is regarded as a population,where each solution is seen as an individual in the population.Our goal is to optimize the value of the fitness function by utilizing the values of individuals.
The basic procedure of our method is that firstly an initial population(a group of over-sampling ratios)is generated randomly, and individuals (over-sampling ratios) are selected from this population.Secondly, some operators (mutation,crossover) produce a new individual.Finally, the unique generated individual is compared to the corresponding individual in the population.Suppose the fitness value(F1-score value)of the new individual is better than that of the current individual.In that case, the present individual will be replaced by the unique individual in the next generation,and vice versa.In this way, good individuals are retained to search for the optimal solution.The primary process of automatically determining the over-sampling ratio based on the DE algorithm is shown in Fig.2.
(1) Population Initialization: At the 0th generation, the N over-sampling ratios are generated in the solution space randomly,
where rand(0,1) is a uniform distributed random value that ranges from 0 to 1.
Then, the over-sampling ratios after initialization are evaluated by calculating the F1-score values of minority data.
Moreover,the termination condition is verified whether it is met or not.If the condition is not satisfied, the method will continue to the loop operation.
(2) Mutation: At the tth generation, three over-sampling ratios Xn1(t),Xn2(t),Xn3(t),n1,n2,n3?[1,N] and n1≠n2≠n3≠i are randomly chosen from the group of over-sampling ratios.The produced mutation over-sampling ratio is given by
Fig.2 Basic process of automatically determining over-sampling ratio.
where CR ?(0,1) represents the crossover probability.CR=0.8 is chosen in our experiment.
(4) Selection: The best over-sampling ratio is selected between the crossover one and the original one.The oversampling ratios are evaluated by calculating their F1-score values.Comparing the F1-score values, we choose the oversampling ratio with a more significant value as the optimal value, where
Xi(t+1)is better than Xi(t)for each over-sampling ratio after mutation, crossover, and selection operations.
Finally,the optimal over-sampling ratio and corresponding F1-score value are selected until the termination condition is met.For the convenience of implementation, the CSAGANDE algorithm is outlined in Algorithm 1, and the flowchart is shown in Fig.3.
Algorithm 1.CSAGAN-DE for imbalanced data classification.
Require X: imbalanced dataset; Z: random noise vector;XU:Upper of over-sampling ratio;XL:Lower of over-sampling ratio;N: The number of over-sampling ratios; T: Maximum number of evolutions;MU: mutation probability; CR: crossover probability;Step 1.Feed the imbalanced dataset into the CSAGAN to learn data distribution by Eq.(6).Step 2.t ←0 (initialization)Step 3.Generate a set of over-sampling ratios by Eq.(7).Step 4.Select an over-sampling ratio from the set by Eq.(9).Step 5.Generate the minority data from the learned CSAGAN model using the selected over-sampling ratio.Step 6.Calculate F1-score value of the minority data.Step 7.while t Fig.3 Basic process of self-attention generative adversarial network with differential evolution approach. In our proposed method,it consists of three hyper-parameters:the size of population N, the mutation probability MU, and the crossover probability CR. The diversity of the population and convergence rate is influenced by the parameter N.Increased N provides the method to converge better to the optimal solution,but the running time will also be increased.The convergence speed can be improved when N decrease,but it is easy to fall into the local optimizations.In order to test the influence of N on the classification performance of our proposed method, single F1-score values with different values of N ?[5,10]are displayed in Fig.4.The x axis represents the values of N,and the y axis represents the single F1-score values of minority data.We find that the F1-score values are not very sensitive to the N values.Because multiple iterations are done to guarantee the diversity of the population.We set N=10 as the default value in the experiment. Fig.4 Single F1-score values with different population. The step size of individual differences in population is determined by the mutation probability MU, which is usually set between[0.4,0.9].Search space of the method and the diversity of the population can be enlarged by increasing MU,but it will weaken the convergence rate when it is too large.The convergence speed of the method will be improved when MU is reduced, but it will have the risk of local optimizations.The global and local search ability will be balanced by the crossover probability CR, which is usually set between [0.4,0.9].The diversity of the population will be enhanced when CR increases, but the convergence speed of the method will be reduced.We have done some heuristic tests on the proposed method shown in Fig.5.The x axis represents the values of MU and CR,and the y axis represents the single F1-score values of minority data.It usually produces good performance when MU=0.5 and CR=0.8. In this section,we aim to evaluate the performance of the proposed method for the binary classification problem of imbalanced data by comparing it with several other data sampling methods, including over-sampling methods34: SMOTE19, and Borderline-SMOTE35;under-sampling methods36:ClusterCentroids37, and Nearmiss38; hybrid-sampling methods39:SmoteTomek40, and SmoteEnn41. Fig.5 Single F1-score values with different mutation probability and crossover probability. The SMOTE19creates minority data by extrapolating between preexisting minority samples for over-sampling method.Initially, a few minority data are linked, and then the data are generated on the connecting line in some proportion.The Borderline-SMOTE35is the improvement of SMOTE method.The minority samples are divided into different types,and only the samples on the boundary are used to create new data. In the under-sampling method, the Nearmiss38selects a part of representative majority data for training.The majority data are chosen in some heuristic rules by defining the parameter version.The ClusterCentroids37uses K-means to cluster all kinds of samples, and the center of the cluster is seen as new data to decrease the majority data. For the hybrid-sampling method, the SmoteTomek40utilizes the over-sampling approach SMOTE to generate data and under-sampling method tomeklinks to eliminate the noise.The SmoteEnn41uses the under-sampling method edited nearest neighbors to clean the noise after over-sampling the data. In this experiment, the base classifiers can be chosen for practical applications.K-Nearest Neighbor classifier (KNN)42,43,Support Vector Machine classifier (SVM)44, Random Forest classifier (RF)45, Decision Tree classifier (DT)46, and Logistic Regression classifier (LR)47are used to show the effectiveness of the proposed method. Four image datasets: MNIST48, Fashion-MNIST49, Amazon50, and HRSC201651are selected to test the performance of our new CSAGAN-DE method.Basic knowledge of these four image datasets is shown in Table 1.The MNIST dataset is handwritten digits from different people containing ten categories.The Fashion-MNIST dataset involves ten types of fashion products.The Office + Caltech10 dataset comprises four different domains, where we select the Amazon part with ten categories in our experiment.The HRSC2016 dataset is a remote sensing image dataset consisting of several ships on the sea and close inshore.These datasets are divided into binary datasets that select one class as minority data randomly,and the remaining classes are combined as majority samples.These datasets have different Imbalance Ratios (IRs).In this work, all datasets are processed into a 32×32 size grayscale image to facilitate the comparison of experimental results. The performance of CSAGAN-DE is compared with those of conventional data sampling methods such as SMOTE, etc.,and generation approaches based on GAN.The parameter setting of data sampling methods is shown in Table 2.Most of thegeneration methods (CGAN and SAGAN) use the same parameter.Both the CGAN and CSAGAN methods use convolution neural networks, and the dimension of the noise vec-tor is set to 128, the batch size 16, and the epoch 300, respectively.The primary filter number is set to 32, and the learning rate 0.0001.Unlike CGAN, the CSAGAN uses spectral normalization in the generator and the discriminator.Thus the Adam optimizer is set as β1=0 and β2=0.9 for training.Meanwhile, the gradient penalty coefficient λ is set to 10. Table 2 Parameter setting of data sampling methods. Fig.6 Image generation on different datasets. Table 1 Basic information of imbalanced dataset. The recall52and F1-score53are used to evaluate the classifica- tion performance towards the minority class,which are usually performed in imbalanced data classification studies.The recall is the proportion of positive samples predicted to be correct,and the F1-score is the average of the precision and recall.The performance evaluation metrics are given by where TP (True Positive) is a positive sample predicted as a positive class.FP (False Positive) represents a negative data expected to be positive.TN (True Negative) is a positive sample indicated as a negative class,and FN(False Negative)represents a negative data predicted as a negative class. Unlike the conventional over-sampling method that uses the distance of the data to generate new samples, our proposed method learns the distribution of the feature by a neural network to generate data.The feature selection added the selfattention mechanism to avoid the local receptive field of convolution networks.By doing this, the networks are able to focus on global information efficiently and generate highquality data.The generation results for Fashion-MNIST,MNIST, Amazon, and HRSC2016 datasets are shown in Fig.6, which means our proposed CSAGAN-DE method can generate multiple style minority images. Traditional over-sampling methods use a random sampling ratio to generate new data.Thus the classification performance is unstable.The differential evolution algorithm is utilized to automatically confirm the best over-sampling ratio, which can obtain a satisfactory and stable classification performance.Our study sets the single F1-score of minority data as the objective function, and the optimized over-sampling ratio is obtained when maximizing the objective function.The results in Fig.7 show that adding the differential evolution algorithm to optimize the over-sampling ratio can acquire higher classification performance than the general methods in most cases. Table 4 F1-score of MNIST dataset. Table 5 F1-score of Amazon dataset. Table 6 F1-score of HRSC2016 dataset. Table 7 F1-score values of CSAGAN-DE method with corresponding over-sampling ratio. The results in Tables 3–6 show the F1-score of minority data for eight over-sampling approaches and no sampling(None) method using Fashion-MNIST, MNIST, Amazon,and HRSC2016 datasets.The maximum accuracy value is marked in boldface type.Moreover, the F1-score value and corresponding over-sampling ratio are presented in Table 7 specifically.The higher the F1-score value, the better the classification performance of the conventional classification method.It is clearly shown that the CSAGAN-DE method exhibits better performance than the other over-sampling approaches and the method without data sampling method in most standard classifiers. We also calculated the single recall and macro recall for different data sampling methods during the experiment.The results in Figs.8 and 9 show that in other evaluation metrics,our method also can get higher performance in most cases. In the SMOTE and Borderline-SMOTE methods, the distance relationship between samples is used to increase the minority data, which may change the original data distribution.In the Nearmiss and ClusterCentroids approaches, the majority data are deleted to balance the dataset,but it will dismiss the important data message.Despite the SmoteTomek and SmoteEnn method combine the over-sampling and under-sampling algorithm,they may not improve classification performance completely.CGAN method can learn the data distribution, and then generate new samples.However, the convolution network only concentrates on local information,limiting the generation of diverse examples. The self-attention mechanism is added in our proposed CSAGAN-DE method, which can learn the data distribution and pay more attention to global information.The learned model can generate high-quality and diverse data.In addition,we introduce the differential evolution algorithm to determine the over-sampling ratio automatically.The over-sampling ratio is used to generate an optimal number of the minority data.The F1-score of minority data is set as an objective function, and the over-sampling ratio is seen as individuals.The over-sampling ratios go through the mutation and crossover operations during the algorithm loop,and it is optimized after maximizing the objective function(F1-score value).This is the reason why the CSAGAN-DE method outperforms the other methods. Fig.8 Single recall of different datasets. Fig.9 Macro recall of different datasets. We have proposed a new method to generate minority data,especially for imbalanced data classification.The presented CSAGAN-DE method can develop minority data by learning the data distribution and automatically determining the oversampling ratio to obtain a stable classification performance.In CSAGAN-DE,the self-attention generative adversarial network can learn the global information from the dataset to generate minority data.By doing this, the network will pay more attention to the critical information in the data.Thus highquality samples can generate from the network.In addition,the differential evolution algorithm is employed to optimize the over-sampling ratio by maximizing the objective function automatically.The classification performance of the new CSAGAN-DE method has been tested with five base classifiers(KNN, SVM, RF, DT, LR) in four different datasets.The experimental results show that the CSAGAN-DE method can efficiently improve the F1-score of minority data compared with other related data sampling approaches.However,when the number of minority data in the imbalanced dataset is too small,the CSAGAN model cannot well capture the feature information of the minority data.In such case, the quality of the generated image may be low, and this is not helpful to improve the classification performance.In our future study,we will attempt to develop a new deep-learning-based model to solve the imbalanced data classification problem with scarce minority data, and we will also extend the application of our method to other scenarios like SAR target recognition. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgements This work was partially supported by the Aeronautical Science Foundation of China (No.201920007001), and National Natural Science Foundation of China (Nos.U20B2067, 61790552 and 61790554).2.3.Parameter optimization
3.Experiment applications
3.1.Data sampling methods
3.2.Base classifiers
3.3.Benchmark datasets
3.4.Parameter setting
3.5.Performance evaluation metrics
3.6.Discussion
4.Conclusions
CHINESE JOURNAL OF AERONAUTICS2023年3期