• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Oversampling Method Based on Gaussian Distribution and K-Means Clustering

    2021-12-10 11:54:36MasoudMuhammedHassanAdelSabryEesaAhmedJameelMohammedandWahabKhArabo
    Computers Materials&Continua 2021年10期

    Masoud Muhammed Hassan,Adel Sabry Eesa,*,Ahmed Jameel Mohammed and Wahab Kh.Arabo

    1Department of Computer Science,University of Zakho,Duhok,42001,Kurdistan Region,Iraq

    2Department of Information Technology,Duhok Polytechnic University,Duhok,42001,Kurdistan Region,Iraq

    Abstract:Learning from imbalanced data is one of the greatest challenging problems in binary classification,and this problem has gained more importance in recent years.When the class distribution is imbalanced,classical machine learning algorithms tend to move strongly towards the majority class and disregard the minority.Therefore,the accuracy may be high,but the model cannot recognize data instances in the minority class to classify them,leading to many misclassifications.Different methods have been proposed in the literature to handle the imbalance problem,but most are complicated and tend to simulate unnecessary noise.In this paper,we propose a simple oversampling method based on Multivariate Gaussian distribution and K-means clustering,called GK-Means.The new method aims to avoid generating noise and control imbalances between and within classes.Various experiments have been carried out with six classifiers and four oversampling methods.Experimental results on different imbalanced datasets show that the proposed GK-Means outperforms other oversampling methods and improves classification performance as measured by F1-score and Accuracy.

    Keywords:Class imbalance;oversampling;gaussian;multivariate distribution;k-means clustering

    1 Introduction

    Machine learning methods aim to discover information from known empirical data,build suitable models for the phenomenon,and eventually use the model and the information they learn to make a prediction for the future unknown data[1].There are two main types of machine learning,namely supervised and unsupervised.In supervised learning,the dependent target class(outcome)is known in advance;and the main task of learning is to explore the relationship between other independent features and the target class.Whereas,in unsupervised learning,input data does not have any dependent target class,and the learning method is used to explore hidden patterns in data[2].

    There are many different classification algorithms exist in the literature,such as Decision Tree(DT),Support Vector Machines(SVM),K-Nearest Neighbor method(KNN),Naive Bayes(NB),Logistic Regression(LR),and Artificial Neural Networks(ANN).Many of these classification algorithms usually give high classification accuracy.However,such high prediction accuracy cannot be seen in solving every problem[3].One of the most important reasons for this is the class imbalance in the distribution of data.The problem of imbalanced datasets occurs when the number of instances in one class(called majority)is significantly higher than the number of instances in the other class(called minority).In such problems,the amount of data in the minority classes cannot be represented sufficiently represented to extract enough information from data compared to the other majority classes.Furthermore,the prediction error is usually very high for imbalanced dataset.Traditional classification methods usually assume that there is a relatively balanced class distribution in general,to minimize the overall error rate of the entire training set.However,since the minority classes have little contribution to the classification task,the classifier does not perform well on imbalanced data.To illustrate this problem,suppose a given dataset contains 100 samples,90 samples labeled as class A and 10 samples as class B.Next,suppose a specific classifier predicted all data as class A.In this case,it would achieve an accuracy of 90% even though the classifier was not able to correctly classify any sample in class B.From this example,it is obvious how the classifier with imbalanced data might result in wrong predictions and hence wrong decisions[4].

    In general,the approaches to handle the class imbalance problem are divided into two main types:algorithms-based and data-based approaches.Algorithms-based approaches aim to replace the existing machine learning algorithms,or developing new ones to address the class imbalance problem and improve the classification performance.On the other hand,data-based approaches usually change the distribution of the imbalanced datasets to stabilize class imbalance,and then give the resulting balanced datasets to the classifier to increase the detection rate of the minority class.An example of data-based approaches is re-sampling method,which is a popular solution for the problem of class imbalance to balance the class distribution of the dataset.The resampling methods are also categorized into two groups:over-sampling of the minority class and undersampling of the majority class[5].

    In this paper,we propose a new oversampling method for imbalanced data based on a Multivariate Gaussian distribution and K-means clustering(GK-means)to improve the prediction accuracy of all classes,particularly minority classes.The proposed method aims to achieve a balance between all classes by changing the distributions of minority classes so that the size of their samples does not differ much from the majority classes.This is performed by generating new samples for the minority classes using GK-means algorithm.The motivation behind proposing GK-means method is to provide a simple method for oversampling.The reason behind using the Gaussian distribution is to produce synthetic data that shares an approximately similar probability distribution with the original minority data.While the reason behind using the K-means clustering algorithm is to further divide the minority class into different clusters that have the same properties and extract more information from each cluster to find hidden patterns or grouping in data.

    The rest of the paper is organized as follows.Section 2 reviews some related recent approaches to class imbalance problem in general and oversampling in particular.Section 3 provides a brief background on current oversampling methods,along with general background on classification algorithms and model performance metrics.Section 4 presents the proposed method in detail.Section 5 exhibits the experimental results with discussions.Finally,Section 6 concludes the study with the main conclusions and suggests some future works.

    2 Related Works

    There are different approaches exist in the literature for dealing with class imbalance problem.This section reviews recent methods and techniques used for balancing data.

    Kamalov[6]proposed a technique called Kernel Density Estimation(KDE)for oversampling imbalanced dataset.Here,the Gaussian function was used as a kernel in KDE due to its wide popularity in the literature.Experimental results showed that this proposed method can provide higher performance when compared to other related methods.The SVM,KNN and MLP algorithms were used for evaluating the classification performance on 14 various datasets.

    Last et al.[7]used K-means clustering algorithm with SMOTE method for oversampling to cope with misbalancing problem.In their proposed method,they followed three main steps:clustering,filtering,and oversampling.Clustering was used to divide the dataset into groups based on the value of K using K-means algorithm.Filtering was then used to select clusters for oversampling based on the minority class samples.Finally,the SMOTE method was applied for oversampling to balance the dataset.They claimed that their proposed method improved the performance of the classification.They used LR,KNN and Gradient Boosting Machine GBM algorithms in their investigation on 71 different datasets.

    Santos et al.[8]applied K-means clustering algorithm to deal with the noisy problem before applying any synthetic sampling.Firstly,they used Gap statistic method to automatically select the number of clusters K,and hence using the K-means method to select centroids automatically instead of selecting them randomly.Secondly,they used the SMOTE for oversampling clusters.The LR and ANN were used for evaluating the classification performance on hepatocellular carcinoma(HCC)dataset.

    Pereira[9]used K-means clustering algorithm with some different methods for oversampling:random method,SMOTE method,Borderline SMOTE method,and G-SMOTE method.The KNN,LR,DT algorithms were used for evaluating the classification performance on 13 datasets.Gradient Boosting Classifier(GBC)was used as an ensemble technique for classification.Their results showed that when the K-means algorithm was combined with the oversampling method,the performances of the models have significantly improved.

    Aditsania et al.[10]used Adaptive Synthetic Sampling(ADASYN)method for oversampling with ANN Backpropagation algorithm to examine the efficiency of the new method.Their experimental results showed that the ADASY method can increase classification performance.

    Mustafa et al.[11]proposed a new method for balancing data called FD_SMOTE(Farther Distance Based on SMOTE).Their method is based on combining the Principle Component Analysis(PCA)with the SMOTE method.The FD_SMOTE approach can increase the performance of the classification models when used on a variety of biomedical data.The ANN,KNN,SVM,Bagging,Random Forest and Na?ve Bayes algorithms were used for checking the classification performance on 3 medical datasets.

    Zhang et al.[12]proposed a method for oversampling imbalanced dataset based on Gaussian distribution.Their approach generates samples with expected mean and variance from the distribution of the minority class.The NB,DT and ANN algorithms were used to examine the classification performance on 8 datasets.Their experimental results showed that the Gaussianbased method provided higher performance in most of the datasets used when compared with the SMOTE and Random methods.

    From our literature survey discussed above,we have seen that the most popular method used for oversampling imbalanced datasets is the SMOTE method.However,this method has some drawbacks.First,it generates noise when creating synthetic data because it does not consider neighboring examples from other classes.Second,it is only beneficial with low-dimensional data,and it is not effective for high-dimensional data[7].These two issues were addressed in the literature as follows.To solve the noisy problem,the K-means algorithm is used,and the feature selection algorithms are applied to figure out the high dimensionality issue.Another disadvantage of the SMOTE method is that it is time-consuming,especially in large datasets,because it uses the nearest neighbor samples during oversampling[13,14].

    The main drawback of ADASYN method is that it does not take care of noisy instances,and thus it is sensitive to outliers in the dataset.While the problem with Gaussian oversampling in[12]is that the Gaussian distribution was only used for the whole minority class that is not suitable for class(sample)space.This is because the mean can easily be affected by excessive outliers.If the data contain some very high or low values,the mean will not be suitable for representing data as a central measurement.

    In this section,we have addressed the shortcomings of the current available methods,and we have proposed a new oversampling method that is our main contribution in this study.The proposed method combines the Gaussian distribution with the K-means clustering algorithm.The minority class will be divided into K clusters.For each new cluster,a local empirical distribution will be defined as Gaussian,and the parameters of the underlying distribution will be estimated from data in each cluster.Then the required number of sampling for each synthetic cluster will be generated until data are balanced.Before presenting the new method in Section 4,we will briefly review the existing methods in the next section.

    3 Background

    This section reviews some existing well-known resampling methods and the model evaluation metrics.

    3.1 Resampling Methods

    Resampling is one of the basic approaches for class imbalance learning.It is a data-level technique that directly adjusts the distribution of data.Resampling does over-sample by adding new samples to the minority class,or under-sample by removing existing samples from the majority class[15].In other words,over-sampling increases the number of samples in the minority class until imbalance disappears.While under-sampling removes some instances of the majority class until the dataset is relatively stable.These methods directly replace the previous distributions of the majority and minority classes[16].Three over-sampling methods are used in this paper to be compared with our proposed method.These methods are:

    SMOTE:Synthetic Minority Oversampling Technique is one of the most common and effective oversampling method in many application domains[17,18].It creates synthetic samples by analyzing the data of the existing minority class.The SMOTE method creates a synthetic sample,which is linear combinations of two samples from the minority class(XiandXj)as follows.

    whereXnewis a new artificial instance.For the new sample of the minority class(Xnew),a sampleXiis selected randomly from the minority class.ThenXjis randomly chosen among the five-minority class nearest neighbors ofXibased on the Euclidean distance.αis a random float number in the range[0,1],then a new synthetic sample of the minority class will be created[13].

    ADASYN:Adaptive Synthetic Sampling Approach is another oversampling method based on the SMOTE.The main idea of this method is to produce minority class samples according to their distribution.ADASYN is similar to SMOTE and derived from it,but has only one important difference.It deflects the sample gap(that is,the probability of selecting any specific point for coupling)to points that are not found in homogeneous neighborhoods[19].For more information about this method,readers are referred to[20].

    GD:Gaussian Distribution is another oversampling method based on normal distribution,which used by[12]for dealing with imbalanced data.The GD-Sampling generates samples that are based on mean and variance from the distribution of the minority class without knowing the real distribution of data.The features are independent of each other.The main advantage of GD-Sampling is that it creates synthetic data that share an almost similar probability distribution with the original minority data[12].

    3.2 Clustering

    Clustering is one of the unsupervised learning algorithms,which means that no information is given beforehand[21].In general,it deals with finding a structure in a collection of uncategorized data[21].Clustering is the problem of grouping a set of objects.In such problem,objects in the same cluster must be more similar to each other than the objects in other clusters.Similarities are high in the same cluster and low between clusters.In this paper,K-means clustering is used,which is briefly described below.

    K-means Clustering Algorithm:It is one of the most popular unsupervised learning methods[22],which is based on dividing samples intoKsets.The essence of the algorithm is based on the idea that similar data are grouped in the same cluster.The term similarity in the algorithm is determined by the distance between the data.In fact,if the distance between data is low,it means the similarity is high,and vice-versa.K-means clustering is based on applying Euclidean distance[22,23],with the following steps.

    (a)The number of clusters,Kis selected.

    (b)The cluster center is initialized randomly.

    (c)Samples outside the center are classified according to their distances.

    (d)Determining new centers according to classification made(or shifting old centers to new centers).

    (e)Repeat Steps 3 and 4 until the stability is reached[24].

    3.3 Probability Distribution

    The numerical breakdown of the observation values held about any event is called distribution[25].The probability realization of these observation values is called the probability distribution[26].Any probability distribution is either discrete or continuous.A discrete distribution means that the random variable can assume one of a countable(finite)number of values.Whereas a continuous distribution means that the random variable can assume one of an infinite(uncountable)number of different values.

    Gaussian Distribution:Gaussian(or Normal)distribution is a continuous probability distribution,which is the most important distribution and very popular because it is assumed that the events observed in our daily lives usually correspond to this distribution.Gaussian distribution is very crucial and it is often used in nature,pure and social science to represent real-values random variables.There are many examples of random variables that follows this distribution,such as monthly returns of an investment,weights of produced products,height,lengths,and IQ test results.It is also used as a basic distribution in statistical inferences[12].The general formula for its probability density function is as follows[27]:

    3.4 Model Evaluation Metrics

    Once a machine learning algorithm is applied,the next step is to find the performance of the algorithm to see how efficient is the model to be used for prediction.Different evaluation metrics are used to measure the performance of the classifiers used.Accuracy is a classic measure that calculates the difference between the predicted model output and the actual class label.In general,high accuracy means better classification model,but other evaluation metrics such as Recall,Precision,and F1 score are also important for class imbalance[29].Tab.1 shows these evaluation metrics,which are derived from confusion matrix.

    4 The Proposed Method(GK-Means)

    Unlike other oversampling methods which take the whole minority class as one group of data to be oversampled,our method aims to further divide data into different groups(clusters)based on the underlying groups exist in the minority class.Our method aims to generate more realistic synthetic data using Gaussian distribution in conjunction with the K-means clustering to rebalance skewed datasets.

    The main difference between the new suggested method(GK-means)and the previous methods is that the GK-means is able to extract hidden patterns exist in the minority class before creating any synthetic data.In other words,our method can detect any unseen patterns in the original data and can handle it via fitting an empirical distribution to each pattern,hence taking these different patterns in the oversampling process.Fig.1 describes the main steps of the proposed method(GK-means)which consists of the following steps:

    (a)Choose minority class from dataset to be oversampled.The minority class is addressed to be further investigated so that the whole dataset can be balanced prior to applying any classification algorithm.

    (b)Put the minority class into the K-means clustering algorithm to be divided into K groups(clusters).The proposed method aims to further divide the minority class into different groups(clusters)based on the underlying groups and hidden features exist in the minority class.

    (c)Calculateμandσ2for each feature in each cluster.In order to sample from each group in the minority class(obtained from step 2),we assume that each group is normally distributed.Then the parameters of the underlying distribution are calculated using Eqs.(3)and(4).

    (d)Create a vector of means(μ)and variance-covariance matrix(Σ)for each cluster,as follows:

    (e)Choose the ratio size to create the number of samples.After randomly generating a number of ratio sizes,they were used to create a numerical model for optimizing the ratios of the minority classes.Each instance was assigned in the test dataset for the classification,and the best ratio was chosen.

    (f)Identify the required samples in each cluster.Based on the ratio size chosen from the previous step,the number of samples required for balancing the data in each cluster is identified.

    (g)Use Gaussian distribution to create synthetic samples in each cluster using its underlyingand(Σ).In order to keep the underlying distribution of data in each cluster in the minority class,random samples are simulated from the Multivariate Gaussian distribution using their parameters as defined in step 4.

    (h)Combine the new synthetic data with the raw dataset.After randomly generating all the required samples for balancing data in each cluster,these synthetic data are then combined with the raw data to create a new balanced dataset.

    5 Experimental Results and Discussions

    In order to check the performance of the proposed method(GK-means),we applied it using different classification algorithms on six imbalanced datasets from different application areas.General characteristics of the six datasets are shown in Tab.2.The ZADA is a newly created dataset collected from Shaker laboratory in Zakho city,Iraq.Other five datasets were collected from the www.kaggle.com,which are freely available.To further assess the efficiency of the proposed method,three other well-known oversampling methods,namely SMOTE,ADASYN and GD,were also applied,and the results were compared.All the experiments in this section were based on an oversampling ratio of 100% for each dataset.The six classifiers used in our investigation are:KNN,DT,NB,LR,SVM,and ANN.Furthermore,for evaluating the performance of each method,two common metrics(Accuracy and F1)were used and compared.As the K-means clustering algorithm was used to cluster the minority class,different number ofKswere investigated,and based on trial and error tests,we have chosen to use k = 2 for all the experiments to come.Regarding other parameters of the proposed method,namely(μi and σ2i)for each group in each dataset,we estimated these parameters from the underlying distribution of each clusteri.For implementing and validating the performance of each method,a 10-fold crossvalidation technique was used to guarantee the randomness of the experiments as well as to avoid any modeling issues with underfitting or overfitting.As an experimental setup,we used Python 3.8 for the implementations using different libraries related to our methods,such as NumPy,Scikit-learn,Keras,and TensorFlow.All experiments are carried out using MacBook Core i7 with 8 GB RAM.

    5.1 Experiment 1:Impact of K-Means Clustering

    In order to show the impact of clustering on the minority class,the K-means clustering is applied on ZADA dataset to split the minority class into 3 clusters.Results in Tab.3 show that when clustering the minority class,for some features,there is a significant difference between the estimated parameters(mean and variance)from one cluster to another compared to the overall parameters of the minority class.For example,for Cholesterol feature,the mean of clusters varies from 169 to 281,and the variance also varies from 14.12 to 23.06.This indicates that even within the same class(minority class),there is a possibility of some extra grouping of data,and this should be taken into account when oversampling.Therefore,using clustering for the minority class is crucial before applying oversampling method.

    5.2 Experiment 2:Test the Performance of GK-Means in Both Cases Correlated and Uncorrelated Features

    In this experiment,the proposed GK-means oversampling method is applied on several datasets to create synthetic data in order to handle the imbalance class problem.Tabs.4 and 5 show the performance of the six classification algorithms using GK-means on the six datasets in both cases(correlated and uncorrelated features)according to F1 metric(see Tab.4)and Accuracy metric(see Tab.5).

    From the results in Tabs.4 and 5,we can see that the performance of the classification algorithms with the GK-means method is higher when oversampling with uncorrelated features from multivariate Gaussian distribution.According to the winning times for correlated and uncorrelated features,we can see that the uncorrelated synthetic data generated from GK-means performed better and provided higher Accuracy and F1 in almost all algorithms and datasets used.

    From our experimental results discussed above,we can see that oversampling with uncorrelated features from multivariate Gaussian distribution provided better and more accurate results.Therefore,we make the assumption that features are independent,thus we chose the uncorrelated features of GK-means method to be compared with SMOTE,ADASYN and GD oversampling methods which will be discussed next.

    5.3 Experiment 3:Comparing Performance of GK-Means with Other Oversampling Methods

    Tabs.6–11 and Figs.2–7 show the results of our proposed method along with the three other oversampling methods using the six datasets(ZADA,Schierz_Bioassay,Vichle,Zoo-3,Bioassay and Pima)with six classification algorithms,respectively.From the mentioned Tables and Figures,it can be seen clearly that classification with the original data(without resampling)has obtained good performance according to the Accuracy measure,but it obtained very bad results according to the F1 measure.This is because when data are imbalanced,the accuracy measure is confusing and unreliable as it only calculates how many samples are correctly classified without taking into account the number of instances in both majority and minority classes.However,unlike the Accuracy,the F1 measure takes into account the number of samples correctly classified in both classes.This indicates that relying only on the Accuracy metric will lead to misclassification,and the results will not be reliable.This situation is clearly viewed in all Figs.2–7.On the other hand,when the oversampling methods were used,the performances of the classifiers,in terms of F1 measure,have improved when compared with the original data.This is because the number of samples in the minority class is increased,and data are balanced.

    Table 1:Model evaluation metrics

    Figure 1:Diagram of the proposed method(GK-Means)

    Table 2:General characteristics of six datasets

    Table 3:Mean and variance of different clusters of ZADA dataset

    Table 4:Performance of the six classifiers based on GK-means for three imbalanced datasets method using F1 metric.Co.F and Uco.F indicate correlated and uncorrelated features

    Table 5:Performance of the six classifiers based on GK-means for three imbalanced datasets method using Accuracy metric.Co.F and Uco.F indicate correlated and uncorrelated features

    Figure 2:Accuracy and F1 measures for GK-means with the three other methods for ZADA dataset

    Figure 3:Accuracy and F1 for GK-means with the three other methods for Schierz_bioassay dataset

    Figure 4:Accuracy and F1 measures for GK-means with the three other methods for Vichle dataset

    Figure 5:Accuracy and F1 measures for GK-means with the three other methods for Zoo-3 dataset

    Figure 6:Accuracy and F1 measures for GK-means with the three other methods for Bioassay dataset

    Figure 7:Accuracy and F1 measures for GK-means with the three other methods for Pima dataset

    Looking at the results of the Pima dataset(Tab.11 and Fig.7)in more detail,we can see that the proposed method has not provided good classification results in terms of Accuracy and F1.The reason behind providing such bad results for this dataset is because the GK-Means method is based on the assumption that the features are uncorrelated.However,experiment 2 showed that only for this specific dataset,the proposed method has performed better when generating correlated features from the GK-Means method(see results in Tabs.4 and 5 for Pima dataset).This indicates that the proposed method is sensitive to the pattern of data whether they are correlated or not.Nevertheless,the GK-Means is flexible to adapt itself with the nature of data before applying it for oversampling.In other words,if the data are already correlated,the GK-Means will generate correlated features from Gaussian distribution,and vise-versa.Table 9:Classification performance of GK-means with three other methods for Zoo-3 dataset

    Table 6:Classification performance of GK-means with three other methods for ZADA dataset

    Table 7:Classification performance of GK-means with three other methods for Schierz_bioassay

    Table 8:Classification performance of GK-means with three other methods for Vichle dataset

    Table 10:Classification performance of GK-means with three other methods for Bioassay dataset

    ClassifierRaw datasetSMOTEGDADASYNGK-means ACC F1ACC F1ACC F1ACC F1ACC F1 KNN97.07 3099.44 99.47 99.49 99.47 95.27 95.19 99.55 99.47 DT94.34 16.6696.44 96.28 97.88 97.83 97.38 97.35 98.94 98.99 NB97.07 3099.44 99.47 78.22 81.06 69.44 76.74 99.44 99.47 LR95.45 09999.04 92.16 92.28 81.11 80.42 97.47 99.47 SVM95.45 099.44 99.47 97.94 97.99 83.22 84.24 99.44 99.47 ANN96.36 23.339999.04 93.66 93.61 91.66 91.21 99.49 99.52 Winning time 0033010056

    Table 11:Classification performance of GK-means with three other methods for Pima dataset

    Table 12:Overall winning times for all six datasets

    Tab.12 summarize the performances in term of wining-time of the six classifiers using one of the four oversampling methods for each dataset.The first value 2/6 in Tab.12 means that when applying the SMOTE method on the ZADA dataset,it performs better than other methods in two classifiers out of six,while the terms(3*)and(1*)means that at least 2 methods have obtained the same result using the same classifiers.From the results in Tab.12,it can be seen clearly that the proposed method(GK-Means)has performed better than all other oversampling

    methods in all datasets except for the results of the Pima dataset,where the SMOTE method has performed better.The second-best performance has obtained with the SMOTE method while the worst results have obtained when the ADASYN method is used.In total,amongst the six datasets used with the six classification algorithms,the proposed method has performed better than other oversampling methods 16 times and has shared 3 times the same results with the SMOTE method and one similar result with the GD method.Therefore,our proposed method has significantly improved the performance results of the classification algorithms in terms of accuracy and F1 criteria when compared with other three methods.This indicates that GK-Means method can be used as a new oversampling method for handling the class imbalance problem.

    From the experimental results above,we conclude that our methodology is sensitive to the pattern of data.For example,if the features in dataset is already uncorrelated,then the GK-Means method will perform better with uncorrelated features generated from Gaussian distribution,and vice-versa.In addition,we observed that even when the number of samples in the minority class is very low,the new method has performed very well.To investigate this situation,we referred to data,and we have realised that the samples were significantly different from each other which helps the K-Means clustering to better split the minority class into different clusters.Thus,generated samples from each cluster will better represent the original data.This is because the number of generated samples will be based on the size of each cluster.For instance,if one cluster has 20% of samples and the other has 80%,the final generated samples will also have the same percentages.That means cluster one and two will generate 20% and 80%,respectively.

    6 Conclusions

    In this study,a new oversampling method,called GK-means,is proposed based on Multivariate Gaussian distribution and K-means clustering algorithm.The main advantage of GK-means method is its simplicity,efficiency,and flexibility in generating synthetic samples for the minority class.The performance of the new method was compared with three standard benchmark methods:SMOTE,ADASYN,and GD using six well-known classification algorithms.Experimental results on different datasets show that GK-means method has demonstrated a very good efficiency to solve the class imbalance problem,and it has the ability to outperform other standard oversampling methods.GK-means method has provided equal or better results than the SMOTE,ADASYN,and GD in terms of the accuracy and F1 metrics.We conclude that GK-means can be used as a potential tool to solve the class imbalance problem.

    This study is only a starting point for further investigations on the proposed GK-means method.Future work may thus concentrate on applying this method to other real-world problems.One can also replace the K-means with other clustering methods used in our proposed method to further investigate the existing clusters in the minority class.The components of GK-means(Gaussian Distribution and K-means clustering)are freely available in many programming languages,which allows researchers and specialists to easily execute and utilize the proposed method in their favored environment.Additionally,finding ideal estimations of hyperparameter(k)is yet to be studied;this could be another future work to be investigated.

    The newly created ZADA dataset was used for the first time,and it has not been used in any other machine learning applications.As a future work,one can use our dataset for other machine learning applications,such as regression and clustering algorithms to further extract the hidden patterns of the data.For example,it can be used for regression to find the relationship between the independent features in ZADA dataset with a view to predict Fasting_Blood_Sugar feature.Moreover,the ZADA dataset was labeled manually by diabetes experts;one can use different clustering algorithms,such as Mean-Shift and Agglomerative Hierarchical Clustering,to automatically label the data for further investigations.

    Funding Statement:The author(s)received no specific funding for this study.

    Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

    999精品在线视频| 久热这里只有精品99| 日韩欧美一区二区三区在线观看 | 午夜福利影视在线免费观看| 国产欧美日韩一区二区三区在线| 在线观看人妻少妇| 日本黄色日本黄色录像| 欧美日韩一级在线毛片| 人人妻人人添人人爽欧美一区卜| 久久国产精品影院| 美女午夜性视频免费| 天天添夜夜摸| 一本—道久久a久久精品蜜桃钙片| 怎么达到女性高潮| 欧美日韩亚洲高清精品| 91精品三级在线观看| 汤姆久久久久久久影院中文字幕| 最近最新中文字幕大全电影3 | 免费在线观看影片大全网站| 亚洲精品一二三| 母亲3免费完整高清在线观看| 精品久久蜜臀av无| 女性被躁到高潮视频| 好男人电影高清在线观看| 18禁国产床啪视频网站| 久久午夜亚洲精品久久| 天堂8中文在线网| 欧美成人午夜精品| kizo精华| 一区在线观看完整版| 久久久久久久久免费视频了| 欧美亚洲日本最大视频资源| 国产精品九九99| 69精品国产乱码久久久| 一区二区三区国产精品乱码| 精品亚洲成a人片在线观看| videosex国产| 桃花免费在线播放| 在线 av 中文字幕| 日韩欧美一区视频在线观看| 人成视频在线观看免费观看| www.熟女人妻精品国产| 天堂俺去俺来也www色官网| 欧美午夜高清在线| 久久午夜亚洲精品久久| xxxhd国产人妻xxx| 丝瓜视频免费看黄片| 国产精品九九99| 18禁黄网站禁片午夜丰满| 亚洲色图综合在线观看| 十八禁高潮呻吟视频| 午夜福利在线免费观看网站| 18禁裸乳无遮挡动漫免费视频| 在线观看舔阴道视频| 99久久人妻综合| 视频区欧美日本亚洲| 99热国产这里只有精品6| 搡老熟女国产l中国老女人| 亚洲成人免费电影在线观看| 日韩欧美三级三区| 三级毛片av免费| 成人精品一区二区免费| 国产成人系列免费观看| 亚洲av欧美aⅴ国产| 亚洲国产中文字幕在线视频| 国产精品国产高清国产av | 亚洲国产欧美一区二区综合| 两个人看的免费小视频| 国产一区二区三区视频了| 日本精品一区二区三区蜜桃| 天堂中文最新版在线下载| 国产成人欧美在线观看 | 国产精品久久久av美女十八| 欧美精品人与动牲交sv欧美| 青青草视频在线视频观看| 在线观看www视频免费| 1024香蕉在线观看| 后天国语完整版免费观看| 高潮久久久久久久久久久不卡| 亚洲精品中文字幕在线视频| 99re6热这里在线精品视频| 青草久久国产| 国产成人av激情在线播放| 丝袜在线中文字幕| av有码第一页| 国产97色在线日韩免费| 超碰成人久久| h视频一区二区三区| 电影成人av| 男女高潮啪啪啪动态图| 国产亚洲一区二区精品| 亚洲人成77777在线视频| 99久久精品国产亚洲精品| 欧美人与性动交α欧美软件| 精品人妻在线不人妻| 1024视频免费在线观看| 精品少妇久久久久久888优播| 午夜视频精品福利| 亚洲av电影在线进入| av片东京热男人的天堂| 九色亚洲精品在线播放| 日韩三级视频一区二区三区| 国产精品亚洲av一区麻豆| 欧美 亚洲 国产 日韩一| 搡老岳熟女国产| 亚洲黑人精品在线| 精品国产乱子伦一区二区三区| 午夜视频精品福利| 满18在线观看网站| 黄色a级毛片大全视频| 午夜视频精品福利| 高清黄色对白视频在线免费看| 国产精品免费大片| 国产一区二区三区视频了| av有码第一页| 午夜成年电影在线免费观看| 少妇粗大呻吟视频| 亚洲av第一区精品v没综合| 美女扒开内裤让男人捅视频| 激情视频va一区二区三区| 一二三四社区在线视频社区8| 午夜精品久久久久久毛片777| 午夜日韩欧美国产| 久久精品aⅴ一区二区三区四区| 亚洲av国产av综合av卡| 午夜91福利影院| 国产在线观看jvid| 中文欧美无线码| 99国产精品99久久久久| 国产又色又爽无遮挡免费看| 欧美日韩视频精品一区| 亚洲专区字幕在线| 在线永久观看黄色视频| 国产成人一区二区三区免费视频网站| 两性午夜刺激爽爽歪歪视频在线观看 | 精品少妇久久久久久888优播| 久久久久久久国产电影| 51午夜福利影视在线观看| 法律面前人人平等表现在哪些方面| 日本黄色视频三级网站网址 | 视频区图区小说| av电影中文网址| 国产又爽黄色视频| 亚洲熟女精品中文字幕| 欧美性长视频在线观看| 黄片小视频在线播放| 亚洲五月婷婷丁香| 久久精品国产综合久久久| 国产日韩欧美在线精品| 大香蕉久久成人网| 午夜日韩欧美国产| 国产精品免费一区二区三区在线 | 777米奇影视久久| 在线观看66精品国产| 99re6热这里在线精品视频| aaaaa片日本免费| 欧美激情久久久久久爽电影 | 亚洲午夜理论影院| 考比视频在线观看| 人人妻人人澡人人看| 无限看片的www在线观看| 国产精品久久久人人做人人爽| 欧美日韩视频精品一区| 老熟妇乱子伦视频在线观看| 亚洲中文av在线| 狂野欧美激情性xxxx| 久久国产精品影院| 亚洲性夜色夜夜综合| 夫妻午夜视频| 日本五十路高清| 啪啪无遮挡十八禁网站| 黑人操中国人逼视频| 久久精品国产亚洲av香蕉五月 | 黄色怎么调成土黄色| 99久久人妻综合| 国产精品久久久久久人妻精品电影 | 少妇精品久久久久久久| 男女边摸边吃奶| 国产成人免费无遮挡视频| 国产真人三级小视频在线观看| 久久九九热精品免费| 王馨瑶露胸无遮挡在线观看| 午夜福利,免费看| 性少妇av在线| 热re99久久精品国产66热6| 国产精品一区二区在线观看99| 自拍欧美九色日韩亚洲蝌蚪91| 亚洲三区欧美一区| 亚洲性夜色夜夜综合| av一本久久久久| 999久久久国产精品视频| 国产福利在线免费观看视频| 精品第一国产精品| 久久久国产欧美日韩av| 99热网站在线观看| 亚洲欧美精品综合一区二区三区| 久久国产精品大桥未久av| 法律面前人人平等表现在哪些方面| 欧美 日韩 精品 国产| 少妇裸体淫交视频免费看高清 | 亚洲熟女精品中文字幕| 国产精品1区2区在线观看. | a在线观看视频网站| 色综合婷婷激情| h视频一区二区三区| 别揉我奶头~嗯~啊~动态视频| 一本久久精品| 日韩三级视频一区二区三区| 久久ye,这里只有精品| 韩国精品一区二区三区| 久久精品亚洲av国产电影网| 黄色视频在线播放观看不卡| 性高湖久久久久久久久免费观看| 国产亚洲欧美精品永久| 99国产极品粉嫩在线观看| 亚洲国产欧美一区二区综合| 国产精品久久久久成人av| 日韩欧美三级三区| 夫妻午夜视频| 天天影视国产精品| 性少妇av在线| 在线亚洲精品国产二区图片欧美| 一区二区三区精品91| 欧美人与性动交α欧美软件| 女性生殖器流出的白浆| 精品第一国产精品| 国产av一区二区精品久久| 久久香蕉激情| 久久精品亚洲熟妇少妇任你| 精品人妻在线不人妻| 在线观看免费午夜福利视频| 国产精品国产高清国产av | 精品国产乱码久久久久久男人| 9色porny在线观看| 久久精品熟女亚洲av麻豆精品| 男人操女人黄网站| 亚洲五月色婷婷综合| 一个人免费在线观看的高清视频| 99精品在免费线老司机午夜| 亚洲男人天堂网一区| 日韩中文字幕视频在线看片| 黑人巨大精品欧美一区二区蜜桃| 国产熟女午夜一区二区三区| 午夜两性在线视频| 在线看a的网站| 午夜福利视频精品| 久久av网站| 欧美黄色淫秽网站| 自线自在国产av| av超薄肉色丝袜交足视频| 午夜日韩欧美国产| 性高湖久久久久久久久免费观看| 一级毛片女人18水好多| 国产一卡二卡三卡精品| 国产97色在线日韩免费| 久久人人97超碰香蕉20202| 交换朋友夫妻互换小说| 午夜91福利影院| 天堂动漫精品| 搡老乐熟女国产| 搡老岳熟女国产| 欧美日韩一级在线毛片| 精品国产亚洲在线| 国产成人精品无人区| 大片免费播放器 马上看| 18禁国产床啪视频网站| 国产成人啪精品午夜网站| 精品亚洲乱码少妇综合久久| av福利片在线| 日韩一区二区三区影片| 欧美精品一区二区大全| 超碰成人久久| 免费观看人在逋| 亚洲精品久久午夜乱码| 精品人妻在线不人妻| 操美女的视频在线观看| 狠狠精品人妻久久久久久综合| 电影成人av| 一本综合久久免费| 久久精品人人爽人人爽视色| 99精品在免费线老司机午夜| 美女高潮喷水抽搐中文字幕| 久久国产精品人妻蜜桃| 两个人免费观看高清视频| 夜夜夜夜夜久久久久| 美女视频免费永久观看网站| 香蕉国产在线看| 老司机在亚洲福利影院| 久久精品国产亚洲av高清一级| av国产精品久久久久影院| 午夜福利一区二区在线看| 女人精品久久久久毛片| 亚洲av成人不卡在线观看播放网| 美女主播在线视频| 黄网站色视频无遮挡免费观看| 亚洲黑人精品在线| 国产精品熟女久久久久浪| 国产男女超爽视频在线观看| 人人妻人人添人人爽欧美一区卜| 一本一本久久a久久精品综合妖精| 2018国产大陆天天弄谢| 少妇精品久久久久久久| 18禁裸乳无遮挡动漫免费视频| 欧美+亚洲+日韩+国产| 欧美日韩成人在线一区二区| 九色亚洲精品在线播放| 亚洲国产精品一区二区三区在线| 搡老熟女国产l中国老女人| 国产亚洲欧美在线一区二区| 亚洲色图综合在线观看| 国产午夜精品久久久久久| 制服人妻中文乱码| 久久天堂一区二区三区四区| 男女高潮啪啪啪动态图| 热99久久久久精品小说推荐| 国产精品久久电影中文字幕 | 欧美黑人精品巨大| 久久久国产一区二区| 女人高潮潮喷娇喘18禁视频| 亚洲国产av影院在线观看| 黄色 视频免费看| 99riav亚洲国产免费| www日本在线高清视频| 欧美午夜高清在线| 精品久久久久久电影网| 亚洲 国产 在线| 欧美日韩成人在线一区二区| 一边摸一边做爽爽视频免费| 亚洲国产欧美一区二区综合| 久久 成人 亚洲| 亚洲欧美色中文字幕在线| 欧美日韩一级在线毛片| 国产精品久久久久久精品电影小说| 99国产精品免费福利视频| 91大片在线观看| 日韩 欧美 亚洲 中文字幕| 国产成人欧美在线观看 | 叶爱在线成人免费视频播放| 久久精品国产综合久久久| 国产精品九九99| 亚洲色图综合在线观看| 国产一区二区三区综合在线观看| 丁香六月欧美| 19禁男女啪啪无遮挡网站| 青青草视频在线视频观看| 一进一出好大好爽视频| 久久ye,这里只有精品| 考比视频在线观看| 久久久久精品国产欧美久久久| 人妻 亚洲 视频| 久久久久视频综合| av在线播放免费不卡| 桃花免费在线播放| 搡老岳熟女国产| 十分钟在线观看高清视频www| 黑人巨大精品欧美一区二区蜜桃| 国产精品秋霞免费鲁丝片| 国产无遮挡羞羞视频在线观看| 18在线观看网站| 变态另类成人亚洲欧美熟女 | 97人妻天天添夜夜摸| 国产精品av久久久久免费| 一区二区三区国产精品乱码| 日韩视频一区二区在线观看| 国产成人精品久久二区二区免费| 日本精品一区二区三区蜜桃| 脱女人内裤的视频| 亚洲熟妇熟女久久| 丁香六月欧美| 90打野战视频偷拍视频| 国产亚洲精品一区二区www | 久久久久视频综合| 日韩欧美国产一区二区入口| 一区二区av电影网| 久久久国产欧美日韩av| av又黄又爽大尺度在线免费看| 国产日韩欧美在线精品| 国产亚洲欧美精品永久| 热99久久久久精品小说推荐| 久久影院123| 1024香蕉在线观看| 国产精品一区二区免费欧美| 十八禁网站网址无遮挡| 日日摸夜夜添夜夜添小说| 久久久水蜜桃国产精品网| xxxhd国产人妻xxx| 露出奶头的视频| 亚洲国产欧美网| 国产精品自产拍在线观看55亚洲 | 亚洲伊人色综图| 日本一区二区免费在线视频| 人人妻,人人澡人人爽秒播| 日韩一卡2卡3卡4卡2021年| 久久人人97超碰香蕉20202| 精品国产一区二区久久| 99国产极品粉嫩在线观看| 两人在一起打扑克的视频| 免费看a级黄色片| 国产精品亚洲av一区麻豆| 丝袜喷水一区| 亚洲精品国产一区二区精华液| 日本欧美视频一区| 黑人猛操日本美女一级片| 欧美变态另类bdsm刘玥| 最新美女视频免费是黄的| 深夜精品福利| 精品一区二区三区av网在线观看 | 欧美大码av| 国产不卡av网站在线观看| 制服人妻中文乱码| 亚洲精品国产精品久久久不卡| 黄色毛片三级朝国网站| 久久国产精品大桥未久av| 女人爽到高潮嗷嗷叫在线视频| 久久中文字幕一级| 欧美激情久久久久久爽电影 | 纯流量卡能插随身wifi吗| 一进一出抽搐动态| 自线自在国产av| 桃花免费在线播放| 午夜福利欧美成人| 正在播放国产对白刺激| www.999成人在线观看| 久久国产亚洲av麻豆专区| 色94色欧美一区二区| 大片免费播放器 马上看| 黄色视频,在线免费观看| 老司机深夜福利视频在线观看| 在线看a的网站| 一级黄色大片毛片| 亚洲色图 男人天堂 中文字幕| 久久久久久久大尺度免费视频| 美女主播在线视频| 日本五十路高清| 中文字幕色久视频| 欧美变态另类bdsm刘玥| 黄色a级毛片大全视频| 欧美日韩中文字幕国产精品一区二区三区 | 久久久久久免费高清国产稀缺| 国产色视频综合| 国产高清视频在线播放一区| 99久久精品国产亚洲精品| 少妇精品久久久久久久| 亚洲欧洲精品一区二区精品久久久| 天堂8中文在线网| 亚洲视频免费观看视频| 国产精品电影一区二区三区 | 国产精品美女特级片免费视频播放器 | 最新的欧美精品一区二区| 一区二区三区激情视频| 在线观看免费午夜福利视频| 国产成人系列免费观看| 在线播放国产精品三级| 岛国毛片在线播放| 一本久久精品| 亚洲av欧美aⅴ国产| 日韩一区二区三区影片| bbb黄色大片| 一区二区三区激情视频| 一区二区三区精品91| 蜜桃国产av成人99| 最新在线观看一区二区三区| 一区二区日韩欧美中文字幕| 久久99热这里只频精品6学生| 91成年电影在线观看| 女性被躁到高潮视频| 免费看十八禁软件| 自线自在国产av| 少妇的丰满在线观看| 国产免费av片在线观看野外av| 黄色a级毛片大全视频| 天天添夜夜摸| 欧美日韩精品网址| 男人操女人黄网站| videos熟女内射| 国产男女内射视频| 久久av网站| 精品国产国语对白av| 亚洲国产精品一区二区三区在线| 亚洲精品在线美女| 热99re8久久精品国产| 极品少妇高潮喷水抽搐| 2018国产大陆天天弄谢| 老汉色av国产亚洲站长工具| 夜夜骑夜夜射夜夜干| 最近最新免费中文字幕在线| 狠狠精品人妻久久久久久综合| 国产精品亚洲av一区麻豆| 一进一出好大好爽视频| 久久国产亚洲av麻豆专区| 亚洲七黄色美女视频| 少妇 在线观看| 国产精品美女特级片免费视频播放器 | 欧美日韩福利视频一区二区| 国产一区有黄有色的免费视频| 搡老乐熟女国产| 国产亚洲欧美在线一区二区| 曰老女人黄片| 久9热在线精品视频| 欧美av亚洲av综合av国产av| 久久久久精品国产欧美久久久| 亚洲中文av在线| 热99国产精品久久久久久7| 亚洲五月婷婷丁香| 精品人妻1区二区| 男女高潮啪啪啪动态图| 女性生殖器流出的白浆| 水蜜桃什么品种好| 人人妻人人澡人人看| 国产真人三级小视频在线观看| 淫妇啪啪啪对白视频| 国产精品自产拍在线观看55亚洲 | a级片在线免费高清观看视频| 日日夜夜操网爽| 一夜夜www| 国产无遮挡羞羞视频在线观看| 色婷婷av一区二区三区视频| 国产欧美日韩一区二区三| 91精品国产国语对白视频| 精品福利永久在线观看| 欧美日韩亚洲高清精品| 久久久国产一区二区| 亚洲久久久国产精品| 亚洲自偷自拍图片 自拍| 免费看十八禁软件| 国产男靠女视频免费网站| 国产人伦9x9x在线观看| 国产一区二区三区在线臀色熟女 | 老司机深夜福利视频在线观看| 日韩大码丰满熟妇| 亚洲精品在线观看二区| 欧美大码av| 叶爱在线成人免费视频播放| 欧美成人午夜精品| 欧美黄色淫秽网站| 免费在线观看完整版高清| 国产成+人综合+亚洲专区| 十八禁人妻一区二区| 久久精品国产99精品国产亚洲性色 | 99riav亚洲国产免费| 丝瓜视频免费看黄片| 国产亚洲午夜精品一区二区久久| 亚洲精品国产精品久久久不卡| 色婷婷久久久亚洲欧美| 男人操女人黄网站| 啦啦啦免费观看视频1| 99精品在免费线老司机午夜| av网站免费在线观看视频| 悠悠久久av| 色综合欧美亚洲国产小说| 十八禁网站网址无遮挡| 国产成+人综合+亚洲专区| 黄片大片在线免费观看| 亚洲国产毛片av蜜桃av| 色视频在线一区二区三区| 国产伦人伦偷精品视频| 国产精品久久久久久精品古装| 成人国语在线视频| 国产精品二区激情视频| 不卡av一区二区三区| 欧美日韩视频精品一区| 啦啦啦免费观看视频1| 一级毛片精品| 99国产精品一区二区蜜桃av | 老司机影院毛片| 女同久久另类99精品国产91| 欧美日韩一级在线毛片| av视频免费观看在线观看| 色播在线永久视频| 久久性视频一级片| 免费黄频网站在线观看国产| 亚洲欧美日韩高清在线视频 | 成人av一区二区三区在线看| av又黄又爽大尺度在线免费看| 国产欧美日韩精品亚洲av| 色老头精品视频在线观看| 欧美日韩福利视频一区二区| 另类亚洲欧美激情| 精品卡一卡二卡四卡免费| 国产黄频视频在线观看| 精品国产一区二区三区四区第35| 日日爽夜夜爽网站| 丝袜人妻中文字幕| 亚洲少妇的诱惑av| 黄色怎么调成土黄色| 男女下面插进去视频免费观看| 纯流量卡能插随身wifi吗| 欧美国产精品一级二级三级| 丝袜美腿诱惑在线| 蜜桃国产av成人99| 美女扒开内裤让男人捅视频| 日韩视频一区二区在线观看| 国产亚洲一区二区精品| 午夜激情久久久久久久| 国产欧美亚洲国产| 女同久久另类99精品国产91| av又黄又爽大尺度在线免费看| 成年人免费黄色播放视频| 一级黄色大片毛片| 在线av久久热| 捣出白浆h1v1| 国产精品98久久久久久宅男小说| 桃花免费在线播放| 在线天堂中文资源库| 国产精品免费视频内射| 一本—道久久a久久精品蜜桃钙片| 深夜精品福利| 九色亚洲精品在线播放| 首页视频小说图片口味搜索| 亚洲熟妇熟女久久| 一边摸一边做爽爽视频免费| 可以免费在线观看a视频的电影网站| 大香蕉久久网| 国产日韩欧美视频二区| 老司机午夜福利在线观看视频 |