Ayat Alrosan,Waleed Alomoush,Mohammed Alswaitti,Khalid Alissa,Shahnorbanun Sahran,Sharif Naser Makhadmeh and Kamal Alieyan
1Deanship of Information and Communication Technology,Imam Abdulrahman bin Faisal University,Dammam,Saudi Arabia
2Computer Department,Imam Abdulrahman bin Faisal University,Dammam,Saudi Arabia
3School of Electrical and Computer Engineering,Department of Information and Communication Technology,Xiamen University Malaysia,Sepang,43900,Malaysia
4Department of Computer Science,College of Computer Science and Information Technology,Imam Abdulrahman bin Faisal University,Dammam,Saudi Arabia
5Center for Artificial Intelligence Technology(CAIT),Faculty of Information Science and Technology,Universiti Kebangsaan Malaysia,43600 Bangi,Malaysia
6School of Computer Sciences,Universiti Sains Malaysia,11800,Penang,Malaysia
7Faculty of Computer Sciences and Informatics,Amman Arab University,Amman,Jordan
Abstract:Fuzzy C-means(FCM)is a clustering method that falls under unsupervised machine learning.The main issues plaguing this clustering algorithm are the number of the unknown clusters within a particular dataset and initialization sensitivity of cluster centres.Artificial Bee Colony(ABC)is a type of swarm algorithm that strives to improve the members’solution quality as an iterative process with the utilization of particular kinds of randomness.However,ABC has some weaknesses,such as balancing exploration and exploitation.To improve the exploration process within the ABC algorithm,the mean artificial bee colony(MeanABC)by its modified search equation that depends on solutions of mean previous and global best is used.Furthermore,to solve the main issues of FCM, Automatic clustering algorithm was proposed based on the mean artificial bee colony called(AC-MeanABC).It uses the MeanABC capability of balancing between exploration and exploitation and its capacity to explore the positive and negative directions in search space to find the best value of clusters number and centroids value.A few benchmark datasets and a set of natural images were used to evaluate the effectiveness of AC-MeanABC.The experimental findings are encouraging and indicate considerable improvements compared to other state-of-the-art approaches in the same domain.
Keywords: Artificial bee colony; automatic clustering; natural images;validity index; number of clusters
Clustering of data is a statistical approach used for managing large data volume.It is a multivariate analytical approach, which identifies patterns and relationships that exist amongst data.By data clustering, a user could divide data into relatively homogeneous groups.By reorganizing these groups, the user may be able to utilize the original data volume efficiently.Clustering accuracy is crucial because, according to [1-3], clustered data that does not accurately denote the original data will have adverse consequences.In other words, data clustering is a set of patterns or points represented as vectors of attributes, measurement methods, or objects that exist in multidimensional feature space [4,5].
Clustering algorithms consist of two types, namely, the partitional clustering, which generates several partitions, and hierarchical clustering, which generates only one partition [4,5].As highlighted by [5,6], there are two types of partitional clustering, namely, hard (crisp) and fuzzy (soft).Every data point in crisp clustering belongs to one cluster only.In the latter clustering of fuzzy,at the same time, data points might relate to much more than single cluster based on a certain degree of fuzzy membership.Between these two types (crisp and fuzzy), fuzzy clustering appears to be more acceptable for datasets that exhibit undefined bounds between regions or clusters.
Over the last few years, clustering methods have been demonstrated to be effective, particularly in tasks of categorization, requiring semi or full automation [7,8].Clustering is employed to classify identical data points according to similarity and distance, based on the standard approach.In inter-cluster, the similarity increases as the distances decrease while in intra-cluster, simultaneously increasing distances will reduce similarity.
Clustering has many useful features, making it one of the most well-known in many fields of image segmentation, pattern recognition, machine learning, data mining, etc.[3,9].One of its features is the partitional method of clustering [2].It is free from issues such as static-behaviour,where the elements of data of a cluster are unmovable to another cluster.Additionally, this type of clustering also does not have the problem of overlapping clusters’separation inability,which is common in hierarchical clustering.Amongst these fuzzy clustering techniques, FCM algorithm [10] is a useful approach that has been applied in many domains because it has robust ambiguity characteristics and can maintain rather more information than hard segmentation techniques [11,12].
However, the Fuzzy clustering-based approach still has significant weaknesses; for instance, its ability to obtain an automatic approach without prior knowledge of the number of clusters and centroid locations.Furthermore, determining the number of clusters within a particular dataset is a significant challenge and the unavailability of experts, operators, and any prior knowledge has contributed to this challenge as well.Accordingly, many researchers have worked on the successful implementation of clustering methods in the last few years to find the number of the appropriate clusters within a particular dataset without experts and operators.
Metaheuristic optimization search algorithms such as artificial bee colony (ABC) [13], bees algorithm (BA) [14,15], harmony search (HS) [16,17], firefly algorithm (FA) [18,19], and cuckoo search (CS) [20,21], have shown success in many fields such as clustering [11,22], image processing [23], scheduling [24] and others real-world applications [25].Moreover, a fuzzy clustering approach based on metaheuristic optimization search algorithms is considered a suitable choice for solving the problems related to fuzzy clustering [3,26].The following is a discussion of some of the methods.
A fuzzy clustering FCM was proposed in Ouadfel et al.[27] based on a modified ABC called MoABC-FCM.Additionally, Hancer et al.[28] presented an image segmentation approach using ABC algorithm to detect tumours from MRI images.In Alrosan et al.[29], ABC algorithms and fuzzy c-means were combined and called ABC-FCM.The combination used two types of MRI brain images namely simulated brain data, and actual MRI images, for its performance.Automatic Clustering-based Improved Binary ABC (IDisABC) which was proposed by [30].In Kuo et al.[3] an automatic customer segmentation method based on an improved artificial bee colony and fuzzy clustering was proposed.Further, Kuo et al.[3], Alrosan et al.[23] proposed a new image clustering method based on improved ABC called (MeanABC-FCM).This method used MRI images, and the outcomes show promising results in this field.
A method, fuzzy automatic clustering named AFDE for image segmentation problem was proposed by [31].Relevantly, Alia et al.[32] proposed a novel automatic algorithm of fuzzy clustering (DC) by the hybrid harmony search HS with FCM to produce an automatic segment method called DCHS.Further, an automatic hard clustering algorithm called DCPSO was proposed by [33-35].While an algorithm called Fuzzy based genetic algorithm with variable string length FVGA to automatically find a fitting clusters number with the matching fuzzy clustering outcomes was proposed by [36].A fuzzy clustering algorithm was proposed in Alomoush et al.[37]using a combination of a firefly optimizer with an FCM algorithm named FFCM [38].In Agbaje et al.[26] an automatic data clustering method was proposed based on improved FA algorithm combined with the Particle Swarm Optimization (PSO) algorithm.
This paper proposes an automatic fuzzy clustering approach based on mean artificial bee colony called (AC-MeanABC).The proposed method uses the (MeanABC) algorithm to determine the appreciate cluster number and initial location of centroids.The method modifies the search equation depending on mean previous and global best solutions to reach the best balance between exploration and exploitation.The remaining portions of the paper are as follows.The type 2 fuzzy set is described in Section 2, related background on the employed techniques are presented in Sections 4 and 5.The proposed method is presented in Section 5, and Section 6 presents the experiments and results, wherein Section 7 concludes the paper.
FCM is an unsupervised method that can identically partite objectives based on the similarity attribute, which tends to increase the similarities of entities within the same set and decrease the similarities of entities within different sets [2,39].The objective function defined by FCM is reformulated using type 2 fuzzy set as in Eq.(1) below.
A standard ABC algorithm was proposed in Karaboga et al.[13] and this algorithm was inspired by the life of bee swarm.The bees in ABC are divided into three kinds:the employed bees focus on seeking sources of food and transmitting their search information to the onlookers.Meanwhile, the onlooker bees retain more searches to achieve better solutions depending on the quality of the discovered food sources.A source of food in the flock is a possible solution, and the amount of nectar determines their fitness values.The ABC algorithm randomly initializes the source of food as in Eq.(5).
where (i=1,...,SN) andSNare the solutions (j=1,2,...,D),Dis the parameters value, whilexj minandxj maxare the highest and lowest bound of the dimensionj, respectively.The computation of the updated source of food is as in Eq.(6).
where:K?(1,2,...SN)andj?(1,2,...D)indexes which are randomly chosen andK/=i·φijis a value from [?1, 1].The balancing of the candidate food exploration and exploitation fitness of food source (solutions)fit(xi)will be found from the value of fitness functionf(xi)that is computed by Eq.(7).
That onlooker selects solutions with certainty.This certainty means that correlation exists between the fitness value of a source of food and the bees employed.The probability of fitness is as in Eq.(8).
In this regard, when several trials of the solution are not changing, the “l(fā)imit,”is reached, and employee bees are converted to scout bees to abandon their solutions.Accordingly, Eq.(5) shows how the scout bees start new searches and solutions randomly until they reached the criterion of termination.
The ABC has some disadvantages in terms of unbalanced search behaviour, and this has become a common limitation with other optimization algorithms.The authors used an improved ABC based mean global best (MeanABC) [23] to solve these disadvantages.In Alrosan et al.[23]an improved ABC algorithm called MeanABC was proposed, and this algorithm presents a unique modified search formula for the bee phase employed based on the knowledge about the best candidates to produce a new neighbourhood source of food as in Eq.(9)
wherexi,jwould be the first expression for the present position,is the variation among the current position and the new position, andφijis a random value from [?1,1].Both expressions are identical to traditional ABC algorithms.The third expression is the modified search equation in MeanABC.Here, the first side of the third expression is essential for switching the bee’s present position towards the mean value of the positive way of the best global position and the positive way that is its own best position,
The second concerns with switching the present mean position value of (pbest), which is the positive way of own best position and the (-gbest) negative way of the best global position.
Based on the above:ψis a random value between [0, C], andCis a positive number.The constant value of C plays a critical role in balancing the candidate food exploration and exploitation [38].Fig.1 shows the MeanABC pseudo-code.
Figure 1:The pseudo-code of MeanABC algorithm
AC-MeanABC is an unsupervised data clustering method that employs the ability of Mean-ABC to seek the best solution that comprises a set number of clusters and centroids’location.The pseudo-code for AC-MeanABC algorithm is available in Fig.2.The steps of AC-MeanABC are described as follows.
Figure 2:The pseudo-code of AC-MeanABC algorithm
The AC-MeanABC algorithm begins by initializing theNpfood sources.Each food sourceNprepresents a vector that encodes the cluster number and centroid.EachNpfood sources’vector length can change based on the randomly generated number of clusters for each vector.Also,theNpsize range (number of clusters) depends on the dataset.EachNpvector encodes several clusters, denoted by‘ClustNo,’which is the number ofNpvector encode between the maximum number of clusters (ClusMaxNum) and the minimum number of clusters (ClustMinNum).The food source initialization initializes the number of clusters and centroids value in MeanABC as in Eq.(10).
The values ofClusMaxNumandClustMinNumare chosen based on the images used for segmentation.In this regard, the length of the vector is allowed to change ifClustNois changed.Additionally, each length of a vector inNpmust reach to highest clusters number (ClusMaxNum).Thus, for a matrix representation, it consists of d-dimensional space and theNpvector length(ClusMaxNum,d), where d represents the centroid locations.In case the vector number of clusters is less thanClusMaxNum, the vector is taken by centroids in random locations, while the rest of unused vector elements are represented as don’t care ‘#’sign [32,40].To clarify the ‘don’t care’elements, letd=2 andClusMaxNum=6, which means that the feature space is 2-dimensional(d=2) and the maximum number of clusters isClusMaxNum=6.So, suppose that one of theNpvectors has only 3-candidate cluster centres such asNp={(113,9),(80,4),(42,5)}, then, these three centres will be set in theNpvector while the rest of the elements are labelled as don’t care’#’signNp={113 9 # # # # 80 4 # # 425}.Also, to determine the best solutions, the fitness functionJ(c1,c2,...,cn)in Eq.(1) is used.The maximum value ofJ(c1,c2,...,cn)is considered as the best solution.
This phase of AC-MeanABC clustering algorithm includes finding the bestNpvector with the optimal number of clusters and centroid locations.To update theNpvector value, the new food source position is calculated using previously mentioned Eq.(9) in section (4).
Here,i=[1,2,...,Np],j=[1,2,...,Cn],xj,iis a randomly chosen jth parameter of the kth individual, and i is one of theNpfood sources withi/=k.If any parameter ofvi.exceeds its predetermined boundaries, it should be adjusted to fit the appropriate range.A scale factorφijis a random number between [?1,1].Meanwhile,ψis a uniform random number from [0,C], whereCis a nonnegative constant number.Here, the pbest represents the previous best value of each vector encodes namelyNP=(C1,C2,C3,C4,...,Cn), and thegbestrepresents the previous best value of all vector encodesi=[1,2,...,Np].After generating the new candidate vector encodes,the new candidate solution’s fitness value is calculated as in Eq.(7), say,fitj.Iffiti In the onlooker bee stage, each onlooker selects a source of food with a probability determined by the amount of nectar (fit) from a source of food exchanged by employed bees.As in Eq.(8), the probability is determined.When the probability forfitiof the vector encodes is high,the food source is selected.Else, the vector will be considered as don’t care ‘#’solution.After the employed bee phase and a food source have been chosen, a neighbour sourceviis calculated by Eq.(9), and itsf(x)is calculated.Then, the process of greedy is used betweenviandxi. When trials of solutions are not changing, and it reached to “l(fā)imit,” employee bees become scout bees and abandon their solutions.Here, is where, the scout bees begin new searches and solutions, randomly via Eq.(10).These aspects are iterated until the criterion of termination has been reached. The degree of appropriateness or goodness is measured by its fitness value of each Mean-ABC solution.Each data is allocated to one or more clusters by using the fuzzy membership categories.These fuzzy membership values are determined using the fuzzy membership equation.Consequently, a cluster validity index is used as an indicator for the appropriateness of the clustering.This index is typically exercised to establish the quality of different solutions obtained using different settings in a particular clustering algorithm (or for solutions given by different algorithms). In this paper, the cluster validity index ofVIis used for the appropriateness of the clustering [35].The VI index is presented as follows: where s is a constant number, andN(μ,σ)is a Gaussian distribution as in Eq.(12) with mean and standard deviation, wherecnis the number of clusters Eqs.(13) and (14) represent these intra clusters and the inter-cluster. whereNpis the number of patterns in a dataset,Zpis a pattern in the cluster, whilemk, andcnis thekthcentroids of the cluster and the number of clusters. The AC-MeanABC fuzzy clustering algorithm is conducted as a fully automatic data clustering to find the number of clusters in datasets.The experiments and results are represented in three parts.Firstly, the AC-MeanABC parameters setting and the most appropriate values are chosen.Secondly, AC-MeanABC is conducted on 11 benchmark clustering datasets selected from the UCI databases such as Iris, Ecoli, Wisconsin, Wine, Dermatology, Glass, Aggregation, R15,D31, Libras movement and Wholesaler customers.Some details about these datasets are given in Tab.1. The third part includes AC-MeanABC being conducted with five natural images obtained from the Berkeley1 dataset [30,41] such as Lena, Jet, MorroBay, Mandril, and Pepper.All-natural images have the size is 512×512 except the Pepper image, which is of 256×256. Table 1:Clustering datasets details To obtain the best outcomes from any clustering optimization algorithm, the suitable selection of parameters is critical because these parameters are essential in the algorithm’s performance and accuracy.To determine the values of parameters of AC-MeanABC (i.e., Population size SN, limit,maximum cycle number (MCN), nonnegative constant parameter (C), and Termination criterion TC) have to be determined.The values of these parameters were set in Tab.2 based on related works [3,30].In Tab.3, other relevant parameters to the dataset, such as the maximum number of clusters Np,μandσ[3,30]. Table 2:AC-MeanABC parameters settings In this experiment, AC-MeanABC was performed with 11 benchmark clustering datasets,which were chosen from the UCI database.The benchmark datasets were divided into two parts, whereby part1 of the benchmark datasets included Glass, Aggregation, R15, D31, Libras movement and wholesaler customers, while part2 of the benchmark datasets included Iris, Ecoli,Wisconsin, Wine and Dermatology. 6.2.1 Part1 In this part, the AC-MeanABC outcomes were compared with the standard ABC algorithm and other related works such as iABC, AKCBCO, AKC-MEPSO, DCPG, DCPSO and DCGA [3,42,43].The parameter settings for AC-MeanABC algorithm were selected as in Tabs.2 and 3 based on the same parameter setting in [3].Accordingly, Tabs.4 and 5 illustrate these benchmark data clustering, the results of VI index, and their optimal cluster numbers (# OC).Tab.4 shows that AC-MeanABC outperforms other related works in minimizing the VI index in most of the outcomes, where iABC outperforms AC-MeanABC in R15 and D31.Additionally,Tab.5 indicates that AC-MeanABC exceeds other state-of-the-art methods in finding the optimal number of clusters for Glass, Aggregation, R15, Libras movement and wholesaler customers, while iABC outperforms AC-MeanABC in D31. Table 3:Other parameters settings Table 4:Benchmark data clustering part1 outcomes of VI index In Tab.6, the Friedman test was carried out and the results were used in comparing the performance of AC-MeanABC and other related works.As shown, the proposed algorithm outperforms all comparable methods in Tab.6.The box whisker for the results of clustering algorithms in Tab.4 is displayed in Fig.3. 6.2.2 Part2 In this part, the AC-MeanABC outcomes were compared with other related works such as discrete binary artificial bee colony DisABC, GA based clustering algorithms, improved discrete binary artificial bee colony IDisABC [28] and dynamic clustering based particle swarm optimization DCPSO [33].The parameter settings for DC-MeanABC algorithm were selected as in Tabs.2 and 3 based on the same parameter setting in [28]. Table 5:Benchmark data clustering part1 outcomes of clusters number Table 6:Mean ranks obtained by Friedman test for the clustering algorithms outcomes in Tab.4 Figure 3:Box-whisker to present the outcomes of five clustering algorithms in Tab.4 Tabs.7 and 8 illustrate these benchmark data clustering, the results of VI index, and their optimal cluster numbers (# OC).Tab.7 shows that AC-MeanABC outperforms other related works in minimizing the VI index in all outcomes.Additionally, Tab.8 indicates that ACMeanABC exceeds other state-of-the-art methods in finding the optimal number of clusters for Ecoli and Wine datasets.In the rest of the three cases, all approaches reached an optimal number of clusters in the Iris dataset.Further, AC-MeanABC, IDisABC, and DisABC obtained the same outcomes in Wisconsin dataset with a preference for AC-MeanABC in standard deviation value.In the Dermatology dataset, DCPSO and AC-MeanABC are very close in terms of result to find the best number of clusters.Still, the DCPSO standard deviation value is much higher than that of AC-MeanABC. Table 7:Benchmark data clustering part2 outcomes of VI index Table 8:Benchmark data clustering part2 outcomes of clusters number To show the significance of improvement of the AC-MeanABC, Friedman test was performed and the results are shown in Tab.9.As shown clearly, the mean ranking value of AC-MeanABC is higher than that of IDisABC, DisABC, DCPSO and GA.where the best value is the minimum value and it’s represented in bold.Also, Fig.4 represents the box whisker for the results of five clustering algorithms as in Tab.7. Table 9:Mean ranks obtained by Friedman test for the five clustering algorithms outcomes in Tab.7 Figure 4:Box-whisker to present the outcomes of five clustering algorithms in Tab.7 6.2.3 Natural Images In this experiment, five natural images were obtained from the Berkeley1 segmentation dataset [30,41] such as Lena, Jet plane, MorroBay, Mandril, and Pepper images.In this experiment, the AC-MeanABC outcomes were compared with other related works such as discrete binary artificial bee colony DisABC, GA based clustering algorithms, and improved discrete binary of ABC called IDisABC [30] and dynamic clustering based PSO that is called DCPSO [35].The parameter settings for AC-MeanABC algorithm were selected based on the same parameters settings in [30], as in Tabs.2 and 3. Tabs.10 and 11 illustrate these natural images, the VI index results, and the corresponding cluster numbers from ground truth image (# AC).Tab.10 shows that AC-MeanABC outperforms other related works in minimizing the VI index in all cases.Tab.11 shows the number of clusters that are determined automatically using algorithm AC-MeanABC and other state-of-theart methods.Tab.8 shows that AC-MeanABC-FCM outperforms other related works in finding the optimal number of clusters for Lena, Jet plane, MorroBay, and Pepper images.In contrast,the IDisABC approach outperforms AC-MeanABC and other approaches in finding the optimal number of clusters for the Mandrill image, but AC-MeanABC-FCM and IDisABC are very close in terms of the result of Mandrill image.Figs.5a, 6a, 7a, 8a and 9a show the original image of MorroBay, Lena, Mandrill, Jet plane and Pepper, where Figs.5b, 6b, 7b, 8b and 9b show the clustering images by AC-MeanABC of MorroBay, Lena, Mandrill, Jet plane and Pepper images. Table 10:Natural image clustering outcomes of VI index Table 11:Obtained number of clusters for natural image by clustering approaches Figure 5:(a) Original MorroBay and (b) segmented MorroBay by DC-MeanAB Figure 6:(a) Original Lena and (b) segmented Lena by (AC-MeanABC) In Tab.12, the Friedman test was carried out to compare the performance between the proposed algorithm and other related works.As shown, AC-MeanABC outperforms IDisABC,DisABC, DCPSO and GA and the best value is presented in bold; the best value is considered the minimum value of all clustering algorithms in Tab.9.The box whisker for the results of five clustering algorithms in Tab.10 is displayed in Fig.10. Figure 7:(a) Original mandrill and (b) segmented mandrill by (AC-MeanABC) Figure 8:(a) Original jet and (b) segmented jet by (AC-MeanABC) Figure 9:(a) Original pepper and (b) segmented pepper by (AC-MeanABC) Table 12:Mean ranks obtained by Friedman test for the five clustering algorithms outcomes in Tab.10 Figure 10:Box-whisker to present the outcomes of five clustering algorithms in Tab.10 In this paper, the automatic fuzzy clustering based on the MeanABC search method called AC-MeanABC was proposed to solve the challenges of determining the number of clusters(region) and cluster centroids.AC-MeanABC clustering method used the capability of the Mean-ABC algorithm to explore the search space in positive and negative directions to search for the near-optimal number of clusters and centroids values.The experiments and results were obtained using 11 benchmark datasets and 5 natural images.These experiments compared the ACMeanABC with other clustering methods such as iABC, ABC, AKCBCO, AKC-MEPSO, DCPG,DCGA IDisABC, DisABC, DCPSO, and GA.In conclusion, the clustering results of the ACMeanABC are better than those of the state-of-the-art techniques in determining the optimal number of clusters and the value of validity index VI. Acknowledgement:Thanks to our families & colleagues who supported us morally. Funding Statement:This research was supported by the Research Management Center, Xiamen University Malaysia under XMUM Research Program Cycle 4 (Grant No:XMUMRF/2019-C4/IECE/0012). Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.5.3 Onlooker Bees Phase
5.4 Scout Bees Phase
5.5 Solutions Evaluation and Clustering
6 Experiments and Results
6.1 AC-MeanABC Parameters Setting
6.2 Benchmark Clustering Datasets
7 Conclusion
Computers Materials&Continua2021年8期