R.D.PhillipsM.S.HossainL.T.WatsonR.H.Wynneand Naren Ramakrishnan
Clustering is an unsupervised process that models locality of data samples in attribute space to identify groupings:samples within a group are closer to each other than to samples from other groups.To assess whether the discovered clusters are meaningful,a typical procedure is to see if the groupings capture other categorical informationnot originally used during clustering.For instance,in microarray bioinformatics,data samples correspond to genes and their expression vectors,clusters capture locality in expression space,and they are evaluated to see if genes within a cluster share common biological function/annotations.(Observe that the functional annotations are not used during clustering).In text mining,data samples correspond to documents and their text vectors,clusters capture locality in term space,and are evaluated for their correspondence witha prioridomain information such as topics.In remote sensing,data samples correspond to pixels in an image,clusters capture locality of pixel intensities,and are evaluated for their correspondence with land cover classifications.
All of the above applications are essentially determining whether locality in one space preserves correspondence with information in another space,also referred to as thecluster assumption[Chapelle,Sch?lkopf,and Zien(2006)].While cluster evaluation is typically conducted as a distinct post-processing stage after mining,recently developed clustering formulations blur this boundary.For instance,in Wagstaff,Cardie,Rogers,and Schr?dl(2001),locality information is used along with background knowledge to influence the clustering.Such background knowledge takes the form of constraints,some of which dictate that certain samples should appear in the same cluster,while others specify that two samples should be in different clusters.Similarly,in Tishby,Pereira,and Bialek(1999),clusters are designed using an objective function that balances compression of the primary random variable against preservation of mutual information with an auxiliary variable.With the advent of semisupervised clustering[Chapelle,Sch?lkopf,and Zien(2006)],more ways to integrate labeled and unlabeled information are rapidly being proposed.
The design of both classical and the newer clustering algorithms is predicated on the ability to evaluate clusters for enrichment and using this information to drive the refinement and subsequent discovery of clusters.However,classical statistical enrichment procedures(e.g.,using the hyper-geometric distribution[Ewens and Grant(2001)])assume a hard clustering formulation.The focus here is on soft clusters where the groupings are defined by portions of individual samples.This paper presents a new statistical test to enrich soft clusters and demonstrates its application to several datasets.
Clustering can be used to analyze and discover relationships in large datasets.Strictly unsupervised clustering is used in the absence of information about target clusters and variables of interest,however,clustering can be partially supervised or guided when additional information regarding target clusters is available.
Clustering by itself doesnot correspond to classification,the process by which class labels are assigned to individual data elements,but clustering can be a useful tool in the classification of large datasets.When clusters are used to organize similar elements in a dataset,class labels can be assigned to entire clusters,allowing individual elements within a cluster to be assigned that class label.Because samples or elements in a particular cluster are similar or “close,”they are assumed likely to share a class label,which is known as thecluster assumption.Assigning labels to a modest number of clusters is less time intensive than assigning labels to many individual samples,so if the cluster assumption holds,clustering is an efficient and powerful tool in classification.Unfortunately,this cluster assumption does not hold in all cases,as there is no rule that dictates that“close”samples must share a label.Finally,the descriptions of clustering,semisupervised clustering,and cluster evaluation given above assume a specific type of clustering where clusters are collections of individual elements,known as hard or crisp clustering.Alternatively,clusters can be defined by portions of individual samples,known as soft clustering.Soft cluster evaluation becomes less intuitive as clusters will no longer“contain”individual samples,and clusters cannot be composed primarily from samples belonging to one class in the same sense.The following subsections define hard and soft clustering and classification.
Hard clustering produces clusters that are a collection of individual samples.Let theith sample be denoted byx(i)∈?bwherei=1,...,n.A cluster is typically represented by a prototype, such as the mean of the samples contained in the cluster,and let thejth cluster prototype beU(j)∈?bwherej=1,...,K.All clusters taken together form a partition of the data,defined by a partition matrix,wwithwij=1 indicating that theith sample belongs to thejth cluster,wij=0 otherwise,and∑Kj=1wij=1 for alli.Each sample is a member of exactly one cluster.
A classic example of a simple hard clustering method is theK-means clustering algorithm that locates a local minimum point of the objective function
whereρij=kx(i)?U(j)k22[Mac Queen(1974)].In this case,ρijis a measure of dissimilarity or distance between theith sample and thejth cluster.TheK-means clustering algorithm attempts to find the ideal partition that minimizes the sum of squared distances between each sample and the prototype of the cluster to which the sample belongs.The algorithm forK-means requiresKinitial cluster prototypes and iteratively assigns each sample to the closest cluster using
for eachi,followed by the cluster prototype(mean)recalculation
oncewhas been calculated. This process, guaranteed to terminate in a finite number of iterations,continues until no further improvement is possible,terminating at a local minimum point of(1).
In hard clusters,such as those produced byK-means,the collection of samples that belong to a particular cluster can be evaluated to determine a cluster’s eligibility to perform classification. The class memberships of the labeled samples in a particular cluster can be modeled using discrete random variables generated from binomial,multinomial,or hyper geometric distributions,for example.These random variables form the basis of statistical tests used to evaluate clusters for classification.For example,letVicbe a Bernoulli random variable where success(Vic=1)indicates theith labeled sample is labeled with thecth class.The number of labeled samples labeled with thecth class in a particular cluster would be a binomial random variableVc,j= ∑i∈IjVicwhereIjis the index set of labeled samples belonging to thejth cluster.This binomial random variable can be used as the basis for a statistical hypothesis test to determine if the number of samples labeled with thecth class(as opposed to all other classes)in thejth cluster is significant.In practice,thecth class that would be tested would be the class that is most represented in thejth cluster,or mathematically,c=argmax1≤c≤CVc,jfor a particularjwhereCis the number of classes.
Soft clusters are clusters that instead of containing a collection of individual samples,contain portions of individual samples.Another view of soft clustering is that each sample has a probability of belonging to a particular cluster.Soft clustering has advantages over hard clustering in that a sample is not simply assigned to the closest cluster,but information is preserved about relationships to other clusters as well.Furthermore,these continuous assignments are less constrained that discrete assignments,resulting in a less constrained objective function.Like in hard clustering,wijindicates cluster membership,but instead of being either zero or one,wij∈(0,1),and like in hard clustering,∑Kj=1wij=1 for alli.Some versions of fuzzy clustering do not impose this requirement,but those nonprobabilistic methods will not be considered here.
An example of a soft clustering method analogous toK-means is fuzzyK-meansthat locates a local minimum point of the objective function
whereρijis still the squared Euclidean distance betweenx(i)andU(j)andp>1[Bezdek(1980)].The algorithm that minimizes this objective function is similar to that ofK-means in that it first calculates
for alliandjfollowed by calculating updated cluster prototypes
The cluster prototype is a weighted average.This iteration(recalculation of the weights followed by recalculation of cluster prototypes,following by recalculation of the weights,etc.)is guaranteed to converge(with these definitions ofρij,U(j),andwij)forp>1[Bezdek(1980)].
Evaluation of soft clusters requires taking cluster weights into account when examining class memberships of the labeled samples.Each labeled sample will have some positive membership in each cluster,and a new type of evaluation will be necessary to directly evaluate soft clusters.Soft cluster memberships could be converted to hard cluster memberships for the purpose of cluster evaluation,but if soft clustering is warranted,those soft clusters should be evaluated directly.
Hard cluster evaluation(for classification)is based on the composition of the cluster,or what type of samples are making up the cluster.The question of whether a cluster should be used for classification can be answered when some of the samples within the cluster have labels and there are a sufficient number of samples to draw statistical conclusions.Because soft clusters no longer“contain”samples,the more important question is whether the relative magnitudes of memberships between samples of a particular class and the cluster are significantly different.In other words,if the magnitude of cluster memberships for samples of a particular class appear to be significantly higher than memberships for other classes,then the cluster is demonstrating characteristics of that class.With hard clusters,a cluster is pure if only one class is contained in the cluster;no samples labeled with another class are present in the cluster.This is impossible in soft clustering as all types of samples will have positive memberships in all clusters,and in practice,these memberships,although possibly small,will be nonnegligible.
Just as hard clusters that are ideal for classification contain only one class,soft clusters that are ideal for classification will be representative of just one class.The goal in using soft clustering for classification is to assign a class label to an entire cluster(the same goal for hard clusters),but just as each sample has a soft membership in a particular cluster,each sample will have soft membership in a class.The samples demonstrate characteristics of multiple classes,justifying soft classification,but the clusters(logical grouping of similar data)should not contain or represent multiple classes.The goal of this work is to associate a soft cluster to one particular class if that class is clearly dominant within the cluster.Probability will determine how clearly a particular cluster is composed of one class,and if this probability passes a predetermined threshold test,the cluster will be associated with a class.
The statistical tests used to evaluate clusters in this paper are statistical hypothesis tests,where a null hypothesis is proposed.If observed evidence strongly indicates the null hypothesis should be rejected,the alternate hypothesis will be accepted.In the absence of compelling evidence to the contrary,the null hypothesis cannot be rejected.
The first hypothesis test is based on the average cluster weights in the cluster of interest,thejth cluster.In order to associate thejth cluster to thecth class,the average cluster weight for thecth class
wherencis the number of samples labeled with thecth class andJcis the index set of samples labeled with thecth class,should be statistically significantly higher than other cluster weights for thejth cluster.If the weights for samples labeled with thecth class are higher in general than samples from arbitrary classes,the cluster is demonstrating a tendency to thecth class,and can be used to discriminate thecth class from other classes.
The null hypothesis is that the average cluster weights for samples from thecth class in thejth cluster is not significantly different from the average cluster weight for samples from all classes in thejth cluster.The alternate hypothesis is that the average weight for samples from thecth class in thejth cluster is significantly different(higher)than the average cluster weight for all samples.Note that in practice,only the class with the highest average cluster weight for thejth cluster would be considered.Suppose that a test statistic derived for this test is normally distributed,and is in fact a standard normal random variableZ.Then if the observed value is?z,ifP(Z≥?z)≤αfor 0<α<1,the null hypothesis is rejected.The following sections derive appropriate test statistics to use in this hypothesis test.
Suppose a datasetxcontainsnsamplesx(i)∈?B,i=1,...,n.ForKfixed cluster centersU(k)∈?B,k=1,...,K,the assigned weight of theith pixel to thejth cluster is
Theorem:LetX(i),i=1,2,...,beB-dimensional random vectors having one ofQdistinct multivariate normal distributions.Fori=1,2,...andj=1,...,Kdefine the random variables
Figure 1:Distribution of sums of weights in one soft cluster out of two.
whereKis the number of clusters andU(k)∈?Bis thekth cluster center(and is considered fixed for weight calculation).Then for anyj=1,...,K,
Remark:The assumption that theX(i),i=1,2,...,are generated from a finite number of normal distributions is stronger than necessary.The proof in Phillips,Watson,Wynne,and Ramakrishnan(2009a)holds ifX(i),i=1,2,...,are generated from a finite number of arbitrary distributions.
Experimental clustering results using a dataset described in Section 4 of this paper match this theoretical result,as illustrated by one experiment in Fig.1.This illustration shows the distribution of sums of cluster weights for one particular cluster(whenK=2).
Starting with the normal approximation for the sum of the cluster weights,the standard normal test statistic would be
where E[Wij]is the expected value ofWijand Var[Wij]is the variance ofWijfor thejth cluster.E[Wij]and Var[Wij]are unknown,but can be reasonably approximated using the sample mean
Sincezis generated(approximately)by the standard normal distribution,this test statistic can be used in the proposed hypothesis test.
One potential issue with the above statistic is that the sample mean and standard deviation calculations assume the sample is identically distributed,which is specificallynotthe assumption in this case(clustering assumes that the data are generated from a number of distributions,where the true number of clusters is equal to the number of distributions,which is unknown apriori).A better statistic ac-knowledges that the data are not identically distributed,but are generated from a if nite number of distributions.Since the number of distributions and the distributions are unknown,the number of classes and the individual class labels,which are assumed to correspond to inherent structure of the data,are used to approximate the true mean and variance of multiple clusters.Precisely,assume that all labeled sample in dicesiwith distribution indexψ(i)=qcorrespond to the same class labelφ(i)=c.Ifi∈ψ?1(q),theni∈φ?1(c),buti∈φ?1(c)does not implyi∈ψ?1(q)(more than one distribution can correspond to one class),andJc=φ?1(c)={i|φ(i)=c,1≤i≤n}.The above statistic requires modification to use class information.In the previous statistic,
recalling that E[Wij]=aij=αqjfori∈Iq.Assume whenφ(i)=c,and distribution indexq=ψ(i)corresponds toc=φ(i),thenαqjcan be approximated byγcj,the mean of classc=φ(i).Ideallyαqjshould be approximated directly,but there is no way to knowψ?1(q),so essentiallyψ?1(q)?φ?1(c)is being approximated byφ?1(c).Unfortunately,using the sample mean of thecth class and thejth cluster to approximateγcjand thereforeαqjbreaks down because the sample mean of thecth class and thejth cluster is both the random variable on the left side and the approximation of the expected value on the right side of the minus sign.This is illustrated below.Approximatingγcj(andαqj)with the sample mean for thecth class,
Thus this test statistic does not work because the value being tested is the same as the estimated mean for thecth class.
In order to make use of class information to estimate distribution statistics(mean and variance),it is necessary to modify the random variable to model class labels as well as cluster memberships.Consider each labeled sample’s membership in a particular class,say thecth class,to be a Bernoulli trialVic,whereVic=1 indicates theith sample is labeled with thecth class,andWijis defined above.Define
wherenis the total number of labeled samples as the random variable for the sum of weights for samples in thecth class to thejth cluster.The Central Limit Theorem applies to this sum of bounded random variables with finite mean and variance(see Theorem 1),andYc,jis approximately normal.
Consider now the test statistic
Fixingjandc,assumingWijandVicare independent,and definingmq=|Iq|,the number of in dicesifor whichX(i)has theqth distribution,
whereCis the number of classes.Then
Using these expressions for the mean and variance ofYc,j,the Wald statistic for thecth class andjth cluster is
and the null hypothesis is rejected ifP(Z≥?z)≤α.
This section presents experimental results to demonstrate the functioning of the new statistical test.It is important to distinguish the nature of enrichments identified by the new test from the quality of clusters mined by a specific algorithm.The features evaluated are(i)whether the test is able to recognize clusters with partial
memberships (soft assignments) as being significant, (ii) whether it leads to a higher number of assignments in soft clustering situations, and (iii) the variation in number of enrichments as entropy of clusters and significance levels are changed.For the purpose of this evaluation,consider the soft k-means algorithm where membership probabilities at each stage of the iteration are non-zero across the clusters.
Table 1:Datasets
Figure 2:Enrichments of synthetic data(Jaccard similarity between class labels and clusters=1.0).
Table 1 describes the datasets used in this study;with the exception of the synthetic dataset,all are taken from the UCI KDD/ML data repository.In each case,the number of clusters to be identified is set equal to the number of natural classes present in the dataset.
Fig.2 presents results on synthetic data involving four separable Gaussians in a two-dimensional layout.The enrichmentp-values are also shown for all 16 combinations for the soft and hard versions of the Wald statistic as well as the hypergeometric test,which is commonly used for cluster evaluation.As can be seen,the qualitative trends are the same so that for all stringent thresholds the results yield four clusters enriched with four different class labels.
Figure 3:Ionosphere data.(left)Soft assignments:p-values are derived from Wald statistic for thecth class andjth cluster.(middle)Hard assignments:p-values are derived from Wald statistic for thecth class andjth cluster.(right)Enrichment with hypergeometric distribution.(Jaccard similarity between fuzzyk-means and the actual class-labels:0.5865.)
Fig.3 presents a more complicated situation with the ionosphere dataset.This dataset involves two classes and there are more tangible differences between the three statistical tests.Note that the Jaccard similarity between the fuzzyk-means and class labels is not a perfect 1.As a result,for various values of thep-value threshold,it is possible to get one,two,three,or four cells enriched by the Wald statistic(soft assignment)whereas the hypergeometric distribution can lead to only two or four cells enriched.The Wald statistic(hard assignment)also performs better than the hypergeometric distribution.
Fig.4 more directly describes a plot of the number of enriched cells as thep-value cutoff is varied,using the vehicle dataset.The Wald statistics lead to a consistently greater number of enrichments compared to the hypergeometric test.A similar plot can be seen in Fig.5.
A different type of evaluation is shown in Fig.6(a)where the membership probabilities are artificially varied(from a hard membership)to impose a specified entropy on their distribution.As the entropy increases,the number of enrichments drops monotonically in the case of the Wald(soft)statistic whereas the hypergeometric enrichment test does not account for the entropy in a smooth manner.Fig.6(b)demonstrates the variation for a fixed value of the entropy but increasingly lax values of thep-value threshold.Again,the enrichments for the Wald(soft)statistic increase steadily.Similar plots for the breast tissue,steel plate faults,and glass datasets are shown in Figs.7,8,9,respectively.Finally,Fig.10 superimposes the variation ofp-value cutoff and entropy threshold to describe how the variation seen in previous plots manifests at allp-value thresholds,whereas the hypergeometric distribution is uniformly unable to provide a richer variety of enrichments.
Figure 5:Cardiotocography data.Number of enrichments at different p-value cut-offs with the three enrichment procedures(Jaccard similarity between fuzzy k-means and actual class labels:0.7896).
This paper presented a new statistical test suitable for enrichment of soft clusters.It was shown how this test produces significantly more enrichments,tunable control of number of enrichments,and smoother variation in enriched cells with entropy andp-value cutoffs.The method can be used as given here or embedded inside a cluster refinment algorithm for continuous evaluation and updating of clusters.Since few soft cluster enrichment methods exist,the framework here contributes a key methodology for clustering and cluster evaluation research.
Figure 6:Cardiotocography data.(top)Number of enrichments with fixed p-value threshold but varying entropy.Note that the number of enrichments falls monotonically with increasing entropy.(bottom)Number of enrichments with fixed entropy and varying p-value threshold.Note that the number of enrichments monotonically increases with increasing p-value threshold.
Figure 7:Breast tissue data.(Jaccard similarity between fuzzy k-means and the actual class-labels:0.7051.)(a)Number of enrichments at different p-value cutoffs with the three enrichment procedures.(b)Number of enrichments with fixed p-value threshold but varying entropy.Note that the number of enrichments falls monotonically with increasing entropy.(c)Number of enrichments with fixed entropy and varying p-value threshold.Note that the number of enrichments monotonically increases with increasing p-value threshold.
Figure 8:Steel plates faults data.(Jaccard similarity between fuzzy k-means and the actual class-labels:0.6681.)(a)Number of enrichments at different p-value cut-offs with the three enrichment procedures.(b)Number of enrichments with fixed p-value threshold but varying entropy.Note that the number of enrichments falls monotonically with increasing entropy for the Wald statistic.(c)Number of enrichments with fixed entropy and varying p-value threshold.Note that the number of enrichments monotonically increases with increasing p-value threshold.
Figure 9:Glass data.(a)Soft assignments:p-values are derived from Wald statistic for the cth class and jth cluster.(b)Hard assignments:p-values are derived from Wald statistic for the cth class and jth cluster.(c)Enrichment with hypergeometric distribution.(Jaccard similarity between fuzzy k-means and the actual class-labels:0.7117.)(d)Number of enrichments at different p-value cut-offs with different enrichment procedures.
Figure 10:Glass data.In this example,assignments are directly taken from the class labels.The entropy is changed by modifying the membership probability of the class of every instance.(a)Number of enrichments with different p-value thresholds and fixed entropy.(b)The plot at left shows how the number of enrichments change over the p-value thresholds and entropy.Note that the p-value is fixed for each of the spikes in this plot.For example,α remains 0.0020 in the interval between 0.0020 and 0.0028.The plot at the right side shows the change in number of enrichments with entropy where the p-value threshold is fixed.
Acknowledgement:This work was supported in part by Department of Energy Grant DE-FG02-06ER25720 and NIGMS/NIH Grant 5-R01-GM078989.
Bezdek,J.(1974):Fuzzy mathematics in pattern classification.Ph.D.thesis,Cornell University,Ithaca,NY,1974.
Bezdek,J.C.(1980):A convergence theorem for the fuzzy ISODATA clustering algorithms.IEEE Transactions on Pattern Analysis and Machine Intelligence,vol.2,pp.1–8.
Chapelle,O.;Sch?lkopf,B.;Zien,A.(2006):Semi-Supervised Learning.MIT Press,Cambridge,MA.
Derraz,F.;Peyrodie,L.;Pinti,A.;Taleb-Ahmed,A.;Chikh,A.;Hautecoeur,P.(2010):Semi-automatic segmentation of multiple sclerosis lesion based active contours model and variational dirichlet process.Computer Modeling in Engineering&Sciences,vol.67,no.2,pp.95–118.
Ewens,W.;Grant,G.(2001):Statistical Methods in Bioinformatics.Springer.
Gnedenko,B.(1997):Theory of Probability.Gordan and Breach Science Publishers,The Netherlands,sixth edition.
Lin,Z.;Cheng,C.(2010): Creative design of multi-layer web frame structure using modified ahp and modified triz clustering method.Computer Modeling in Engineering&Sciences,vol.68,no.1,pp.25–54.
MacQueen,J.B.(1967): Some methods for classification and analysis of multivariate observations.InProc.of the fifth Berkeley Symposium on Mathematical Statistics and Probability,volume 1,pp.281–297.L.M.Le Cam and J.Neyman,editors,University of California Press.
MacQueen,J.B.(1974):Fuzzy Mathematics in Pattern Classification.Ph.D.thesis,1974.
Musy,R.F.;Wynne,R.H.;Blinn,C.E.;Scrivani,J.A.;Mcroberts,R.E.(2006): Automated forest area estimation via iterative guided spectral class rejection.Photogrammetric Engineering&Remote Sensing,vol.72,no.8,pp.949–960.
Phillips,R.D.;Watson,L.T.;Wynne,R.H.(2007):Hybrid image classification and parameter selection using a shared memory parallel algorithm.Comput.Geosci.,vol.33,pp.875–897.
Phillips,R.D.;Watson,L.T.;Wynne,R.H.;Ramakrishnan,N.(2009a):Continuous iterative guided spectral class rejection classification algorithm:Part 1.Technical report,Department of Computer Science,VPI&SU,Blacksburg,VA,2009a.
Phillips,R.D.;Watson,L.T.;Wynne,R.H.;Ramakrishnan,N.(2009b):Continuous iterative guided spectral class rejection classification algorithm:Part 2.Technical report,Department of Computer Science,VPI&SU,Blacksburg,VA,2009b.
Richards,J.A.;Jia,X.(1999):Remote Sensing Digital Image Analysis.Springer-Verlag,Berlin,third edition.
Tishby,N.;Pereira,F.C.;Bialek,W.(1999): The information bottleneck method.InProc.of 37th annual Allerton Conference on Communication,Control,and Computing,pp.368–377.
van Aardt,J.A.N.;Wynne,R.H.(2007):Examining pine spectral separability using hyperspectral data from an airborne sensor:An extension of field-based results.International Journal of Remote Sensing,vol.28,pp.431–436.
Wagstaff,K.;Cardie,C.;Rogers,S.;Schr?dl,S.(2001):Constrained k-means clustering with background knowledge.InICML’01,pp.577–584.
Computer Modeling In Engineering&Sciences2014年2期