Wenrn YANG, Xinde LI, Yong DENG,c,d,e,*
aInstitute of Fundamental and Frontier Science, University of Electronic Science and Technology of China, Chengdu 610054, China
bKey Laboratory of Measurement and Control of CSE, School of Automation, Southeast University, Nanjing 210096, China
cSchool of Education, Shannxi Normal University, Xi’an 710062, China
dSchool of Knowledge Science, Japan Advanced Institute of Science and Technology, Nomi, Ishikawa 923-1211, Japan
eDepartment of Management, Technology, and Economics, ETH Zurich, Zurich 8092, Switzerland
KEYWORDSDempster-Shafer evidence theory;Generalized evidence theory;Information fusion;K-means clustering;Open world assumption;Target recognition
AbstractWhen the existing information does not contain all categories,the Generalized Evidence Theory (GET) can deal with information fusion.However, the question of how to determine the number of categories through GET is still intriguing.To address this question, a modified kmeans clustering, named centers initialized clustering is proposed, filling the gap of identification and complement of the frame of discernment.Based on this clustering method, the number of categories is determined.The initialized centers selected by center density keep the cluster results constant, enhancing the stability of clustering results.Besides, constructing Generalized basic Probability Assignment (GBPA) modules in a conservative way improves the reliability of the results.The mass of empty set in combined GBPAs is the indicator of the number of categories.Experiments on real and artificial data sets are conducted to show the effectiveness.
Dempster-Shafer evidence theory (DS theory)1,2has many advantages in dealing with uncertainty and unknown information, which requires less information than probability theory.3,4DS theory has been studied deeply through the past few years, such as probabilistic data modelling,5risk analysis,6,7evidential reasoning,8,9pattern recognition,10–12decision making13,14and the generalization of evidence.15,16
Although DS theory has been widely used, the conflict problem is still unsolved, that is, the combination results of highly conflict evidence are counter-intuitive.To address this problem, some new combination methods based on network,17,18gravity,19conflict measurement20and others21are proposed.However, these solutions above are based on the close world,which assumes that all kinds of targets are known,but the close world assumption conflicts with the open world reality.In some real world situations, target numbers are not constant.Take COVID-19 as an example, the number of species is changing all the time because of mutations, so the close world assumption is not reliable all the time.Therefore, when the existing information does not contain all categories, the incompleteness of the frame of discernment is one source of the conflict.
Based on the open world assumption, the Generalized Evidence Theory (GET) is developed.15The basic framework of mass function is extended by viewing the empty set as the representation of the unknown targets.
However, the determination of the exact number of categories is rarely studied.Recently, Liu and Deng found that the target number could be determined by Elbow method22which requires manual observation.Since the empty set in GET represents the unknown, the motivation of this work is to take advantage of GET and output the accurate number of categories.Furthermore, the process of dividing unlabeled samples into several classes is called unsupervised learning and is deeply studied in the field of machine learning.23Among them,typical algorithms are clustering algorithms,including kmeans clustering,24density peak clustering,25and adjacent propagation clustering.26Using clustering algorithms,it is possible to model the division of data into categories.
Therefore,the task of this paper is to complete the frame of discernment by clustering and generating GBPAs for data with incomplete frame.Through clustering, unlabeled data is separated into different clusters.Traverse all possible numbers of clusters and analyze the degree of data conflict through GET.Since the conflict is minimized when the number of types is correct, the correct number of types can be obtained.However,there is always a difference between the clustering results and the real classification and it is a challenge to make the clustering results as close to the real situation as possible.
In the following section, some necessary preliminaries are presented shortly.Then,the proposed clustering based method is demonstrated.To illustrate the effectiveness,experiments on Iris data set and simulated Gaussian data set are conducted.Finally, conclusions are stated.
To a given question, all hypotheses Hiconstruct the Frame of Discernment (FOD), denoted as Ω.
A Basic Probability Assignment(BPA)is to assign each element in power set P(Ω)a value,under the constrain of Eq.(3)and Eq.(4).
BPAs can be deemed as a generalized probability distribution.5,27,28Compared with the probability distribution, BPAs take the advantage of modeling uncertainty.29–31Many functions are proposed to handling BPAs, such as entropy function,32–34information volume35–37information quality measure,38complexity analysis39,40and uncertainty measure.41–44For two BPAs, the combination rule in DS theory is to combine them to a new BPA, which is represented in Eq.(5) and Eq.(6).
Information fusion is important and widely used.45,46In addition, the inverse process of fusion, the de-combination method, is also proposed.47
There are many mathematical models to explain evidence theory.48For example, a geometric approach is proposed to explore evidence theory.49Nevertheless, an intriguing and longtime unsolved problem is the conflict management.50,51A typical way is to consider the reliability of evidence.52–54On the other hand,the incomplete frame of discernment is also an important reason to cause conflict.To address this issue,an open world assumption, called transferable belief model, is developed.55,56In addition, another model to deal with the open world is GET.15In D-S evidence theory, m(?)should always be zero.However, in GET, m(?)is not necessarily zero,and is treated as other focal elements.GET combination rule is shown in Eqs.(7)–(10).
K-means clustering24classifies unlabeled data into different categories.To divide n samples to k clusters, K-means clustering assigns each sample to the nearest cluster center.Then centers are updated with the mean of each cluster.The calculation steps are shown in Algorithm 1.
Algorithm 1 K-means clustering.
The proposed method is to make use of m(?)and estimate the number of targets.Traverse the number of targets from 2 to N,and calculate the corresponding m(?).N is the maximum number of possible species we believe.The proposed method can be divided into two main steps.Firstly,categorize samples into different groups based on distribution features.Secondly,construct GBPA models and take m(?)as an indicator to determine the possible number of targets.Fig.1(a) visualizes the above steps, and the diagram of centers initialized clustering is shown in Fig.1(b).
One popular method to classify samples is K-means clustering.However, this method has a drawback that the clustering results keep changing during the repeated experiences because of the randomly selected initial centers.Inspired by density peak clustering 25,our method initializes centers with samples having high density and large distances from other regions of higher density.
In our method, the density of each point is represented by Eq.(11), where dijdenotes the distance between point i and j.
Since the higher the density, the more likely the data is a cluster center.The point with the biggest q is firstly chosen as an initial center and added to center set Cse.Then, the remains (k-1)centers are chosen by the consideration of both the density of itself and the distribution of centers, that is,the newly added center should be away from existing centers and have high density.The distance between point i and the nearest center is denoted by γi, then the selected center is to maximum q ?γ.Since the choice of centers is based on the order of q ?γ, the set of initial centers just add a point when the cluster number k increases by one.Moreover, the clustering result is constant because of the constant initial centers.The details are shown in Algorithm 2.
With Algorithm 2, the randomly selected initial centers in Algorithm 2 are replaced by centers selected based on density.This change makes the output of clustering remain unchanged in the repeated experiences and make full use of the distribution features of data sets.
Fig.1 Diagram of the proposed method.
The most popular method to generate GBPA is shown in Ref.22.Since parameters of GBPA models totally come from cluster results, some misclassified points have a great influence on GBPA models.In our method,GBPAs are generated in a conservative way.The details are introduced in the following.
With centers initialized clustering, samples are divided into k clusters.Usually,each cluster is used to generate a fuzzy triangle57directly.In this article, the points, whose distance to the nearest center point is greater than the ninth quantile,are ignored in constructing fuzzy triangles as shown in Fig.1(b).The generated fuzzy triangles are supposed to represent the category information based on clustering.However, the clustering result is not all reliable since the boundary point between class A and B with closer distance to center A owns the possibility that it actually belongs to class B.The misclassified point makes the parameters of both fuzzy triangle A and B inaccurate.Therefore, in our method, the points away from centers, with high possibility to be misclassified, are discarded and interpreted as noise.
With k fuzzy triangles {T1;T2;???;Tk}, k intersections{x1;x2;???;xk}are available.{ω1;ω2;???;ωk}are intersections in descending order.Assuming there exists only one positive intersection ω, the degree of belonging to the proposition P is shown in Eq.(12).
Then, the generated GBPA is shown in Eq.(16) and Eq.(17).If the sum of the belonging degrees is greater than one,then normalize directly;if it is less than one,assign the part less than one to the empty set.
Algorithm 2 Density based center finding.
Given a data set with n samples and s attributes,each attribute corresponds to a GBPA model.Therefore, a sample with s attributes generates s GBPAs m1;m2;???;ms.Each GBPA represents the degree of membership to subsets from the perspective of each attribute.
Combine s GBPAs with Eq.(18),where ⊕denotes the combination rule illustrated in Eq.(7).Since m(?)is an indicator of conflict degree, if cluster number is larger than real kinds,the increased misclassified samples will increase the value of m(?).
Using the above steps,when the number of clusters k is changing from 2 to a relatively large value, each k corresponds to a value of m(?)respectively.If k is smaller than true species,the empty set value m(?)of the generated GBPA will be larger than the correct clustering.If k is greater than true species,the empty set value m(?)of the generated GBPA will decrease but the m(?)of the combined GBPA will increase,because of the increased degree of conflict.Therefore, m(?)will be minimum if the number of species is correct.
In the assumption of close world, evidence theory takes the advantage of efficient modeling uncertainty58–60and information fusion.61,62It has been widely used in classification,63–65decision-making systems,66,67autonomous driving,68human reliability analysis69,70and many other engineering applications.71–74However, the real application is often based on the assumption of open world.We construct experiments on Iris data set and simulated Gaussian data set to show the effectiveness under the assumption of incomplete discernment.
The experiments are constructed through the following steps.Firstly, centers initialized clustering is used to separate samples.GBPA generation and combination method are used to produce the final m(?).Different numbers of clusters result in different values of m(?).Finally, the number of types is predicted based on m(?).
Iris data set75consists of 150 samples with 4 attributes,sepal length,sepal width,petal length and petal width,shorten as SL,SW,PL and PW.There are 3 categories,setosa(a),versicolar (b), virginica (c), and 50 samples in each category.
First of all, all samples are normalized to [0;1] using
Assuming the known categories are a and b, the FOD is{a;b}.2 center points are selected by Algorithm 2.The randomly selected initial centers in classical k-means clustering are replaced by these two points.The clustering result is used to represent the distribution feature of data and construct GBPA models.However, since clustering results always contain some misclassified points,a relatively conservative method is proposed.The outer points, whose distance to the nearest center point is greater than the ninth quantile, are ignored in generating GBPA.
Fig.2 Cluster results shown in attribute SL, SW, PL and PW.
Table 1 Specific fuzzy triangle feature information of Iris data set.
Distribution features of Iris data are shown in Fig.2 and Fig.2(a) demonstrates data in the perspective of attribute SL and SW.Fig.2(b) is from the perspective of attribute PL and PW.The outer points, the gray points in figures, are far from centers and likely to be the misclassified points.Therefore,these gray points,whose distance to center points is larger than the ninth quantile, are discarded in constructing GBPA modules.
Then, the clustering result is used to construct fuzzy triangles.For each cluster, the maximum, average and minimum values of each attribute are used in Eq.(20).Table 1 shows the specific fuzzy triangle feature information after omitting outer points.
The generated GBPA is shown in Fig.3.Fig.s 3(a)–(d)correspond to attribute SL, SW, PL and PW respectively.
The m(?)of 4 attributes is calculated and combined in Table 2.
Fig.3 GBPA model for attributes SL, SW, PL and PW with 2 clusters.
Table 2 Sum of m().
Then, change the data into 3 clusters and repeat the above steps.The generated GBPAs are demonstrated in Fig.4.Compared with Fig.3, the increased number of clusters decreases the value of m(?).
Increase the number of clusters from 2 to 8 and repeat the generation and combination steps.The change of m(?)along with cluster numbers is illustrated in Fig.5.
As the cluster number k increases, the overall trend of the m(?)of each attribute decreases, since the increased number of triangles occupies more space and the space labeled as unknown decreases.However, the combined m(?)reaches the minimum when samples are clustered into 3 categories.The conflict degree is small when samples are classified correctly.If the number of categories is larger than real kinds,the increased number of misclassified samples will lead to an increased degree of conflict and manifested in the increase of the m(?)value.Since the experiment result reaches the minimum when the cluster number is three, the complete frame of discernment is supposed to contain three elements.
Fig.4 GBPA model for attributes SL, SW, PL and PW with 3 clusters.
Fig.5 Changing of m(?)with cluster numbers.
In order to filter out outliers while retaining the category features, the defined criterion for judging a point as an outlier is that its distance from the nearest center point is greater than the ninth quantile of the distances.The ninth quantile,denoted as q=0.9,corresponds to results shown in Fig.5.The increase of q brings the risk of using misclassified samples, while the decrease of q increases the possibility of missing correctly classified points and destroying category features.The experiment results corresponding to q changing from 0.85 to 0.95 are shown in Fig.6.
With different q, the overall trends of m(?)in combined GBPAs are all increasing.With the decrease of q, the number of outliers screened out increases, therefore, the established GBPA model loses some data features,resulting in the gradual increase of conflict degree after fusion.But only when q ?{0.90;0.91;0.92}, m(?)reaches minimum when cluster number is 3.When q takes other values, the minimum points all occur at the beginning.If q is larger than 0.92, undeleted outliers cause m(?)to keep increasing.If q is less than 0.90,the incorrectly removed points result in the loss of data feature, making the conflict degree large even when samples are separated into the right number of clusters.In conclusion,the choice of q is essential and it is reasonable to choose q ?[0.90;0.92] for Iris data set.
Fig.6 Changing of m(?)in combined GBPAs with cluster numbers.
The experiments on simulated Gaussian data set are constructed through the following steps.Firstly, generate data set with pre-set parameters.Then,centers initialized clustering is used to cluster data.GBPA generation and combination methods are used to produce the final m(?).Different numbers of clusters result in different values of m(?).Finally,the number of types is predicted based on m(?).
Since the degree of conflict between GBPAs is revealed through the combination process, the dimension of simulated data is set to 3, so that the generation of m(?)contains two combination processes.
The artificial Gaussian data set consists of 4 distributions with the following characteristics:
where I is identity matrix.Fig.7 demonstrates the generated data from × and y dimension.
The clustering result is shown in Fig.8.Each cluster along with a center denoted as a black point, some outer points labeled as gray points are discarded in generating GBPAs.
Fig.7 A Gaussian data set of 800 instances.
Fig.8 A Gaussian data set with centers and outer points.
Fig.9 GBPA model for attribute x.
Fig.10 Changing of m(?)with cluster numbers.
The generated GBPAs from the perspective of × is shown in Fig.9.
Iterate the process of clustering and generate the corresponding m(?).The results are shown in Fig.10.Just as the result of the Iris data set,the overall trend of m(?)each attribute is decreasing.The value of m(?)also reaches the minimum when cluster number k equals the real number of categories.
A clustering based method to determine the number of categories and complete FOD is presented above.Firstly, data samples are clustered into several categories with initialized centers chosen based on density peaks.Then,the cluster results are used to generate GBPAs, after omitting outer points.Finally, the FOD is determined by the value of m(?)in combined GBPAs.The effectiveness is illustrated by both real data set and simulated data set.However,the preset value to define outliers endows an impact on final results.Hence,how to give a data-driven threshold requires further research.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
The work was partially supported by the National Natural Science Foundation of China(No.61973332),the JSPS Invitational Fellowships for Research in Japan (Short-term).
CHINESE JOURNAL OF AERONAUTICS2023年4期