• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Distinguishing the Rare Spectra with the Unbalanced Classification Method Based on Mutual Information

    2016-07-12 12:48:55LIUZhongbaoRENJuanjuanKONGXiao
    光譜學與光譜分析 2016年11期
    關鍵詞:互信息天文臺中國科學院

    LIU Zhong-bao, REN Juan-juan, KONG Xiao

    1. School of Computer and Control Engineering, North University of China, Taiyuan 030051, China 2. Key Laboratory of Optical Astronomy, NAOC, Chinese Academy of Sciences, Beijing 100012, China 3. National Astronomical Observatories, Chinese Academy of Sciences, Beijing 100012, China

    Distinguishing the Rare Spectra with the Unbalanced Classification Method Based on Mutual Information

    LIU Zhong-bao1*, REN Juan-juan2, KONG Xiao3

    1. School of Computer and Control Engineering, North University of China, Taiyuan 030051, China 2. Key Laboratory of Optical Astronomy, NAOC, Chinese Academy of Sciences, Beijing 100012, China 3. National Astronomical Observatories, Chinese Academy of Sciences, Beijing 100012, China

    Distinguishing the rare spectra from the majority of stellar spectra is one of quite important issues in astronomy. As the size of the rare spectra is much smaller than the majority of the spectra, many traditional classifiers can’t work effectively because they only focus on the classification accuracy and have not paid enough attentions on the rare spectra. In view of this, the relationship between the decision tree and mutual information is discussed on the basis of summarizing the traditional classifiers, and the cost-free decision tree based on mutual information is proposed in this paper to improve the performance of distinguishing the rare spectra. In the experiment, we investigate the performance of the proposed method on the K-type, F-type, G-type, M-type datasets from Sloan Digital Sky Survey (SDSS), Data Release 8. It can be concluded that the proposed method can complete the rare spectra distinguishing task compared with several traditional classifiers.

    Unbalanced classification; Mutual information; Rare spectra; Decision tree

    Introduction

    Stellar spectral classification is an important research area in astronomy. Many effective classification methods are proposed by researchers. Wavelet transform is utilized for spectral analysis by Starck[1]; BP neural network is utilized to classify the spectra by Gulati[2]; A novel stellar spectral classification method based on principal component analysis (PCA) and neural network is proposed by Bailer-Jones[3]; Sparse representation and dictionary learning are used to stellar spectral classification by Hernandez[4]; Monte Carlo local outlier factor (MCLOF) algorithm is utilized to find the unusual stellar spectra from the large-scale spectral dataset[5]; Weighted frequent pattern tree is constructed by Cai[6]to mine the stellar spectra association rule; Locally linear embedding (LLE) is used to classify the stellar spectra by Bu[7]; Gray proposes an expert computer program MKCLASS designed to classify stellar spectra on the MK spectral classification system[8]; Isometric feature map (ISOMAP) and support vector machine (SVM) are utilized for stellar spectral classification by Bu[9]; Self-organizing map is applied to stellar spectral classification by Bazarghan[10]; Principal component analysis and nearest neighbor algorithm are successively used to extract the spectral feature and classify the stellar spectra by Dan[11]; The advantages of the cover algorithm and kernel technique are utilized to classify the celestial spectra by Yang[12]; The data warehouse is used to classify the spectra from galaxy by Sun[13]; PCA and Bayesian algorithm are successively used for feature extraction and spectral classification by Liu[14]; A series of stellar spectral classification methods based on manifold-based discriminant analysis (MDA)[15]are proposed by Liu[16-17].

    The above spectral classification methods aim to maximize the classification accuracy, and therefore, they can’t deal with the unbalanced classification problem. The unbalance data refers to the wide gap in quantity between the data. The unbalanced problem exists in the stellar spectral classification. For example, in rare spectra recognition, many traditional classifiers misclassify the rare spectra to the majority, although their classification accuracies are high, they can’t satisfy our expectation. In view of this, the unbalanced classification problem has been widely concerned by many researchers. In unbalanced classification, the rare spectra are paid more attention than the majority, and the cost of misclassifying the rare spectra is much greater. Therefore, the unbalanced classification problem is related to cost-sensitive learning. Cost-sensitive classification overcomes the shortages of many traditional classifiers always focusing on the classification accuracy, tries to reduce the high-cost misclassification by assigning different costs to various misclassifications. The premise of cost-sensitive classification is the misclassified cost should be given in advance to ensure the classification results reliable. But in practice, it is difficult to give the precise misclassified costs, and meanwhile, the hypothesis of all the misclassified costs of traditional classifiers are the same is not valid when dealing with the unbalanced classification problem, and therefore, the reliability can’t be satisfied. In view of this, the relationship between the decision tree and the mutual information is discussed on the basis of summarizing the traditional classifiers, and then the cost-free decision tree based on mutual information is proposed to improve the classification efficiency of the rare spectra.

    The rest of this paper is organized as follows. In Section 2, the unbalanced classification method based on mutual information is proposed; The experiments on the SDSS dataset are provided in Section 3; Section 4 conclude our work.

    1 The unbalanced classification method based on mutual information

    1.1 Background Knowledge

    (1) Entropy

    SupposeDis the dataset andCis the class set. The entropy can be defined as follows.

    (1)

    where |C| denotes the number of classes, Pr(ci) denotes the probability of the classciin the datasetD.

    (2) Information gain

    SupportAiis the attribute ofDwithνvalues.Dcan be divided into several disjoint subsetsD1,D2,…,Dν. The entropy ofDcan be defined as follows.

    (2)

    The information gain of the attributeAiis

    gain(D,Ai)=entropy(D)-entropy(Ai,D)

    (3)

    (3) Information gain ratio

    The information gain prefers to choose the attributes with more values. In order to solve such preference problem, information gain ratio is proposed.

    (4)

    wheresdenotes the possible value of the attributeAi,Didenotes theith subset with the attributeAi.

    1.2 Construction of the decision tree

    The decision tree is one of the important data mining methods, which can determine which class the data belonged to by the classification criterion in a tree structure. The decision tree is easy to understand and its classification performance is quite well, and therefore, it is widely used in practice. The general process of the decision tree is as follows. Firstly, the data used for training are seen as the root nodes and the splitting attributes are selected based on the splitting criterion. Secondly, according to the values of the splitting attributes, the training dataset is divided into several subsets, treated as the first layer of the root nodes. Then, the nodes of the first layer are seen as the root nodes, repeat the above steps, and when the termination conditions sufficient, the construction process of decision tree can be terminated, and then the decision tree is constructed.

    The algorithm description of decision tree is as follows. SupposeDdenote the dataset andAdenote the candidate attribute set.

    Algorithm: Generate_decision_tree

    Input: the datasetDand the candidate attribute setA

    Output: decision treeT

    Step1: Create the nodeT;

    Step2: If the data inDbelongs to the same classC, thenTis the leaf node and its label is marked as the classC;

    Step3: IfA=Null or the size of the rest data is much smaller than the threshold given by the user in advance, thenTis the leaf node, which is seen as most popular class inD;

    Step4: Compute the information gain ratio of the attributes inA;

    Step5: The attribute with the maximized information gain ratio is selected as the test attributetest_A;

    Step6: The nodeTis marked astest_A;

    Step7: Iftest_Ais continuous, then the splitting threshold of the above attribute should be given;

    Step8: For each attributeAiintest_A, a branch is created by the nodeTwith the condition oftest_A=Ai;

    Step9: SupposeDiis the dataset with the condition oftest_A=Ai. IfDi=Null, then add a leaf node andDis seen as most popular class; else add the node returned by Generate_decision_tree(Di,D-test_A);

    Step10: Pruning for the decision tree.

    1.3 Pruning methods

    The purpose of pruning is to speed up the classification process and improve the classification accuracy. There are two research trends in pruning methods: One is pre-pruning, and the other is post-pruning. Pre-pruning tries to finish the construction of the decision tree as soon as possible by means of a threshold, which is difficult to determine in advance, and therefore, the reliability of the decision tree can’t be satisfied. Post-pruning starts to pruning after the construction of the decision tree. Generally, the mistake of each node on the subtree is firstly estimated, if the number of mistakes is quite large, then pruning such subtree. In practice, post-pruning is superior to pre-pruning, and therefore, post-pruning is applied for pruning in this paper.

    1.4 Cost-free decision tree based on mutual information

    The traditional classifiers can be divided into cost-sensitive and cost-free. The performance of cost-sensitive classifier relies on the cost information given by the user in advance. However, it is difficult to give the cost in practice, and even given, sometimes maybe wrong. In this situation, the classification efficiency of the cost-sensitive classifier can’t be satisfied. In order to solve the above problem, cost-gree classifier is proposed by researchers, which has no concern with the cost information. In this paper, the cost-free classifier used in the stellar spectral classification is discussed and we will propose a new decision tree with no cost information. Precisely, we take binary classification as an example: one class is positive, and the other is negative. It can be described as follows.

    (1) The classification results of a classifier are stored in the confusion matrix, shown in formula (5). The confusion matrix can be processed by normalization, and it can be transformed to a new confusion matrix with only one free variable.

    (5)

    whereCis the confusion matrix,TNdenotes the correctly classified number of the negative spectra,FPdenotes the misclassified number of the positive spectra,FNdenotes the misclassified number of the negative spectra,TPdenotes the correctly classified number of the positive spectra.

    (2) In the construction of the decision tree based on mutual information, the cost information is treated as an unknown variable and the information entropy produced by the node of the decision tree is weighted, shown in formula (6).

    (6)

    whereDdenotes the spectra,mdenotes the size of the classes, especially,m=2 in this section,λidenotes the percentage of theith class.

    (3) The value of the free variable can be obtained by maximizing the mutual information between the predicted label and the actual label, shown in formula (7).

    (7)

    whereαis the unknown variable,tdenotes the actual label,ydenotes the predicted label,g(x) denotes the inner information of the decision tree according to the spectrumx.

    (4) The way of pruning the decision tree based on mutual information is similar with the traditional decision tree. The purpose of pruning is to ensure the mutual information maximized. If the mutual information increases or invariant after pruning, then prune it; else hold it.

    1.3 Algorithm description

    SupposeXdenote the stellar spectral dataset including the training datasetX_trainand the test datasetX_test.

    Input: the stellar spectral datasetXincluding the rare spectra and the majority of the stellar spectra

    Output: the label of each spectrum inX_test

    Step1: The datasetXshould be divided into two parts, one for training, and the other for test.

    Step2: The datasetXshould be preprocessed, such as noise reduction and normalization.

    Step3: The decision tree based on mutual information is applied on the training datasetX_trainand we can obtain the classification criteria.

    Step4: The label of the spectrum inX_testcan be predicted by the classification criteria.

    Step5: The classification accuracies can be obtained by comparing the predicted label and the actual label of the spectrum inX_test.

    2 Experimental Analysis

    The performance of the proposed decision tree based on mutual information in distinguishing the rare spectra from the majority of the stellar spectra will be investigated in this section. The experimental dataset is from Sloan Digital Sky Survey (SDSS), Data Release 8[18]. The experimental dataset includes 4 subclasses of K-type spectra, K1-type, K3-type, K5-type and K7-type, whose signal-to-ratios (SNRs) with 60

    The rare spectra and the majority of the stellar spectra are listed in Table 2(a)—(d).

    2.1 Performance on distinguishing the rare spectra

    In our experiments, 30%, 40%, 50%, 60%, 70% of the above spectral datasets are respectively used for training, and the remainders are used for test. The proposed decision tree based on mutual information is applied to investigate its performance on distinguishing the rare spectra. The experimental results are shown in Table 3(a)—(d), the first two columns of which respectively includes three parts: the total number of the training spectra or the test spectra and its proportion, the number of the rare spectra.

    Table 1(a) The total number of K starts with 60

    Table 1(b) The total number of F starts with 30

    Table 1(c) The total number of G starts with 40

    Table 1(d) The total number of M starts with 30

    Table 2(a) The majority vs. the rareness of K starts with 60

    Table 2(b) The majority vs. the rareness of F starts with 30

    Table 2(c) The majority vs. the rareness of G starts with 40

    Table 2(d) The majority vs. the rareness of M starts with 40

    Table 3(a) The experimental results on the K-type dataset

    Table 3(b) The experimental results on the F-type dataset

    Table 3(c) The experimental results on the G-type dataset

    Table 3(d) The experimental results on the M-type dataset

    It can be seen from Table 3(a)—(d) that with the training size rise, the classification accuracies of the proposed method on distinguishing the rare spectra have an upward tendency. Concretely, with the training size rise from 30% to 70%, the classification accuracies on the K-type, F-type, G-type, M-type datasets are respectively from 0.545 0 to 0.852 6, from 0.550 6 to 0.787 4, from 0.505 3 to 0.832 5 and from 0.533 3 to 0.762 4. In the view of the average classification accuracy, the performance of the proposed method on the above datasets is quite well, and all the classification accuracies exceed 60%. In general, the proposed decision tree based on mutual information can complete the rare spectra distinguishing task on the SDSS spectral datasets.

    2.2 Comparative experiments with several traditional classifiers

    In this section, we will investigate the performance of the proposed method compared with several traditional classifiers such as support vector machine (SVM) and k nearest neighbor (kNN). 50% of the K-, F-, G-, M-type spectral datasets are respectively used for training, and the remainders are used for test. The comparative experimental results are shown in Table 4.

    Table 4 The comparative experimental results on the SDSS dataset

    It can be seen from Table 4 that the performance of the proposed method on distinguishing the rare spectra is much better than SVM andkNN. Specially, in the view of average performance, the average classification accuracies of the proposed method are more than 20% higher than SVM andkNN. It can be concluded that the advantage of the proposed decision tree based on mutual information is its effectiveness on distinguishing the rare spectra.

    3 Conclusions

    In order to overcome the unbalanced and cost-sensitive classification problem of traditional classifiers, the decision tree based on mutual information is proposed in this paper. Firstly, the confusion matrix, storing the classification results, can be normalization and transformed to a new confusion matrix with only one free variable; Secondly, the cost information is treated as an unknown variable and the information entropy produced by the node of the decision tree is weighted; Thirdly, the value of the free variable can be obtained by maximizing the mutual information between the predicted label and the actual label; At last, the way of pruning the decision tree based on mutual information is similar with the traditional decision tree. The comparative experiments with several traditional classifiers on the SDSS DR8 datasets verify the effectiveness of the proposed method in distinguishing the rare spectra. Whether the proposed method is also effective on the other types of spectra such as galaxies is our future work.

    Acknowledgement: We are very grateful to the anonymous referee for many useful comments and suggestions.

    [1] Stark P B, Herron M M, Matteson A. Applied Spectroscopy, 1993, 47(11): 1820.

    [2] Gulati R K, Gupta R, Gothoskar P. AJ, 1994, 426(1): 340.

    [3] Bailer-Jones C A L, Irwin M, Hippel T. MNRAS, 1998, 298: 361.

    [4] Hernandez R D, Barreto H P, Robles L A, et al. Experimental Astronomy, 2014, 38: 193.

    [5] Wei P, Luo A L, Li Y B, et al. MNRAS, 2013, 431: 1800.

    [6] Cai J H, Zhao X J, Sun S W, et al. RAA, 2013, 13(3): 334.

    [7] Gray R O, Corbally C J. AJ, 2014, 147(80): 1.

    [8] Bu Y D, Pan J C, Jiang B, et al. PASJ, 2013, 65: 81.

    [9] Bu Y D, Chen F Q, Pan J C. New Astronomy, 2014, 28: 35.

    [10] Bazarghan M. Astrophysics and Space Science, 2012, 337: 93.

    [11] Dan D M, Hu Z Y, Zhao Y H. Spectroscopy and Specctral Analysis, 2003, 23(1): 182.

    [12] Yang J F, Wu C F, Luo A L, et al. Pattern Recognition and Artificial Intelligence, 2006, 19(3): 368.

    [13] Sun S W, Luo A L, Zhang J F. Astronomical Research and Technology, 2007, 4(3): 276.

    [14] Liu R, Jin H M, Duan F Q. Spectroscopy and Specctral Analysis, 2010, 30(3): 838.

    [15] Liu Z B, Pan G Z, Zhao W J. Journal of Electronics & Information Technology, 2013, 35(9): 2047.

    [16] Liu Z B, Wang Z B, Zhao W J. Spectroscopy and Spectral Analysis, 2014, 34(1): 263.

    [17] Liu Z B, Gao Y Y, Wang J Z. Spectroscopy and Spectral Analysis, 2015, 35(1): 263.

    [18] Almeida J S, Prieto C A. AJ, 2013, 763: 1.

    *通訊聯(lián)系人

    TP29

    A

    利用基于互信息的不平衡分類方法識別稀有光譜

    劉忠寶1*,任娟娟2,孔 嘯3

    1. 中北大學軟件學院, 山西 太原 030051 2. 中國科學院國家天文臺光學天文重點實驗室, 北京 100012 3. 中國科學院國家天文臺, 北京 100012

    從海量恒星光譜中發(fā)現(xiàn)稀有光譜是天文學研究的重要課題之一。與一般光譜相比,稀有光譜數(shù)量較少,因此,傳統(tǒng)分類方法無法正常工作。究其原因是這些方法不僅在分類決策時并未對稀有光譜予以更多關注,而且只關注分類的準確率。鑒于此,在總結當前分類方法的基礎上,深入分析互信息與決策樹之間的關系,提出基于互信息的代價缺失決策樹。SDSS DR8中K型、F型、G型以及M型恒星光譜上的比較實驗表明,與傳統(tǒng)分類方法相比,所提方法能夠較好地完成稀有光譜識別的任務。

    不平衡分類;互信息;稀有光譜;決策樹

    2015-12-14,

    2016-04-15)

    Foundation item: Scientific and Technological Innovation Programs of Higher Education Institutions in Shanxi (2014142), Program for the outstanding Innovative Teams of Higher Learning Institutions of Shanxi

    10.3964/j.issn.1000-0593(2016)11-3746-06

    Received: 2015-12-14; accepted: 2016-04-10

    Biography: LIU Zhong-bao, (1981—), associate professor in North University of China *Corresponding author e-mail: liu_zhongbao@hotmail.com

    猜你喜歡
    互信息天文臺中國科學院
    《中國科學院院刊》新媒體
    中國科學院院士
    ——李振聲
    天文臺就該這么看
    祝賀戴永久編委當選中國科學院院
    海爾與望遠鏡和天文臺的故事
    軍事文摘(2020年24期)2020-02-06 05:57:02
    天文臺
    《中國科學院院刊》創(chuàng)刊30周年
    基于互信息的貝葉斯網絡結構學習
    歐米茄超霸系列月相至臻天文臺表
    空中之家(2016年5期)2016-02-04 01:28:35
    聯(lián)合互信息水下目標特征選擇算法
    蕉岭县| 沂源县| 镇宁| 大城县| 娄烦县| 东丽区| 东海县| 新化县| 南漳县| 泾源县| 科技| 化德县| 泰州市| 休宁县| 周宁县| 石狮市| 栾川县| 湘潭县| 洛阳市| 乐陵市| 凤阳县| 云和县| 常州市| 石台县| 鸡西市| 大名县| 礼泉县| 济宁市| 讷河市| 平罗县| 蒲江县| 徐闻县| 庆元县| 紫云| 木兰县| 太保市| 额济纳旗| 合阳县| 松阳县| 高州市| 普定县|