摘要:基于遺傳編程(GP)提出一種最優(yōu)規(guī)則遺傳算法(BRGA)對分類規(guī)則進行優(yōu)化的方法,獲取最佳分類規(guī)則集,此算法可以調(diào)整分類器模型的相關參數(shù),在適當增加迭代基礎上大幅提高分類的精確度,具有相當?shù)撵`活性和可理解性.利用6個基因數(shù)據(jù)集檢驗了算法的性能.仿真結(jié)果表明,本文提出的算法與其他文獻的方法相比,在具有較高分類精確度和穩(wěn)定性前提下大幅降低了計算復雜度及冗余.
關鍵詞:最優(yōu)規(guī)則遺傳算法;微陣列;遺傳編程;分類規(guī)則;計算復雜度
中圖分類號:TP391 文獻標識碼:A
生物醫(yī)學研究表明,人類大多數(shù)疾病的發(fā)病機制,比如癌癥,從根本上來說都和基因息息相關.微陣列數(shù)據(jù)是將樣本實驗形成的影像轉(zhuǎn)為基因表達矩陣,矩陣行表示基因,列表示類別樣本,矩陣中的元素描述不同基因在不同樣本的表達水平.
由于微陣列芯片技術(shù)[1]獲得的基因數(shù)據(jù)數(shù)量遠大于樣本數(shù)量,隨著維數(shù)的增加,最大的障礙則是在高維特征空間運算時存在的“維數(shù)災難”.微陣列大量基因數(shù)據(jù)僅為樣本分類提供了少數(shù)有分類意義的、具有明顯特征的基因.因此,在樣本分類之前,選擇特征基因是至關重要的,這直接影響到之后生成的分類器性能.微陣列分類作為生物指標的探索成為生物信息學一個重要的課題,事實上,由于存在更多的癌癥類型和潛在的癌癥子類,如果展開腫瘤分類問題到多重腫瘤類別,數(shù)據(jù)集包含更多的類別和非常少量的樣本,問題將變得更具有挑戰(zhàn)性.
一些研究報告指出,在基因選擇部分使用遺傳算法能改進微陣列數(shù)據(jù)的分類性能[1-2],因此,遺傳算法已廣泛用于解決包括數(shù)據(jù)分類的各種難題[3-4].本文提出一種最優(yōu)規(guī)則遺傳算法(Best Rule Genetic Algorithm,BRGA),選用一種基于遺傳優(yōu)化的分類算法生成分類規(guī)則,用二進制向量表示分類規(guī)則,初始化規(guī)則集,設定相應的適應度及初始種群的規(guī)模,通過變異產(chǎn)生一定數(shù)量的最優(yōu)分類規(guī)則.通過實驗,使用6個基因表達數(shù)據(jù)集來驗證算法的性能.
微陣列數(shù)據(jù)分類技術(shù)通常包含2部分內(nèi)容:1)基因選擇;2)構(gòu)建分類器模型.文獻[5]在基因選擇部分使用排列值計分RBS算法,很好地解釋了基因之間的相關性,大幅降低基因矩陣維度,在一定程度上減少了計算復雜性;在構(gòu)建分類器部分提出了LCR方法,可以用很少的基因構(gòu)造形成分類規(guī)則,提高了算法的可理解性.但分類規(guī)則的形成過程仍存在很多不足,如分類器模型中規(guī)則形成框架過于縝密,容易導致過擬合,產(chǎn)生龐大規(guī)則集的迭代過程相當繁瑣,并產(chǎn)生大量冗余的規(guī)則,導致計算復雜度較高且算法收斂速度較低.分類器的構(gòu)建則是整個技術(shù)的核心所在,傳統(tǒng)的微陣列分類方法有:加權(quán)投票(WV)[6],K近鄰(kNN)[7],支持向量機(SVM)[8],費舍爾線性判別分析(LDA)[9] ,人工神經(jīng)網(wǎng)絡(ANN)[10],遺傳規(guī)劃(GP)[11],最小二乘邏輯回歸[12]和樸素貝葉斯方法[13]等.由于它們僅僅聚焦于分類性能,而不能進一步提供任何醫(yī)學和生物學依據(jù),導致這些分類算法往往產(chǎn)生僵硬的分類系統(tǒng),存在穩(wěn)定性弱和開銷大的特征,缺乏可擴展性.決策樹算法[14]和隨機森林算法[15]基于決策規(guī)則產(chǎn)生分類器模型,此類算法獲得的分類規(guī)則在某種意義上包含了生物體基因之間的相關性,但如果訓練樣本存在小的差異會導致決策樹結(jié)構(gòu)產(chǎn)生大的變化,致使分類器缺乏穩(wěn)定性,這些分類方法仍然存在很大的局限性.
1 BRGA方法的基本思想
BRGA算法是在遺傳優(yōu)化的基礎上,將分類規(guī)則集作為種群,使用二進制串表示其中任意一條分類規(guī)則,計算對應于基因?qū)傩缘谋容^關系的分類規(guī)則適應度值,經(jīng)過若干代的繁殖過程,包括選擇、交叉和變異運算,反復迭代優(yōu)化,獲取具有較高適應度的最佳分類規(guī)則.
4 結(jié)論
本文提出的BRGA算法很好地解決了用微陣列基因表達值構(gòu)建分類決策規(guī)則普遍速度慢的難題,通過調(diào)整適合規(guī)則的適應度值及相關參數(shù)對初始規(guī)則集進行優(yōu)化,該算法能很快收斂于最優(yōu)分類規(guī)則集.采用6個數(shù)據(jù)集驗證了該算法的性能,實驗結(jié)果表明,BRGA算法具有較高的精確度和極少的分類運算耗時(CPU time).當然,由于實驗條件和生物學發(fā)展的局限性,該算法有待進一步提高和完善.
參考文獻
[1] HENGPRAPROHM S,MUKVIBOONCHAI S,THAMMASANG R,et al.A GAbased classifier for microarray data classification[C]// Proceedings of 2010 International Conference on Intelligent Computing and Cognitive Informatics(ICICCI 2010).Kuala Lumpur:IEEE Computer Society,2010:199-202.
[2] OOI C H,TAN P.Genetic algorithms applied to multiclass prediction for the analysis of gene expression data[J].Bioinformatics,2003,19 (1):37-44.
[3] BANDYOPADHYAY S,MURTHY C A,PAL S K.Pattern classification with genetic algorithms[J].Pattern Recognition Letters,1995,16(8):801-808.
[4] BANDYOPADHYAY S,MURTHY C A,PAL S K.VGAclassifier: design and applications[J]. IEEE Transactions on Systems, Man and CyberneticsPart B,Cybernetics,2000, 30(6):890-895.
[5] GANESHKUMAR P,AMMU V,VICTOIRE T A A.Building decision rules using a novel data driven method for microarray data classification[C]//2011 International Conference on Process Automation,Control and Computing(PACC 2011).Coimbatore:IEEE Express Conference Publishing,2011:1-6.
[6] GOLUB T R,SLONIM D K,TAMAYO P,et al.Molecular classification of cancer: class discovery and class prediction by gene expression monitoring[J].Science,1999,286(5439):531-537.
[7] WU Wei, XING E P,MYERS C,et al.Evaluation of normalization methods for cDNA microarray data by kNN classification[J].BMC Bioinformatics,2005,6:191-211.
[8] YOONKYUNG L,CHEOLKOO L .Classification of multiple cancer types by multicategory support vector machines using gene expression data[J]. Bioinformatics, 2003,19(9): 1132-1139.
[9] JAEWON L,JUNGBOK L,MIRA P,et al.An extensive comparison of recent classification tools applied to microarray data[J].Computational Statistics Data Analysis, 2005, 48(4): 869-885.
[10]KHAN J,JUN S,RINGNER M,et al.Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks[J].Nat Med, 2001,7(6):673-679.
[11]HONG J H,CHO S B. The classification of cancer based on DNA microarray data that uses diverse ensemble genetic programming[J].Artificial Intelligence in Medicine,2006,36(1):43-58.
[12]TANG E K,SUGANTHAN P N,YAO X.Gene selection algorithms for microarray data based on least square support vector machine[J]. BMC Bioinf,2006,7:95-110.
[13]JOHNSON W E,LI C,RABINOVIC A. Adjusting batch effects in microarray expression data using empirical bayes methods[J]. Biostatistics,2007,8(1):118-127.
[14]YU WANG,IGOR V T,MARK A H,et al.Gene selection from microarray data for cancer classification—a machine learning approach[J]. Computational Biology and Chemistry,2005, 29(1): 37-46.
[15]RAMON D U, SARA A A. Gene selection and classification of microarray data using random forest[J]. BMC Bioinformatics,2006,7:3.
[16]WELSH J B,SAPINOSO L M,SU A I,et al.Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer[J].Cancer Res, 2001,61: 5974-5978.
[17]ALON U,BARKAI N,NOTTERMAN D A,et al.Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays[J].PNAS, 1999, 96(12): 6745-6750.
[18]Broad Institute. Cancer program data sets[EB/OL]. [2012-01-01]. http://www.broadinstitute.org /cgibin/cancer/datasets.cgi.
[19]ASH A A,MICHAEL B E,ERIC R D,et al.Distinct types of diffuse large Bcell lymphoma identified by gene expression profiling[J]. Nature, 2000, 403(4): 503-511.
[20]BHATTACHARJEE A,RICHARDS W,STAUNTON J,et al.Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses[J]. PNAS, 2001,98(24): 13790-13795.
[21]SCOTT L P,PABLO T,MICHELLE G,et al.Gene expression-based classification and outcome prediction of central nervous system embryonal tumors[EB/OL]. [2012-01-01].http://www.broadinstitute.org/mpr/CNS/.
[22]SHI Chao,CHEN Lihui.Feature dimension reduction for microarray data analysis using locally linear embedding[C]//Proceedings of 3rd AsiaPacific Bioinformatics Conference(APBC 2005).Singapore:Imperial College Press,2005:211-217.
[23]TAN A,NAIMAN D,XU L,et al.Simple decision rules for classifying human cancers from gene expression profiles[J]. Bioinformatics, 2005, 21(20): 3896-3904.
[24]FUREY T S,CRISTIANINI N,DUFFY N,et al.Support vector machine classification and validation of cancer tissue samples using microarray data[J].Bioinformatics, 2000, 16(10): 906-914.
[25]JUNBAI W,TROND H B,INGE J,et al.Tumor classification and marker gene prediction by feature selection and fuzzy cmeans clustering using microarray data[J]. BMC Bioinformatics,2003, 4:60-71.