路建波 高華方 曹宗富 李 乾 蔡瑞琨 閆有圣 馬 旭*
1.國(guó)家衛(wèi)生計(jì)生委科學(xué)技術(shù)研究所(北京,100081);2.甘肅省婦幼保健院
·技術(shù)與方法·
孕婦血漿中胎兒游離DNA片段濃度和長(zhǎng)度均值的計(jì)算方法研究
路建波1高華方1曹宗富1李 乾1蔡瑞琨1閆有圣2馬 旭1*
1.國(guó)家衛(wèi)生計(jì)生委科學(xué)技術(shù)研究所(北京,100081);2.甘肅省婦幼保健院
目的:建立計(jì)算孕婦外周血中胎兒游離DNA片段(cffDNA)濃度和長(zhǎng)度均值的新的數(shù)學(xué)模型。方法:根據(jù)孕婦外周血中母親和胎兒的DNA片段的長(zhǎng)度分布不同的現(xiàn)象,利用K-means算法和EM算法開發(fā)新的算法。用提出的新方法分析孕婦外周血樣本的DNA片段長(zhǎng)度分布,cffDNA濃度和片段長(zhǎng)度均值。結(jié)果:①三份樣本中來(lái)源于母親的DNA片段長(zhǎng)度均值,分別為167bp,166bp和165bp;cffDNA的長(zhǎng)度均值為160bp,163bp和158bp;②孕婦血漿中cffDNA的濃度分別為10.4%,5.5%,17.4%。此結(jié)果與基于Y染色體的百分含量的方法計(jì)算所得的結(jié)果相近。結(jié)論:用本課題組提出的新方法分析孕婦外周血的cffDNA的濃度和長(zhǎng)度均值是有效的、準(zhǔn)確的。
EM算法;產(chǎn)前診斷;二代測(cè)序;胎兒游離DNA濃度; DNA濃度
我國(guó)每年出生缺陷發(fā)生率為4%~6%[1],給家庭和社會(huì)帶來(lái)了沉重的負(fù)擔(dān)。為降低出生缺陷的發(fā)生,應(yīng)該盡早進(jìn)行產(chǎn)前篩查。目前常用的方法有羊水細(xì)胞法、絨毛細(xì)胞法和胎兒臍帶血法,但這些方法都是有創(chuàng)產(chǎn)前篩查,對(duì)胎兒和孕婦均有一定危險(xiǎn)性。1997年,盧煜明等[2]在孕婦血漿中發(fā)現(xiàn)少量胎兒游離DNA片段 (cffDNA),這些片段在孕婦外周血中含量很少;從妊娠第7周開始就可以檢測(cè)到,最初3個(gè)月內(nèi)幾乎每周都增加,在隨后的幾個(gè)月進(jìn)入平臺(tái)期,在分娩前迅速增加,分娩后2 h內(nèi)迅速消失[3-8]。cffDNA的這個(gè)特點(diǎn)使其可以在妊娠早期特異地被檢測(cè),且不會(huì)受到以往妊娠的干擾,為無(wú)創(chuàng)產(chǎn)前檢測(cè)提供了新途徑。本文提出一種新的方法來(lái)計(jì)算孕婦外周血的cffDNA的濃度和母親的DNA片段長(zhǎng)度均值以及cffDNA片段的長(zhǎng)度均值。
1.1研究過程
研究表明,孕婦外周血中大多數(shù)游離DNA片段長(zhǎng)度小于300bp[9-14],孕婦和胎兒的片段長(zhǎng)度分布也不相同,根據(jù)這一特點(diǎn)結(jié)合K-means算法和EM算法提出一種新的模型方法。利用該算法分析孕婦外周血中的DNA片段高通量測(cè)序數(shù)據(jù)。經(jīng)過該方法計(jì)算得到的cffDNA的濃度和長(zhǎng)度均值和已有的基于Y染色體的百分含量的計(jì)算方法[15-16]的計(jì)算結(jié)果進(jìn)行比對(duì)。
1.2孕婦外周血二代測(cè)序片段的處理
本研究利用模擬數(shù)據(jù)建立數(shù)學(xué)模型,分析了3位孕婦外周血二代測(cè)序結(jié)果,包括序列長(zhǎng)度和突變的堿基等數(shù)據(jù)。二代測(cè)序的原始數(shù)據(jù)為fastq格式。對(duì)二代測(cè)序數(shù)據(jù)和參考序列(hg19)進(jìn)行比對(duì),刪掉與參考序列重復(fù)的片段。
1.3孕婦外周血游離DNA片段長(zhǎng)度統(tǒng)計(jì)
采用本研究組編寫的R語(yǔ)言軟件進(jìn)行孕婦和胎兒的混合DNA片段總體長(zhǎng)度分布統(tǒng)計(jì)得到長(zhǎng)度分布圖,與文獻(xiàn)[14]分布圖進(jìn)行比對(duì)。
1.4孕婦和胎兒DNA片段初始值的計(jì)算
用K-means聚類算法計(jì)算初始值,包括母親和胎兒的游離DNA片段的濃度、均值和方差。
1.5數(shù)學(xué)模型的建立
對(duì)二代測(cè)序的混合游離DNA片段,假設(shè)孕婦的游離DNA片段和cffDNA片段分別服從正態(tài)分布,采用EM算法建立數(shù)學(xué)模型。具體如下。
假設(shè)孕婦外周血中cffDNA片段服從正態(tài)分布:Xi~N(μ1,σ12)。其中μ1、σ12分別來(lái)自cffDNA片段長(zhǎng)度的均值和方差。
假設(shè)孕婦外周血中來(lái)自母親的DNA片段長(zhǎng)度服從正態(tài)分布:Xi~N(μ2,σ22)。其中μ2、σ12分別來(lái)自母親游離DNA片段長(zhǎng)度的均值和方差。
假設(shè)孕婦和胎兒的混合游離DNA片段長(zhǎng)度服從正態(tài)分布:Xi~α1N(μ1,σ12)+α2N(μ2,σ22)。
E步:
M步:
經(jīng)過E步和M步的迭代,最后收斂到一個(gè)固定值,得到參數(shù)的具體數(shù)值包括:孕婦和胎兒游離DNA分別對(duì)應(yīng)的濃度、均值和方差,即α1、μ1、σ1和α2、μ2、σ2。
將該方法的計(jì)算結(jié)果與基于Y染色體的百分含量的計(jì)算方法[15-16]相比較。Y染色體百分含量(chrY%)=0.157F+0007(1-F),該方法需要用實(shí)驗(yàn)測(cè)出Y染色體的百分含量,F(xiàn)為cffDNA濃度。
2.1混合DNA片段長(zhǎng)度分布圖
3例孕婦外周血中游離DNA的長(zhǎng)度分布圖,見圖1。從這3幅圖可以看出母親外周血中游離DNA片段長(zhǎng)度含量在0~165bp時(shí)呈指數(shù)增長(zhǎng),在165bp達(dá)到最大值,隨后又隨著游離DNA的長(zhǎng)度增加而減少。分析得到:混合的孕婦外周血中的游離DNA片段長(zhǎng)度分布服從正態(tài)分布,研究結(jié)果與Chandrananda等人[14]的研究結(jié)果基本一致。
2.2兩種方法計(jì)算結(jié)果比較
經(jīng)過研究分析得到第一個(gè)孕婦血清樣品的結(jié)果如下:來(lái)自母親的游離DNA片段長(zhǎng)度均值為167 bp,cffDNA片段長(zhǎng)度均值為160 bp。cffDNA濃度是10.4%,用Y染色體的百分含量算出來(lái)的濃度是10.5%。第二個(gè)樣品的數(shù)據(jù)分析結(jié)果如下:來(lái)自母親的游離DNA片段長(zhǎng)度均值為166bp,cffDNA片段長(zhǎng)度均值為163 bp。cffDNA的濃度為5.5%,用Y染色體的百分含量算出來(lái)的濃度是5.6%。第三個(gè)樣品的分析結(jié)果為:來(lái)自母親的游離DNA片段長(zhǎng)度均值為165 bp,cffDNA片段長(zhǎng)度均值為158 bp。cffDNA的濃度為17.4%,用Y染色體的百分含量算出來(lái)的濃度是15.9%。由此可見,通過片段的長(zhǎng)短分布可以大致估算胎兒的DNA片段的濃度。計(jì)算的均值和Chandrananda等人[14]的研究結(jié)果較接近。研究結(jié)果表明母親的游離DNA片段長(zhǎng)度主要分布在166bp左右,而胎兒的游離DNA片段長(zhǎng)度主要分布在160bp左右。
圖1 3例孕婦外周血中混合DNA片段的長(zhǎng)度分布圖
2012年香港的盧煜明教授團(tuán)隊(duì)[9]提出了一種計(jì)算cffDNA濃度的方法,并且編寫了相應(yīng)的軟件(FetalQUANT)。利用大規(guī)模并行二代測(cè)序(MPS),從DBSNP數(shù)據(jù)庫(kù)中選擇了20 000個(gè)單核苷酸多態(tài)性(SNP)位點(diǎn),分4種組合進(jìn)行了研究:AA(AA), AA(AB), AB(AA), AB(AB)。這里括號(hào)外面的是母親的分型,括號(hào)里面的是胎兒的分型。FetalQUANT方法用到大規(guī)模并行二代測(cè)序,成本較高,且收斂速度較慢。本課題組提出的基于K-means聚類算法和EM算法的新算法,成本較小,算法收斂速度快。
到目前為止,基于孕婦外周血中cffDNA的無(wú)創(chuàng)產(chǎn)前檢測(cè)已經(jīng)有很多應(yīng)用:胎兒性別鑒定、母胎間RhD血型不合鑒定、胎兒非整倍體疾病鑒定以及Y染色體連鎖病如軟骨發(fā)育不全、強(qiáng)直性肌營(yíng)養(yǎng)不良、地中海貧血等疾病的診斷。在這些應(yīng)用中,cffDNA濃度的準(zhǔn)確計(jì)算對(duì)于診斷疾病至關(guān)重要[7]。例如,對(duì)于常染色體隱性疾病的產(chǎn)前診斷,孕婦外周血中突變型和野生型的相對(duì)濃度可用于推測(cè)母體突變是否傳播到胎兒[17]。孕婦血漿中游離DNA片段的長(zhǎng)度分布和胎兒以及母體的均值計(jì)算,對(duì)于研究母體血漿游離DNA的特征具有重要意義[14]。
本文重點(diǎn)研究了孕婦外周血中cffDNA片段的長(zhǎng)度分布、濃度和均值等參數(shù)。由于孕婦外周血中來(lái)自母親的游離DNA片段和cffDNA片段是混合的,很難區(qū)分,所以導(dǎo)致濃度和均值的計(jì)算比較困難。本課題組提出了一種新的方法來(lái)計(jì)算這些參數(shù),即采用K-means聚類算法和EM算法相結(jié)合的綜合方法,用MATLAB語(yǔ)言編寫程序,通過對(duì)孕婦血清樣本DNA的計(jì)算以及與傳統(tǒng)的方法相比較進(jìn)行研究。該方法利用孕婦游離DNA片段長(zhǎng)度分布的不同,針對(duì)孕婦和胎兒分別建立正態(tài)分布的模型。與現(xiàn)有的計(jì)算cffDNA濃度的結(jié)果比較,本研究的初步結(jié)果顯示所提出的方法準(zhǔn)確而有效。但由于高通量測(cè)序誤差和其它誤差的存在,仍然需要大量的實(shí)驗(yàn)數(shù)據(jù)來(lái)修正模型的參數(shù)。
[1] 楊麒巍.選擇性擴(kuò)增孕婦血漿中游離胎兒DNA方法建立及其在21-三體綜合征無(wú)創(chuàng)性產(chǎn)前檢測(cè)中的應(yīng)用[D].長(zhǎng)春:吉林大學(xué),2015.
[2] Lo YM, Corbetta N, Chamberlain PF, et al. Presence of fetal DNA in maternal plasma and serum[J]. Lancet, 1997, 350: 485-487.
[3] Goya R, Sun MG, Morin RD, et al. SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors[J]. Bioinformatics, 2010,26: 730-736.
[4] Lo YM, Chan KC, Sun H, et al. Maternal plasma DNA sequencing reveals the genome-wide genetic and mutational profile of the fetus[J]. Sci Transl Med, 2010,2(61):61ra91.
[5] Chitty LS, Lo YM. Noninvasive Prenatal Screening for Genetic Diseases Using Massively Parallel Sequencing of Maternal Plasma DNA[J]. Cold Spring Harb Perspect Med,2015, 5(9):a023085.
[6] Lo YM, Hjelm NM, Fidler C, et al. Prenatal diagnosis of fetal RhD status by molecular analysis of maternal plasma[J]. N Engl J Med, 1998,339: 1734-1738.
[7] Lo YM, Lun FM, Chan KC, et al. Digital PCR for the molecular detection of fetal chromosomal aneuploidy[J]. Proc Natl Acad Sci USA, 2007,104: 13116-13121.
[8] Lo YM, Tein MS, Lau TK, et al. Quantitative analysis of fetal DNA in maternal plasma and serum: implications for noninvasive prenatal diagnosis[J]. Am J Hum Genet, 1998,62:768-775.
[9] Jiang P, Chan KC, Liao GJ, et al. FetalQuant: deducing fractional fetal DNA concentration from massively parallel sequencing of DNA in maternal plasma[J]. Bioinformatics, 2012,28(22):2883-2890.
[10] Fan HC, Blumenfeld YJ, Chitkara U, et. al. Analysis of the size distributions of fetal and maternal cell-free DNA by paired-end sequencing[J]. Clin Chem,2010, 56(8):1279-1286.
[11] van der Vaart M, Pretorius PJ. The origin of circulating free DNA[J]. Clin Chem, 2007 , 53(12):2215.
[12] Jiang P, Chan CW, Chan KC, et al. Lengthening and shortening of plasma DNA in hepatocellular carcinoma patients [J]. Proc Natl Acad Sci USA, 2015, 112(11): E1317-1325.
[13] Hernández-Gómez M. Non invasive prenatal test (NIPT) in maternal blood by parallel massive sequencingInitial experience in Mexican women and literature review[J].GinecolObstet Mex, 2015, 83(5):277-288.
[14] Chandrananda D, Thorne NP, Bahlo M. High-resolution characterization of sequence signatures due to non-random cleavage of cell-free DNA[J]. BMC Med Genomics, 2015,8-29.
[15] Xu XP, Gan HY, Li, FX, et al. Amethod toquantify cell-free fetal dna fraction in maternal plasma using next generation sequencing: Its application innon-invasive prenatal chromosomal aneuploidy detection[J]. PLoS One, 2016, 11(1):e0146997.
[16] Chiu RW, Akolekar R, Zheng YW, et al. Non-invasive prenatalassessment of trisomy 21 by multiplexed maternal plasma dna sequencing: large scale validity study[J]. BMJ, 2011:342, c7401.
[17] Lun FM, Chiu RW, Chan KC, et al. Microfluidics digital PCR reveals a higher than expected fraction of fetal DNA in maternal plasma[J]. Clin Chem, 2008, 54(10):1664-1672.
[責(zé)任編輯:王麗娜]
Theresearchoncalculatingconcentrationandmeanofcell-freefetalDNAinmaternalplasma
LU Jianbo1, GAO Huafang1, CAO Zongfu1, LI Qian1,CAI Ruikun1, YAN Yousheng2, MA Xu1*
1.HumanGeneticsResourceCenter,NationalResearchInstituteforFamilyPlanning,Beijing, 100081;2.GansuProvincialMaternityandChildCareHospital.
*Correspondingauthor:genetic88@126.com
Objective: To propose new mathematical model of the cell-free fetal DNA (cffDNA) concentration and their length mean value in pregnant women’s plasma. Methods:There was few method to be proposed for analyzing the concentration of cffDNA in pregnant women’s plasma. The existing methods were almost based on single nucleotide polymorphisms (SNPs). In this study, a new method were proposed, which had used K-means algorithm and Expectation-Maximization (EM) algorithm to develop a new method based on the length distribution of DNA fragments of mothers and fetuses in pregnant women’s plasma. The DNA fragments length distribution, cffDNA concentration and the length mean were analyzed by the new method. Results: The DNA fragments length distribution, cffDNA concentration and the length mean were analyzed by the new method.① From three samples, the mean of maternal DNA length of the were 167bp, 166bp, 165bp, respectively, and their corresponding mean value of fetal DNA length were 160bp, 163bp, 158bp, respectively. ②The fractional concentration of cffDNA from the three samples of pregnant mothers was 10.4%, 5.5% and 17.4%, respectively, which was similar with the results based on the percentages of the Y chromosome. Conclusion:Numerical experiments shows that the calculation of the concentration and the length mean value in pregnant women’s plasma are effective and accurate by the new method, which based on K-means algorithm combines with EM algorithm.
Expectation-Maximization algorithm; Non-invasive prenatal diagnosis; Next generation sequencing; Cell-free fetal DNA; DNA concentration
10.3969/j.issn.1004-8189.2017.06.004
國(guó)家重點(diǎn)研究發(fā)展計(jì)劃(2016YFC1000307);國(guó)家重點(diǎn)研究發(fā)展計(jì)劃子課題(2016YFC1000307-10);國(guó)家衛(wèi)生計(jì)生委科學(xué)技 術(shù)研究所科技創(chuàng)新基金面上項(xiàng)目(2017GJM04)
2017-01-01
2017-03-13
*通訊作者:genetic88@126.com