宋娜娜,鐘金城,柴志欣,汪琦,何世明,吳錦波,蹇尚林,冉強(qiáng),蒙欣,胡紅春
?
三江黃牛全基因組數(shù)據(jù)分析
宋娜娜1,2,鐘金城1,2,柴志欣1,2,汪琦1,2,何世明3,吳錦波3,蹇尚林4,冉強(qiáng)5,蒙欣5,胡紅春4
(1西南民族大學(xué)動(dòng)物遺傳育種學(xué)國家民委-教育部重點(diǎn)實(shí)驗(yàn)室,成都 610041;2西南民族大學(xué)青藏高原研究院成都 610041;3阿壩州畜牧科學(xué)研究所,四川汶川 623000;4阿壩州畜牧工作站,四川汶川 623000;5汶川縣畜牧工作站,四川汶川 623000)
【目的】研究三江黃牛群體遺傳多樣性,從基因組層面討論其群體遺傳變異情況?!痉椒ā刻崛?0個(gè)體基因組總DNA,等濃度等體積混合,構(gòu)建混合樣本DNA池,利用CovarisS2進(jìn)行隨機(jī)打斷基因組DNA,電泳回收長度500 bp的DNA片段,構(gòu)建DNA文庫。應(yīng)用Illumina HiSeq 2000測序,最終得到測序數(shù)據(jù)。利用BWA軟件將短序列比對到牛參考基因組(UMD 3.1),來檢測三江黃?;蚪M突變情況。SAMtools、Picard-tools、GATK、Reseqtools對重測序數(shù)據(jù)進(jìn)行分析,Ensembl、DAVID、dbSNP數(shù)據(jù)庫對SNPs和indels進(jìn)行注釋。【結(jié)果】全基因組重測序分析共計(jì)得到77.8 Gb序列數(shù)據(jù),測序深度為25.32×,覆蓋率為99.31%。測序得到778 403 444個(gè)reads和77 840 344 400個(gè)堿基,比對到參考基因組(UMD 3.1)的reads為673 670 505,堿基為67 341 451 555,匹配率分別為86.55%和86.51%,成對比對上的reads數(shù)為635 242 898(81.61%),成對比對上的堿基數(shù)為63 512 636 924(81.59%);共確定了20 477 130個(gè)SNPs位點(diǎn)和1 355 308個(gè)indels,其中2 147 988個(gè)SNPs(2.4%)和90 180個(gè)indels(6.7%)是新發(fā)現(xiàn)的??係NPs中,鑒定出純合SNPs989 686(4.83%),雜合SNPs19 487 444(95.17%),純合/雜合SNP比為1﹕19.7。轉(zhuǎn)換數(shù)為14 800 438個(gè),顛換為6 680 058個(gè),轉(zhuǎn)換/顛換(TS/TV)為2.215。剪切位點(diǎn)突變SNP727個(gè),開始密碼子變非開始密碼子SNP117個(gè),提前終止密碼子的SNP530個(gè),終止密碼子變非終止密碼子SNP88個(gè)。檢測到非同義突變數(shù)為57 621,同義突變?yōu)?3 797,非同義/同義比率為0.69。檢測到非同義SNPs分布在9 017個(gè)基因上,其中發(fā)現(xiàn)567個(gè)基因與已報(bào)道的重要經(jīng)濟(jì)性狀相符,肉質(zhì)、抗病、產(chǎn)奶、生長性狀、生殖等相關(guān)基因的數(shù)量分別為471、77、21、10、8個(gè),其中包括功能相重疊的基因;indels數(shù)據(jù)中,缺失數(shù)量為693 180(51.15%),插入數(shù)量為662 148(48.85%),純合indels數(shù)量為161 198(11.89%),雜合indels數(shù)量1 194 110(88.11%),大部分的變異都位于基因間隔區(qū)和內(nèi)含子區(qū);三江黃牛全基因組雜合度()、核苷酸多樣性()及theta W分別為7.6×10-30.0 0390.0 040,說明其遺傳多樣性較為豐富。三江黃牛群體Tajima'D為-0.06 832,推測可能由于群體內(nèi)存在不平衡選擇所致?!窘Y(jié)論】本研究為進(jìn)一步分析與經(jīng)濟(jì)性狀相關(guān)的遺傳學(xué)機(jī)制和保護(hù)三江黃牛品種遺傳多樣性提供了基因組數(shù)據(jù)支持。
三江黃牛;基因組;第二代測序技術(shù);SNP;indel
【研究意義】三江黃牛原產(chǎn)于四川省阿壩藏族自治州汶川縣,其中三江、白石、水磨和映秀等鄉(xiāng)鎮(zhèn)為主產(chǎn)區(qū),在理縣、茂汶等縣市均有分布[1];三江黃牛具有軀干較長、役用性能良好、肉質(zhì)好、抗病力和適應(yīng)性強(qiáng)等特點(diǎn),是經(jīng)長期自然選擇和人工選育而成,在遺傳資源上是一個(gè)極為寶貴的基因庫,但由于當(dāng)?shù)亟?jīng)濟(jì)社會(huì)的發(fā)展和農(nóng)業(yè)生產(chǎn)方式的改變,以及2008年汶川特大地震的發(fā)生導(dǎo)致三江黃牛產(chǎn)區(qū)功能的布局空間受限,使三江黃牛養(yǎng)殖規(guī)模、種群數(shù)量銳減,已瀕臨滅絕,因此,保護(hù)三江黃牛遺傳資源顯得尤為重要[2]?!厩叭搜芯窟M(jìn)展】基因組包含了一個(gè)物種全部的遺傳信息,全基因測序是解讀基因組的核心技術(shù),揭示基因組的多樣性和信息的復(fù)雜性,最早的測序技術(shù)源于20世紀(jì)60年代中期[3-5],70年代后期第一代測序體系逐漸建立,主要有Sanger等[6]發(fā)明的雙脫氧鏈末端終止法、Maxam等[7]發(fā)明的化學(xué)降解法;隨著生物技術(shù)的發(fā)展,二代測序技術(shù)逐漸走入大眾視野,即大規(guī)模平行測序平臺(tái)(massively parallel DNA sequeneing platform),主要包括:焦磷酸測序Roche/454 FLX、基于邊合成邊測序的Illumina/Solexa技術(shù)和邊連接邊測序的SOLID技術(shù)。測序技術(shù)不斷發(fā)展,有助于對基因組進(jìn)行更全面和更深入的解析,使得解析稀有物種的基因組,以及對轉(zhuǎn)錄組、表達(dá)譜、小RNA等大規(guī)模的功能基因組學(xué)的研究成為可能。人類基因組計(jì)劃的完成,開啟了不同物種全基因組測序的時(shí)代。?;蚪M測序和HapMap計(jì)劃的完成[8-9],鑒定出相當(dāng)數(shù)量的遺傳變異,其中單核苷酸多態(tài)性(SNP)是研究最為廣泛的遺傳變異類型,用于鑒定基因與牛表型變異相關(guān)的基因組區(qū)域,現(xiàn)已測序了多個(gè)牛種的全基因組[8,10-17]。利用Illumina HiSeq II平臺(tái),Eck等[8]在花斑牛上檢測到240萬SNPs,利用相同的測序平臺(tái),KAWAHARA-MIKI等[10]共鑒定了603萬SNPs,采用SOLID技術(shù),STOTHARD等[11]成功比對了黑安格斯和荷斯坦公牛的基因組變異,確定了約700萬個(gè)SNPs和790拷貝數(shù)變異。同時(shí),WGS-SNP位點(diǎn)衍生到全基因組關(guān)聯(lián)研究,能夠以更高的精度預(yù)測物種的重要經(jīng)濟(jì)性狀,以及檢測整個(gè)基因組的重要信息[18-19]?!颈狙芯壳腥朦c(diǎn)】近年來對黃?;蚪M層面的變異研究較多,但對三江黃牛全基因組研究尚未見報(bào)道,對三江黃牛品種的遺傳資源研究相對匱乏?!緮M解決的關(guān)鍵問題】從基因組層面揭示三江黃牛的變異情況,探討三江黃牛遺傳多樣性,為進(jìn)一步分析與經(jīng)濟(jì)性狀相關(guān)的遺傳學(xué)機(jī)制和保護(hù)三江黃牛品種遺傳多樣性提供基因組數(shù)據(jù)支持。
1.1 供試材料
樣本采集于2015年4月,地點(diǎn)是四川省阿壩州汶川縣的三江鄉(xiāng)和水磨鎮(zhèn),兩鄉(xiāng)鎮(zhèn)是三江黃牛主要分布區(qū)域,選取毛色黃色、體型較大、特征明顯的三江黃牛個(gè)體,采集其耳組織樣50個(gè),75%乙醇保存,帶回實(shí)驗(yàn)室倒出保存液,-80℃保存?zhèn)溆谩?/p>
1.2 DNA文庫的構(gòu)建及測序
采用苯酚-氯仿法提取基因組DNA,1.5%瓊脂糖凝膠電泳和A260/A280的比值檢測DNA的純度和濃度,將50個(gè)樣本的基因組DNA等濃度等體積混合,利用CovarisS2進(jìn)行隨機(jī)打斷,電泳回收所需長度的DNA片段(—500 bp),加上接頭,進(jìn)行cluster制備,最后應(yīng)用Illumina HiSeq 2000測序儀,Paired-end法對插入片段進(jìn)行測序,雙端測序的長度為150 bp,最終得到測序數(shù)據(jù)。
1.3 測序數(shù)據(jù)的質(zhì)量控制和數(shù)據(jù)過濾
為保證數(shù)據(jù)的質(zhì)量,測序原始數(shù)據(jù)要經(jīng)過質(zhì)量控制控和數(shù)據(jù)過濾,在信息分析前對數(shù)據(jù)進(jìn)行質(zhì)控,并通過數(shù)據(jù)過濾來減少數(shù)據(jù)的噪音。通過分析堿基的組成和質(zhì)量值可控制原始數(shù)據(jù)的質(zhì)量(圖1)。由圖1-a可以看出測序得到低質(zhì)量(Q<20)堿基含量較低,圖1-b看出A、T曲線重合,G、C曲線重合,說明堿基組成平衡,測序結(jié)果較好,可以進(jìn)一步對數(shù)據(jù)進(jìn)行處理分析。將得到的原始測序序列(raw reads)里有部分帶接頭或低質(zhì)量的reads進(jìn)行過濾,得到高質(zhì)量的凈數(shù)據(jù)(clean data),后續(xù)分析都基于凈數(shù)據(jù)。數(shù)據(jù)過濾主要是去除帶接頭的成對reads;去除單端read中N堿基(N表示無法確定堿基信息)比例大于10%的成對reads;當(dāng)單端測序read中含有低質(zhì)量(質(zhì)量值Q≤7)堿基數(shù)超過該條read堿基總數(shù)的30%時(shí),去除此成對reads。
(a)堿基質(zhì)量值;(b)堿基分布比
1.4 基因組數(shù)據(jù)組裝和測序數(shù)據(jù)處理
利用BWA[20]軟件將序列比對到參考基因組。應(yīng)用工具包SAMtools、Picard-tools對比對結(jié)果進(jìn)行統(tǒng)計(jì)、預(yù)處理(排序,去重復(fù)等),基因組分析工具包(genome analysis tool kit, GATK)[21]完成SNP/indel檢測,即經(jīng)比對在獲得樣本所有SNP信息的基礎(chǔ)上,將檢測到的基因型與參考序列之間存在多態(tài)性的位點(diǎn)進(jìn)行過濾,得到高可信度的SNP/indel數(shù)據(jù)集,將所得到的SNPs和indels調(diào)用為VCF格式,比對到dbSNP數(shù)據(jù)庫,鑒定新的SNPs及indels。Break Dancer[22]工具包分析結(jié)構(gòu)變異(SV),最后應(yīng)用Reseqtools[23]工具對變異結(jié)果進(jìn)行注釋統(tǒng)計(jì)作圖等。
2.1 數(shù)據(jù)產(chǎn)出
2.1.1 凈數(shù)據(jù) 測序共獲得三江黃牛基因組77.8G凈數(shù)據(jù)(Clean data),將所得到的Clean data進(jìn)行統(tǒng)計(jì)(表1)。以普通?;蚪M(UMD 3.1)(GCA_000003055.3)為參考,使用BWA軟件[21]將clean reads比對到參考基因組(表2),測序深度為25.32×,覆蓋率達(dá)99%以上,說明具有較高的單堿基正確性,比對到參考基因組reads和堿基的比率分別為86.55%和86.51%,說明測序樣品同參考物種相似度高,親緣關(guān)系較近。
2.1.2 染色體測序深度和覆蓋度 對三江黃牛每條染色體測序深度和覆蓋度統(tǒng)計(jì)。整個(gè)基因組測序深度為25.32×,深度最高為14號(hào)染色體26.55×,最低為X染色體21.54×?;蚪M的覆蓋率為99.22%,其中X染色體覆蓋率97.59%。由圖2可知,覆蓋上的reads和染色體長度成正比。
2.1.3 GC含量 GC含量對測序有一定的影響,高GC和低GC的區(qū)域會(huì)使測序的難度加大,導(dǎo)致部分序列無法準(zhǔn)確測出,由圖3可知,整個(gè)GC分布范圍內(nèi)覆蓋深度較好(25×),GC含量結(jié)果無明顯偏向性,說明建庫與測序質(zhì)量良好。
表1 三江黃牛凈數(shù)據(jù)
表2 三江黃牛數(shù)據(jù)比對到參考基因組
圖2 reads覆蓋各染色體區(qū)域的長度
圖3 GC含量和測序深度
2.2 SNP檢測
GATK的unifiedGenotyper完成對三江黃牛樣品的SNP檢測,共檢測到20 477 130個(gè)SNPs位點(diǎn),SNP密度確定為大約每131個(gè)堿基含有一個(gè)突變位點(diǎn),突變分布能夠定位各種與經(jīng)濟(jì)性狀相關(guān)聯(lián)的候選基因組區(qū)域。將SNP比對到dbSNP數(shù)據(jù)庫(圖4),數(shù)據(jù)庫中共計(jì)90 045 399個(gè)SNPs,三江黃牛SNPs與數(shù)據(jù)庫未匹配上的為2 147 988個(gè),說明其為新發(fā)現(xiàn)的SNPs,占總SNPs的2.4%。純合SNPs數(shù)為989 686(4.83%),雜合SNPs數(shù)為19 487 444(95.17%),純合/雜合SNP比為1﹕19.7。基因間隔區(qū)的SNPs為15 009 500,占總SNPs的73.29%,大多數(shù)的SNPs集中在基因間隔區(qū)和內(nèi)含子區(qū),少部分在外顯子、剪接位點(diǎn)和非編碼區(qū)域(表4)。轉(zhuǎn)換TS(transition)/顛換TV(transversion)是檢測隨機(jī)序列誤差的重要指標(biāo),是對SNP的質(zhì)量評估,經(jīng)驗(yàn)值TS/TV>2.1[24],三江黃牛SNP轉(zhuǎn)換數(shù)為14 800 438個(gè),顛換為6 680 058個(gè),轉(zhuǎn)換/顛換(TS/TV)為2.215(圖4),高于經(jīng)驗(yàn)值,說明所識(shí)別大多數(shù)的SNP是準(zhǔn)確的。在所有SNP中,由于SNP位點(diǎn)突變導(dǎo)致剪切位點(diǎn)突變和編碼氨基酸密碼子變化的SNP共1 462個(gè),其中剪切位點(diǎn)突變SNP 727個(gè),開始密碼子變?yōu)榉情_始密碼子SNP 117個(gè),提前終止密碼子SNP 530個(gè),終止密碼子變非終止密碼子SNP 88個(gè),在染色體上分布情況如圖5。
圖4 SNPs(indels)比對到dbSNP數(shù)據(jù)庫和轉(zhuǎn)換/顛換比
圖5 SNPs位點(diǎn)突變效應(yīng)在每個(gè)染色體上的數(shù)量分布
人類和其他動(dòng)物許多表型都與非同義SNP(non- synonymous SNP, nsSNP)相關(guān)[25],SNP注釋是提供SNP位點(diǎn)與功能相關(guān)聯(lián)的依據(jù)。本研究共檢測到57 621個(gè)非同義突變,83 797個(gè)同義突變,非同義/同義SNP比率為0.69。Ensembl[26]數(shù)據(jù)庫對非同義SNP注釋得到9 017個(gè)基因(電子附表1,http:// pan.baidu.com/s/1qXN18dA)。DAVID[27]數(shù)據(jù)庫對含nsSNP較多的108個(gè)基因進(jìn)行功能基因富集分析(電子附表2,http://pan.baidu.com/s/1qXN18dA),基因功能注釋可分為7類,主要集中在生化、代謝、免疫等過程(電子附表3,http://pan.baidu.com/s/1qXN18dA),其中免疫功能相關(guān)的GO:0006955涉及到免疫應(yīng)答,GO:0019882涉及到抗原加工和呈遞,對機(jī)體有重要作用,包括、、等基因。同時(shí)還分析了nsSNP與肉質(zhì)、產(chǎn)奶量、生長速度等重要經(jīng)濟(jì)性狀的相關(guān)性,發(fā)現(xiàn)567個(gè)基因與已報(bào)道的重要經(jīng)濟(jì)性狀相關(guān)[10,28-34],并對其基因進(jìn)行注釋(電子附表4,http://pan.baidu.com/s/1qXN18dA),471個(gè)與肉質(zhì)相關(guān)基因,77個(gè)與抗病相關(guān)的基因,21個(gè)與產(chǎn)奶性狀相關(guān)基因,10個(gè)與生長性狀相關(guān)的基因,8個(gè)與生殖相關(guān)的基因,567個(gè)基因中還包括一些功能相重疊的基因,例如同時(shí)和肉質(zhì)、生長性狀相關(guān)的、基因,與肉質(zhì)和抗病均相關(guān)的、基因,產(chǎn)奶和抗病均相關(guān)的基因等。還有一些研究相對較多且機(jī)制較為清晰的基因,包括生長激素(),生長激素受體()和瘦肉素受體()催乳素受體()基因[29,32]等。
表3 SNPs和indels的注釋
圖6 SNPs和indels在每個(gè)染色體上數(shù)量分布
2.3 indel檢測
最近研究中認(rèn)為高于50 bp的缺失(deletion)和插入(insertion)為結(jié)構(gòu)變異,而低于50 bp的deletion和insertion合稱indel[35]。indel是基因組中除SNP數(shù)量最多的變異,三江黃牛共檢測到1 355 308個(gè)indels,缺失和插入數(shù)量分別為693 180和662 148個(gè),比例分別為51.15%、48.85%,比對到數(shù)據(jù)庫共發(fā)現(xiàn)90 180個(gè)新indels,占總indels的6.7%(圖4)。純合indels為161 198(11.89%),雜合indels為1 194 110(88.11%)。indels注釋情況(表3),基因間隔區(qū)的indels最多為982 443個(gè),占總indels的72.49%。CDS區(qū)indels1 545個(gè),外顯子indels4 137個(gè),3′端非編碼區(qū)和5′端非編碼區(qū)indels數(shù)分別為3 606和240。SNPs和indels在每個(gè)染色體上數(shù)量分布(圖6),除11、12、13號(hào)染色體上的SNP外,其余每條染色體上的SNP數(shù)均與染色體長度相關(guān),隨染色體長度的減小而降低。indels在每條染色體上的長度分布隨染色體長度減小而降低。插入和缺失在CDS區(qū)和全基因組上的分布情況(圖7),由圖可知,缺失和插入的數(shù)量在全基因組上的分布隨長度增加而減少,在CDS區(qū)未發(fā)現(xiàn)類似情況。但由兩圖都可看出插入和缺失數(shù)量集中在1—10 bp,其中1—3 bp最多?;诔蓪Ρ葘ι蟫eads的結(jié)果,檢查插入的長度是否異常,針對缺失部分進(jìn)行分析,共檢測出1 906個(gè)結(jié)構(gòu)變異。
2.4 基因組的雜合度和群體核苷酸多樣性指數(shù)
雜合度()和核苷酸多樣性()是反映多態(tài)性高低的指標(biāo)(圖8)。將reads比對到參考基因組,識(shí)別三江黃牛19 487 444個(gè)雜合SNPs,其整個(gè)基因組的雜合度為7.6×10-3,說明三江黃牛品種的遺傳多樣性較高。三江黃牛全基因組為0.0039,說明遺傳多樣性較為豐富。
2.5 基因組Tajima'D和theta W指數(shù)
群體Tajima'D值是目標(biāo)DNA序列在進(jìn)化過程中是否遵循中性進(jìn)化模型,導(dǎo)致D值為負(fù)可能是搭載效應(yīng)[36]。三江黃牛群體Tajima'D為-0.06832(圖9),說明群體中存在不平衡的選擇。theta W是反映群體多態(tài)性的指標(biāo),是群體在核苷酸多態(tài)性水平上偏離中性進(jìn)化且處于突變-漂移平衡的理想模型,三江黃牛基因組theta W為0.0040(圖9),說明三江黃牛群體遺傳多樣性較為豐富。
圖7 Indels在全基因組上的長度分布
圖8 indels在CDS上的長度分布
圖9 (a)全基因組的雜合率H;(b)全基因組的多態(tài)性指標(biāo)Pi;(c)全基因組的多態(tài)性指標(biāo)theta W;(d)全基因組的Tajima'D
在研究中,筆者使用Illumina 2000測序平臺(tái)對瀕危三江黃牛品種進(jìn)行了全基因組測序,三江黃牛品種數(shù)量少,選擇能夠代表品種的個(gè)體進(jìn)行測序尤為重要,為避免個(gè)體差異,在較低成本下聚集更多個(gè)體的信息來反映三江黃牛品種群體遺傳多樣性情況,因此將50個(gè)體采用混合DNA樣本進(jìn)行測序。測序獲得三江黃?;蚪M77.8G凈數(shù)據(jù),86.55%的reads、81.61%的成對reads、86.51%的堿基、81.59%的成對堿基比對到參考基因組,測序深度為25.32×,覆蓋率達(dá)99%以上,具有較高的單堿基正確性,與先前對普通牛測序的結(jié)果相比[8,11],測序深度較高,檢測到的變異充分可信[15]。
通過GATK分析,在29條常染色體和X染色體上共發(fā)現(xiàn)20 477 130個(gè)SNPs位點(diǎn)和1 355 308個(gè)indels,三江黃牛種群數(shù)量較少,僅2 000多頭,SNP數(shù)量較多,說明該品種遺傳多樣性較為豐富??係NP中,雜合SNPs 19 487 444個(gè)(95.17%),純合SNPs 989 686個(gè)(4.83%),與Shin等[37]測序10個(gè)韓國公牛所得到2 234 514個(gè)(90.3%)雜合SNPs和239 370個(gè)(9.7%)的純合SNPs結(jié)果相似。純合/雜合SNPs的比為1﹕19.7,明顯低于Kawahara-Mik等[10]研究的日本牛(1﹕1.2)和Choi等[12]研究的韓牛(1﹕1.92)。從測序角度闡述純和SNP是該混和樣本中的所有樣品在這個(gè)位點(diǎn)都是同一個(gè)堿基型且和參考基因組一致,雜合SNP是這個(gè)位點(diǎn)在所有混和樣品里有多個(gè)堿基型,三江黃牛SNP純合比率低,雜合度高,推測可能測序的50個(gè)體之間變異差異較大。還可能是三江黃牛選育程度低,近親繁殖的概率低,與其他牛之間的基因交流較多,本身特異的功能基因正在丟失,保護(hù)三江黃牛品種顯得尤為重要。測序所得indels大約占總變異(indels和SNPs)的5.3%,稍高于Kawahara-Mik等[10]和Choi等[12]研究的結(jié)果。三江黃牛轉(zhuǎn)換/顛換值為2.21,高于Choi等[16]研究韓國牛牛種所得到2.1。將SNPs和indels比對到數(shù)據(jù)庫發(fā)現(xiàn)2 147 988個(gè)新的SNPs和90 180個(gè)新的indels,分別占總SNPs的2.4%和總indels的6.7%,推測可能由于近年來隨著基因組測序的發(fā)展,發(fā)現(xiàn)了較多新的SNPs及indels,數(shù)據(jù)庫越來越完善,使比對上的比例明顯增大,新發(fā)現(xiàn)的逐漸變少。大多數(shù)indels的長度較短,缺失的長度范圍在1—29 kb,插入的范圍在1—44 kb,缺失和插入的數(shù)量集中在1—10 bp,其中1—3 bp最多,從人類基因組數(shù)據(jù)上也觀察到類似現(xiàn)象[38]。三江黃牛數(shù)據(jù)中,接近84.7%插入和79.6%缺失長度小于3 kb。29個(gè)常染色體上檢測到的SNPs和indels與染色體長度成正比,結(jié)果符合預(yù)期,其中X染色體突變率最低,為4.33%,在小種群研究上,X染色體相比常染色體有較低的突變率[39]。
通過Ensemble數(shù)據(jù)庫對nsSNP注釋得到9 017個(gè)基因,與Eck等[8]報(bào)道的結(jié)果相一致,高于Kawahara-Mik等[10]研究的日本牛和Lee等[14]研究的韓國牛。非同義SNP注釋發(fā)現(xiàn)567個(gè)與經(jīng)濟(jì)性狀相關(guān)的基因,其中肉質(zhì)、抗病、產(chǎn)奶、生長、生殖等相關(guān)的基因分別為471、77、21、10、8個(gè)。三江黃牛主要是供耕地役用,近年來逐漸向役肉兼用方向發(fā)展,研究中一些肉質(zhì)相關(guān)基因中在其他黃牛品種上已有報(bào)道,例如脂肪酸結(jié)合蛋白4(FABP4)的突變與棕櫚油酸肌肉內(nèi)脂肪含量相關(guān)[40],加壓素Ⅱ受體(UTS2R)突變與骨骼肌脂肪堆積相關(guān)[41],鈣蛋白酶1(CAPN1)突變與阿伯丁安格斯牛肉嫩度有關(guān)[42],和基因被發(fā)現(xiàn)與內(nèi)洛爾、荷斯坦黑白花、安格斯、夏洛來、海福特、西門塔爾牛牛肉的脂肪含量有關(guān)[34,43],肌聯(lián)蛋白基因()也被發(fā)現(xiàn)是影響肉質(zhì)重要的候選基因[44],在本研究中、等肉質(zhì)相關(guān)基因上分布了較多nsSNP(>10)。還有一些與發(fā)育和疾病相關(guān)的基因,如和基因與荷斯坦牛育種值相關(guān)[45],蛋白激酶()基因與早期胚胎發(fā)育相關(guān)[46],Y連鎖肽重復(fù)序列結(jié)構(gòu)域()在公牛精子發(fā)生過程中發(fā)揮關(guān)鍵作用[47],性別決定區(qū)Y()是檢測公牛精子質(zhì)量和生育能力的重要候選標(biāo)記[48]。、和是與瑞士褐牛韋弗綜合征疾病相關(guān)的重要候選基因[49]。哺乳動(dòng)物中的色素沉淀是由于毛發(fā)或皮膚缺乏或存在黑色素引起的,影響色素合成的主要基因有酪氨酸酶蛋白1()、多巴色素互變異構(gòu)酶(、絲氨酸肽酶()、黑皮素受體1()、酪氨酸酶()、信號(hào)蛋白()。在嚙齒動(dòng)物,黑色和黃色間的變化是由和拮抗劑所引起的。MITF基因能夠調(diào)節(jié),和基因的表達(dá)。三江黃牛毛色以黃色為主,其次為黑色和草黑色,在其他牛種上很多與顏色相關(guān)的基因在三江黃牛上也發(fā)現(xiàn)了,如CORIN基因發(fā)現(xiàn)與韓牛黃毛色有關(guān)[37],推測可能也是影響三江黃牛黃色毛的重要基因。、、、、等基因發(fā)現(xiàn)能夠?qū)е骂^發(fā)毛囊表型變化[50]。
本研究通過對三江黃牛全基因組測序得到77.8 Gb凈數(shù)據(jù),鑒定出大量的遺傳變異,說明三江黃牛遺傳多樣性較為豐富,為進(jìn)一步研究三江黃牛品種的遺傳特性提供基因組數(shù)據(jù)支持。
[1] 孫福勇, 劉君. 三江黃牛的生態(tài)分布及其品種特點(diǎn). 草業(yè)與畜牧, 2009(9): 51-52.
SUN F Y, LIU J. Distribution and ecological features of Sanjiang cattle breed., 2009(9): 51-52.(in Chinese)
[2] 陳智華, 顧磊, 鐘金城. 三江黃牛 Bola-DRB3 基因第二外顯子的 PCR-RFLP 多態(tài)性研究. 西南民族大學(xué)學(xué)報(bào)(自然科學(xué)版), 2008, 33(4): 782-787.
CHEN Z H, GU L, ZHONG J C. Study on the polymorphism of the Bola-DRB3 gene exon 2 in the Sanjiang Cattle by PCR-RFLP method.(), 2008, 33(4): 782-787. (in Chinese)
[3] HOLLEY R W, EVERETT G A, MADISON J T, ZAMIR A. Nucleotide sequences in the yeast alanine transfer ribonucleic acid., 1965, 240(5): 2122-2128.
[4] FRESCO J R, ADAMS A, ASCIONE R, HENLEY D, LINDAHL T. Tertiary structure in transfer ribonucleic acids//Cold Spring Harbor Laboratory Press, 1966, 31: 527-537.
[5] CELANDER D W, CECH T R. Visualizing the higher order folding of a catalytic RNA molecule., 1991, 251(4992): 401-407.
[6] SANGER F, NICKLEN S, COULSON A R. DNA sequencing with chain-terminating inhibitors., 1977, 74(12): 5463-5467.
[7] MAXAM A M, GILBERT W. Sequencing end-labeled DNA with base-specific chemical cleavages., 1979, 65(1): 499-560.
[8] ECK S H, BENET-PAGèS A, FLISIKOWSKI K, MEITINGER T, FRIES R, STROM T M. Whole genome sequencing of a single Bos taurus animal for single nucleotide polymorphism discovery., 2009, 10(8): R82.
[9] GIBBS R A, BELMONT J W, Hardenbol P, WILLIS T D, YU F L, YANG H M, CHANG L Y, HUANG W, LIU B, SHEN Y, et al. The international HapMap project., 2003, 426(6968): 789-796.
[10] KAWAHARA-MIKI R, TSUDA K, Shiwa Y, ARAI-KICHISE Y, MATSUMOTO T, KANESAKI Y, ODA S, EBIHARA S, YAJIMA S, YOSHIKAWA H, KONO T. Whole-genome resequencing shows numerous genes with nonsynonymous SNPs in the Japanese native cattle Kuchinoshima-Ushi., 2011, 12(1): 103.
[11] STOTHARD P, CHOI J W, BASU U, SUMNER-THOMSON J M, MENG Y, LIAO X, MOORE S S. Whole genome resequencing of black Angus and Holstein cattle for SNP and CNV discovery., 2011, 12(1): 1.
[12] CHOI J W, CHUNG W H, LEE K T, LEE K T, CHOI J W, JUNG K S, CHO Y, KIM N, KIM T H. Whole genome resequencing of Heugu (Korean Black Cattle) for the genome-wide SNP discovery., 2013, 33(6): 715-722.
[13] CHOI J W, LIAO X, PARK S, JEON H J, CHUNG W H, STOTHARD P, PARK Y S, LEE J K, LEE K T, KIM S H, OH J D, KIM N, KIM T H, LEE H K, LEE S J. Massively parallel sequencing of Chikso (Korean brindle cattle) to discover genome-wide SNPs and InDels., 2013, 36(3): 203-211.
[14] LEE K T, CHUNG W H, LEE S Y, CHOI J W, KIM J, LIM D, LEE S,JANG G W, KIM B, CHOY Y H, LIAO X, STOTHARD P, MOORE S S, LEE S H, AHN S, KIM N, KIM T H. Whole-genome resequencing of Hanwoo (Korean cattle) and insight into regions of homozygosity., 2013, 14(1): 519.
[15] CHOI J W, LIAO X, STOTHARD P, CHUNG W H, JEON H J, MILLER S P, CHOI S Y, LEE J K, YANG B, LEE K T, HAN K J, KIM H C, JEONG D, OH J D, KIM N, KIM T H, LEE H K, LEE S J. Whole-genome analyses of Korean native and Holstein cattle breeds by massively parallel sequencing., 2014, 9(7): e101127.
[16] CHOI J W, CHOI B H, LEE S H, LEE S S, KIM H C, YU D, CHUNG W H, LEE K T, CHAI H H, CHO Y M, LIM D. Whole-genome resequencing analysis of hanwoo and yanbian cattle to identify genome-wide SNPs and signatures of selection., 2015, 38(5): 466.
[17] SASAKI S, WATANABE T, NISHIMURA S, SUGIMOTO Y. Genome-wide identification of copy number variation using high-density single-nucleotide polymorphism array in Japanese Black cattle., 2016, 17(1): 1.
[18] DAETWYLER H D, CAPITAN A, PAUSCH H, STOTHARD P, VAN BINSBERGEN R, BR?NDUM R F, LIAO X, DJARI A, RODRIGUEZ S C, GROHS C, ESQUERRé D, BOUCHEZ O, ROSSIGNOL M N, KLOPP C, ROCHA D, FRITZ S, EGGEN A, BOWMAN P J, COOTE D, CHAMBERLAIN A J, ANDERSON C, VANTASSELL C P, HULSEGGE I, GODDARD M E, GULDBRANDTSEN B, LUND M S, VEERKAMP R F, BOICHARD D A, FRIES R, HAYES B J. Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle., 2014, 46(8): 858-865.
[19] QANBARI S, PAUSCH H, JANSEN S, SOMEL M, STROM T M, FRIES R, NIELSEN R, SIMIANER H. Classic selective sweeps revealed by massive sequencing in cattle., 2014, 10(2): e1004148.
[20] LI H, DURBIN R. Fast and accurate short read alignment with Burrows-Wheeler transform., 2009, 25(14): 1754-1760.
[21] MCKENNA A, HANNA M, BANKS E, SIVACHENKO A, CIBULSKIS K, KERNYTSKY A, GARIMELLA K, ALTSHULER D, GABRIEL S, DALY M, DEPRISTO M A. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data., 2010, 20(9): 1297-1303.
[22] CHEN K, WALLIS J W, MCLELLAN M D LARSON D E, KALICKI J M, POHL C S, MCGRATH S D, WENDL M C, ZHANG Q, LOCKE D P, SHI X, FULTON R S, LEY T J, WILSON R K, DING L, MARDIS E R. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation., 2009, 6: 677-681.
[23] HE W, ZHAO S, LIU X, DONG S, LV J, LIU D, WANG J, MENG Z. ReSeqTools: an integrated toolkit for large-scale next-generation sequencing based resequencing analysis., 2013, 12(4): 6275-6283.
[24] 1000 genomes project consortium, ABECASIS G R, AUTON A, BRODKS L D, DEPRISTO M A, DURBIN R M, HANDSAKER R E, KANG H M, MARTH G T, MCVEAN G A. an integrated map of genetic variation from 1,092 human genomes., 2012, 491(7422): 56-65.
[25] STENSON P D, BALL E V, MORT M, PHILLIPS A D, SHIEL J A, THOMAS N S, ABEYSINGHE S, KRAWCZAK M, COOPER D N. Human gene mutation database (HGMD?): 2003 update., 2003, 21(6): 577-581.
[26] FLICEK P, AMODE M R, BARRELL D, BEAL K, BRENT S, CARVALHOSILVA D, CLAPHAM P, COATES G, FAIRLEY S, FITZGERALD S, GIL L, GORDON L, HENDRIX M, HOURLIER T, JOHNSON N, K?H?RI A K, KEEFE D, KEENAN S, KINSELLA R, KOMOROWSKA M, KOSCIELNY G, KULESHA E, LARSSON P, LONGDEN L, MCLAREN W, MUFFATO M, OVERDUIN B, PIGNATELLI M, PRITCHARD B, RIAT H S, RITCHIE G R S, RUFFIER M, SCHUSTER M, SOBRAL D, TANG Y A, TAYLOR K, TREVANION S, VANDROVCOVA J, WHITE S, WILSON M, WILDER S P, AKEN B L, BIRNEY E, CUNNINGHAM F, DUNHAM L, DURBIN R, FERNáNDEZ, SUAREZ X M, HARROW J, HERRERO J, HUBBARD T J P, PARKER A, PROCTOR G, SPUDICH G, VOGEL J, YATES A, ZADISSA A, SEARLE S M J. Ensembl 2012., 2011: gkr991.
[27] HUANG D W, SHERMAN B T, LEMPICKI R A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources., 2009, 4(1): 44-57.
[28] HIENDLEDER S, THOMSEN H, REINSCH N, BENNEWITZ J, LEYHE-HORN B, LOOFT C, XU N, MEDJUGORAC I, RUSS I, KüHN C, BROCKMANN G A, BLüMEL J, BRENIG B, REINHARDT F, REENTS R, AVERDUNK G, SCHWERIN M, F?RSTER M, KALM E, ERHARDT G. Mapping of QTL for body conformation and behavior in cattle., 2003, 94(6): 496-506.
[29] NKRUMAH J D, LI C, YU J, HANSEN C, KEISLER D H, MOORE S S. Polymorphisms in the bovine leptin promoter associated with serum leptin concentration, growth, feed intake, feeding behavior, and measures of carcass merit., 2005, 83(1): 20-28.
[30] HU Z L, FRITZ E R, REECY J M. AnimalQTLdb: a livestock QTL database tool set for positional QTL information mining and beyond., 2007, 35(suppl. 1): D604-D609.
[31] HU Z L, REECY J M. Animal QTLdb: beyond a repository. A public platform for QTL comparisons and integration with diverse types of structural genomic information., 2007, 18(1): 1-4.
[32] THOMAS M G, ENNS R M, SHIRLEY K L, GARCIA M D, GARRETT A J, SILVER G A. Associations of DNA polymorphisms in growth hormone and its transcriptional regulators with growth and carcass traits in two populations of Brangus bulls., 2007, 6(1): 222-237.
[33] BAGNATO A, SCHIAVINI F, ROSSONI A, MALTECCA C, DOLEZAL M, MEDUGORAC I, S?LKNER J, RUSSO V, FONTANESI L, FRIEDMANN A, SOLLER M, LIPKIN E. Quantitative trait loci affecting milk yield and protein percentage in a three-country Brown Swiss population., 2008, 91(2): 767-783.
[34] FERRAZ J B S, PINTO L F B, MEIRELLES F V, ELER J P, DE REZENDE F M, OLIVEIRA E C, ALMEIDA H B, WOODWARD B, NKRUMAH D. Association of single nucleotide polymorphisms with carcass traits in Nellore cattle., 2009, 8(4): 1360-1366.
[35] ALBERS C A, LUNTER G, MACARTHUR D G, MCVEAN G, OUWEHAND W H, DURBIN R. Dindel: accurate indel calls from short-read data., 2011, 21(6): 961-973.
[36] NIELSEN R. Molecular signatures of natural selection., 2005, 39: 197-218.
[37] Shin Y, Jung H J, Jung M, Yoo S I, Subramaniyam S, Markkandan K, Kang J M, Rai R, Park J, Kim J J. Discovery of gene sources for economic traits in Hanwoo by whole- genome resequencing., 2016, 29(9): 1353-1362.
[38] FUJIMOTO A, NAKAGAWA H, HOSONO N, NAKANO K, ABE T, BOROEVICH K A, NAGASAKI M, YAMAGUCHI R, SHIBUYA T, KUBO M, MIYANO S, NAKAMURA Y, TSUNODA T. Whole- genome sequencing and comprehensive variant analysis of a Japanese individual using massively parallel sequencing., 2010, 42(11): 931-936.
[39] MAKOVA K D, LI W H. Strong male-driven evolution of DNA sequences in humans and apes., 2002, 416(6881): 624-626.
[40] HOASHI S, HINENOYA T, TANAKA A, OHSAKI H, SASAZAKI S, TANIGUCHI M, OYAMA K, MUKAI F, MANNEN H. Association between fatty acid compositions and genotypes of FABP4 and LXR-alpha in Japanese Black cattle., 2008, 9(1): 1.
[41] JIANG Z, MICHAL J J, TOBEY D J, WANG Z, MACNEIL M D, MAGNUSON N S. Comparative understanding of UTS2 and UTS2R genes for their involvement in type 2 diabetes mellitus., 2008, 4(2): 96-102.
[42] GILL J L, BISHOP S C, MCCORQUODALE C, WILLIAMS J L, WIENER P. Association of selected SNP with carcass and taste panel assessed meat quality traits in a commercial population of Aberdeen Angus-sired beef cattle., 2009, 41(1): 36.
[43] BUCHANAN F C, FITZSIMMONS C J, VAN KESSEL A G, Thue T D, Winkelman-Sim D C, Schmutz S M. Association of a missense mutation in the bovine leptin gene with carcass fat content and leptin mRNA levels., 2002, 34(1): 105-116.
[44] Watanabe N, Satoh Y, Fujita T, Ohta T, Kose H, Muramatsu Y, Yamamoto T, Yamada T. Distribution of allele frequencies atand*between Japanese Black and four other cattle breeds with differing historical selection for marbling., 2011 4: 10.
[45] LIEFERS S C, VEERKAMP R F, PAS M F W, DELAVAUD C, CHILLIARD Y, LENDE T A. missense mutation in the bovine leptin receptor gene is associated with leptin concentrations during late pregnancy., 2004, 35(2): 138-141.
[46] YANG Q E, OZAWA M, ZHANG K, JOHNSON S E, EALY A D. The requirement for protein kinase C delta (PRKCD) during preimplantationbovine embryo development., 2014, 28(4): 482-490.
[47] LIU Y, QIN X, SONG X Z, JIANG H Y, SHEN Y F, DURBIN K J, LIEN S, KENT M P, SODELAND M, REN Y R, ZHANG L, SODERGREN E, HAVLAK P, WORLEY K C, WEINSTOCK G M, GIBBS R A. Bos taurus genome assembly., 2009, 10(1): 1.
[48] MISHRA C, PALAI T K, SARANGI L N, PRUSTY B R, MAHARANA B R. Candidate gene markers for sperm quality and fertility in bulls., 2013, 6: 905-910.
[49] McClure M, Kim E, Bickhart D, Null D, Cooper T, Cole J, Wiggans J, Ajmone-Marsan P, Colli L, Santus E, Liu G, Schroeder S, Matukumalli L, Tassell C V, Sonstegard T. Fine mapping for Weaver syndrome in Brown Swiss cattle and the identification of 41 concordant mutations across NRCAM, PNPLA8 and CTTNBP2., 2013, 8(3): e59251.
[50] Van den Bossche J, Malissen B, Mantovani A, De Baetselier P J A, Ginderachter V. Regulation and function of the E-cadherin/catenin complex in cells of the monocyte- macrophage lineage and DCs., 2012, 119(7): 1623-1633.
(責(zé)任編輯 林鑒非)
The Whole Genome Data Analysis of Sanjiang Cattle
SONG Nana1,2, ZHONG Jincheng1,2, CHAI Zhixin1,2, WANG Qi1,2, HE Shiming3,WU Jinbo3, JIAN Shanglin4, RAN Qiang5, MENG Xin5, HU Hongchun4
(1Key Laboratory of Animal Genetics and Breeding of State Ethnic Affairs Commission and Ministry of Education, Southwest University for Nationalities, Chengdu 610041;2Institute of Tibetan Plateau Research, Southwest University for Nationalities, Chengdu 610041;3Animal Husbandry Science Institute of ABa Autonomous Prefecture, Wenchuan 623000, Sichuan;4Animal Husbandry and Veterinary Station of Aba Autonomous Prefecture, Wenchuan 623000, Sichuan;5Animal Husbandry and Veterinary Station of Wenchuan, Wenchuan 623000, Sichuan)
【Objective】The objective of this paper is to study the genetic diversity of Sanjiang cattle group and discuss its genetic variation at the genome level.【Method】Fifty individual genomic DNA were extracted and mixed with isocratic and equal volumes, then the DNA pool of the mixed samples were constructed. Genomic DNA was interrupted randomly by using CovarisS2 and the DNA fragments of 500 bp were recovered by electrophoresis, and DNA library was constructed at last. Finally, the sequencing data were obtained through the Illumina HiSeq 2000. The short reads were mapped to bovine reference genome (UMD 3.1) to detect the genomic mutations of Sanjiang cattle using BWA software. The analysis of the re-sequencing data was implemented using SAMtools, Picard-tools, GATK, Reseqtools, the SNPs and indels were annotated based on the Ensembl, DAVID and dbSNP database. 【Result】A total of 77.8 Gb of sequence data were generated by whole-genome sequencing analysis, 99.31% of the reference genome sequence was covered with a mapping depth of 25.32-fold, 778 403 444 reads and 77 840 344 400 bases were obtained, of which 673 670 505 reads and 67 341 451 555 bases covered 86.55% and 86.51% of bovine reference genomes (UMD 3.1) respectively, paired-end reads mapping were 635 242 898 (81.61%), paired-end bases mapping were 63 512 636 924 (81.59%). A total of 20 477 130 SNPs and 1 355 308 small indels were identified, of which 2 147 988 SNPs (2.4%) and 90 180 (6.7%) indels were found to be new. Of the total number of SNPs, 989 686 (4.83%) homozygous SNPs and 19 487 444 (95.17%) heterozygous SNPs were discovered, homozygous/heterozygous SNPs was 1﹕19.7. Transitions were 14 800 438, transversions were 6 680 058, transition/transversion (TS/TV) was 2.215. SNPs of splice site mutations were 727,the number of SNPs which the start codon converts into no stop codon were 117, SNPs of premature stop codon were 530, the number of SNPs which stop codon converts into no stop codon were 88. A total of 57 621 non-synonymous SNPs and 83 797 synonymous SNPs were detected, the ratio was 0.69. Non-synonymous SNPs were detected in 9 017 genes, 567 genes were assigned as trait-associated genes, which included meat quality, disease resistance, milk production, growth rate, fecundity with the number of 471, 77, 21, 10, and 8 respectively, the function of some genes were overlap. In detection of indels, 693 180 (51.15%) were deletions and 662 148 (48.85%) were insertions, 161 198 (11.89%) were homozygous and 1 194 110 (88.11%) were heterozygous. Most variations were located in intergenic regions and introns. Heterozygosity (), nucleotide diversity () and theta W of Sanjiang cattle genome-wide were 7.6×10-3, 0.0039, 0.0040, respectively, which indicated that Sanjiang cattle have an abundant genetic diversity. The Tajima'D of Sanjiang cattle population was -0.06832, which speculated that the population exists an unbalanced selection.【Conclusion】Results of this research will provide valuable genomic data for further investigations of the genetic mechanisms underlying traits of interest and protection of Sanjiang cattle breeds genetic diversity.
Sanjiang cattle; genome; next generation sequencing; SNP; indel
2016-06-12;接受日期:2016-11-07
四川省科技廳項(xiàng)目(2015JY0248)、中央高校服務(wù)民族地區(qū)發(fā)展項(xiàng)目(2015NFW01)
宋娜娜,Tel:13688499824;E-mail:songnana28@126.com。通信作者鐘金城,E-mail:zhongjincheng518@126.com