李軒 朱艷
摘 要:面對超大規(guī)模測序數(shù)據(jù)的處理方法與處理能力的挑戰(zhàn),該課題從各種新一代測序技術平臺,包括Illumina/Solexa、Roche/454、AB/SOLiD和國產(chǎn)AG-100/200測序系統(tǒng)等數(shù)據(jù)產(chǎn)生的源頭出發(fā),研究數(shù)據(jù)的特點、實驗設計策略和數(shù)據(jù)處理方法, 發(fā)展新一代測序技術中的編碼模型和高通量實驗設計理論與方法,研究各種測序平臺數(shù)據(jù)的數(shù)學模型和質量控制方法,發(fā)展高通量測序數(shù)據(jù)的高效處理方法及工作流程和跨平臺數(shù)據(jù)的統(tǒng)合分析方法。在研究發(fā)展新一代測序技術和序數(shù)據(jù)的數(shù)學模型和質量控制方法的基礎上,建立新一代測序的編碼和實驗設計理論。這些理論方法,對測序數(shù)據(jù)處理提供重要的指導的同時,將改進我國自主研發(fā)的新一代測序儀AG系統(tǒng)。該課題將建立適應多種平臺、針對多種應用的新一代測序數(shù)據(jù)處理方法、算法、可重構軟件工作流程和和跨平臺數(shù)據(jù)統(tǒng)合分析方法,并開發(fā)面向大數(shù)據(jù)量序列數(shù)據(jù)處理的硬件加速技術;課題的進展將推動我國生物信息學和高通量測序技術的研究發(fā)展進入世界前沿行列。在課題工作實施的一年多的時間里,圍繞著課題的主要方向目標,各個參與團隊和合作單位積極開展工作,取得了一些突出的進展,為后一階段工作的開展完成打下了良好的基礎。主要進展包括研究發(fā)展了一套新的解碼合成測序技術體系,研究建立了測序誤差模型和原始測序數(shù)據(jù)處理算法,建立了AG測序系統(tǒng)數(shù)據(jù)處理軟件的框架,并完成了該系統(tǒng)的主要模塊發(fā)展;研究建立了測序誤差模型和原始測序數(shù)據(jù)處理算法;面向多樣本測序實驗的編碼理論和方法,建立了測序樣本編碼優(yōu)化設計方法,提出雙標簽編碼的高通量測序文庫制備方案;研究了基于群試理論(Group testing)的樣本混合(pooled DNA samples)編碼方法,提出了面向Pool-Seq實驗的均衡編碼設計算法和基于超幾何分布計算的分組設計算法;進行了對不同高通量測序技術平臺、不同組學應用(基因組、轉錄組)的數(shù)據(jù)特征分析,完成多套應用案例;完成了高通量測序的轉錄組數(shù)據(jù)(RNA-seq)的數(shù)據(jù)處理和拼裝優(yōu)化流程;申請多項專利技術,形成我國自主產(chǎn)權的新一代測序技術的核心技術體系。
關鍵詞:新一代測序 技術平臺 數(shù)學模型 誤差分析 實驗設計 優(yōu)化
Abstract:Challenges of processing methods and capabilities of large scale sequencing data, generated from a variety of next-generation sequencing platforms, including Illumina/Solexa, Roche/454,AB/ OLiD sequencing systems and domestic AG-100/200, are the focus of bioinformatics today. We develop the experimental design strategies and the data processing methods, the high-throughput experimental coding model design, and methods of quality control data for a variety of sequencing platforms. On the basis of mathematical models and methods of quality control research and sequence data on the establishment of coding theory and experimental design of next-generation sequencing, we will provide new theoretical methods of sequencing data processing that give important guidance to improve our self-developed next-generation sequencing AG system. This strategy will be used to adapt to a variety of platforms for a variety of next-generation sequencing data processing methods, algorithms, software reconfigurable workflow, and the development of sequence data for the hardware acceleration technology. The progress will promote the bioinformatics and sequencing technology to enter the ranks of the world's frontier. In more than a year of work implementation, with the goals surrounding the main directions of the target, in cooperation with the participating teams, we made outstanding progress, and carried out work to lay a good foundation for the later stages of the project. Major research progress, including a new set of decoding sequencing by synthesis technology system to study the establishment of a sequencing error model and raw sequencing data processing algorithms to establish a framework AG sequencing data processing software system. To study the sequencing error model and raw sequencing data processing algorithms, the oriented coding theory and methods with various sequencing experiments, a sequencing sample code optimization design method is proposed to double tag encoding high-throughput sequencing library preparation programs. Test sample mixed group of pooled DNA samples encoding method is proposed for the Pool-Seq balanced design algorithms and packet-based encoding algorithm design with hypergeometric distribution calculations. The applications of different genomics data analysis, and assembly optimization were filed for patent application. They form the core of our own proprietary system for next-generation sequencing technology.
Key Words:Nextgen sequencing; Technological platform; Mathematics model; Error analysis; Experiment design; Optimization
閱讀全文鏈接(需實名注冊):http://www.nstrs.cn/xiangxiBG.aspx?id=49573&flag=1