Pengju Zhao, Xianrui Zheng, Ying Yu, Zhuocheng Hou, Chenguang Diao,Haifei Wang, Huimin Kang, Chao Ning, Junhui Li, Wen Feng, Wen Wang,George E. Liu, Bugao Li, Jacqueline Smith, Yangzom Chamba,Jian-Feng Liu,*
KEYWORDS Expression pattern;Unknown protein;Pig;Proteome;Subcellular location
Abstract A lack of the complete pig proteome has left a gap in our knowledge of the pig genome and has restricted the feasibility of using pigs as a biomedical model.In this study,we developed a tissue-based proteome map using 34 major normal pig tissues.A total of 5841 unknown protein isoforms were identified and systematically characterized, including 2225 novel protein isoforms, 669 protein isoforms from 460 genes symbolized beginning with LOC,and 2947 protein isoforms without clear NCBI annotation in the current pig reference genome.These newly identified protein isoforms were functionally annotated through profiling the pig transcriptome with high-throughput RNA sequencing of the same pig tissues, further improving the genome annotation of the corresponding protein-coding genes. Combining the well-annotated genes that have parallel expression pattern and subcellular witness, we predicted the tissue-related subcellular locations and potential functions for these unknown proteins. Finally, we mined 3081 orthologous genes for 52.7% of unknown protein isoforms across multiple species, referring to 68 KEGG pathways as well as 23 disease signaling pathways. These findings provide valuable insights and a rich resource for enhancing studies of pig genomics and biology, as well as biomedical model application to human medicine.
The domestic pig (Sus scrofa) is one of the most popular livestock species predominately raised for human consumption worldwide.Besides its socio-economic importance,pig has been generally recognized as a valuable model species for studying human biology and disease due to its striking resemblances with humans in anatomy,physiology,and genome sequence[1,2].To date, many porcine biomedical models have been created for exploring etiology,pathogenesis,and treatment of a wide range of human diseases,e.g.,Parkinson’s disease[3],obesity[4],brain disorder [5], cardiovascular, atherosclerotic disease [6], and Huntington’s disease[7].Furthermore,pigs and humans share similarities in the size of their organs,making pig organs potential candidates for porcine-to-human xenotransplantation[8,9].Recently,major efforts have been devoted to the development of tools for further enhancing the value of pigs as a biomedical model for human medicine as well as its role in meat production.Of essential significance is the completion of the assembly of the pig genome sequence(Sus scrofa11.1)in recent time.It provides researchers with a vast amount of genomic information,facilitating characterization of individual pig genome as well as genome comparison between pigs and humans.
With the progress of large-scale genome projects, such as Encyclopedia of DNA Elements [10] and Human Proteome Projects [11], many genes have been annotated at RNA and protein levels, and diverse regulatory elements across the human genome have been systematically characterized. This creates great opportunities for exploring how genetic variations underly complex human phenotypes [12]. In particular,a spate of groundbreaking studies have been succeeded in building high-resolution maps of the proteome [13–15] in a variety of human tissues and cells.Findings from these studies greatly facilitate the functional annotation of the genome at multiple-omic levels and further improve the understanding of the complexity of human phenotypes.
Compared with humans, however, studies of pig proteome are very limited [16,17]. In particular, in-depth identification and characterization of the proteome maps of the pig genome across a broad variety of pig tissues are not yet available. To date, the leading protein database UniProtKB comprised around 1419 reviewed and 34,201 unreviewed pig proteins in Swiss-Prot and TrEMBL, respectively. It is far less than the numbers of entries in Swiss-Prot (20,215 proteins) and TrEMBL (159,615 proteins) corresponding to human proteome data. Although the recent update of the Pig PeptideAtlas presented 7139 canonical protein identifications from 25 tissues and 3 body fluids [18], it is still a limited promotion to whole pig proteome research. In fact, a large number of unreviewed and PeptideAtlas-identified pig proteins are not well annotated in current genome (Sus scrofa11.1)due to lack of specific genomic locations and corresponding assembled RNA transcripts. This suggests that there are still plenty of poorly annotated proteins that have not been identified and characterized in previous pig studies. In addition, even if nearly 20% of the annotated pig proteincoding genes (PCGs) have been symbolized beginning with LOC, the orthologues and functions of these genes have not been determined,which also presents one of the key limitations of pig gene set enrichment analysis.The absence of completive maps for the pig proteome triggers a substantial bottleneck in the progress of refining pig genome annotation and even hinders systematic comparison of omics data between humans and pigs.
Therefore, considering the potential contribution to develop pig proteomic atlases, we conducted in-depth characterization of pig proteome across 34 histologically normal tissues using high-resolution mass spectrometry (MS).Accordingly, we exploited the novel proteins, poorly annotated proteins, and LOC proteins, and defined these as the pig unknown proteins. These unknown proteins were mapped to the latest pig genome (Sus scrofa11.1) for confirming their available genomic locations. Jointly profiling the proteome and transcriptome across multiple pig tissues investigated herein, we found that the majority expression of transcripts was dominated by the expression of a small proportion of PCGs and most of the newly identified protein isoforms exhibited relatively higher tissue-specific expression than the proteins encoded by the existing PCGs. We subsequently constructed the tissue-based PCG spectrum of tissueenriched, group-enriched, and ubiquitously expressed genes in the pig genome, and determined 452 unknown protein isoforms as the novel candidates of pig housekeeping genes.Accordingly, we developed a pig transcriptomic atlas and characterized the subcellular locations for these unknown protein isoforms to infer their connections with the specific functions of tissues. Finally, by systematically comparing the orthologous relationships of these unknown pig proteins with other 10 species,we further predicted the potential functions of these unknown protein isoforms to ensure their availability in future relevant studies. Findings herein will benefit studies and development of pig genome and will allow further investigation of swine gene functions and networks of particular interest to the scientific community.
We explored the pig proteome from 34 tissue samples(Figure 1A) using liquid chromatography tandem mass spectrometry(LC–MS/MS).In silicoanalyses(Figure 1B)were then conducted to construct the whole landscape of the pig proteome with a view to furthering pig biological research and human medical studies. The resulting proteome data involved a total of 21,681,643 MS/MS spectra produced from 680 LC–MS/MS runs (20 runs per tissue).
Figure 1 Overview of pig proteome-based annotation
To exploit convincing peptide evidence for all putative PCGs in the pig genome, we searched the raw MS/MS data by Mascot[19]against three pig databases,as well as the common Repository of Adventitious Proteins (cRAP) database which assumes the digestion enzyme trypsin. The three pig databases included the primary pig database of UniProt [20]for the initial search and two custom-developed databases for sequential searches of unmatched spectra,i.e., 1) RNA sequencing (RNA-seq)-basedde novoassembly transcriptomic database, including the RNA-seq data generated from the 34 tissue samples in this study,the 1.08 giga-base(Gb)data from an external public expressed sequence tag(EST)database,and the 953.57 Gb publicly available RNA-seq data(see Materials and methods); and 2) a six-frame-translated pig genome database. Those corresponding matched spectra extracted from each subset of databases were re-searched against the same database by X!Tandem [21] for further filtration, producing the final 5,082,599 peptide spectrum matches (PSMs) at 1%PSM false discovery rate (FDR). Subsequently, Scaffold (version Scaffold_4.4.5, Proteome Software, Portland, OR) was run for MS/MS-based peptide and protein identification,both of them using the local FDR criterion of 0.01.
Totally, we identified 212,154 non-redundant peptides with a median number of 8 unique peptides per protein (quality assessment of protein identification is shown in Figures S1–S14). Comparison of identified peptides with the largest pig peptide resource PeptideAtlas (http://www.peptideatlas.org/)showed that 49,144 out of 87,909 curated peptides (55.9%)were confirmed by our identification.The peptides we detected greatly outnumbered those deposited in PeptideAtlas, with a major fraction (77.0%) found to be novel. A total of 24,431 protein isoforms with median sequence coverage of 30.3%were determined by Scaffold, which corresponded to 19,914 PCGs. To ascertain whether our protein identification achieved a reasonable false positive error rate,we additionally validated 31 proteins from different proteogenomic categories.By comparing MS/MS spectra from 71 synthetic peptides with those obtained from our analyzed pig tissues, we obtained 100% of validation (Table S1; File S1).
Classifying all of 24,431 identified protein isoforms(Figure 2A)indicated that 16,738(68.5%)protein isoforms were confirmed by the UniProt protein evidence, 9204 (37.7%) protein isoforms had evidence from Pig PeptideAtlas [18], 17,781(72.8%)protein isoforms were included in NCBI protein database, and 7910 (32.4%) were supported by all of them. Of all confirmed protein isoforms, 17,781 (85.8%) protein isoforms corresponding to 11,308 PCGs were included in known NCBI annotations,669 protein isoforms corresponding to 460 PCGs were annotated in the pig genome but classified as uncharacterized LOC genes,and 2947 protein isoforms remained a lack of NCBI annotation support in the pig genome (Figure 2B).Among the rest of 3703 protein isoforms identified by MS/MS data for the first time in this study, 2225 had higher confidence (PSMs with Mascot ion score > 20 and identification probability > 20%; details in Table S2), and thus were considered as potential novel proteins.
To further enhance the annotation of PCGs for the current pig genome, we systematically characterized the features or/and genomic locations of these 5841 unknown protein isoforms detected in current study (i.e., 669 protein isoforms of LOC genes,2947 protein isoforms without genomic location annotation, and 2225 novel protein isoforms firstly identified herein). Considering only 11.4% of protein isoforms had available genomic locations, we mapped the rest of 5172 unknown protein isoforms to the pig reference genome(Sus scrofa11.1) by MAKER annotation workflow [22]. First,the low-complexity repeats of pig reference genome were soft-masked by RepeatMasker, and then the 5172 unknown protein isoforms(non-LOC genes)were aligned to the masked reference genome by BLAST [23]. Second, Exonerate [24] was run to realign and polish the exon–intron boundaries of the unknown genes with the splice-site aware alignment algorithm.A total of 4026 (77.8%) unknown protein isoforms were successfully aligned to the reference genome with > 95%sequence identity and similarity (2073 with the 100% identity and 100%similarity),including 3886 assigned to chromosomes and 140 resided on 23 unplaced scaffolds. More interestingly,we found that the proportion of novel proteins mapped in respective chromosomes was related to the improvement of genomic annotation fromSus scrofa10.2 toSus scrofa11.1 for different chromosomes (Figure 2C;R2= 0.67,P= 0.0015). This result demonstrated that these unknown proteins, especially the novel proteins, were actually ignored in the current pig genome annotation, since most of previous studies have been limited to the incomplete annotation ofSus scrofa10.2 genome and the small number of tissues investigated.
Figure 2 Characterization of unknown pig protein isoforms
Comparison of the unknown protein isoforms (n= 5841)with the well-annotated proteins (n= 17,781) revealed that a major fraction of unknown protein isoforms (2281/5841,39.1%), especially the novel protein isoforms (1020/2225,45.8%),were merely identified in a single tissue,which were far more than the well-annotated proteins. It was most likely due to the tissue-specific and low-abundance expression of these novel protein isoforms. Further analysis of the reliability for these unknown protein isoforms revealed that a major fraction of them (3529/5841, 60.4%) were regarded as abundant proteins that had more than ten spectral counts [25]. Particularly,although the novel protein isoforms were first identified in this study,almost 60.7%of them were supported by a high spectral count of > 5.
To further probe potential functions of unknown protein isoforms, we characterized the expression landscape of unknown protein isoforms by high-throughput RNA-seq of the 34 tissue samples analyzed in the LC–MS/MS assays.
Approximately 1495 million paired-end reads(376.7 Gb per tissue) were obtained through sequencing 116 strand-specific paired-end RNA libraries,of which 1230 million were mapped to the pig genome (Sus scrofa11.1) with an overall pair alignment rate of 88.3% (Table S3). As expected, a total of 2,486,239 transcripts [fragments per kilobase per million(FPKM) > 0.1 in at least one tissue] corresponding to 29,270 genes were then assembled and quantified across all tissues, which contained 5250 annotated transcripts corresponding to 3486 known noncoding genes, 7595 potentially novel alternatively spliced transcripts corresponding to 2421 known noncoding genes, 55,328 annotated transcripts corresponding to 20,401 PCGs,136,537 potentially novel alternatively spliced transcripts corresponding to 15,385 PCGs,and 2,281,529 newly assembled transcripts corresponding to 26,493 genes in the pig genome without annotation information.These findings clearly increased the average number of transcripts per gene compared with the existing gene annotation in NCBI (Human-NCBI,n= 7.27; Pig-NCBI,n= 2.75; Pig-identified,n= 6.60;Figure 3A).
On the basis of all the currently well-annotated genes (the genes annotated in NCBI), we constructed a tissue similarity map across the 34 tissue samples using hierarchical clustering based on the Pearson correlation.As shown in Figure 3B,with the exception of three obvious outliers [i.e., adult testis, pancreas, and peripheral blood mononuclear cells (PBMCs)], the data were clustered into multiple known connected groups:liver and kidney, muscular system (longissimus dorsi and heart), nervous system (retina, brain, and spinal cord), adult immune organs (spleen, salivary gland, and lymph), and bladder tissue (urinary bladder, gall bladder, and oesophagus).These results revealed the expected biology that had a similar expression profile to that of human tissues [13], reflecting the biological similarity between humans and pigs, as well as the reliability of transcripts we constructed.
Intriguingly,a total of 51.7%(3018/5841)of unknown protein isoforms were successfully confirmed by the transcripts constructed herein,which offered a detailed view of the understanding of unknown proteins. Considering that unknown protein isoforms are usually expressed at low levels,we applied zFPKM normalization method [26] to generate highconfidence estimates of gene expression.The observed zFPKM values of unknown protein isoforms ranged from -3.02 to 19.89, showing a lower average expression level(zFPKM = 2.47), especially for the novel protein isoforms(zFPKM = 2.21), than the well-annotated PCGs(zFPKM = 3.62). Besides, we also found that these unknown protein isoforms tended to be expressed in less tissues(in average 12.1 tissue samples) than the well-annotated PCGs(in average 21.3 tissue samples),and nearly 39.0%(1178/3018)of unknown protein isoforms were only identified in a single tissue (Figure 3C). The results suggested that their tissuespecific expression characteristics may be one of the factors that lead to the incomplete annotation of these unknown protein isoforms.
Screening the expression patterns of protein isoforms in each tissue, we observed that the majority expression of transcripts was dominated by the expression of a small proportion of genes in all of the investigated tissues (Table S4).Specifically, the adult pig tissues (prostate, longissimus dorsi,pancreas, gall bladder,etc.) had the least complex transcriptome, with 50% expression of the transcripts coming from a few highly expressed genes (3–8 transcripts). In contrast, the reproductive tissues (i.e., uterus, testis, and ovary), expressed more complex transcriptome, with a large number of genes expressed. Similar transcriptomic patterns have also been reported in human tissues [27]. It was surprising that 203 unknown protein isoforms were potentially associated with 148 (14.0%) highly expressed genes, suggesting that these unknown protein isoforms may play an important role in basic function among tissues or organs.
Several approaches for systematic analysis of gene expression across different tissues have indicated that gene expression patterns are usually associated with their biological functions,and genes with the similar functions are more likely to exhibit similar expression patterns [28]. Implementing the similar classification criteria for human genes [13] on the RNA-seq data generated from the multiple pig tissues herein, we classified 23,887 putative NCBI genes(including 18,377 PCGs;corresponding to well-annotated 60,578 transcripts) and 3018 unknown protein isoforms into three categories (tissueenriched, group-enriched, and ubiquitously-expressed) for exhibiting their expression features. The numbers of tissueenriched genes, group-enriched genes, and ubiquitouslyexpressed genes are also displayed as a network plot to show an overview of pig PCGs (Figure 4A).
In multicellular organisms, genes expressed in a few tissue types are thought to be tissue-enriched genes, which have tissue-specific related functions. We observed 8482 (14.0%)well-annotated transcripts (5592 genes) and 1178 (39.0%)unknown protein isoforms that have a specific expression in a particular tissue. Furthermore, 16,356 (27.0%) wellannotated transcripts (9726 genes) and 203 (6.7%) unknown protein isoforms were expressed at least 5-fold higher at the zFPKM level in one tissue than in the tissue with the second highest expression. Similar to a previous study in humans [13](Figure 4B), the largest number of tissue-enriched genes were detected in the adult testis, followed by infancy brain, adult retina, and adult brain. These results reflect that the tissues with complex biological processes usually have more tissueenriched genes, and these tissue-enriched genes are strongly associated with the functions of the corresponding tissues.This can be exemplified by theRHO(Rhodopsin) gene that was enriched in retina and was proven to play important roles in retinal pigments [29]. This demonstrates that the tissue specificity can not only confirm the biological characteristics of known genes but also predict basic function of undefined genes in pigs. Accordingly, we successfully updated 1386 tissueenriched unknown protein isoforms to further explain the functional differences among tissues.
Figure 3 The pig transcriptome in unknown protein isoforms
Figure 4 Expression landscape in pig transcriptome
Apart from the genes observed to have tissue specificity,some group-enriched genes were over-represented in the group of tissues/organs that together perform closely related functions. Accordingly, we found that a total of 1322 (2.2%)well-annotated transcripts(948 genes)and 48(1.6%)unknown protein isoforms were detected and could be grouped into seven types of tissues (Figure 4C). The largest fraction of group-enriched genes belonged to the brain tissues (996/1370,72.7%), followed by the muscular system (cardiac muscle and longissimus dorsi; 201/1370, 14.7%), adrenal and thymus glands (90/1370, 6.6%), as well as liver and gallbladder(61/1370, 4.5%). Generally, these group-enriched genes have potential roles in biological system functions,and their expression patterns are usually similar between different species. As exemplified by the group-enriched expression ofMYL3(myosin light chain 3, a known myosin component)(Figure 4D) andENC1(ectodermal-neural cortex 1, involved in mediating uptake of synaptic material) (Figure 4E), these two genes separately displayed similar expression patterns in the muscular system and brain tissue between humans and pigs. Therefore, the 48 of unknown protein isoforms will be the valuable resources for further enriching the functional and comparative genomics between pig and human.
Specifically, we identified 5656 well-annotated transcripts corresponding to 5147 (21.6%) NCBI genes expressed in all tested pig tissues. Among these genes, a variety of known‘‘housekeeping” genes, such asACTB,GAPDH,PGK1, andRPL19(Figure 4F), are usually intracellular and tend to be functionally essential to cell subsistence that are involved in metabolism, transcription, and RNA processing or translation[30].Interestingly,452(15.0%)of unknown protein isoforms were detected as the ubiquitously-expressed genes.The finding of these unknown protein isoforms offers an important supplement to pig genomic annotation. To characterize the gene set of ubiquitous expression of the unknown genes identified herein, we constructed a co-expression network heatmap that consisted of 24 blocks to assess the interactions among these ubiquitously-expressed unknown genes across all pig tissues (Figure 4G). Obviously,these unknown protein isoforms have potentially functional connections with the well-annotated genes in the same blocks(Table S5), which can be explained by those genes within modules of a co-expression network involved in similar or related pathways and biological processes[31].
Figure 5 Subcellular characterization of the unknown pig proteome
Proteins with different subcellular locations usually play different roles in physiological and pathological processes.To characterize these unknown pig proteins at the subcellular level, we performed a proteome-wide subcellular classification for all identified pig protein isoforms (n= 24,431) based on the existing prediction methods [13] (as described in Materials and methods).A major fraction(17,745/24,431, 72.6%)of pig protein isoforms were predicted to be soluble protein isoforms,followed by 21.6% (5272/24,431) of membrane protein isoforms and 5.8% (1414/24,431) of secreted protein isoforms(Figure 5A; Table S6). For an in-depth comparative analysis on PCGs (n= 17,759), we further clustered all proteins with available protein isoforms into four basic categories including 12,744 soluble proteins, 3918 membrane proteins, 1050 secreted proteins, and 47 membrane and secreted proteins(Table S7). As shown in Figure 5B, there were only 2.4% of PCGs (n= 416) with isoforms belonging to two or more categories, which is far less than the percentage of PCGs(19.3%,n=3917)with the similar type of isoforms in human[13]. It is worth noting that the novel protein isoforms(1875/2225, 84.3%) has a greater proportion of soluble proteins than the known protein isoforms (11,090/16,782,66.1%). The results hint that the solubility of soluble proteins in liquids may be one of the reasons that lead to the missing of some proteins in the current pig proteome.
More interestingly, we found that the functions of organs or tissues were also related to the subcellular locations of their expressed proteins. Ranking all of identified proteins by their zFPKM values for each tissue, we selected the top 1% to represent their main proteins. As shown in Figure 5C, a higher proportion of membrane proteins were associated with nervous tissues, such as spinal cord, brain, and retina.Moreover,muscle tissues had a higher proportion in soluble proteins,and a higher proportion of secreted proteins were highly expressed in some secretory tissues, such as liver, uterus, pancreas, gall bladder, and gut. In addition, similar to human proteome[13], proteins highly expressed in secretory tissues tended to be secreted proteins (Figure 5D). For example, proteins encoded by LOC100620249 and Progastricsin (PGC) were secreted proteins highly expressed in the stomach tissue, and the latter is a known secreted protein and constitutes a major component of the gastric mucosa.This result suggests that the LOC100620249 gene more likely has the stomach-related function,providing a valuable information for enhancing studies of pig genomics and biology.
To pursue stronger evidence and orthologous functions for these unknown pig proteins, we further aligned the sequence of each isoform against the top 10 species databases. We adopted two criteria to identify homologous sequences to the newly identified pig proteins with those of other species: 1)percentage of identity greater than 80% and 2) length of homologous sequence longer than 80% of the pig protein sequence.Consequentially,3081 out of 5841(52.7%)unknown protein isoforms were inferred to have orthologues in other species.Although 90.2%(2778/3081)of the unknown isoforms had orthologues in at least other two species, 36.5%(1125/3081) of the unknown isoforms had orthologues in 9(except chicken) or all 10 species (Figure 6A). Interestingly,43.6% (970/2225) of the novel protein isoforms had orthologues in other species, and almost 73.1% (709/970) of them were mapped in the pig genome (Sus scrofa11.1)(Figure 6B). The results indicate that the novel proteins identified herein can be considered as the reliable proteome data that significantly enhance both the pig genome annotation and the current pig protein database.
Figure 6 Orthologous proteins of unknown pig proteome across multiple species
In addition, compared with the existing orthologues in OMA browser (http://omabrowser.org) and current genome sequences, 3081 of the unknown protein isoforms enriched 12,375 novel pairwise orthologous relationships between pigs and other species (Table S8).These pairwise orthologous relationships of proteins between pigs and other species provided a feasible way to investigate the potential functions of corresponding PCGs in the pig genome if these orthologous proteins have been well studied in other species. Therefore,considering the most complete set of annotated genes in human genome, we preformed the functional gene set enrichment for human orthologous proteins of these unknown protein isoforms to speculate their potential functions. A functional Gene Ontology (GO) analysis for these unknown protein isoforms showed that most unknown protein isoforms were enriched for the GO terms of cell and intracellular parts(correctedP< 0.01), providing an important supplement to understand the biological process in pig (Table S9). Meanwhile, by further examining the functional characterization of these unknown protein isoforms,we found 68 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways represented in our unknown proteome (Table S10), mainly involving metabolic pathways (correctedP= 5.2E-20), focal adhesion (correctedP= 2.4E-09), and regulation of actin cytoskeleton (correctedP= 5.1E-07). Importantly, we detected 23 disease signaling pathways from the KEGG disease database (correctedP< 0.05) that included the metabolism, nervous system, skeletal, muscular, and skin diseases(Table S11). These findings will help us better recognize the potential functions of the unknown pig protein isoforms, and provide a valuable resource to support the pig as a biomedical model for human medicine and as donors for porcine-tohuman xenotransplantation.
Here we presented the landscape of a tissue-based proteome for pigs.Our findings not only offered the verification for 84.8%of the existing pig proteins that have been deposited in the UniprotKB(n=16,738),Pig PeptideAtlas(n=9204),and NCBI Protein database (n= 17,781), but also identified 2225 novel protein isoforms.Besides,we also detected 669 protein isoforms from uncharacterized LOC genes and 2947 protein isoforms without NCBI annotation in the current reference genome ofSus scrofa11.1.Eventually,a total of 5841 unknown protein isoforms were exploited to further optimize the annotation of PCGs for the current pig genome.
We systematically characterized unknown protein proteome for their expression features, subcellular locations, and orthologous functions, providing a valuable resource for enhancing studies of pig genomics, as well as offering the opportunities for exploring the potential functions of these unknown proteins. Our findings clearly showed that the missing protein annotation in previous studies was due to the three aspects: 1) low-quality assembly inSus scrofa10.2 genome; 2) specific features (including low expression level,tissue specificity, and greater proportion of soluble components in novel protein isoforms);and 3)inevitable errors derivded from the traditional gene prediction and annotation methods [32]. The in-depth identification and subcellular characterization of proteome using multiple tissues make it feasible to develop a tissue-based pig proteome map and facilitate studies of functional genomics and relevant research fields. We effectively improved the genome annotation for 4026 unknown protein isoforms by mapping their protein sequences to the current pig genome(Sus scrofa11.1),of which 3886 were assigned to chromosomes and 140 were resided on 23 unplaced scaffolds.
High-resolution profiling of pig transcriptome allows us to further reveal 1434 unknown protein isoforms that display a tissue-enriched (1386) or group-enriched (48) function expression pattern. In addition, 452 of unknown protein isoforms were ubiquitously expressed in 34 tissue samples,which raised 7.4% of the potential ‘‘housekeeping” gene in the pig genome.These findings provide new insight into understanding the molecular function of the respective tissue or organ. Further inferring the biological function of unknown pig proteome by human orthologous proteome,we found that these unknown protein isoforms were enriched in 68 KEGG pathways and 23 disease signaling pathways, including the pathways involved in disease of concern for human medicine,such as metabolism, nervous system, skeletal, muscular, and skin diseases. The integrated data of proteome and transcriptome in the 34 pig tissue samples herein were respectively presented in Tables S12 and S13, and 5841 unknown protein isoforms with corresponding genomic locations, expression landscapes, subcellular characterizations, orthologous proteins, and predicted functions were also summarized in Table S14. All findings herein will provide valuable insights and resources for enhancing studies of pig genomics and biomedical model application to human medicine in the future.
The tissue samples and PBMCs used for protein identification and mRNA expression analyses were collected from pigs raised in the Ninghe breeding pig farm in Tianjin, China.For purpose of generating profiles of transcriptome and proteome of all major organs and tissues of pigs, we totally collected 34 samples (i.e., 33 pooled tissues and the PBMCs)from the nine unrelated Duroc pigs,including three adult male pigs and three female pigs at 200–240 days of age, as well as three male infant pigs at 21–25 days of age.All pig tissues were histologically confirmed to be normal and healthy by an experienced pathologist. An overview of all involved tissues and cell samples is provided in Table S3.
All samples were snap frozen within 20 min after slaughter and stored in liquid nitrogen until usage. PBMCs were isolated using Ficoll-Hypaque PLUS (GE Healthcare, Beijing, China)following the manufacturer’s instructions. In brief, the whole blood was first diluted by an equal volume of phosphate buffer solution (PBS). Then, 20 ml of diluted blood was carefully added on top of 10 ml of Ficoll-Hypaque solution in a 50 ml conical tube and centrifuged at 460gfor 20 min at room temperature. After centrifugation, the middle whitish interface containing mononuclear cells was transferred to a new tube,washed by PBS, and centrifuged at 1000 r/min for 10 min twice.
Fresh frozen tissue was thawed, cut into small pieces, and extensively washed with precooled PBS. A pool of equal amounts of tissues from three unrelated pigs was homogenized and sonicated in cold lysis buffer.Extraction of 100 μg protein using protein extraction buffer(8 M urea,0.1%SDS)containing an additional 1 mM phenylmethylsulfonyl fluoride (Beyotime Biotechnology, Shanghai, China) and protease inhibitor cocktail (Roche, CA) was kept on ice for 30 min and then centrifuged at 16,000gfor 15 min at 4°C.The supernatant was collected and determined with BCA assay (Pierce,WA) and 10%–20% SDS–PAGE. The cell lysate was stored at -80 °C before LC–MS/MS analysis.
Total RNA was extracted from the pooled tissues via the Trizol method (Invitrogen, Carlsbad, CA) according to standard protocols. RNA degradation and contamination were monitored on 1% agarose gels.The purity and contamination of total RNA were checked using NanoPhotometer(IMPLEN, Los Angeles, CA) and Qubit RNA Assay Kit in Qubit 2.0 Fluorometer (Life Technologies, Carlsbad, CA).RNA integrity was measured using the RNA Nano 6000 Assay Kit of the Bioanalyzer 2100 system (Agilent Technologies,Palo Alto, CA). All pig samples with an RNA integrity number(RIN)value greater than 7.0 and at least 5 μg of total RNA were included and batched for RNA-seq.
Total RNA of samples meeting quality control (QC) criteria were rRNA depleted, and depleted QC was done using the RiboMinus Eukaryote System v2 and RNA 6000 Pico chip according to the manufacturer’s protocol. RNA-seq libraries were constructed using the NEBNext Ultra RNA Library Prep Kit (New England Biolabs, Ipswich, England) for Illumina with 3 μg rRNA-depleted RNA according to the manufacturer’s recommendation. RNA-seq library preparations were clustered on a cBot Cluster Generation System using HiSeq PE Cluster Kit v4 cBot (Illumina, CA) and sequenced using the Illumina Hiseq 2500 platform according to the manufacturer’s instructions, to a minimum of 10 G reads per sample(corresponding to 125 bp paired-end reads).
Peptide mixture from each sample was first lyophilized and reconstituted in buffer A [2% acetonitrile (ACN), 98% H2O,pH 10].Then,it was loaded onto a Xbridge PST C18 Column(130 A?,5 μm,250 mm×4.6 mm,Waters,MA)on the Dionex Ultimate 3000 HPLC(Dionex,CA)equipped with a UV detector. Mobile phase consists of buffer A and buffer B (90%ACN, 10% H2O, pH 10). The column was equilibrated with 100%buffer A for 25 min before sample injection.The mobile phase gradient was set as follows at a flow rate of 1.0 ml/min:1)0–19.9 min:0%buffer B;2)19.9–20 min:0%–4%buffer B;3)20–22 min:4%–8%buffer B;4)22–42 min:8%–20%buffer B;5)42–59 min:20%–35%buffer B;6)59–60 min:35%–45%buffer B;7)60–61 min:45%–95%buffer B;8)61–66 min:95%buffer B;9)66–67 min:95%–0%buffer B;and 10)67–91 min:0% buffer B. A fraction was collected every minute from 24 min to 63 min, and a total of 40 fractions collected were then concentrated to 20 fractions, vacuum dried, and stored at –80 °C until further LC–MS/MS analysis.
Peptide mixture was analyzed on a Q Exactive instrument(ThermoFisher Scientific, MA) coupled to a reversed phase chromatography on a Dionex nano-UPLC system using an Acclaim C18 PepMap100 nano-Trap column (75 μm × 2 cm,2 μm particle size, ThermoFisher Scientific) connected to an Acclaim PepMap RSLC C18 analytical column(75 μm × 25 cm, 2 μm particle size, ThermoFisher Scientific).Before loading, the sample was dissolved in sample buffer,containing 4% ACN and 0.1% formic acid. Samples were washed with 97% mobile phase A (99.9% H2O, 0.1% formic acid) for concentration and desalting. Subsequently, peptides were eluted over 85 min using a linear gradient of 3%–80%mobile phase B(99.9%ACN,0.1%formic acid)at a flow rate of 300 nl/min using the following gradient: 3% B for 5 min;3%–5% B for 1 min; 5%–18% B for 42 min; 18%–25% B for 11 min; 25%–30% B for 3 min; 30%–80% B for 1 min;hold at 80% B for 5 min; 80%–3% B for 0.5 min; and then hold at 3% B for 21.5 min. High mass resolution and higher-energy collisional dissociation (HCD) was employed for peptide identification. The nano-LC was coupled online with the Q Exactive mass spectrometer using a stainless steel emitter coupled to a nanospray ion source. The eluent was sprayed via stainless steel emitters at a spray voltage of 2.3 kV and a heated capillary temperature of 320 °C. The Q Exactive instrument was operated in data-dependent mode,automatically switching between MS and MS2. MS analysis was performed in a data dependent manner with full scans(350–1600m/z) acquired using an Orbitrap mass analyzer at a mass resolution of 70,000 at 400m/zon Q Exactive using an automatic gain control (AGC) target value of 1 × 106charges. All the tandem mass spectra were produced by HCD. Twenty most intense precursor ions from a survey scan were selected for MS/MS from each duty cycle and detected at a mass resolution of 17,500 atm/zof 400 in Orbitrap analyzer using an AGC target value of 2 × 105charges. The maximum injection time for MS2 was 100 ms and dynamic exclusion was set to 20 s.
In total, 71 peptides from 31 proteins (7 known proteins, 11 homologous novel proteins, and 13 non-homologous novel proteins) were randomly selected for peptide synthesis (GL biochem, Shanghai, China) for validation of identified proteins. The synthesized peptide sequences were mixed and processed twice by chromatographic separation using the Thermo Scientific EASY-nLC HPLC system and Thermo Scientific EASY column. Mass spectral analysis was then performed by Q Exactive (ThermoFisher Scientific) and processed by Mascot V2.5.1.Finally, all these peptides were compared with those identified from our proteome analysis to verify novel proteins.
We conducted a QC step on raw fastq reads for efficient and accurate RNA-seq alignment and analysis. In this step, raw reads were cleaned up for downstream analyses using the following steps: removal of adapter sequences using BBDuk(http://sourceforge.net/projects/bbtools/) [33]; calculation of the Q20, Q30, and GC content of the clean data for QC and filtering using FASTQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/); and homopolymer trimming to the 3′end of fragments and removed the N bases from the 3′end using FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/).
RNA-seq data were mapped and genome indexed with Hisat 0.1.6-beta 64-bit [34] to the pig genome release version ofSus scrofa11.1 (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/003/025/GCF_000003025.6_Sscrofa11.1/).Sus scrofa11.1 annotation was used as the transcript model reference for alignment, splice junction identification, and quantification of all PCGs and isoforms. To obtain expression levels for all pig genes and transcripts across all 34 samples, FPKM values were calculated using Stringtie 1.0.4 (Linux_x86_64) [35] with the default parameters. A gene or transcript was defined as expressed if it’s FPKM value was measured greater than 0.1 across all tissues. For each tissue, we applied zFPKM normalization method [26] to generate high-confidence estimates of gene expression.
Refer to the gene classification in human,we also classified the pig genes into one of the three categories based on the zFPKM levels in 34 samples:1)‘‘tissue-enriched”was only detected in a single tissue as well as at least 5-fold higher at the zFPKM level in one tissue compared to the tissue with the second highest expression; 2) ‘‘group-enriched” was detected in all tissues from a group, and the expression of genes in any tissue from the group was higher than that in the tissue(s) not from the group; and 3) ‘‘ubiquitously-expressed” was detected in all 34 tissue samples.
To identify novel proteins and improve existing protein annotations in the pig genome, the database for protein searching (MS/MS data searched against protein database)was taken from three different levels using in-house perl scripts, including: 1) UniProt database (Sus scrofa); 2)three-frame-translated mRNAde novosequences from the current study; and 3) six-frame-translated pig genome database. The details are as follows:
Primary database of proteins
Resource protein datasets for pig (UniProt version 20150717 containing 34,131 entries, with 1486 Swiss-Prot and 32,643 TrEMBL) were downloaded from the UniProt database(http://www.uniprot.org/).
Secondary database of proteins
It is well known that pig proteins were insufficiently represented by the known detectable proteins,because of the incomplete nature of the pig genome assembly and limited annotation. In our study, three RNA resources were used(Table S15): 1) EST datasets including 34,131 entries from UCSC (http://hgdownload.soe.ucsc.edu/goldenPath/susScr3/bigZips/) and 1,676,406 entries from the NCBI database(http://www.ncbi.nlm.nih.gov/nucest). ESTs are normally assembled into longer consensus sequences for three-frametranslated mRNA protein database using iAssembler version 1.3.2.x64 [36] with default parameter; 2) paired-end read libraries including 34 RNA-seq libraries from our study and 7 previously published articles and NCBI databases. To construct a complete protein database for three-frame-translated mRNA, we used Trinity (version 2.0.6) [37] forde novotranscriptome assembly from RNA-seq data,and identified potential coding regions within Trinity-reconstructed transcripts by TransDecoder (developed and included with Trinity); and 3)single-end reads from 10 previous studies were downloaded from NCBI (http://www.ncbi.nlm.nih.gov/sra/). The methods for sequence assembly and coding region prediction were similar to that used for the paired-end reads. Finally, the length cutoff of the translated proteins was set as 100 amino acids to ensure the quality of all databases and reduce the number of false positive hits.
Tertiary database of proteins
To capture the proteins missed during the laboratory discovery process as far as possible, protein annotation of the pig genome was carried out using theab initiomethods with GeneScan (version 1.0) software [38].
Finally, we compared different protein databases to detect the repetitive protein isoforms (redundancies) among three protein databases using BLASTP software. And then the repetitive protein isoforms were removed and reserved for only one according to the following database priorities:UniProt >De novo>Ab initio.
All MS/MS data were analyzed using Mascot (version 2.5.1, Matrix Science, London, UK) [39] and X!Tandem(version 2010.12.01.1, The GPM, Rockville, MD) [40]. Mascot was set up to search the three pig databases (UniProt,De novo, andAb initio), as well as the common Repository of Adventitious Proteins (cRAP) database (downloaded on 07 Jul 2015, 116 sequences) assuming the digestion enzyme trypsin.
The high-resolution peaklist files were converted into Mascot generic format prior to database searching by ProteoWizard (version 3.0.6). X!Tandem was set up to search a subset of the pig databases, as well as the cRAP database. The target-decoy option of Mascot and X!Tandem was enabled (decoy database with reversed protein sequences). Mascot and X!Tandem were used for searching with a fragment ion mass tolerance of 0.050 Da and a parent ion tolerance of 10.0 PPM. The number of maximums allowed missed cleavage sites was set to 2. All PSMs identified at 1% FDR were set for all samples. Carbamidomethyl of cysteine was specified in Mascot and X!Tandem as a fixed modification. Gln->pyro-Glu of the N-terminus, oxidation of methionine,and acetylation of the N-terminus were specified in Mascot as variable modifications. Glu->pyro-Glu of the N-terminus, ammonia-loss of the N-terminus, Gln->pyro-Glu of the N-terminus, oxidation of methionine,and acetylation of the N-terminus were specified in X!Tandem as variable modifications.
Scaffold was used to validate MS/MS-based peptide and protein identification. Peptide identification was accepted if it achieved an FDR < 1% by the Scaffold local FDR algorithm. Protein identification was accepted if it had an FDR < 1% and contained at least 2 identified peptides.Protein probability was assigned by the Protein Prophet algorithm. Proteins that contained similar peptides and could not be differentiated based on MS/MS analysis alone were grouped to satisfy the principles of parsimony. Proteins sharing significant peptide evidence were grouped into clusters. In the database searching workflow, unmatched MS/MS spectra generated from the Uniprot database searching were then searched against next level protein database (De novoandAb initio).
We attempted to map all unknown protein isoforms against the pig genome using MAKER annotation workflow [22].First,the low-complexity repeats of pig reference genome were soft-masked by RepeatMasker. Then, the unknown protein isoforms were aligned to the masked reference genome by BLAST [23] for identifying their genomic location roughly.Last, Exonerate [24] was used to realign and polish the exon–intron boundaries of the unknown gene with the splicesite aware alignment algorithm. The house-python script was used to deal with the result: if a successfully aligned protein had 95% identity overall, 95% coverage, and a less than 50-kb distance from its neighboring exon, it was recorded to be an effectively aligned sequence.
The prediction of pig membrane proteins was carried out similarly to how these proteins were classified in the human proteome. A total of seven methods, MEMSAT3 [41],MEMSAT-SVM [42], SPOCTOPUS [43], THUMBUP [44],SCAMPI multi-sequence-version [45], TMHMM [46], and Phobius version 1.01 [47], were used to identify membrane protein topology with different assessment algorithms [e.g.,opological models, neural networks, support vector machines(SVMs), scale of free energy contributions, and hidden Markov models (HMMs)]. In our study, a protein isoform was assigned as transmembrane if it was predicted by at least four out of the seven methods.
In accordance with human secretome analysis, the prediction of signal peptides was based on neural networks and HMMs with three software programs: SignalP4.0 [48], SPOCTOPUS [43], and Phobius version 1.01 [47]. The protein isoforms, which were predicted to contain a signal peptide by at least two out of the three methods, were classified as potentially secreted.
Integrating the prediction results of pig membrane proteins and pig secreted proteins, we classified each pig protein isoform into one of the three classes: secreted, membrane, or soluble (neither membrane nor secreted protein). In order to compare the proteome between pig and human conveniently,we also constructed four major categories to classify proteins composed of single or multipe subunits:1)‘‘soluble”contained proteins with subunits only from the soluble category; 2)‘‘secreted” contained proteins with subunits only from the secreted category or from both the soluble and secreted categories; 3) ‘‘membrane” included proteins with subunits only from the membrane category or from both the soluble and membrane categories; and 4) ‘‘membrane and secreted”contained proteins with subunits from both the secreted and membrane categories or from the soluble, secreted, and membrane categories.
In order to reveal the groups of PCGs that are functionally related in the whole pig organism, 34 pig tissue datasets were constructed using the WGCNA method. In our study, we mainly used the blockwiseModules function in the WGCNA R package [49] to perform the co-expression network construction, with the following parameters:corType =pearson; maxBlockSize = 30,000; power = 8; minModuleSize = 30; mergeCutHeight = 0.1. The brief function of blockwiseModules automatically constructed a correlation network, created a cluster tree, defined modules as branches,merged close modules, and yielded the module colors and module eigen genes for subsequent analysis (such as visualization by the plotDendroAndColors function).
GO analysis and KEGG (http://www.genome.jp/kegg/) pathway enrichment analysis were performed and corrected by FDR method with KOBAS 3.0 (http://kobas.cbi.pku.edu.cn/anno_iden.php). GO terms appearing in this study were summarized within three categories: cell component, molecular function,and biological process.In view of the most complete gene annotation in human genome, we gave priority to those human annotated genes which were homologous to pig genes and utilized them as the background.
All experimental procedures were performed by a license holder in accordance with the protocol approved by the Institutional Animal Care and Use Committee (IACUC) of China Agricultural University (Permit No. DK1023).
The RNA-seq raw data of the 34 pig tissue samples have been deposited in the BioProject database at NCBI (BioProject:PRJNA392949), which are publicly accessible at https://www.ncbi.nlm.nih.gov/, and also in the Genome Sequence Archive [50] at the National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation (BioProject:PRJCA004356), which are publicly accessible at https://ngdc.cncb.ac.cn/gsa. The pig proteomic data have been deposited in PRIDE (PRIDE: PXD006991), which can be accessed at https://www.ebi.ac.uk/pride/archive/projects/PXD006991.
CRediT author statement
Pengju Zhao:Software,Data curation,Writing-original draft.Xianrui Zheng:Validation, Visualization, Writing - original draft.Ying Yu:Writing-review&editing,Supervision.Zhuocheng Hou:Formal analysis, Investigation.Chenguang Diao:Methodology,Data curation.Haifei Wang:Resources,Validation.Huimin Kang:Software.Chao Ning:Data curation.Junhui Li:Resources.Wen Feng:Resources.Wen Wang:Writingreview & editing.George E. Liu:Writing - review & editing.Bugao Li:Writing - review & editing.Jacqueline Smith:Writing - review & editing.Yangzom Chamba:Supervision.Jian-Feng Liu:Conceptualization, Writing - original draft, Writing- review & editing, Supervision, Project administration,Funding acquisition. All author have read and approved the final manuscript.
Competing interests
The authors have declared that they have no competing interests.
Acknowledgments
This work was financially supported by the National Natural Science Foundations of China (Grant No. 31661143013) and the Jinxinnong Animal Science Development Foundation.We thank Ziyao Fan,Yichun Dong,and Kaijie Yang for sample collection. We gratefully acknowledge the assistance from CapitalBio Technology and Beijing Compass Biotechnology Co., Ltd. for high-performance computing platform support.
Supplementary material
Supplementary data to this article can be found online at https://doi.org/10.1016/j.gpb.2021.02.002.
ORCID
ORCID 0000-0001-6990-1147 (Pengju Zhao)
ORCID 0000-0002-1784-7393 (Xianrui Zheng)
ORCID 0000-0002-4524-0791 (Ying Yu)
ORCID 0000-0003-4752-2255 (Zhuocheng Hou)
ORCID 0000-0001-5649-1687 (Chenguang Diao)
ORCID 0000-0003-2424-9743 (Haifei Wang)
ORCID 0000-0002-9396-1948 (Huimin Kang)
ORCID 0000-0001-8247-1700 (Chao Ning)
ORCID 0000-0002-2810-1765 (Junhui Li)
ORCID 0000-0001-8485-5345 (Wen Feng)
ORCID 0000-0002-7801-2066 (Wen Wang)
ORCID 0000-0003-0192-6705 (George E. Liu)
ORCID 0000-0002-3844-3726 (Bugao Li)
ORCID 0000-0002-2813-7872 (Jacqueline Smith)
ORCID 0000-0003-0982-3364 (Yangzom Chamba)
ORCID 0000-0002-5766-7864 (Jian-Feng Liu)
Genomics,Proteomics & Bioinformatics2021年5期