Lun Hu,, Shicheng Yang, Xin Luo, Senior,Huaqiang Yuan, Khaled Sedraoui, and MengChu Zhou,
Abstract—Protein-protein interactions are of great significance for human to understand the functional mechanisms of proteins.With the rapid development of high-throughput genomic technologies, massive protein-protein interaction (PPI) data have been generated, making it very difficult to analyze them efficiently. To address this problem, this paper presents a distributed framework by reimplementing one of state-of-the-art algorithms, i.e., CoFex, using MapReduce. To do so, an in-depth analysis of its limitations is conducted from the perspectives of efficiency and memory consumption when applying it for largescale PPI data analysis and prediction. Respective solutions are then devised to overcome these limitations. In particular, we adopt a novel tree-based data structure to reduce the heavy memory consumption caused by the huge sequence information of proteins. After that, its procedure is modified by following the MapReduce framework to take the prediction task distributively.A series of extensive experiments have been conducted to evaluate the performance of our framework in terms of both efficiency and accuracy. Experimental results well demonstrate that the proposed framework can considerably improve its computational efficiency by more than two orders of magnitude while retaining the same high accuracy.
PROTEINS are crucial to provide valuable insight into understanding the mechanisms of cellular functions for a variety of living organisms. However, proteins have to interact with each other in order to function well [1]. In this regard,protein-protein interactions (PPIs) are of interest in biology because of their ability of regulating a variety of cellular processes, including but not limited to metabolic cycles, cell signaling and DNA transcription. Due to the successful applications of machine learning techniques in bioinformatics[2], [3], a considerable number of computational techniques have been developed in recent years to complete the task of PPI prediction.
When performing PPI prediction, most of existing algorithms require additional information of proteins to be known in advance [4]. Some of them take into account genomic information, such as the conversion of gene-order[5], [6] and the priors of genomic features among interacting proteins [7]. Motivated by the observation that coevolving proteins are more likely to interact with each other, several techniques address the problem from an alternative viewpoint by making use of evolutionary information among proteins,such as phylogenetic trees [8], [9] and protein domain information [10]–[12]. Though efficient, these two kinds of algorithms commonly suffer from the disadvantage that both of genomic and coevolutionary information required by them is not always available especially for newly discovered proteins. To overcome this disadvantage, some attempts have been made by using the sequence information of proteins to predict PPIs [13]–[17], as such sequence information can be easily obtained by high-throughput technologies. Since the performance of sequence-based algorithms heavily relies on the evidence collected from protein sequences, different strategies are adopted to extract appropriate patterns from protein sequences for more accurate PPI prediction.
In [18], a novel feature extraction algorithm, namely CoFex,has been proposed to make use of variable-lengthk-mers.Moreover, unlike existing sequence-based algorithms that intend to concatenate the feature vectors of two proteins,CoFex explicitly composes feature vectors for pairs of proteins based on the presence ofk-mers that provide certain evidence to support the existence of interactions. It can then discover the relationship between the sequence information of proteins and their interactions in a more intuitive manner. To evaluate its performance, a support vector machine (SVM)classifier with sigmoid kernel is trained based on its extracted feature vectors and the results show that such SVM beats the state-of-the-art algorithms in prediction accuracy.
With the rapid development of high-throughput genomic technology, PPI data become so massive that most of existing prediction algorithms are no longer favorable. In order to fulfill the demanding requirements for large-scale PPI prediction, it is essential for us to design a scalable prediction algorithm that not only improves the computational efficiency but also retains high accuracy. Since existing prediction algorithms have been proved with a promising performance when applied to predict PPIs, we adopt a rather intuitive strategy by modifying an existing prediction algorithm instead of developing completely new algorithms that are capable of handling such huge data. To this end, we develop a distributed framework in this work by reimplementing CoFex under the MapReduce framework.
To do so, we first conduct an in-depth analysis to identify the limitations of CoFex in terms of memory consumption and efficiency. There are several points worth noting. First of all,the sequence information of proteins has to be traversed multiple times when CoFex extracts coevolutionary patterns that arek-mers frequently observed in the interacting proteins with statistical significance. Such traversal mechanism could take much time for a large number of proteins. Secondly,CoFex requires the massive data of protein sequence information to be stored in the memory, thus leading to high memory consumption for large-scale PPI prediction. Lastly,its efficiency is constrained by the integration with SVM,which is not applicable to proceed the training process when the number of feature vectors is huge. In this work, respective solutions are developed to overcome these limitations and a distributed framework modified from CoFex, namely CoFex+,is proposed for significantly improved efficiency when applied to large-scale PPI prediction. The experimental results show that CoFex+achieves more than two orders of magnitude improvement in running time with only a negligible loss in accuracy.
The rest of this paper is organized as follows. In Section II,a detailed literature review about related work of PPI prediction is presented. In Section III, preliminaries about the mathematical symbols used are introduced. The details of CoFex+are described in Section IV. Experimental results are discussed in Section V, following which we present an indepth discussion about the effectiveness and efficiency of CoFex+in Section VI. The paper ends with a conclusion in Section VII.
In this section, a detailed literature review about PPI prediction is presented. Existing algorithms can be classified by following different criteria. To distinguish them, we summarize the representative work by following the biological information used to construct the feature vectors of proteins or PPIs. The main sources of such information include genomic information, evolutionary profiles, and protein sequences, thus leading to three categories of algorithms. Moreover, we introduce recent attempts to conduct large-scale PPI prediction.
Genes located in genome sequences can hint at the existence of interaction between pairwise proteins at a comprehensive level due to the availability of whole genomic sequencing data. Dandekaret al. [5] find that proteins encoded by conserved gene pairs are more likely to interact with each other and such conserved gene pairs are within a low level conservation of gene-order. Although this algorithm is useful to predict PPIs by using the conservation of gene-order, it is incapable of predicting the interaction of proteins that are not conserved for gene sequences, such as those encoded by distantly located genes.
In [7], different genomic features, such as messenger RNA coexpression, coessentiality and colocalization, are used to quantify the associations between them and PPIs. Based on these quantified associations, Bayesian networks are constructed to predict PPIs when the genomic features of query proteins are provided.
Another motivation about the use of genomic information for PPI prediction is that interacting proteins are found to be homologous in another genome where they are fused into a single protein [19]. Furthermore, Enrightet al. [20] have developed a computational algorithm to identify such fusion events in different genomes to predict the PPIs of proteins involved in these events. However, its disadvantage is that it is unable to work with proteins whose fusion events are uncovered through the analysis of genomic sequencing.
Evolutionary information of proteins reveals the procedure of how proteins evolve across different species. Since proteins that co-evolve are more likely to interact with each other, the similarity in evolutionary information is of potential relevance to the prediction of PPIs as it indicates to what extent the two proteins co-evolve.
Pazos and Valencia [8] make use of phylogenetic trees of proteins to indicate PPIs. They propose a distance measure to compute the similarity among the phylogenetic trees of proteins, thus determining whether there is a possible interaction among them. Another source of evolutionary information is the domain knowledge of proteins. It is believed that proteins interact as a result of their interacting domains, and hence there are many computational algorithms proposed to predict PPIs based on domain knowledge. A maximum likelihood estimation algorithm is applied to identify interacting domains that infer curated PPIs. Then with such inferred interacting domains the interactions between proteins can be predicted [21]. Similarly, random forest of decision trees is trained by taking into consideration all the proteins domains; and then used to predict PPIs [22].Maetschkeet al. [23] introduce the concept of inducers to make use of gene ontology (GO) information more effectively. Given two GO terms, their induced term set is composed of all GO terms along the shortest path between the two terms. After that, GO2PPI is proposed by integrating all induced GO terms with popular classification techniques to predict PPIs.
Protein sequences, composed of amino acids, are the primary structures of proteins and the motivation behind the use of protein sequences for predicting PPIs derives from the hypothesis that sequence information may contribute to mediate PPIs.
When extracting patterns from protein sequences, sequencebased algorithms normally takek-mers into account to compose feature vectors of proteins. Thesek-mers are amino acid sequence segments with length equal tok. Taking [13] as an example, the value of each element in the feature vector of a protein is the number of occurrences of a particular 3-mers in the sequence of that protein. Once all the feature vectors are obtained for all the proteins in the training dataset, these sequence-based algorithms apply them to traditional classifiers, such as SVM and random forest, to distinguish between interacting and non-interacting proteins. Similar to[13], an SVM-based prediction algorithm is proposed by combining a kernel function and a conjoint triad feature for the description of amino acids [14].
In addition to SVM-based algorithms, Pitreet al. [24]propose PIPE to tackle the problem of predicting PPIs from a different perspective. Their idea is to measure how often pairs of subsequences in the two query proteins co-occur in pairs of proteins that are known to interact. In [25], PPIevo is proposed to extract the feature vectors from Position-Specific Scoring Matrix for each of proteins based on sequence information. It adopts the random forest classifier to predict PPIs based on the feature vectors of all protein pairs involved.
Instead of constructing feature vectors for individual proteins, CoFex [18] intends to extract coevolutionary features from pairs of protein sequences for predicting PPIs. DNN-PPI[26] develop a novel deep neural network framework to automatically extract features, including including semantic associations between amino acids, position-related sequence segments, and their long- and short-term dependencies, from protein primary sequences. As a recent attempt, a novel computational model has been developed to predict PPIs effectively [17]. To do so, it first combines the F-vector,composition and transition to map each protein sequence onto numeric feature vectors, and then employs a principal component analysis to reconstruct the most discriminative feature subspaces as the input for weighted sparse representation classification.
Since PPI networks have rich connectivity patterns [1],network-based prediction algorithms have been developed to make use of these patterns to characterize known PPIs in a given PPI network. Defining a degree-normalized score based on network paths with length three, L3 [27] is able to determine whether two proteins are interact or not if one is similar to the interacting partners of the other. To alleviate the restriction on the length of network paths, Wanget al. [28]develop a novel prediction algorithm based on a mixed membership stochastic block model. The algorithm simulates the generative process of a PPI network according to the likelihoods of pairwise proteins being grouped in the same protein complex. Recently, an improved graph representation learning method, namely S-VGAE, is proposed to predict PPIs based on both sequence information and network structure[29].
At present, PPIs that have been identified take less than 20% of the whole interactome [30]. With the development of high-throughput technologies, the size and complexity of protein interaction data have increased significantly. A new challenge is thus raised for large-scale PPI prediction.Recently, several attempts have been made in this field.
In order to achieve the purpose of effectively and accurately predicting large-scale PPIs, Youet al. [31] propose a parallel SVM model by only requiring the use of protein sequence information for large-scale PPI prediction. Huet al. [32]propose a large-scale protein interaction prediction algorithm based on MapReduce. It first extracts amino acid fragments from sequences of proteins for statistical analysis, and then constructs the corresponding feature vectors of proteins to train SVM. Motivated by the idea of distributed computing,Chenet al. [33] propose a Multi-source Learning based Protein Community Detection algorithm by integrating gene expression data and implement its parallel version on the Apache Spark platform to enhance computational efficiency.
Due to the limitation of high-throughput genomic experiments, existing PPI data suffers the disadvantage of high false-positive and false-negative rates, which could indicate the existence of noise in PPI data. In this regard,several attempts have been made to reduce noise and can also be applied to noise reduction in PPI prediction. Biet al. [34]examine a variety of sequence smoothing methods to eliminate the interference of noise points in the original sequence, and identify Savitzky-Golay (SG) filter as the most effective method among them. Similarly, SGW-SCN(Savitzky-Golay and wavelet-supported stochastic configuration networks) [35] adopts the SG filter method to remove possible outliers and noises in non-stationary workload time series. Kritikoset al. [36] integrates several well established PPI weighting methods to assign weights that can indicate the confidence level that a given PPI is a true-positive one. Luoet al. [37] propose a collaborative filtering-enhanced topology-based method to compute an inter-neighbourhood similarity for assessing PPIs.
Given an alphabet set Γ={τ1,τ2,...,τnΓ} consisting of totalnΓamino acids, a protein sequenceSis represented asS=(st)1≤t≤ns, wherest∈Γ andnsis the length ofS.Therefore, ak-mer segment starting from the positiontinSis denoted asSt,k=(st,st+1,...,st+k?1), where 1 ≤t≤ns?k+1.
A pair of coevolving positions with lengthkmeans that there arek?2 do not-care positions between them. For the sake of convenience, we use (τi,τj)k, where τi,τj∈Γ to represent a possible candidate of coevolutionary patterns. IfScontains (τi,τj)k, it means that ?St,k:st=τiandst+k?1=τj.is the set of coevolutionary patterns with lengthk. Assuming thatkmaxis the maximum value ofk, the set of all coevolutionary patterns is denoted as
As one of state-of-the-art algorithms proposed for PPI prediction, CoFex yields better accuracy than several popular sequence-based algorithms. In the rest of this section, we briefly introduce it and then discuss its efficiency bottlenecks when deployed to perform large-scale PPI prediction.
CoFex is composed of four steps. First, it targets to identify all coevolutionary patterns so as to compose V with some statistical knowledge. Next, it makes use of mutual information theory to quantify the amount of evidence to support or oppose the existence of interaction for each coevolutionary pattern and obtains a weighted version of V denoted as. Thirdly, for each pair of proteins, it constructs a feature vector based on theLastly, it trains an SVM classifier is trained by CoFex to predict PPIs.
Although a series of extensive experiments have demonstrated its promising accuracy, its efficiency could not fulfill the demanding requirements for large-scale PPI prediction. According to Fig. 1(b), we note that its running time is exponentially increased with the number of interacting proteins. Moreover, CoFex is unable to perform its task when the number of interacting proteins is larger than 800 000.
To identify its efficiency bottlenecks, we conduct an indepth analysis into its procedure and identify several key issues. First of all, when extracting coevolutionary patterns to compose V, it has to traverse the sequence information of all proteins for each candidate, and such traversal is very inefficient when the number of proteins increases. Secondly,the heavy computation in its second and third steps further constraints its efficiency. Lastly, due to the memory limitation in a single machine, SVM adopted by CoFex is not feasible for training and testing in the context of big data. Hence, in the following section, we intend to address these bottlenecks of CoFex in a distributed manner so as to well conduct largescale PPI prediction.
As a programming model, MapReduce is convenient to allow programs running in a high-performance parallel platform consisting of a huge number of computing nodes[38]. Engineers only need to provide high-level parallelism information and design map and reduce functions to be executed in parallel across multiple nodes. Because of its convenience and efficiency, it has become one of the most popular frameworks for large-scale date analysis [31],[39]–[41]. It has two parallel processing phases, i.e.,MapandReduce. Each phase adopts a user-defined function to complete its task. The function used inMaptakes key-value pairs as input and generates another series of key-value pairs as intermediate output to write to local disk. The function used inReduceintends to aggregate all the values with the same key obtained from the output ofMap.
As for the implementation of CoFex+, a main concern lies in the specialized environments and configuration requirements when using MapReduce. In order to make it easy to use and simplify the construction of complex environments, we choose a distributed integration platform called Hadoop [42]as the base of our experiments. The reason is that Hadoop is an open source cloud computing platform. It implements the MapReduce framework that have been successfully evaluated by a large number of practical applications in many fields,such as machine learning and computational biology.Moreover, we have accumulated solid experiences in implementing distributed algorithms in Hadoop. In addition to Hadoop, it is also possible for us to implement CoFex+by using other distributed platforms, such as spark [43], [44], if we follow the details of CoFex+and design the corresponding MapReduce-like functions provided by these platforms.
In this section, we describe the procedure of CoFex+. It addresses the efficiency bottlenecks of CoFex by following a MapReduce framework. As shown in Fig. 2, it is divided into four steps including V identification, V weighting, feature vector construction, and large-scale PPI prediction. In fact, the structure of CoFex+is similar to that of CoFex, which also has these four steps. However, the structure of CoFex is not compatible with popular distributed integration platforms, and thus it is impossible to explicitly allow CoFex to execute in a distributed manner. Hence, for each step of CoFex+, we follow the MapReduce framework to redesign its process, and divide it into theMapandReducephases. By doing so, we are able to deploy CoFex+ in a distributed integration platform, thus completing the task of large-scale PPI prediction. The two algorithms mentioned in Fig. 2 are independent with each other, and thus they do not have relationships in terms of input and output parameters. Note that all algorithms and procedures of CoFex+are presented in the supplementary material.
Fig. 1. Efficiency and ROC curves of CoFex+. (a) The efficiency of CoFex+ on the datasets composed of proteins at different magnitudes; (b) the efficiency of CoFex+ on the datasets composed of pairwise proteins at different magnitudes; (c) the ROC curves of CoFex+ on the datasets composed of proteins at different magnitudes.
Fig. 2. The complete procedure of CoFex+.
When verifying each candidate of coevolutionary patterns,CoFex has to traverse the sequence information of all proteins,which is very inefficient for large-scale PPI datasets. To avoid unnecessary traversal, CoFex+adopts a tree-based data structure, namely CF-tree [45], to record the occurrences of all candidates by only traversing the input dataset once. In particular, CF-tree is designed as a multiway tree where each node can take more than two children. For each node in the tree, we define two items, i.e.,AminoAcidandProteinSet. The former is the amino acid assigned to this node, while the latter is a set of proteins where the coevolutionary pattern defined by the root and this node is observed. Each coevolutionary pattern in V is identified by CoFex+with two phases, i.e.,MapandReduce.
1) Map:In this phase, CoFex+aims to construct a CF-tree from the dataset of protein sequences assigned to an arbitrary computing node running a map task. Since the largest length of coevolutionary patterns iskmax, it first splits each sequence, i.e.,S, into a set of segments denoted as{St,kmax}(1 ≤t≤ns?kmax+1). After that, several CF-trees are built by following Algorithm 1 and each CF-tree has a unique amino acid assigned to its root node. For each segment inSt,kmax, we first compose a branch for it and then select the CFtree whose root node is the amino acid at the first position of this segment. Finally, we append the branch to a CF-tree by following Procedure 1. Hence, for each amino acid in this segment, its previous amino acid is the parent node while its next one is the child node. In addition to the growth of the CFtree, the protein owning this segment is added toProteinSetof each node in the corresponding branch.
Algorithm 1 Construction of CF-trees{S} kmax Input: protein sequences , {Tτi}Output: CF-trees Tτi AminoAcid τi 1: initialize each by setting the value of as for its root node{S}2: for S in do St,kmax {St,kmax}3: for in do Grow(Tst,St,kmax)4: call 5: end for 6: end for{Tτi}7: return
Procedure 1 Grow(Tst,St,kmax)Input: , TstSt,kmax Output: Tst Node(St,k)AminoAcid st+k?1 St,kmax Tst 1: assume that returns the kth node whose value of is along the branch of in Node(S t,1)2: add the protein of S to i ∈[1,kmax ?1]3: for each do Node(S t,i+1)4: if does not exist then 5: initialize a node by setting the value of as Node(S t,i)AminoAcid st+i 6: set the parent of this node as 7: end if ProteinS et Node(S t,i+1)8: add the protein of S to the of 9: end for 10: return Tst
The output ofMapare the key-value pairs by following the format 〈τi,Tτi〉, whererepresents the CF-tree whose root node has the value τiin itsAminoAcid.
2) Reduce:CoFex+tends to merge all the CF-trees whose root nodes have the same value ofAminoAcid, thus identifying the coevolutionary patterns to compose V. Since the input ofReduceis the key-value pairs with format 〈τi,{Tτi}〉, each CFtree inis processed by the same computing node according to a MapReduce framework. The identification of V on each computing node is described by Algorithm 2. In particular, givenCoFex+first merges all CF-trees in it into a single CF-tree, as their root nodes have the same value,i.e.,forAminoAcid. For convenience,used afterward r epresents the merged CF-tree.
Algorithm 2 Identification V Input: , V τi{Tτi}Output: V=?1: set 2: merge all CF-trees in to compose τj Γ{Tτi} Tτi 3: for in do k=2 →kmax 4: for do f((τi,τ j)k)5: obtain with (1) and defined by Procedure 2 f((τi,τ j)k)≥1.96 6: if then Count(Tτi,τ j,k)7: add to 8: end if 9: end for 10: end for V(τi,τ j)k V 11: return
Procedure 2 Count(Tτi,τ j,k)Input: , , k Tτiτj Output: the number of occurrences of L={}(τi,τ j)k 1: initialize 2: assume that return all nodes at the depth k of node Nodes(Tτi,k)Nodes(Tτi,k) Tτi 3: for each in do AminoAcid node τj 4: if the of is then L=L∪ProteinS et node 5: of 6: end if 7: end for 8: remove duplicate proteins in L 9: return the size of L
The next step verifies whether a candidate (τi,τj)kcan be considered as a coevolutionary pattern with some statistical knowledge. Assuming that there are totalnunique proteins in the dataset, we would like to confirm that the value ofCount(,τj,k) is sufficient enough to reach the conclusion t hat the occurrences of (τi,τj)kare frequently observed among the sequences of all proteins. Following the statistical analysis in [18], we can further rewrite the definition off((τi,τj)k) as
where
Given (1), we reckon that (τi,τj)kis frequently found am(ong all) protein sequences at a confidence level of 95% iff(τi,τj)k≥1.96, and it can be considered as a coevolutionary pattern. V is obtained once we examine all the candidates by following the same procedure, and hence the output of this phase follows the format of 〈 (τi,τj)k,null〉.
Weighting coevolutionary patterns is another step requiring multiple traversal and heavy computation. In particular, for each pattern, CoFex has to obtain its occurrences in the interacting and non-interacting proteins by traversing the training dataset and then calculate its weight according to mutual information theory. To accelerate such computation,CoFex+decomposes the weighting procedure intoMapandReducephases. InMap, the occurrence information of coevolutionary patterns in the interacting and non-interacting proteins is obtained while the weights of coevolutionary patterns are computed inReduce.
1) Map:The input dataset is composed of interacting and non-interacting proteins. For each computing node, CoFex+targets to count the occurrences of evolutionary patterns from a part of the input dataset assigned to this node. To do so, it first obtains all the CF-trees and coevolutionary patterns generated in the first step. Since similar counting procedures are applied to all coevolutionary patterns, we take an arbitrary pattern, i.e., (τi,τj)k, as an example to illustrate it. For a particular computing node taking the map task, we assume that I is the dataset of interacting proteins andis the dataset of non-interacting proteins, and both of them are allocated to that computing node. Furthermore, we introduceandto denote the numbers of occurrences of (τi,τj)kin I andrespectively. Verifying (τi,τj)kin the sequences of two proteins is done via Procedure 3.
Assuming thatpxandpyare two interacting proteins in I,CoFex+examines CF-treeto verify the occurrences of(τi,τj)kin the sequences ofpxandpy. It first obtains all the nodes from the CF-tree at the depth ofkand only disregards those whose labels are not τjas indicated by Line 4 in Procedure 3. For the remaining nodes, CoFex+further checks whetherpxandpyare found in the values of theirProteinSetas indicated by Lines 5–11. If variablesmatchXandmatchYare both true, we reckon that coevolutionary pattern (τi,τj)koccurs in the sequences ofpxandpy. In this regard, the value ofis increased by 1. After examining all pairs of interacting proteins, we obtain the final value of. Similarly,we could obtainfrom.
?
Input: , , , true false Tτi(τi,τ j)k px py Output: or result ←false 1: matchX ←false 2: matchY ←false 3: 4: for each in do AminoAcid node τj node Nodes(Tτi,k)5: if the value of of is then px ∈ProteinS et node 6: if of then matchX=true 7: 8: end if py ∈ProteinS et node 9: if of then matchY=true 10: 11: end if result=matchX&&matchY 12: 13: end if
14: end for result 15: return
This step targets to construct a feature vector for each pair of proteins, thus preparing for the training of distributed SVM classifiers in the last step.
1) Map:CoFex+makes use of CF-trees, i.e., {} and V to construct the feature vectors for pairwise proteins.Accordingly, the input is the pairs of proteins inDintand.CoFex+splits them into many data blocks, which are then processed by the computing nodes inMap.
Algorithm 3 constructs feature vectors. Taking two proteinspxandpyas an example, it first initializes a feature vector denoted asVxyfor them. Each element inVxyrepresents a unique coevolutionary pattern in V and accordingly the length ofVxyis equal to |V|. For each (τi,τj)k∈V, it selects CF-treeand traverses this tree to verify whether (τi,τj)koccurs in the sequences ofpxandpyas indicated by Line 4 in Algorithm 3. If so, the weight of (τi,τj)k, i.e.,is assigned to the corresponding element inVxyas indicated by Line 5.The output of this phase follows the format of 〈 {px,py},Vxy〉.
2) Reduce:All the feature vectors generated byMapare saved to the distributed file system.
In the final step of CoFex+, we adopt a divide-and-conquer strategy to perform large-scale PPI prediction. To prediction PPIs, we use SVM as the predictor, as the experimental results presented in [18] indicate that the combination of CoFex and SVM has been proved to yield the best accuracy. However,SVMs are unable to handle large-scale training data. To overcome this problem, we first split the training dataset into several data blocks such that SVM could be trained for each block. Hence, the output ofMapfollows the formatwherepxandpyare a pair of query proteins in the testing dataset, andis the prediction score yielded by SVM in them-th computing node. ForReduce, the input of a particular computing node isThus, the final prediction score forpxandpyis the average value of, and can be determined as
wherenxyis the number of items in {}.
It is clearly seen that each of efficiency bottlenecks of CoFex is specifically addressed by CoFex+. In particular, the inefficient traversal of CoFex is overcome by CoFex+by using a CF-tree data structure where a training dataset is only traversed once. The heavy computation caused by the second and third steps of CoFex is decomposed into many subtasks by following the MapReduce framework and these subtasks are able to well run in a parallel manner. Instead of training a single SVM, a divide-and-conquer strategy is adopted by CoFex+such that a set of SVMs are trained on different data blocks of the training data. With all these particularly designed solutions, CoFex+is capable of performing the prediction task for large-scale PPIs, to be shown next.
To evaluate the accuracy and efficiency of CoFex+, we have conducted a series of experiments and the results are discussed in this section. When analyzing its efficiency, we have also investigated the influences made by the change in the number of computing nodes used inMapandReducephases. Each computing node is equipped with Intel SkyLake processors at 3.0 GHz and 16 GB of RAM.
To obtain convincing results, two kinds of benchmarking datasets are used. One is composed of three small datasets collected from the species of arabidopsis thaliana (AT),escherichia coli (EC) and schizosaccharomyces pombe (SP)and the other involves a huge dataset obtained from the human species. The three small datasets are from STRING [46],which is a database of known PPIs stemmed from knowledge transfer between organisms, and from interactions aggregated from other databases. The latter is from the human protein reference database (HPRD) [15] where all the information has been manually extracted from the literature by expert biologists who read, interpret and analyze the published data.These four datasets have been widely adopted for performance evaluation on PPI prediction [18], [23], [26], [28], [29], and their numbers of proteins and pairs of interacting proteins and non-interacting proteins are presented in Table I.
TABLE I DESCRIPTION OF BENCHMARKING DATASETS
Since PPIs generated by high-throughput technologies suffer from the disadvantage of high false-positive and falsenegative rates, such noisy PPI data obviously could negatively influence the accuracy of PPI prediction. To avoid this problem, we only selected PPIs with confidence scores not less than 0.9 from the STRING database for AT, EC and SP datasets. This operation is not applied to the human dataset as all PPIs are manually curated to avoid errors as indicated by[15]. Regarding the generation of non-interacting proteins, we randomly paired up proteins whose interactions were not reported previously.
To verify whether CoFex+is applicable to large-scale PPI prediction while maintaining a promising accuracy comparable to CoFex, the experiments are divided into two parts. The fist part of experiments focuses on evaluating the accuracy of CoFex+by comparing it with several state-of-theart sequence-based algorithms. Since all the algorithms except CoFex+are incapable of performing their prediction tasks on the Human dataset, another three small benchmarking datasets, i.e., AT, EC and SP, are used for accuracy evaluation. The second part of experiments concentrated on verifying its efficiency for large-scale PPI prediction.
a) Accuracy
1) Experimental Setup:In the experiments, we compare the accuracy of CoFex+with CoFex, Ben-Hur and Nobel [13],Shenet al. [14], FCTP-WSRC [17] and S-VGAE [29]. All algorithms are evaluated under the scheme of five-fold crossvalidation to each small dataset. We adopt the receiver operating characteristic (ROC) analysis and use the area under ROC curve (AUC) to compare algorithms’ accuracy. AUC scores are within the range from 0 to 1, and a higher AUC score indicates a better performance in terms of accuracy. In addition, we adopt several evaluation indicators to evaluate them from different perspectives, i.e., i) sensitivity (Sn) that is the percentage of correctly identified PPIs; ii) specificity (Sp)that is the percentage of correctly identified non-interacting proteins; iii) accuracy (Acc) that is the percentage of correctly identified PPIs and non-interacting proteins; iv) matthew’s correlation coefficient (Mcc) that is a more strict evaluation standard considering both under and over predictions.
Regarding the parameter settings, we set the maximum valuekmaxas 10 in the experiments. For a particular PPI dataset, we verify it from 2 to 10 with a step size of 1 and the one that achieves the best performance is taken as its value.The reason why we set the maximum value ofkmaxas 10 is that after our study we find that the size of Vkis rather small whenkis larger than 10. For the distributed configuration of CoFex+, total 10 computing nodes are used in bothMapandReduce. For the other algorithms we use in the experiments,parameters that have to be determined for them to work properly are set to the values recommended by the authors of their references and we list them in Table II, whereCandγused by all algorithms except FCTP-SWRC and S-VGAE arethe parameters of SVM.
TABLE II PARAMETER SETTINGS OF ALGORITHMS
2) Experiment Results:Regarding the AUC scores presented in Table III, CoFex is only better by 1.3% than CoFex+.Moreover, we also perform an independent samples t-test and the results reveals that the difference in the all metrics between CoFex and CoFex+is not significant (p<0.05). The reason for such a slight difference is mainly ascribed to the divide-and-conquer strategy adopted by CoFex+. In particular,a set of SVMs are trained by CoFex+on different data blocks of training data whereas CoFex performs its prediction task on a single SVM. When comparing CoFex+with other sequencebased algorithms, we note the strong performance of CoFex+,as it yields a better average score for each evaluation indicator. We also note that CoFex+performs better than the two state-of-the-art algorithms, i.e., FCTP-SWRC and SVGAE, as the average scores of AUC, Acc and Mcc obtained by CoFex+are larger by 30.2%, 23.6% and 15.1%respectively than those of FCTP-SWRC, and by 7.1%, 7.1%and 7.7% respectively than those of S-VGAE, for the three small datasets. Hence, according to the performance of CoFex+in the three small datasets, we reason that as the reimplementation of CoFex using MapReduce, CoFex+still maintains promising accuracy comparable to CoFex.
TABLE III PERFORMANCE COMPARISON OF SEQUENCE-BASED ALGORITHMS IN THE AT, E C AND SP DATASETS
b) Efficiency
1) Experimental Setup:We apply CoFex and CoFex+to two categories of PPI datasets generated from the Human dataset.The first category includes three datasets composed of proteins at magnitudes of 102, 103, 104and All respectively,and the second category involves the datasets composed of pairwise proteins at magnitudes of 102, 103, 1 04, 105, 106and All respectively. Note that since we concentrate analyzing their efficiency, all the interacting and non-interacting proteins in each dataset are used only for training, and we record their running time. For a fair comparison, CoFex is executed in the machine with the same hardware configuration as for CoFex+.
Regarding the parameter settings, we set the maximum valuekmaxas 5 and take it as an example to demonstrate their efficiency. For the distributed configuration of CoFex+, total 10 computing nodes are used in theMapandReducephases.
2) Experimental Results:From Fig. 1, both of them take more running time when either proteins or PPIs increase.Concerning their efficiency comparison, we have the following findings. First, CoFex+is not as efficient as CoFex in all small datasets. A possible reason for that phenomenon is that the time used for transferring data among different computing nodes is considerably larger than that for computation. Second, the advantage of CoFex+is evident in larger datasets, as the blue curves of CoFex+are always below the red ones of CoFex after the intersection points in Fig. 1(a)and Fig. 1(b). Although the increase in the size of training data requires a heavier computation with more time and resources, CoFex+is able to divide such computation into many tiny tasks and assign them to the computing nodes. In this regard, computing nodes only handle small parts of the original computation task in a parallel manner, thus reducing the running time of the entire procedure. Thirdly, from Fig. 1(b), the training task of CoFex is unable to complete when the amount of interacting proteins exceeds 800 000.Although we could upgrade the hardware configuration of our machine, it is not the best solution as the efficiency bottlenecks of CoFex are still unsolved. That is the reason why we must develop CoFex+for large-scale PPI prediction.Lastly, for two datasets from different categories but at the same magnitude, CoFex+takes more time for the dataset in the first category. In other words, for CoFex+, the majority of computation takes place in the second step where coevolutionary patterns are weighted. Hence, we suggest that CoFex+is preferred for large-scale PPI prediction while CoFex for small datasets.
In order to evaluate whether CoFex+is still effective as the scale of data increases, we apply a five-fold cross-validation to the datasets in the first category, and present corresponding ROC curves in Fig. 1(c). We observe the promising performance of CoFex+when the number of proteins increases. In other words, with the consideration of more protein sequences, CoFex+is able to better identify coevolutionary patterns that are useful for PPI prediction.
In Fig. 3, we present the speedup performance of CoFex+in the datasets of PPIs at different magnitudes. The baseline to calculate the speedup scores is the running time of CoFex given in Fig. 1(b). It is worth noting that its speedup performance is generally improved when PPI dataset size increases. When the size of PPI dataset exceeds 1 05, it is more obvious, as it can considerably improve the efficiency of CoFex by achieving more than two orders of magnitude improvement in computational efficiency as indicated by the red bar in Fig. 3.
Fig. 3. The speedup performance of CoFex+ in the datasets composed of PPIs at different magnitudes.
When compared with CoFex, CoFex+adopts a tree-based data structure, i.e., CF-tree, to accelerate the procedure of identifying coevolutionary patterns. To verify such acceleration effect, we compare the running time of CoFex+using a local mode with that of CoFex on the Human dataset where we have to identify coevolutionary patterns from the sequences of total 13 730 proteins. The reason why we use the local mode for CoFex+in Hadoop is to avoid the bias benefited from distributed computing. The comparison in running time taken by CoFex+and CoFex in the identification of coevolutionary patterns is presented in Fig. 4, where we note that the blue curve representing CoFex+is always below the red one of CoFex. In this regard, we reason that the identification of coevolutionary patterns cost less time with CoFex+than CoFex.
Fig. 4. The comparison in running time taken by CoFex+ and CoFex in the identification of coevolutionary patterns on the Human dataset.
In summary, although the increase in the size of training data requires more computation and thus more time and resources, CoFex+is able to divide such computation into many tiny tasks and assign them to the computing nodes.According to the results and discussions presented above,CoFex+yields more promising performance for large datasets in terms of efficiency and is verified to have the ability to complete the task of large-scale PPI prediction.
Memory Consumption
1) Experimental Setup:Since the change in memory consumption of CoFex+is similar when we increase the number of computing nodes in eitherMaporReduce, we take the results ofMapas an example to discuss how the memory is consumed by CoFex+when training it by using the entire Human dataset.
2) Experimental Results:From Table IV, we note that the amount of memory consumed by CoFex+gradually becomes larger when the number of computing nodes inMapincreases.Obviously, CoFex+requires much less memory than CoFex,as CoFex is unable to handle the entire Human dataset by using the machine whose memory is 16 GB. The major reason for this phenomenon is due to the integration with the CF-tree data structure, which saves a lot of memory space by avoiding storing massive protein sequence information. In the step of V Identification, CoFex+constructs at mostnΓCF-trees in each computing node ofMap. That is to say, when we deploy more computing nodes performing the map tasks of CoFex+, the increase in memory consumption is mainly from newly generated CF-trees.
TABLE IV MEMORY CONSUMPTION OF COFEX+ GIVEN DIFFERENT NUMBER OF COMPUTING NODES IN MAP
When using CoFex+in reality, the only assumption is the possible integration with popular distributed computing platforms, such as Hadoop and Spark. Given that CoFex+is designed by following the MapReduce framework, this assumption can be easily satisfied in reality, as both Hadoop and Spark provide MapReduce-like functions for the integration. In this regard, it is possible to implement CoFex+with Spark by using its MapReduce-like functions. The major reason for us to choose Hadoop is that we have accumulated solid experiences to implement algorithms in Hadoop environment.
Regardless of the parameters we set for SVM, the only parameter we need to specify in advance iskmax, which is the maximum value ofkand also is the depth of CF-trees constructed in the first step of CoFex+. Although a larger value ofkallows CoFex+exploit more information in protein sequences, the improvement in accuracy is limited as the number of coevolutionary patterns extracted by using a largerkbecomes smaller. Thus, these patterns have less significance,as they are not sufficient enough for SVM to find an optimum hypersurface. Taking the five-fold cross validation on the EC dataset as an example, the number of coevolutionary patterns extracted whenk=2 is 207 while that number is only 39 whenk=10. In other words, for two amino acids located far away from each other in a protein sequence, their ability in terms of providing evidence supporting or refusing the existence of interaction is not as strong as those located closely.
According to our computational complexity analysis, there are several parameters affecting the efficiency of CoFex+, i.e.,n,,n1,n2, |I| a ndIn particular,nandn? plays a critical role in the first step of CoFex+, as we need to identify coevolutionary patterns from the sequences of all proteins.The last four parameters decides the efficiency of its second and third steps. Specifically, the processes of weighting V and constructing feature vectors are applied to each pair of proteins in |I| andby traversing CF-trees. Moreover, there are also relationships among these parameters. For example,the values ofn1andn2are determined by that ofn, as the sizes of CF-trees will become larger when more protein sequences are considered. Hence, the number of proteins, i.e.,n, and that of pairwise proteins, i.e.,are the key factors impacting the efficiency of CoFex+.
However, the upper limit of CoFex+’s efficiency exists, as it is impossible to reduce its running time by simply increasing the number of computing nodes. In fact, the runtime overhead of CoFex+is composed of two parts, one is the time used for computation and the other is the time spent to transfer data among different computing nodes. When the number of computing nodes exceeds some threshold, the process of data transfer takes more time than the computation, and thus constrains the further improvement of efficiency.
In this paper, we for the first time propose a distributed framework, namely CoFex+, to complete a challenging prediction task for large-scale PPIs. Modified from a wellestablished algorithm CoFex, it overcomes CoFex’s efficiency bottlenecks from two perspectives. First, it adopts a tree-based data structure to avoid the heavy memory consumption caused by the huge sequence information of proteins. Second, its implementation is integrated with the MapReduce framework such that the task can be completed in a distributed manner. A series of extensive experiments have been conducted and the results demonstrate that CoFex+can considerably improve the efficiency of CoFex by achieving more than two orders of magnitude improvement in computational efficiency while retaining a comparable level of accuracy for large-scale PPI prediction. Moreover, regarding the distributed configuration of CoFex+, we conclude that the efficiency of CoFex+could gradually be improved till a certain level when the number of computing nodes used in eitherMaporReducephase increases. Our next work intends to integrate the noise reduction methods with CoFex+to further improve its accuracy in PPI prediction. We are also interested in exploring the possibility of applying deep neural network [47]–[49],clustering methods [50]–[54] to further improve the effectiveness of CoFex+.
IEEE/CAA Journal of Automatica Sinica2022年1期