Adrian
Guthals
a,
Jeramie D.
Watrous
bc,
Pieter C.
Dorrestein
bc and
Nuno
Bandeira
*ac
aDept. Computer Science and Engineering, University of California, San Diego, USA. E-mail: bandeira@ucsd.edu
bDepartment of Pharmacology and Department of Chemistry and Biochemistry, University of California, San Diego, USA
cSkaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, USA
First published on 20th April 2012
High-throughput proteomics is made possible by a combination of modern mass spectrometry instruments capable of generating many millions of tandem mass (MS2) spectra on a daily basis and the increasingly sophisticated associated software for their automated identification. Despite the growing accumulation of collections of identified spectra and the regular generation of MS2 data from related peptides, the mainstream approach for peptide identification is still the nearly two decades old approach of matching one MS2 spectrum at a time against a database of protein sequences. Moreover, database search tools overwhelmingly continue to require that users guess in advance a small set of 4–6 post-translational modifications that may be present in their data in order to avoid incurring substantial false positive and negative rates. The spectral networks paradigm for analysis of MS2 spectra differs from the mainstream database search paradigm in three fundamental ways. First, spectral networks are based on matching spectra against other spectra instead of against protein sequences. Second, spectral networks find spectra from related peptides even before considering their possible identifications. Third, spectral networks determine consensus identifications from sets of spectra from related peptides instead of separately attempting to identify one spectrum at a time. Even though spectral networks algorithms are still in their infancy, they have already delivered the longest and most accurate de novo sequences to date, revealed a new route for the discovery of unexpected post-translational modifications and highly-modified peptides, enabled automated sequencing of cyclic non-ribosomal peptides with unknown amino acids and are now defining a novel approach for mapping the entire molecular output of biological systems that is suitable for analysis with tandem mass spectrometry. Here we review the current state of spectral networks algorithms and discuss possible future directions for automated interpretation of spectra from any class of molecules.
The dominant paradigm for high-throughput protein identification is based on trypsin digestion of extracted proteins to produce peptides followed by tandem mass spectrometry to generate single-peptide MS2 spectra that are then computationally matched one spectrum at a time against protein sequence databases to finally obtain peptide and protein identifications. This paradigm has been the basis of nearly all large-scale proteomics studies to date despite its typical low spectrum identification rate of only 15–30% because enzymatic digestion generates multiple peptides per protein and, in the extreme, only one peptide needs to be identified per protein (though more are usually preferred) to enable protein-level quantification and comparison across multiple tissues or experimental conditions. However, the serious downside of this low identification rate is that it consistently leads to missing information on non-tryptic peptides and yields very low protein sequence coverage, thus substantially limiting the chances of detecting alternative splicing or to identify and localize post-translational modifications (PTMs). In fact, the limitations of PTM search are so dire that most labs still only allow for 4–6 PTMs per search (about half or which due to sample handling procedures) even though more than 500 PTMs are known and listed in UniMOD.
Peptidomics, defined as the study of endogenous peptides, is an abundant source of drug candidates derived from neuropeptides,18 toxins19 and non-linear cyclic peptides.20 Conversely, endogenous peptides are also valuable as therapeutic targets21 (neuropeptides) and antigenic peptides are key in immunotherapeutic strategies22 (MHC class-I/II peptides). Despite its critical importance, peptidomics research continues to suffer from the inadequate reutilization of computational tools primarily developed for proteomics since (a) endogenous peptides are not suitable for enzymatic digestion (as it eliminates the active peptide form), (b) tend to be modified with unexpected PTMs, (c) often contain sequence polymorphisms and (d) generally lack the “MS-friendly” features of trypsin-digested peptides. As such, each endogenous peptide must be identified “on its own” (not being able to benefit from multiple peptides per protein as in proteomics) and new identification algorithms are needed to be able to handle non-tryptic peptides of atypical lengths21 (e.g., ≤6 AA or ≥35 AA) containing unexpected PTMs, sequence polymorphisms20,23 and often featuring non-linear structures.19,20 Finally, Metaproteomics analysis of environmental samples from host-pathogen interactions24 and microbial communities (as in the Human Microbiome Project) requires the ability to search mass spectrometry data against very large databases and, in many cases, against six-frame translations of poorly-annotated genomes or even just assembled DNA reads. This enormous growth in the size of the sequences database and the need to allow for polymorphisms and/or unexpected PTMs results in a combined search space so large that 90–95% of all spectra are commonly discarded as unidentified, thus severely limiting proteomics analysis of the role of microbiomes in health and disease.25
We argue that overcoming the identification bottleneck will require new ways of thinking about MS2 spectra in order to develop new ways of interpreting them. In particular, we describe how the spectral networks paradigm differs from the current mainstream paradigm and illustrate its potential with applications where current paradigms perform poorly or completely fail. By finding spectra from related peptides even before considering their possible identifications and using these spectra to determine consensus identifications from sets of spectra from related peptides instead of separately attempting to identify one spectrum at a time, the spectral networking paradigm is capable of addressing many of the pitfalls of mainstream spectra identification paradigms. In addition to improving identification by significantly increasing signal-to-noise ratios and deconvoluting MS2 ion types, spectral networks further open up new computational avenues for analysis of natural products and non-peptidic molecules, including compounds with non-linear structures, novel amino acids or post-translational modifications, lipids, glycans and other families of compounds.
The potential of spectral libraries to improve peptide identification is well illustrated by the recent example of the NeuroPedia30 spectral library of identified neuropeptide spectra. Neuropeptides are peptide neurotransmitters and hormones that mediate cell-to-cell communication for regulation of physiological functions and biological processes.31 Understanding the role and regulation of neuropeptide forms in health, disease, and drug treatments requires the ability to globally analyze neuropeptide expression in an unbiased form. Mass spectrometry based neuropeptidomics is highly suited for untargeted, global neuropeptides studies.31–35 However, the unique characteristics of neuropeptides (i.e. short/long sequences or non-tryptic) presents difficulties for identification from tandem mass spectrometry with traditional database search tools. For example, short neuropeptides can lead to inaccurate search results as database search tools usually assign lower scores to short peptides. Conversely, long or non-tryptic neuropeptides are difficult to identify since database search tools are trained for tryptic peptides cleaved at K/R and because peptide fragmentation processes for long neuropeptides is usually not efficient. In addition, as current databases mature, querying the larger search space requires more time due to the increase in the number of comparisons which ultimately reduces the number of identifications by allowing a higher probability for false positive matches.27 Since many spectral libraries, such as NeuroPedia, are directly searchable using mass spectrometry data, the caveats associated with matching experimental data against MS2 spectra predicted from a protein sequence no longer apply as irregularities in fragmentation efficiency will be shared amongst the annotated and unannotated spectra. In addition to the expected improvement in sensitivity from searching against a small targeted sequence database, the neuropeptide spectral libraries further improve identification efficiency, sensitivity and reliability by considering all spectral features, including actual fragment intensities, neutral losses from fragments, and various uncommon or even unknown fragments to determine the best matches. As such, NeuroPedia was shown to improve peptide identification by up to ten fold (at the same false discovery rate10,26,27 but searching against a much smaller space of possible matches).
In addition to improving peptide identification, spectral library search opens up new possibilities for interpretation of MS2 spectra. For example, mainstream approaches were developed under the ubiquitous assumption that each MS2 spectrum is generated from a single peptide. While chromatographic procedures greatly contribute to making this a reasonable assumption, there are several situations where it is difficult or even impossible to separate pairs of peptides. Examples include certain permutations of the peptide sequence or post-translational modifications (PTMs, see36 for examples of co-eluting histone modification variants). In addition, innovative experimental setups have demonstrated the potential for increased throughput in peptide identification using mixture spectra–examples include Data-Independent Acquisition37 Ion-Mobility Mass Spectrometry38 and MSE strategies.39 To address the resulting computational bottleneck, we introduced the first spectral library-based approach (M-SPLIT40) for identification of mixture spectra generated from more than one peptide. Theoretical bounds were proposed to prune the search space using branch-and-bound techniques and further improved using a new projected-cosine metric. In brief, M-SPLIT uses single-peptide matches to prune the search space for mixture peptides–it first matches experimental spectra to single-peptide spectra and then attempts to improve the score of the match by adding more single-peptide matches to form mixture-spectrum matches (false discovery rates also controlled using decoy spectral libraries26). Thus, M-SPLIT dramatically reduces the search space by six orders of magnitude and is able to deliver results at an average of 2 s/spectrum (on a regular laptop with a Pentium Core2Duo, 1.6Ghz, 2Gb RAM), even when searching against proteome-scale spectral libraries. Despite considering only a tiny fraction of the whole search space, benchmarks on both simulated and experimental data consistently show that M-SPLIT40 has both high sensitivity (≈94%) and high accuracy (up to ≈98%).
Fig. 1 Discovery and identification of post-translational modifications through spectral networks; (a) Spectral alignment between modified and unmodified variants of the peptide TETMA (b-ions shown in blue, y-ions in red, blue/red lines track consecutively matched b/y-ions); (b) Grouped modification states of the peptide MDVTIQHPWFK from a sample of cataractous lenses. Nodes in the spectral network represent individual MS2 spectra and edges between nodes represent significant spectral alignments such as that shown in part (a); (c) Spectra assembled in the spectral network for TNSMVTLGCLVK with diverse Cysteine modifications on a monoclonal antibody. Each arrow corresponds to the propagation of a sequence and/or PTM from an identified spectrum to an unidentified spectrum (repeated arrows are iterative propagations). Arrow colors correspond to types of modifications transferred. |
In traditional DNA sequence alignment, it often happens that query sequences differ from the reference sequences by the insertion or deletion of one or more nucleotides.48 While the insertion/deletion of amino acids is also usually allowed when aligning protein sequences, an additional factor needs to be considered when aligning peptides from experimental samples due to the occurrence of post-translational modifications. In fact, multiple groups have shown16,46,52 that the phenomenon of unexpected modifications is much more widespread than commonly acknowledged. From a sequence alignment perspective, a modification could be modeled by following the modified residue with a special character for each type of modification. Thus, the alignment of a modified peptide PEPT*IDE with its unmodified counterpart PEPTIDE would result in a single difference caused by the insertion of the modification ‘*’ In tandem mass spectrometry, however, a modification of mass m conceptually corresponds to the insertion of additional m Da in the b/y-ion series between the ions immediately preceding and following the site of post-translational modification (i.e. the mass of the residue becomes larger by mass m). Conversely, if the modification causes a loss of m Da from the modified residue then the corresponding effect is the subtraction of m Da between the ions for the modified residue. When applied to unmodified and modified versions of the same peptide, the role of spectral alignment algorithms15,17,53 is to (a) use the spectrum of the unmodified peptide to determine where to position the modification mass in the spectrum of the modified peptide and (b) to assess whether the post-alignment match between the two spectra is significant enough to accept the spectra as a pair of modified/unmodified spectra from the same peptide. Thus, spectral alignment considers every possible spectral pair and every possible location for the mass difference (i.e. modification mass) between the aligned spectra. Fig. 1a illustrates the spectral alignment between MS2 spectra from the peptides TETMA and phosphorylated TET+80MA. By requiring a significant match between the aligned spectrum peaks17 and by placing no restrictions on which modifications to consider, this approach can be used to discover novel or unexpected modifications. In fact, when applied to a set of spectra from cataractous lenses proteins from a 93-year old patient, spectral networks were able to rediscover the modifications identified by database search methods and additionally discovered several novel modification events.17,46
When first analyzing a sample possibly containing modified peptides one does not know a priori which residues or peptides will be modified. Thus, spectral alignment considers every possible spectral pair and every possible location for the mass difference (e.g. modification mass) between the aligned spectra. By requiring a significant match between the aligned spectrum peaks17 but placing no restrictions on which modifications to consider, this approach can be used to discover novel or unexpected modifications. In fact, when applied to a set of spectra from cataractous lenses proteins from a 93-year old patient, spectral networks were able to rediscover the modifications identified by database search methods and additionally discovered several novel modification events17,46.
The identification of peptides containing multiple modifications via database search is a challenging problem imparted by the combinatorial explosion in the number of possible modification variants for all the peptides in a database.46,52 Not only can this make the approach much slower, but the increased number of peptide candidates for any given spectrum significantly increases the risk of incorrect identifications. However, samples containing peptides with two or more modifications often also contain variants of the same peptide with only one or no modification. In these cases, we have found that spectral alignment is able to group these related spectra from multiple modification variants of the same peptide into small spectral networks thus increasing confidence in their identity as a related peptide. Fig. 1b illustrates the spectral network for a particular peptide in a sample of cataractous lenses proteins.
By grouping together spectra from multiple variants of the same peptide, spectral networks additionally contribute to the reliable identification of highly modified peptides. While database searching is restricted to matching ion masses between theoretical and observed spectra, spectral networks further capitalizes on the occurrence of common fragment ions at corresponding masses with similar peak intensities (Fig. 1c). In general, it becomes easier to identify a highly modified peptide if one additionally observes highly-similar spectra from its intermediate modification states. Thus, spectral alignment not only allows one to discover unexpected modifications (instead of only identifying expected modifications) but additionally provides an alternative route for identification of highly modified peptides.
Conceptually, sequencing a protein from a set of MS2 spectra can be described by a simple analogy. Imagine a jewellery box with many identical copies of a specific model of bead necklaces. Although all the beads are identical, this model is characterized by having irregular distances between consecutive beads–the set of inter-bead distances is initially chosen by the designer and all necklaces are then made using exactly the same specification. Now assume that one day you open your jewellery box and realize that someone has vandalized all the necklaces by cutting them to fragments at randomly chosen bead positions. Can you recover the original design of this model of necklaces, as specified by the set of consecutive inter-bead distances? In this allegory inter-bead distances correspond to amino acid masses and beads correspond to MS2 fragmentation points (between consecutive amino acids). MS2 data add more than a few difficulties to this necklace assembly problem; for example, most peaks in MS2 spectra do not correspond to any fragment ions (extra beads) and many fragment ions do not result in any peaks (missing beads). Nevertheless, Fig. 2 presents an example of assembled MS2 spectra resulting in a 22 amino acid long segment of a monoclonal antibody.51
Fig. 2 Shotgun Protein Sequencing (SPS) via assembly of tandem mass spectra; (a) Spectral alignment between spectra for peptide WSCILMEPKR (purple), PEWSCILMEPKR (green), WSCILMEPK (red), WSCILMoxEPK (cyan); Mox represents oxidized Methionine. Matching peaks in spectral alignments become pairwise gluing instructions between every pair of aligned spectra. (b) Protein contig resulting from 24 spectra from a monoclonal antibody (aBTLA heavy chain). Each spectrum is shown superimposed with a sequence of arrows indicating its sequence of recovered masses; modified variants of the consensus sequence are indicated by red arrows (6 different modifications on 7 spectra). (c) The complete aBTLA heavy chain sequence recovered by Comparative SPS;57 highlighted sections were covered by protein contigs (95% coverage) and the missing amino acids were obtained from homologous protein sequences. |
Shotgun Protein Sequencing (SPS) is a de novo sequencing approach15 that utilizes multiple MS2 spectra from overlapping peptides generated using non-specific proteases or multiple proteases with different specificities.70–74 The original approach was based on the overlap → layout → consensus approach to assembly and shown to be efficient for the assembly of a single purified unmodified protein. However, practical applications (like sequencing snake venoms) require applicability to mixtures of modified proteins. In fact, most MS2 samples contain both modified and unmodified versions for many peptides, including biological and chemical modifications both native and introduced during sample preparation. Sequence variations and post-translational modifications present a formidable algorithmic challenge for assembly algorithms as the performance of the original SPS approach15 steeply degraded as soon as even a small percentage of the spectra are from modified peptides. To use the beads analogy, the necklace puzzle becomes very difficult if in addition to the canonical necklaces (non-modified proteins), the jewellery box also contains some necklaces that deviate from the designer's specification (modified proteins). Building on spectral networks algorithms for analysis of post-translational modifications based on alignment of spectra from modified and unmodified peptide variants,17,50 we showed how to integrate these alignments into Shotgun Protein Sequencing to derive a completely new form of spectral assembly. This utilized a generalized notion of ABruijn graphs (originally proposed in the context of DNA fragment assembly75) for the assembly of MS2 spectra from overlapping, modified and unmodified peptides into contigs (sets of aligned spectra from overlapping peptides, see Fig. 2), where each contig then capitalizes on the corroborating evidence from the assembled spectra to yield a high-quality consensus de novo sequence. As a result, SPS consensus de novo sequences were found to be twice as accurate as sequences derived from single spectra (1 mistake per 10 vs. 5 amino acid predictions) while yielding sequences that were much longer that single-peptide/spectrum could support (up to 24 AA long).
Recently this paradigm was extended in two distinct directions. First, we capitalized on homology between SPS long/accurate de novo sequences and known sequences to deliver the first automated full-length protein sequencing approach (Comparative SPS57) and demonstrated it with database-assisted de novo sequencing of two monoclonal antibodies. Spectral networks also underlie the related work of Castellana et al.,76,77 who proposed an effective method for sequencing monoclonal antibodies with database-guided iterative alignment+assembly of spectra from overlapping peptides. Both of these methods rely upon the existence of a homologous database. To reduce this dependence, we have since developed MetaSPS78 algorithms for assembling SPS contigs into meta-contigs (sets of overlapping contigs). These methods now deliver de novo sequences over 100 AA long at sequencing error rates as low as 1 mistake per 50 predicted amino acids without requiring homology to known sequences, which demonstrates the feasibility of fully-automated de novo protein sequencing with unidentified MS2 spectra. It is expected that the performance of these algorithms will only improve as new types of mass spectrometry data (e.g., Electron Transfer Dissociation) are also incorporated in SPS and spectral networks approaches.
When DNA sequencing is not available, biologists use either Edman degradation or tandem mass spectrometry (MS2) to sequence ribosomal peptides. However, neither of these approaches works for nonribosomal peptides since they differ from ribosomal peptides in many respects: (a) they often represent non-linear structures of amino acids (e.g., cyclic, tree-like, and branch-cyclic peptides), (b) they often contain non-standard amino acids increasing the number of possible building blocks from 20 to several hundred, (c) they often have a non-standard backbone, and (d) they are often modified. Each of these complications renders traditional Edman degradation and MS2 peptide sequencing approaches useless, leaving NMR as the only technology capable of analyzing NRPs.82–85 The use of NMR for NRP sequencing is time-consuming, difficult to automate (there are currently no software tools for automatic interpretation of NRPs from NMR data), and error-prone (see85,86 for examples of errors in NMR sequencing). In addition, the abundance of these specialized compounds in vivo is often very low requiring extensive raw biological material in order to purify enough of the compound to perform 2D NMR for structure elucidation. As a result, the extremely difficult process of total chemical synthesis remained one of the only reliable way to sequence and validate NRPs.87
Having shown how multi-stage mass spectrometry (MSn) can improve de novo sequencing accuracy for linear peptides,88 we then extended spectral networks algorithms using a combination of experimental and computational protocols to enable a mass-spectrometry based approach for de novo sequencing of cyclic peptides.89 The NRP-Sequencing algorithm discovers amino acid masses and reconstructs cyclic peptide sequences directly from a single MS3 spectrum and MS4 MS−5 spectra are used to rescore all putative MS3 reconstructions. The NRP-Assembly approach assembles MS4 MS−5 spectra, similarly to what was described above for Shotgun Protein Sequencing, and further integrates the resulting contig with the MS3 spectrum and all non-assembled spectra (Fig. 3). These algorithmic foundations were further extended as more data became available90 and we were able to show how these tools can conserve significant efforts using several marine cyanobacterial cyclic peptides. In particular, Cyanopeptide X was an unknown bioactive molecular whose identity was elucidated using the very time intensive workflow of isolating, purifying and collecting 2D NMR data to obtain the structure.91 However, using very small amounts of raw material, sequencing of MS2 data using our cyclic peptide annotation algorithms revealed that this compound was related to dolastatin 11 (reversed amino acid sequence with a single modification) and majusculamide C with identical scores, which provided great insight into the nature of the structure with very little time investment. The compound turned out to be desmethoxymajusculamide C and a full report on its structure as determined by NMR is now available.92 Another example was compound 879, which was initially assumed to be a novel compound but was later found to be already known during the patent application. Our analysis could dereplicate the spectrum of compound 879 as the known NRP neoviridogrisen and could thus have saved the three years of effort it took to determine the structure.
Fig. 3 Analysis of the cyclic peptide Seglitide. (a) The circular structure of Seglitide is schematically illustrated with each residue represented by a different color (slice sizes not scaled to corresponding masses of the residues). A+14 denotes a non-standard residue with integer mass 71 + 14 = 85 Da. (b) MS2 fragmentation of Seglitide generates up to 6 linear peptides representing different rotated variants of the same cyclic peptide. (c) Theoretical spectrum for Seglitide by superposition of the fragment masses of the linearized peptides. (d) Experimental spectrum of Seglitide resulting from a mixture of 6 linear peptides (the peaks corresponding to fragment ions are shown in red). (e) Spectral network from assembled Seglitide MSn spectra and used for de novo sequencing with unknown amino acid masses. |
As with peptide-based spectral networks, molecular spectral networks101 start with raw MS2 data acquired from one or more microbial species, irrespective of the number of spectra or mass spectrometry runs. Then, similarly to the algorithm illustrated in Fig. 1a), pairs of MS2 spectra from related molecules are detected using structure-independent spectral alignment to find spectra with significantly-similar fragmentation patterns, regardless of whether the spectra are identified in advance or not. By avoiding peptide-specific fragmentation models and assumptions, structure-independent spectral alignment reveals molecular networks containing not only spectra of peptides but also primary and secondary metabolites, non-linear natural-products, lipids, glycans, and other classes of molecules. Fig. 4 shows a molecular spectral network for Bacillus subtilis 3610 and the chemical structures for several compounds corresponding to specific highlighted subcomponents of the whole network.
Fig. 4 Molecular spectral network of a partial Bacillus subtilis secretome; nodes indicate MS2 spectra of initially-unknown compounds of any class of molecules (no peptide-specific assumptions were made), and edges indicate significant similarity between the MS2 fragmentation patterns of different spectra, mostly between intermediates/variants of the same compounds. Selected molecular structures are shown in black overlaid with the network and next to the correspondingly highlighted network clusters. |
As with spectral matching and library search, the potential of spectral networks algorithms extends beyond the scope of significantly improving on traditional uses of mass spectrometry data. Despite the significant clinical importance of natural products drug discovery, automated analysis of their mass spectra has always been tremendously challenging since these are often non-ribosomal and have no genomic propeptide template, are assembled with non-standard and heavily-modified amino acids and almost always have non-linear structures such as multi-cyclic, branched-cyclic and others–each of which renders traditional database search and de novo sequencing algorithms essentially useless. Using a combination of new mass spectrometry protocols and novel spectral networks algorithms, we showed89 how amino acid masses can be discovered directly from the data and how spectra of cyclic peptides can be assembled into accurate de novo sequences, a direction that was later explored for the analysis of several more novel natural products.90,91 Building on these results and recent advances,101 the scope of spectral networks analysis has now been extended to the analysis of tandem mass spectra for any type of molecules by aligning spectrum fragmentation patterns without any prior assumptions on molecular structure or composition. As such, preliminary results indicate that the spectral networks paradigm may serve as the foundation to organize and search a mass spectrometry-centric view of the complete biomolecular space.
Being a relatively new paradigm,15,17,103 the field of spectral networks analysis of tandem mass spectrometry data remains rich with open computational problems that stand to substantially benefit from additional developments in spectral matching, alignment, assembly and consensus interpretation. These and related developments continue to be proposed in closely related fields29,40,104 and are expected to have a substantial impact on the quality and extent of future spectral networks repositories and tools.
Footnote |
† Published as part of a themed issue dedicated to Emerging Investigators. |
This journal is © The Royal Society of Chemistry 2012 |