Protein Inference and Grouping
A key process in many proteomics workflows is the identification of proteins, following analysis of tandem MS (MS/MS) spectra, for example by a database search. The core unit of identification from a database search is the identification of peptides, yet most researchers wish to know which proteins have been confidently identified in their samples. As such, following peptide identification, a second stage of data analysis is performed, either internally in the search engine or in a second package, called protein inference. Protein inference is challenging in the common case that proteins have been digested into peptides early in the proteomics workflow, and thus there is no direct link between a peptide and its parent protein. Many peptides could theoretically have been derived from more than one protein in the database searched, and thus it is not straightforward to determine which is the correct assignment. A variety of algorithms and implementations have been developed, which are reviewed in this chapter. Most approaches now report “protein groups” as a the core unit of identification from protein inference, since it is common for more than one database protein to share the same-set of evidence, and thus be indistinguishable. The chapter also describes scoring and statistical values that can be assigned during the protein identification process, to give confidence in the resulting values.