Proteogenomics: Proteomics for Genome Annotation
One of major bottlenecks in omics biology is the generation of accurate gene models, including correct calling of the start codon, splicing of introns (taking account of alternative splicing), and the stop codon – collectively called genome annotation. Current genome annotation approaches for newly sequenced genomes are generally based on automated or semi-automated methods, usually involving gene finding software to look for intrinsic gene-like signatures (motifs) in the DNA sequence, the propagation of annotations from other (more well annotated) related species, and the mapping of experimental data sets, particularly from RNA Sequencing (RNA-Seq). Large scale proteomics data can also play an important role for confirming and correcting gene models. While proteomics approaches tend not to have the same level of sensitivity as RNA-Seq, they have the advantage that they can provide evidence that a predicted gene/transcript is indeed protein-coding. The use of proteomics data for genome annotation is called proteogenomics, and forms the basis for this chapter. We describe the theoretical underpinnings, different software packages that have been developed for proteogenomics, statistical approaches for validating the evidence, and support for proteogenomics data in file formats, standards and databases.