- 1.1 Introduction
- 1.2 Principles of LC-MS/MS Proteomics
- 1.2.1 Protein Fundamentals
- 1.2.2 Shotgun Proteomics
- 1.2.3 Separation of Peptides by Chromatography
- 1.2.4 Mass Spectrometry
- 1.3 Identification of Peptides and Proteins
- 1.4 Protein Quantitation
- 1.5 Applications and Downstream Analysis
- 1.6 Proteomics Software
- 1.6.1 Proteomics Data Standards and Databases
- 1.7 Conclusions
- References
Chapter 1: Introduction to Proteome Informatics
-
Published:15 Nov 2016
-
Special Collection: 2016 ebook collection
C. Bessant, in Proteome Informatics, ed. C. Bessant, The Royal Society of Chemistry, 2016, ch. 1, pp. 1-14.
Download citation file:
At its core, proteomics can be defined as the branch of analytical science concerned with identifying and, ideally, quantifying every protein within a complex biological sample. This chapter provides a high level overview of this field and the key technologies that underpin it, as a primer for the chapters that follow. It also introduces the field of proteome informatics, and explains why it is an integral part of any proteomics experiment.
1.1 Introduction
In an era of biology dominated by genomics, and next generation sequencing (NGS) in particular, it is easy to forget that proteins are the real workhorses of biology. Among other tasks, proteins give organisms their structure, they transport molecules, and they take care of cell signalling. Proteins are even responsible for creating proteins when and where they are needed and disassembling them when they are no longer required. Monitoring proteins is therefore essential to understanding any biological system, and proteomics is the discipline tasked with achieving this.
Since the ground-breaking development of soft ionisation technologies by Masamichi Yamashita and John Fenn in 1984,1 liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS, introduced in the next section) has emerged as the most effective method for high throughput identification and quantification of proteins in complex biological mixtures.2 Recent years have seen a succession of new and improved instruments bringing higher throughput, accuracy and sensitivity. Alongside these instrumental improvements, researchers have developed an extensive range of protocols which optimally utilise the available instrumentation to answer a wide range of biological questions. Some protocols are concerned only with protein identification, whereas others seek to quantify the proteins as well. Depending on the particular biological study, a protocol may be selected because it provides the widest possible coverage of proteins present in a sample, whereas another protocol may be selected to target individual proteins of interest. Protocols have also been developed for specific applications, for example to study post-translational modification of proteins, e.g.,3 to localise proteins to their particular subcellular location, e.g.,4 and to study particular classes of protein, e.g.5
A common feature of all LC-MS/MS-based proteomics protocols is that they generate a large quantity of data. At the time of writing, a raw data file from a single LC-MS/MS run on a modern instrument is over a gigabyte (GB) in size, containing thousands of individual high resolution mass spectra. Because of their complexity, biological samples are often fractionated prior to analysis and ten individual LC-MS/MS runs per sample is not unusual, so a single sample can yield 10–20 GB of data. Given that most proteomics studies are intended to answer questions about protein dynamics, e.g. differences in protein expression between populations or at different time points, an experiment is likely to include many individual samples. Technical and biological replicates are always recommended, at least doubling the number of runs and volume of data collected. Hundreds of gigabytes of data per experiment is therefore not unusual.
Such data volumes are impossible to interpret without computational assistance. The volume of data per experiment is actually relatively modest compared to other fields, such as next generation sequencing or particle physics, but proteomics poses some very specific challenges due to the complexity of the samples involved, the many different proteins that exist, and the particularities of mass spectrometry. The path from spectral peaks to confident protein identification and quantitation is complex, and must be optimised according to the particular laboratory protocol used and the specific biological question being asked. As laboratory proteomics continues to evolve, so do the computational methods that go with it. It is a fast moving field, which has grown into a discipline in its own right. Proteome informatics is the term that we have given this discipline for this book, but many alternative terms are in use. The aim of the book is to provide a snapshot of current thinking in the field, and to impart the fundamental knowledge needed to use, assess and develop the proteomics algorithms and software that are now essential in biological research.
Proteomics is a truly interdisciplinary endeavour. Biological knowledge is required to appreciate the motivations of proteomics, understand the research questions being asked, and interpret results. Analytical science expertise is essential – despite instrument vendors’ best efforts at making instruments reliable and easy to use, highly skilled analysts are needed to operate such instruments and develop the protocols needed for a given study. At least a basic knowledge of chemistry, biochemistry and physics is required to understand the series of processes that happen between a sample being delivered to a proteomics lab and data being produced. Finally, specialised computational expertise is needed to handle the acquired data, and it is this expertise that this book seeks to impart. The computational skills cover a wide range of specialities, ranging from algorithm design to identify peptides (Chapters 2 and 3), statistics to score and validate identifications (Chapter 4), infer the presence of proteins (Chapter 5) and perform downstream analysis (Chapter 14), through signal processing to quantify proteins from acquired mass spectrometry peaks (Chapters 7 and 8) and software skills needed to devise and utilise data standards (Chapter 11) and analysis frameworks (Chapters 12–14), and integrate proteomics data with NGS data (Chapters 15 and 16).
1.2 Principles of LC-MS/MS Proteomics
The wide range of disciplines that overlap with proteome informatics draws in a great diversity of people including biologists, biochemists, computer scientists, physicists, statisticians, mathematicians and analytical chemists. This poses a challenge when writing a book on the subject as a core set of prior knowledge cannot be assumed. To mitigate this, this section provides a brief overview of the main concepts underlying proteomics, from a data-centric perspective, together with citations to sources of further detail.
1.2.1 Protein Fundamentals
A protein is a relatively large (median molecular weight around 40 000 Daltons) molecule that has evolved to perform a specific role within a biological organism. The role of a protein is determined by its chemical composition and 3D structure. In 1949 Frederick Sanger provided conclusive proof6 that proteins consist of a polymer chain of amino acids (The 20 amino acids that occur naturally in proteins are listed in Table 1.1). Proteins are synthesised within cells by assembling amino acids in a sequence dictated by a gene – a specific region of DNA within the organism’s genome. As it is produced, physical interactions between the amino acids causes the string of amino acids to fold up into the 3D structure of the finished protein. Because the folding process is deterministic (albeit difficult to model) it is convenient to assume a one-to-one relationship between amino acid sequence and structure so a protein is often represented by the sequence of letters corresponding to its amino acid sequence. These letters are said to represent residues, rather than amino acids, as two hydrogens and an oxygen are lost from an amino acid when it is incorporated into a protein so the letters cannot strictly be said to represent amino acid molecules.
Amino acid . | Abbreviation . | Single letter code . | Monoisotopic residue mass . |
---|---|---|---|
Alanine | Ala | A | 71.037114 |
Cysteine | Cys | C | 103.009185 |
Aspartic acid | Asp | D | 115.026943 |
Glutamic acid | Glu | E | 129.042593 |
Phenylalanine | Phe | F | 147.068414 |
Glycine | Gly | G | 57.021464 |
Histidine | His | H | 137.058912 |
Isoleucine | Ile | I | 113.084064 |
Lysine | Lys | K | 128.094963 |
Leucine | Leu | L | 113.084064 |
Methionine | Met | M | 131.040485 |
Asparagine | Asn | N | 114.042927 |
Proline | Pro | P | 97.052764 |
Glutamine | Gln | Q | 128.058578 |
Arginine | Arg | R | 156.101111 |
Serine | Ser | S | 87.032028 |
Threonine | Thr | T | 101.047679 |
Valine | Val | V | 99.068414 |
Tryptophan | Trp | W | 186.079313 |
Tyrosine | Tyr | Y | 163.06333 |
Amino acid . | Abbreviation . | Single letter code . | Monoisotopic residue mass . |
---|---|---|---|
Alanine | Ala | A | 71.037114 |
Cysteine | Cys | C | 103.009185 |
Aspartic acid | Asp | D | 115.026943 |
Glutamic acid | Glu | E | 129.042593 |
Phenylalanine | Phe | F | 147.068414 |
Glycine | Gly | G | 57.021464 |
Histidine | His | H | 137.058912 |
Isoleucine | Ile | I | 113.084064 |
Lysine | Lys | K | 128.094963 |
Leucine | Leu | L | 113.084064 |
Methionine | Met | M | 131.040485 |
Asparagine | Asn | N | 114.042927 |
Proline | Pro | P | 97.052764 |
Glutamine | Gln | Q | 128.058578 |
Arginine | Arg | R | 156.101111 |
Serine | Ser | S | 87.032028 |
Threonine | Thr | T | 101.047679 |
Valine | Val | V | 99.068414 |
Tryptophan | Trp | W | 186.079313 |
Tyrosine | Tyr | Y | 163.06333 |
Organisms typically have thousands of genes, e.g. around 20 000 in humans. The human body is therefore capable of producing over 20 000 distinct proteins, which illustrates one of the major challenges for proteomics – the large number of distinct proteins that may be present in a given sample (referred to as the so-called search space when seeking to identify proteins). The situation is further complicated by alternative splicing,7 where different combinations of segments of a gene are used to create different versions of the protein sequence, called protein isoforms. Because of alternative splicing each human gene can produce on average around five distinct protein isoforms per gene. So, our search space expands to ∼100 000 distinct proteins. If we are working with samples from a population of different individuals, the search space expands still further as some individual genome variations will translate into variations in protein sequence, some of which have transformative effects on protein structure and function.
However, the situation is yet more complex because, after synthesis, a protein may be modified by covalent addition (and possibly later removal) of a chemical entity at one or more amino acids within the protein sequence. Phosphorylation is a very common example, known to be important in regulating the activity of many proteins. Phosphorylation involves the addition of a phosphoryl group, typically (but not exclusively) to an S, T or Y. Such post-translational modifications (PTMs) change the mass of proteins, and often their function. Because each protein contains many sites at which PTMs may occur, there is a large number of distinct combinations of PTMs that may be seen on a given protein. This increases the search space massively, and it is not an exaggeration to state that the number of distinct proteins that could be produced by a human cell exceeds one million. We will never find a million proteins in a single cell – a few thousand is more typical – but the fact that these few thousand must be identified from a potential list of over a million represents one of the biggest challenges in proteomics.
1.2.2 Shotgun Proteomics
The obvious way to identify proteins from a complex sample would be to separate them from each other, then analyse each protein one by one to determine what it is. Although conceptually simple, practical challenges of this so-called top-down method8 have led the majority of labs to adopt the alternative bottom-up methodology, often called shotgun proteomics. This book therefore deals almost exclusively with the analysis of data acquired using this methodology, which is shown schematically in Figure 1.1.
In shotgun proteomics, proteins are broken down into peptides – amino acid chains that are much shorter than the average protein. These peptides are then separated, identified and used to infer which proteins were in the sample. The cleavage of proteins to peptides is achieved using a proteolytic enzyme which is known to cleave the protein into peptides at specific points. Trypsin, a popular choice for this task, generally cuts proteins after K and R, unless these residues are followed by P. The majority of the peptides produced by trypsin have a length of between 4–26 amino acids, equivalent to a mass range of approximately 450–3000 Da, which is well suited to analysis by mass spectrometry. Given the sequence of a protein, it is computationally trivial to determine the set of peptides that will be produced by tryptic digestion. However, digestion is not always 100% efficient so any data analysis must also consider longer peptides that result from one or more missed cleavage sites.
1.2.3 Separation of Peptides by Chromatography
Adding an enzyme such as trypsin to a complex mixture of proteins results in an even more complex mixture of peptides. The next step in shotgun proteomics is therefore to separate these peptides. To achieve high throughput this is typically performed using high performance liquid chromatography (HPLC). Explanations of HPLC can be found in analytical chemistry textbooks, e.g.,9 but in simple terms it works by dissolving the sample in a liquid, known as the mobile phase, and passing this under pressure through a column packed with a solid material called the solid phase. The solid phase is specifically selected such that it interacts with, and therefore retards, some compounds more than others based on their physical properties. This phenomenon is used to separate different compounds as they are retained in the column for different amounts of time (their individual retention time, RT) and therefore emerge from the column (elute) separately. In shotgun proteomics, the solid phase is usually chosen to separate peptides based on their hydrophobicity. Protocols vary, but a typical proteomics chromatography run takes 30–240 minutes depending on expected sample complexity and, after sample preparation, is the primary pace factor in most proteomic analyses.
While HPLC provides some form of peptide separation, the complexity of biological samples is such that many peptides co-elute, so further separation is needed. This is done in the subsequent mass spectrometry step, which also leads to peptide identification.
1.2.4 Mass Spectrometry
In the very simplest terms, mass spectrometry (MS) is a method for sorting molecules according to their mass. In shotgun proteomics, MS is used to separate co-eluting peptides after HPLC and to determine their mass. A detailed explanation of mass spectrometry is beyond the scope of this chapter. The basic principles can be found in analytical chemistry textbooks, e.g.,10 and an in-depth introduction to peptide MS can be found in ref. 11, but a key detail is that a molecule must be carrying a charge if it is to be detected. Peptides in the liquid phase must be ionised and transferred to the gas phase prior to entering the mass spectrometer. The so-called soft ionisation methods of electrospray ionisation (ESI)1,12 and matrix assisted laser desorption–ionisation (MALDI)13 ,14 are popular for this because they bestow charge on peptides without fragmenting them. In these methods a positive charge is endowed by transferring one or more protons to the peptide, a process called protonation. If a single proton is added, the peptides become a singly charged (1+) ion but higher charge states are also possible (typically 2+ or 3+) as more than one proton may be added. The mass of a peptide correspondingly increases by one proton (∼1.007 Da) for each charge state. Not every copy of every peptide gets ionised (this depends on the ionisation efficiency of the instrument) and it is worth noting that many peptides are very difficult to ionise, making them essentially undetectable in MS – this has a significant impact on how proteomics data are analysed as we will see in later chapters.
The charge state is denoted by z (e.g. z = 2 for a doubly charged ion) and the mass of a peptide by m. Mass spectrometers measure the mass to charge ratio of ions, so always report m/z, from which mass can be calculated if z can be determined. In a typical shotgun proteomics analysis, the mass spectrometer is programmed to perform a survey scan – a sweep across its whole m/z range – at regular intervals as peptides elute from the chromatography column. This results in a mass spectrum consisting of a series of peaks representing peptides whose horizontal position is indicative of their m/z (There are invariably additional peaks due to contaminants or other noise.). This set of peaks is often referred to as an MS1 spectrum, and thousands are usually acquired during one HPLC run, each at a specific retention time.
The current generation of mass spectrometers, such as those based on orbitrap technology15 can provide a mass accuracy exceeding 1 ppm so, for example, the mass of a singly charged peptide with m/z of 400 can be determined to an accuracy of 0.0004 Da. Determining the mass of a peptide with this accuracy provides a useful indication of the composition of a peptide, but does not reveal its amino acid sequence because many different sequences can share the exact same mass.
To discover the sequence of a peptide we must break it apart and analyse the fragments generated. Typically, a data dependent acquisition (DDA) approach is used, where ions are selected in real time at each retention time by considering the MS1 spectrum, with the most abundant peptides (inferred from peak height) being passed to a collision chamber for fragmentation. Peptides are passed one at a time, providing a final step of separation, based on mass. A second stage of mass spectrometry is performed to produce a spectrum of the fragment ions (also called product ions) emerging from the peptide fragmentation – this is often called an MS2 spectrum (or MS/MS spectrum). Numerous methods have been developed to fragment peptides, including electron transfer dissociation (ETD,16 ) and collision induced dissociation (CID,17 ). The crucial feature of these methods is that they predominantly break the peptide along its backbone, rather than at random bonds. This phenomenon, shown graphically in Figure 1.2, produces fragment ions whose masses can be used to determine the peptide’s sequence.
The DDA approach has two notable limitations: it is biased towards peptides of high abundance, and there is no guarantee that a given peptide will be selected in different runs, making it difficult to combine data from multiple samples into a single dataset. Despite this, DDA remains popular at the time of writing, but two alternative methods are gaining ground. Selected reaction monitoring (SRM) aims to overcome DDA’s limitations by a priori selection of peptides to monitor (see Chapter 9) at the expense of breadth of coverage, whereas data independent acquisition (DIA) simply aims to fragment every peptide (see Chapter 10).
1.3 Identification of Peptides and Proteins
Determining the peptide sequence represented by an acquired MS2 spectrum is the first major computational challenge dealt with in this book. The purest and least biased method is arguably de novo sequencing (Chapter 2) in which the sequence is determined purely from the mass difference between adjacent fragment ions. In practice, identifying peptides with the help of information from protein sequence databases such as UniProt18 is generally considered more reliable and an array of competing algorithms have emerged for performing this task (Chapter 3). These algorithms require access to a representative proteome, which may not be available for non-model organisms and some other complex samples. In these cases, a sample specific database may be created from RNA-seq transcriptomics collected from the same sample (Chapter 16). Spectral library searching (also covered in Chapter 3) offers a further alternative, if a suitable library of peptide MS2 spectra exists for the sample under study.
None of the available algorithms gives a totally definitive peptide match for a given spectrum, but provide scores indicating the likelihood that the match is correct. Historically, each algorithm provided its own proprietary score but great strides have been made in recent years in developing statistical methods for objectively scoring and validating peptide spectrum matches independently of the identification algorithm used (see Chapter 4). Confidently identified peptides can then be used to infer which proteins are present in the sample. There are a number of challenges here, including the aforementioned problem of undetectable peptides, and the fact that many peptides map to multiple proteins. These issues, and current solutions to them, are covered in Chapter 5.
As mentioned earlier, the phenomenon of post-translational modification complicates protein identification considerably by massively increasing the search space. Chapter 6 discusses this issue and summarises current thinking on how best to deal with PTM identification and localisation.
1.4 Protein Quantitation
In most biological studies it is important to augment protein identifications with information about the abundance of those proteins. Laboratory protocols for quantitative proteomics are numerous and diverse, indeed there is a whole book in this series dedicated to the topic.19 Each protocol requires different data processing, leading to a vast range of quantitative proteomics algorithms and workflows. For the purposes of this book we have made a distinction between methods that extract the quantitative information from MS1 spectra (covered in Chapter 7) and those that use MS2 spectra (Chapter 8). Despite the diversity of quantitation methods, the vast majority infer protein abundance from peptide-level features so there is much in common between the algorithms used.
1.5 Applications and Downstream Analysis
As we have seen, identifying and quantifying proteins is a complex process but is one that has matured enough to be widely applied in biological research. Most researchers now expect that a list of proteins and their abundances can be extracted for a given biological sample. Of course, any serious research project is unlikely to conclude with a simple list of identified proteins and their abundance. Further analysis will be needed to interpret the results obtained to answer the biological question posed, from biomarker discovery through to systems biology studies.
Downstream analysis is not generally covered in this book, partly because there are too many potential workflows to cover, but mainly because many of the methods used are not specific to proteomics. For example, statistical approaches used for determining which proteins are differentially expressed between two populations are often similar to those used for finding differentially expressed genes – typically a significance test followed by some multiple testing correction.20 Similarly, the pathway analysis performed with proteomics data is not dissimilar to that carried out with gene expression data.21
However, caution is needed when applying transcriptomics methods to proteomics data, as there are many subtle differences. Incomplete sequence coverage due to undetectable peptides is one important difference between proteomics and RNA-seq, and confidence of protein identification and quantification is also something that should be considered. For example, proteins identified based on a single peptide observation (so called “one hit wonders”) should be avoided in any quantitative analysis as abundance accuracy is likely to be poor (see Chapter 5). PTMs are another important consideration, as they have the potential to affect a protein’s role in pathway analysis. One area of downstream analysis that we have chosen to cover is genome annotation using proteomics data (proteogenomics, Chapter 15), as this is an excellent and very specific example of proteomics being combined with genomics, and sometimes also transcriptomics, to better understand an organism.
1.6 Proteomics Software
As the proteomics community has grown, so has the available software for handling proteomics data. It is not possible to cover all available software within a book of this size, and nor is it sensible as the situation is in constant flux, with new software being released, existing software updated and old software having support withdrawn (but rarely disappearing completely). For this reason, most of the chapters in this book avoid focussing on specific software packages, instead discussing more generic concepts and algorithms that are implemented across multiple packages. However, for the benefit of readers new to the field, it is worth briefly surveying the current proteomics software landscape.
At the time of writing, proteomics is dominated by a relatively small number of generally monolithic Windows-based desktop software packages. These include commercial offerings such as Proteome Discoverer from Thermo and Progenesis QI from Waters, and freely available software such a MaxQuant22 and Skyline.22,23 Some of these packages support the whole data analysis workflow, from raw data through protein identification and quantitation and on to statistical analysis of the results. Reliance on Windows is unusual in the scientific research community, but perhaps explained by the fact that most mass spectrometer control software is Windows-based and some raw MS data formats can only be accessed using Windows-based software libraries.24 From a bioinformatics perspective there are clear disadvantages of the status quo, including a lack of flexibility, lack of transparency due to closed source code in some cases, and doubts about whether desktop-based Windows software can scale to cope with growing datasets. However, bench scientists appreciate the quality and usability of these packages and they are likely to remain popular for the foreseeable future.
The aforementioned packages are complemented by a vast array of other software tools, most of which have been developed by academic groups and are freely available. Typically, these packages are reference implementations of a published algorithm designed to perform a specific task (e.g. peptide identification), or support a particular protocol (e.g. quantitation with specific labels). Assembling such tools into a pipeline can be challenging, but can be the best way of implementing a specialised workflow. To ease the process of integrating disparate tools, developers are increasingly making their software available within common open frameworks such as OpenMS (Chapter 12), Galaxy (Chapter 13), BioConductor (Chapter 14) and as a set of PSI-centric libraries (see Chapter 11). These frameworks are mainly differentiated by their user interfaces and the programming languages that underpin them (C++ for OpenMS, R for BioConductor and Java for the PSI libraries). Galaxy is largely language agnostic, although much of its internals are written in Python.
1.6.1 Proteomics Data Standards and Databases
As in other data rich fields of biological research, the proteomics community has established databases to share data from proteomics experiments, and to enable interoperability between different pieces of software. This has proven difficult due to the wide range of proteomics protocols in use and different opinions about the most appropriate way to represent the results of a proteomics experiment, e.g. should raw data be stored or is a list of identified proteins sufficient? Questions like these have been tackled by the Human Proteome Organisation Proteomics Standards Initiative (HUPO–PSI), who have drawn up guidelines for reporting minimum information about a proteomics experiment (MIAPE) and data formats that capture the necessary information in a consistent way (see Chapter 11).
Progress in community standards for reporting results has paved the way for public repositories of proteomics databases. Arguably PRIDE25 is foremost among these as it is long established and at the time of writing is the only proteomics database backed by a dedicated bioinformatics institution (the European Bioinformatics Institute). Several leading journals request, or require, deposition of data to PRIDE to support any paper that involves proteomics. Other well established databases include PeptideAtlas,26 GPMDB27 and PASSEL28 (specifically for SRM data) but there are many more. A recent review article29 provides an extensive overview of the current state of proteomic repositories.
1.7 Conclusions
At the time of writing, much crucial groundwork in proteome informatics is already in place, but many interesting challenges remain and new challenges continue to appear as new laboratory protocols and biological applications emerge and evolve. Proteome informatics is therefore an active area of research, and it is now easier to get into thanks to an abundance of excellent freely available software tools and large collections of high quality data in public repositories.
The author would like to acknowledge funding from BBSRC grants BB/M020118/1 and BB/M006174/1. Many thanks to Pedro Cutillas for reviewing this chapter.