Celia
Blanco
*ab,
Samuel
Verbanic
bc,
Burckhard
Seelig
de and
Irene A.
Chen
abc
aDepartment of Chemistry and Biochemistry, University of California, Santa Barbara, CA 93106, USA. E-mail: blanco@ucsb.edu
bDepartment of Chemical and Biomolecular Engineering, University of California, Los Angeles, CA 90095, USA
cProgram in Biomolecular Sciences and Engineering, University of California, Santa Barbara, CA 93106, USA
dDepartment of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, Minneapolis, MN 55455, USA
eBioTechnology Institute, University of Minnesota, St. Paul, MN 55108, USA
First published on 17th January 2020
In vitro selection using mRNA display is currently a widely used method to isolate functional peptides with desired properties. The analysis of high throughput sequencing (HTS) data from in vitro evolution experiments has proven to be a powerful technique but only recently has it been applied to mRNA display selections. In this Perspective, we introduce aspects of mRNA display and HTS that may be of interest to physical chemists. We highlight the potential of HTS to analyze in vitro selections of peptides and review recent advances in the application of HTS analysis to mRNA display experiments. We discuss some possible issues involved with HTS analysis and summarize some strategies to alleviate them. Finally, the potential for future impact of advancing HTS analysis on mRNA display experiments is discussed.
Fig. 1 General scheme for the isolation of active sequences from in vitro evolution experiments, HTS and the analysis of the sequencing data. A large library of mutant variants is subjected to a selection process in which survival depends on the ability to carry out a specific biochemical function (e.g., binding). Selected variants are isolated and amplified while unselected variants are discarded. The cycle of selection and amplification is repeated several times (rounds) until variants with high activity dominate the library. The final library (and possibly intermediate pools) is then sequenced, such as by using high throughput sequencing (HTS) technologies. Finally, HTS data is analyzed using bioinformatic tools appropriate for the project's goal. For a more detailed explanation of the selection process see Fig. 2. |
High throughput sequencing (HTS) data analysis for in vitro evolution experiments has become increasingly common in the last decade for DNA and RNA molecules. However, this approach has not been as widely applied to mRNA display selections. The focus of this Perspective is to examine possible applications of mRNA display and HTS in contexts relevant for the field of physical chemistry, such as the improvement of binding kinetics measurement and the understanding of molecular fitness landscapes. We discuss the capability of HTS analysis applied to in vitro selections of peptides and review recent progress made in the field. Based on our experience, we discuss some possible issues that might arise during the sequencing process or the data pre-processing steps. Finally, we offer our perspective on possible future applications of HTS and mRNA display experiments, and discuss the potential effect that future improvements in sequencing technology might have on the field.
The phage genotype–phenotype linkage is exploited in another cellular method, called Phage-Assisted Continuous Evolution (PACE).11 In PACE, a library of plasmids encoding an evolvable gene is transformed into E. coli. The host cells also contain plasmids that express phage proteins, but expression of an essential gene, pIII, is suppressed. Instead, pIII is only expressed if the selection plasmid has the desired activity, enabling production of a functional phage. Phage encoding enzymes with higher activity produce more pIII, in turn producing more viable phage that can infect more cells, propagating their genotype. The major advantage of this system is faster evolution allowing many generations of selection in tandem; as its name suggests, PACE is continuous and does not require manual intervention to cycle through a selection scheme. PACE has drawbacks, though, primarily in the difficulty of the experimental design and implementation, genetic engineering, and use of custom-built apparatus that may be difficult for a non-expert to implement.
Two widely used acellular approaches for protein selection and evolution are ribosome display and mRNA display (Table 1). Library diversity using these methods usually is 1012–1014 variants, surpassing the cellular methods by orders of magnitude.5,12–14 In addition to the increased library size due to lack of need for transformation, acellular approaches show reduced biases by avoiding cellular expression (e.g., the toxicity of protein sequences is not relevant in acellular approaches). Like cellular methods, acellular methods are amenable to combination with diversity-generating techniques that mimic natural evolution: error-prone PCR is used to introduce random mutations, gene shuffling (recombination) is used to generate permutations of mutations, and non-natural amino acids can be introduced.
Cell-based selectionsa | In vitro selections | ||
---|---|---|---|
Ribosome display | mRNA display | ||
a Selection parameters and conditions are limited to ensure compatibility with cell survival. b Optimum temperature depends on the type of cell used. Phages may tolerate wider temperature ranges. | |||
Library diversity | 106–109 | ∼1013 | ∼1013 |
Genotype–phenotype connection | Non-covalent | Non-covalent | Covalent |
Type of protein | Cell-compatible only | Any | Any |
Temperature range | ±5 °Cb | ∼4 °C | 0–100 °C |
Buffer conditions | Must be compatible with cell or phage integrity | High Mg2+, low T; must be compatible with ternary complex | Generally tolerant as long as compatible with chemical integrity of protein and RNA |
In order to create the physical link between genotype and phenotype, both ribosome display and mRNA display take advantage of the fact that an mRNA and its encoded protein, while not covalently bound, are in intimate proximity during translation. Thus, manipulation of events surrounding termination of in vitro translation can capture an mRNA together with its newly expressed protein molecule. In ribosome display,15–17 the mRNA and the translated peptide or protein product are held together non-covalently by the ribosome. To accomplish this, the stop codon of the gene is deleted, and therefore the ribosome does not dissociate at the end of mRNA translation. This ternary complex of mRNA, ribosome and peptide is further stabilized mainly through high Mg2+-concentrations and incubation at low temperature. While this complex can be stable for days, any subsequent selection conditions are limited to those that preserve the integrity of the mRNA–ribosome–peptide complex. When more physiological and/or stringent conditions are of interest, mRNA display is an important alternative (Table 1).
Progress of the selection is commonly monitored by measuring the recovery of mRNA-displayed proteins during the selection step, which is expected to increase over rounds if active variants are being selected. This measurement estimates the bulk activity (binding or catalysis) of the library of enriched variants. When the desired variants have been sufficiently enriched, the proteins are identified by DNA sequencing and subsequently analyzed individually as appropriate for the particular activity. In a complementary approach, the progress of the enrichment can also be monitored through DNA sequencing of the library after each round of selection. The comparison of populations of variants over the course of selection can reveal the enrichment of dominant proteins.
As with other selection techniques, the general biophysical nature of the selectable entity should be kept in mind. In this case, the mRNA–protein fusion is mostly RNA by mass. With an RNA monomer being roughly three times the mass of an amino acid and each codon being three nucleotides long, the mRNA–protein fusion is approximately 1/10 protein by mass. A fortunate consequence of this is that the fusion benefits from the high solubility of the negatively charged RNA. Thus, while random protein sequences are prone to aggregations,21 random mRNA–protein fusions are less so. Solubility can be further improved by a selection preceding the intended selection (a ‘pre-selection’), in which the soluble fraction itself is selected. On the other hand, the covalent linkage of the mRNA to the peptide means that it is possible for an mRNA–protein fusion to survive a selection due to activity of the mRNA, not the peptide. For example, a selection may inadvertently result in ribozyme or aptamer sequences. This outcome can be guarded against by strategies such as ‘protection’ of the mRNA in a duplex with complementary DNA. It is also possible that the presence of the RNA affects the peptide's activity, e.g., by effects on folding, such that a fusion exhibiting a particular activity may not exhibit the same activity when expressed as an isolated peptide. In addition, the selection step often requires experimental measures that should be kept in mind during data interpretation, such as the need to attach an affinity tag to the substrate to render selected molecules isolable. As with any in vitro selection experiment, the resulting ‘hits’ must be validated by additional assays. Nevertheless, due to the minimalist display design, the stability of the covalent link, and the freedom to operate under a wide range of conditions in this in vitro format, mRNA display is a powerful method for peptide or protein display systems.
mRNA-displayed peptide and protein libraries have regularly been selected to isolate protein binders and, in some cases, even enzymes.22,23 For example, mRNA display has been utilized to study protein–protein interactions, or interactions between proteins and small molecules or other targets.24–26 mRNA display has also been used to display cyclic peptide libraries, enabling the discovery of bioactive macrocycles as potential drug candidates.27–29 More detailed reviews of the mRNA display technology and its applications can be found elsewhere.24,28,30–33 Furthermore, mRNA display has proven to be particularly suitable for the investigation of fundamental questions. For example, mRNA display selection can be used to discover entirely de novo proteins from libraries of randomized polypeptides, with implications for the potential origin of the earliest functional proteins.34,35 Another example is using mRNA display to mimic natural Darwinian protein evolution in the lab to examine protein fitness landscapes.36–39 The high versatility of mRNA display methods adds value to its potential benefits for the in vitro selection of peptides and proteins.
High-throughput sequencing (HTS) refers to a number of technologies capable of producing a large amount of sequence data (Table 2). HTS methods are highly scalable, with some allowing a large number of different variants (thousands to millions or even billions) to be sequenced in parallel. HTS methods are also referred to as next-generation sequencing (NGS) or second-generation sequencing (2GS) in the literature. However, the terms NGS intuitively refers to the most recent sequencing technology, and hence, it has been progressively abandoned in literature since the advent of more recent long-read sequencing methods. In the last 20 years, the data output capacity has outpaced Moore's law and the associated costs have dropped almost at the same rate. While the sequencing of one entire human genome in the Human Genome Project took 13 years and cost nearly three billion dollars,41 nowadays many whole human genomes can be sequenced within a single day for approximately a thousand dollars each. HTS technologies have tremendously impacted several fields of biological research and have opened the door to new approaches in medicine, such as in personalized medicine.42–44
Company | Platform | Run time | Maximum output | Maximum read length | Reads per run |
---|---|---|---|---|---|
Illumina Inc. | MiSeq | 4–55 h | 15 Gb | 2 × 300 bp | 25 M per lane |
NextSeq | 12–30 h | 120 Gb | 2 × 150 bp | 400 M per lane | |
HiSeq 3000 | <1–3.5 days | 750 Gb | 2 × 150 bp | 2.5 B per lane | |
HiSeq 4000 | <1–3.5 days | 1.5 Tb | 2 × 150 bp | 5 B per lane | |
HiSeq X Series | <3 days | 1.8 Tb | 2 × 150 bp | 6 B per flow cell | |
NovaSeq 6000 | ∼13–38 h | N/A | 2 × 250 bp | 10 B per lane | |
Pacific Biosciences Inc | PacBio RS II | 0.5–4 h | 1 Gb | ∼10–15 kb | 50–80 k |
Life Technologies Corp. | Ion GeneStudio S5 | 4.5–19 h | 15 Gb | 200–600 bp | 2–130 M |
Ion GeneStudio S5 Plus | 3–20 h | 30 Gb | 200–600 bp | 2–130 M | |
Ion GeneStudio S5 Prime | 3–10 h | 50 Gb | 200–600 bp | 2–130 M | |
Sequencing by Oligo Ligation Detection | SOLiD 5500 W | 10 days | 120 Gb | 2 × 50 bp | 1.2 B |
SOLiD 5500xl W | 10 days | 240 Gb | 2 × 50 bp | 2.4 B | |
Roche Inc. | 454 GS FLX+ | 10–23 h | 450–700 Mb | Up to 1 kb | 1 M |
454 GS Jr | 10 h | 35 Mb | 400 bp | 100 k | |
Oxford Nanopore | Flongle | 1 min–16 h | 2 Gb | >2 Mb | 126 channels |
MinION | 1 min–48 h | 50 Gb | >2 Mb | 512 channels | |
GridION Mk1 | 1 min–48 h | 250 Gb | >2 Mb | 512 × 5 channels | |
PromethION 24 | 1 min–72 h | 5.2 Tb | >2 Mb | 24 × 3000 channels | |
PromethION 48 | 1 min–72 h | 10.5 Tb | >2 Mb | 48 × 3000 channels |
The sequencing market is nowadays dominated by the Illumina platform. However, several other companies offer sequencing platforms that use different technologies (Table 2), which present different advantages and disadvantages. Illumina's high popularity is mainly due to the sheer throughput and low cost, such that for genomic sequencing applications high coverage can be readily obtained. However, a major limitation is the length of individual sequence reads. For longer reads (more than a few hundred base pairs), other sequencing technologies are preferable or necessary (Table 3). Sequencing platforms capable of long reads usually have a higher error rate associated with them, but often strategies can be devised to circumvent this problem (e.g., multiple effective reads of the same base), and technologies are constantly under development in this highly competitive area. Detailed comparisons among different sequencing methods can be found elsewhere.45–48
Advantages | Disadvantages | Library amplification | Sequencing technology | |
---|---|---|---|---|
Illumina Inc. | Large user base platform | Short reads | Bridge-PCR on flow cell surface | Reversible terminator sequencing by synthesis |
Low cost per base | ||||
High coverage (high output) | ||||
Pacific Biosciences Inc | Very long reads (>1 kb) | High basal error rate | NA | Single-molecule, real-time DNA sequencing by synthesis |
Short run time | Low output | |||
Low reagents cost | ||||
Life Technologies Corp. | High coverage | Lower output | PCR on FlowChip surface | Polymerase synthesis |
Longer reads | ||||
Sequencing by Oligo Ligation Detection | Low cost per base | Short reads | Emulsion PCR | Sequencing by ligation |
Low reagents cost | Long run time | |||
Inherent error correction (two-base encoding) | ||||
Roche Inc. | Longer reads | Higher cost per base | Emulsion PCR on microbeads | Pyrosequencing |
Short run times | High reagents cost | |||
High coverage | High error rates in homopolymer repeats | |||
Oxford Nanopore | Very long reads | High error rate | NA | Nanopore exonuclease sequencing |
Customization | Difficult to design multiple parallel pores |
In mRNA display selections, HTS enables the tracking of evolutionary paths of selected sequences throughout the selection and evolution process, as well as measurement of the distribution of activity of proteins over sequence space. A clear impact of HTS is that the high depth of sequencing can reveal a greater number of active sequences, especially those without closely related neighbors. Although such sequences might be rare, they could exhibit high activity and thus be of interest. Also, a practical benefit of deep sequencing is the potential for reducing the number of cycles required to identify active clones. Without HTS, a selection is usually pursued until the active variants represent a majority of the library, so that a small number of clones subjected to Sanger sequencing would identify the ‘winning’ sequences. With HTS, however, the selection can be stopped relatively early and clones selected on the basis of the rate of their enrichment, even if they are present at somewhat low relative abundance (e.g., <1%).40 In summary, large sequencing depth while tracking selections can enable both improved identification of active clones as well as a better understanding of the evolutionary process. We discuss in Section 5 below some important applications in which NGS is transforming our understanding of fundamental problems.
Given the unprecedented increase in data, an interesting question is whether NGS can allow one to entirely circumvent the evolutionary process during discovery of functional peptides from a diverse library. With unlimited sequencing capability, one could imagine sequencing the starting library, subjecting the library to a single screening reaction, and then sequencing the selected pool. In principle, a comparison of the composition of the pool before and after the screen should yield estimates of the relative activities of all of the different sequences in the library. Whether this is attainable in practice depends on the library size. In directed evolution experiments, initial libraries are generated to, ideally, cover as much sequence and structural diversity as possible while targeting the activity of interest; the larger the library, the greater the chance of discovering rare, active sequences. Libraries generated using mRNA display methods can typically contain up to ∼1014 different variants. While Sanger sequencing could yield perhaps a few hundred sequences, massively parallel HTS methods can read up to 1010 (as of 2019, Illumina's NovaSeq 6000 System can yield a maximum of 20 billion reads per run49). However, despite the high number of different variants that can be sequenced nowadays, the number of variants that can be explored experimentally in the initial library is still higher. Therefore, if the initial library has fewer than ∼1 billion variants, it is conceivable to use HTS effectively as a screen. But if mRNA display is used for exploring extremely diverse libraries, at least a few rounds of selection are currently necessary to reduce the complexity of an mRNA display library to a tractable size.
To understand the importance of HTS for examining fitness landscapes, let us consider how the shape of these landscapes influences natural selection. Under conditions in which selection pressures are strong, as is common during in vitro selection, sequences evolve by local uphill climbs over the landscape. Over a perfectly smooth peak, one could imagine easily reaching the global optimum through a continuously uphill climb. However, if the landscape contains many local optima with valleys separating them from the global optimum, populations of sequences may become trapped on the local optima.37,51 Thus, the ability of natural selection to discover an optimal sequence is heavily influenced by the ruggedness of the fitness landscape.
One measure of ruggedness is epistasis, which describes how different sites along the sequence interact to determine the fitness contribution of each mutation. In other words, in a landscape with epistasis, the genetic background of a mutation influences how beneficial or detrimental that mutation is. Sign epistasis describes the situation in which the effect on fitness of a single mutation is either positive or negative depending on the presence or absence of another mutation. Reciprocal sign epistasis corresponds to a particular case of sign epistasis in which mutations that are independently advantageous became jointly unfavorable (or vice versa). Such epistasis is particularly important for the landscape, as it leads to local optima.52 Epistasis is therefore an important feature determining the viability of individual evolutionary pathways of protein sequences.37 Calculations to measure epistasis in experimental fitness landscapes have been reviewed elsewhere.53
To map fitness landscapes, individual sequences would be sampled and their fitness determined (such as by sequencing). Let us consider how the depth of sampling influences our ability to probe ruggedness and epistasis. In the most simple, smoothest landscape (i.e., that with a single peak), the absence of local maxima implies that under a regime of strong selection and weak mutation, evolution starting from any point in sequence space will end in the global optimum. Generally, sparse random sampling on these topographies will still give an adequate representation of the landscape, because the fitness of unsampled points in sequence space can be interpolated using an assumption of additivity among mutations (Fig. 3A). In contrast, a highly rugged landscape would occur if the fitness of related sequences were totally uncorrelated (Fig. 3C). Evolution on this type of landscape will almost certainly not end in the global optimum, and populations will be ‘stuck’ in peaks corresponding to local maxima. In these topographies, sparse random sampling cannot give a proper representation of the landscape. Although these two cases (completely correlated and completely random fitness landscapes) are interesting as theoretical limits, most empirical landscapes exhibit an intermediate degree of ruggedness (as well as a certain degree of correlation), lying somewhere in between these two limiting cases (Fig. 3B). The ruggedness of the landscape ultimately determines whether subsampling of sequence space can result in a trustworthy representation of the topography. Severe undersampling of a rugged landscape would miss many epistatic correlations. For realistically rugged landscapes, high sampling levels, enabled by HTS, are essential for understanding the fitness landscapes.
The issue of adequately sampling fitness landscapes resembles the core idea behind the Nyquist–Shannon sampling theorem for digital signal processing.54 In this field, sampling refers to the process of converting a continuous signal into a string of discrete values. The theorem states that, for a given continuous function, there is a critical, minimum rate of sampling for which perfect reconstruction of the function is guaranteed, this rate being at least twice the maximum frequency response of the signal. That is, fN = fS/2, where fN is the critical frequency (also called Nyquist frequency) and fS is the sampling frequency. Similarly, for a given sample rate, there is a maximum bandlimit or frequency that ensures perfect reconstruction. For example, in the case of a sine wave, sampling at less than twice the maximum frequency will lead to a lower frequency sine wave. This phenomenon is known as aliasing (Fig. 3D). Sampling at more than twice the maximum frequency ensures perfect reconstruction of the wave function. In the case of fitness landscapes, whether the number of sampled sequences is large enough to reconstruct the topography of the fitness landscape depends on the topographical features of the landscape, which is determined by the epistatic interactions. In this context, rugged landscapes are at a higher risk of suffering ‘aliasing’.
In addition to mapping epistatic landscapes, which is discussed further in Section 6, the combined use of mRNA display libraries and HTS methods can provide a direct view into the evolutionary history of peptides over the course of selection.37 Like Sanger sequencing, HTS can identify the different families of sequences selected for high activity at the end of the selection, but with higher depth. Additionally, it can provide valuable information on sequence composition at different points of the selection, i.e., 'snapshots' during evolution. At each snapshot, deep sequencing can reveal the number of families of similar sequences, the size of each family present, as well as information about the common motifs of a family or the different motifs across families. Merging the sequencing data across rounds of selection thus can provide a window into the details of the evolutionary process. For example, one can estimate how the number of families and their sizes changed over the selection, at which point of the selection the different families emerged (or were left behind) and how the different families compete with and related to each other. Importantly, one may potentially trace the evolutionary trajectory of the most active families across different rounds of selection.
Another important finding was that many mutations that were generally deleterious were found to be beneficial in at least one alternative mutational background. This is relevant because, although rare, positive epistasis can substantially expand the functional portion of sequence space, and thus, the accessible evolutionary pathways. Again, the depth of data from HTS was required to discover these rare situations, which may have an outsized impact. An illustrative example of the importance of these rare pathways came in 2016, when the same group used mRNA display and HTS to experimentally characterize the fitness landscape of four amino acid sites in protein GB1, corresponding to 204 = 160000 variants,36 including several mutations with interactions known to be positively epistatic.39 Reciprocal sign epistasis (i.e. mutations that are separately advantageous became jointly unfavorable) blocked many direct evolutionary paths through genotype space,59 leading to an appearance of difficult optimization over the local landscape. However, these ‘dead end paths’ could be circumvented by following longer indirect paths through consecutive gains and losses of mutations. In other words, they are overcome through reversible mutations that avoid the need to lose fitness at any particular step. This mechanism allows protein optimization by natural selection (i.e., uphill climbs) despite epistasis. The indirect paths reduce the constraint on adaptive protein evolution, supporting the idea that the previously ignored regions of the functional sequence space may be crucial for the evolution of proteins. This highlights the qualitative importance of HTS, which allows much deeper exploration of sequence space and discovery of rare but important features, for understanding evolutionary trajectories.
One consequence of using HTS to analyze in vitro selection is that one may obtain quite a long list of candidate sequences. Additional experiments are required to quantify relative binding affinities for each candidate sequence. Even for a relatively small number of sequences, this characterization step is often labor-intensive and represents an experimental bottleneck in analysis of selected sequences. Given results from HTS, the problem is seriously compounded by the number of candidates. To solve this problem, the Roberts group recently used a combination of mRNA display and HTS to calculate the on- and off-rates for many thousands of mRNA-displayed ligands simultaneously, without synthesizing or purifying individual sequences.38 To do so, they devised a method based on the fact that the on- and off-rates of a sequence (kon and koff) determine its fractional presence at different time points during the selection step. That is, sequences with high on-rates are present in higher fractions at early time points because they bind quickly to the target; however, at later time points, as the fraction of ligands with slower on-rates bound to the target increases, the fraction of the ‘fast’ ligands bound to the target decreases. Following this idea, they mixed a library of mRNA–peptide fusions with an immobilized target and removed an aliquot at different time points for washing, PCR, and deep sequencing. The resulting HTS data yielded the identity of all the ligands bound to the target at each time point and their frequencies, which could be used to calculate the on-rates of each ligand. Off-rates could be measured in an analogous fashion, and binding affinities (Kd = koff/kon) were therefore obtained for thousands of ligands in parallel. This example illustrates the creative use of HTS for not only tracking sequences during evolution, but also for massively parallelizing a binding assay.
There are a few means to overcome these issues. One method to combat the problem of overlapping fluorescent spots is to reduce the density of spots by diluting the sample, thus sacrificing sequence depth for higher quality. Other methods include increasing nucleotide diversity.65,66 For example, one may add (‘spike in’) a sample of high diversity such as the ΦX174 genome. This genome is from a small, well-characterized bacteriophage that has a relatively uniform base composition (and was one of the first whole genomes to be sequenced). Sequencing reads derived from this genome can be readily removed during bioinformatic processing. Spiking in ΦX174 DNA increases the sample diversity at the beginning of the read, improving intensity distribution issues during initial reading of the template. Depending on the specifics, it might be necessary to spike in between 5% and 50% ΦX174 DNA to achieve a sufficiently diverse sample.67 As with the method of sample dilution, the main disadvantage of spiking a high amount of ΦX174 is the loss of sequencing depth for the desired sample. Ideally, the amount of ΦX174 used should achieve a good balance between improvement in sequencing quality and loss of reads.
An alternative method to increase nucleotide diversity without sacrificing sequencing depth is the use of degenerate insertions after the adapter constant region. To increase the diversity of the pool, a series of random nucleotides can be added to the adaptor region, after the primer binding site.68 If the added series of random nucleotides is of varying length (e.g. 2, 4, and 6 nt), this addition can improve sequence diversity by essentially frame-shifting the sequences with respect to one another. This increases the diversity in not only the initial primer but also beyond it, due to the frame shift, and is likely superior to spike-in or dilution methods. However, while the spike-in and dilution methods can be applied to samples that have already been prepared (i.e., after an issue has been identified), the addition of a small randomized region would require design of additional PCRs and fresh sample preparation. Since these methods are not mutually exclusive, a combination of methods could be considered for particularly problematic cases.
An interesting advantage of HTS is the identification of minor anomalies in the constant regions that may have been otherwise overlooked during selection. These unanticipated insertions, deletions and substitutions might be either functional (i.e., selected) or non-functional (e.g., primer synthesis errors or sequencing errors). Knowledge of expected error rates and profiles during both synthesis and sequencing can be helpful, and the overall error rate from each run should be compared to standard error rates obtained using that technology. In general, we find that it is most useful to have a method for independent reads of the same template (e.g., in Illumina sequencing, paired-end reads with as much overlap as possible; or in PacBio sequencing, consensus sequencing) in order to reduce the error rate. In any case, anomalies point toward a need for further consideration.
During quality assessment, it is important to consider the library's properties and sequencing method, and how they will influence quality scores. For example, low-diversity libraries and read lengths >150 bp will typically have lower average quality scores than high-diversity libraries sequenced with short read lengths. In addition, it is expected in Illumina sequencing that quality decays later into the read. Therefore, reads can be quality-trimmed to remove low-quality bases with tools such as Trimmomatic79 or BBDuk,80 using parameters informed by the quality assessment (e.g. distribution of low-quality bases and their scores). This step ensures that only high quality bases are retained, which enables optimal read joining (for paired-end sequences) and reduces error-based noise in downstream analyses.
Library metrics should be collected at each step of the pre-processing pipeline to assess progress toward the goal of retaining the most reads at the highest quality as well as to quickly identify any errors of coding. Common metrics include average read quality scores, read length distributions, total read counts, unique read counts, and percentage of reads retained. By monitoring these metrics at each step, the pre-processing pipeline can be fine-tuned to optimize the final output. It should also be noted that pre-processing steps are not limited to those listed here; other methods like length filtering, head cropping, and contaminant filtering can be implemented as needed to further increase the quality of the final library.
Ultimately, the final indicator of successful pre-processing is the set of amino acid sequences produced by in silico translation of the pre-processed reads. These should have near-uniform length distributions at the expected length (or lengths), be consistent with the expected amino acid composition, and retain conserved or semi-conserved motifs and the overall structural framework, if any.
As the affordability of HTS increases, future progress is expected in the field of molecular evolution. Experiments performed in the past, in which only a few variants were sequenced and tested for activity might now be the starting point to future studies. As long as samples for such experiments are still available, a kind of molecular archeology can be performed on the freezer samples. A notable subject of such study is the Lenski lab's famous long-term evolutionary experiment (LTEE) on E. coli, which began in 1988 and has progressed through more than 60000 generations.87–89 Although the beginning of the LTEE predated HTS, freezer samples examined from early generations can reveal the emergence of new mutations, including new metabolic activities.
As HTS technology improves, it is interesting to consider whether greater sequencing depth is always desirable. While more information is undoubtedly obtained, greater computational time is also required to process the data, and in some samples, there is likely to be little overall benefit to greater sequencing depth. For example, the number of unique sequences (i.e., the pool complexity) in very diverse samples, such as the initial pool, probably exceeds the capacity of sequencing (i.e., 1014 different sequences); the benefit of 109 reads compared to 108 reads or even fewer is unclear. At the other extreme, for a highly converged pool of low complexity, such as would be derived from samples late in the selection, additional reads are also less useful; if the pool contains 100 unique sequences, the benefit of having 109vs. 108 sequences is also marginal. Thus greater sequencing depth will become most useful for pools of intermediate diversity. Having said this, there are certain scenarios in which very deep sequencing of low and high complexity pools may be useful, such as if one intends to characterize the bias among k-mers of the synthesized pool90,91 or if the frequency distribution of sequences in a highly converged pool is very uneven (i.e., some sequences of interest are present at very low abundance). Another special scenario that may require increasingly deep sequencing is the systematic exploration of fitness landscapes, in which all mutations at certain sites are investigated. If it is desirable to compare frequencies of each mutant before and after selection, then the number of reads that can be obtained from the initial pool becomes a critical parameter (i.e., a 20-fold increase in sequencing depth will allow one additional site to be explored by saturation mutagenesis).
At the same time, while HTS analysis offers important quantitative and qualitative advantages for the analysis of in vitro evolution experiments, care is required during each step of the analysis to ensure that the analysis itself does not bias the results (i.e., high quality data are preserved and artifacts are not introduced). Seemingly minor choices during data processing, such as number of errors tolerated in the adapter or primer sequence, or length of the sequence extracted, can have unexpectedly large effects on the quality and quantity of the resulting sequences; thus attention should be paid to any processing step that gives an unexpectedly low yield of passing sequences. Sequencing error is a frequent issue when studying evolutionary trajectories, since it is essential to distinguish between sequencing errors and true mutations. In our experience, experimental measures taken to reduce sequencing error rates (e.g., paired-end sequencing with stringent joining criteria, consensus sequencing, etc.) are usually worth their effort and expense in order to reduce uncertainty in data interpretation or the need for error correction strategies. It is also good practice to validate results obtained from HTS by classical biochemical assays whenever possible, to ensure the reliability of the results and expose any biases that may have been introduced by the HTS analysis itself. On a practical note, it can be useful to take advantage of rapid 'micro' or 'nano' low-output runs to generate a small preliminary data set to test the analysis pipeline as well as the quality of the input sample. For example, for the low complexity samples generated by in vitro evolution, such preliminary runs can uncover important but correctable sample issues.
As HTS instruments themselves decrease in cost and new instruments replace old ones, another interesting avenue for future research will be custom modification of the instruments themselves to achieve new goals. HTS technology combines miniaturization, massive parallelization, and highly sensitive detection – these features are assets to a number of potential applications. For example, an mRNA bound to the surface of a chip could be translated, assayed, and sequenced all at once.92 Some sequencing technologies assay single molecules, and could probe not only nucleobases and epigenetic modifications but also other chemical varieties that could be of interest. In the next stage of technology exploration, it will become increasingly common not only to use HTS as a ‘black box’ that produces sequencing data, but to adopt and alter the hardware directly in the laboratory for new, ‘off-label’ applications.
This journal is © the Owner Societies 2020 |