How far can you go? Extrapolating values of catalytic activity from known protein landscapes in natural and directed evolution

Douglas B. Kell; Ivayla Roberts

doi:10.1039/D5CS01387A

How far can you go? Extrapolating values of catalytic activity from known protein landscapes in natural and directed evolution

Douglas B. Kell

*^abc and Ivayla Roberts

^a

Author affiliations

* Corresponding authors

^a Department of Biochemistry, Cell and Systems Biology, Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Crown St, Liverpool L69 7ZB, UK
E-mail: dbk@liv.ac.uk

^b The Novo Nordisk Foundation Centre for Biosustainability, Technical University of Denmark, Building 220, Søltofts Plads 200, 2800 Kongens Lyngby, Denmark

^c Department of Physiological Sciences, Faculty of Science, Stellenbosch University, Stellenbosch Private Bag X1, Matieland, South Africa

Abstract

The number of possible variants representing the landscape of a protein sequence of length N residues, made of the standard unmodified proteinogenic amino acids, is 20^N; its exhaustive experimental analysis is consequently intractable. Our focus is on the real and perceived shapes of different fitness landscapes. Epistasis refers to a phenomenon by which the ‘best’ amino acid at a given residue depends on the nature of the amino acid at one or more other residues. Because of epistasis, real protein landscapes display peaks representing local maxima in which weak mutation/strong-selection regimes can cause evolution to become trapped, leading to landscapes that are rugged. Fortunately, although they are necessarily somewhat rugged, such protein landscapes possess regularities that admit their modelling from more limited experimental data, using the methods of statistics and machine learning. We provide a variety of arguments that for typical proteins of length 300–500 residues some 10⁵ or 10⁶ examples, and in favourable cases even fewer, are likely sufficient to allow a reasonable initial modelling (and accurate predictive exploration) of the entire 20^N landscape for properties such as k_cat. The distribution of fitness effects (DFE) around an existing wild type is usually reasonably fitted statistically by a gamma distribution. However, we also survey modern ideas, especially extreme value theory, that allow extrapolation from the known, with a focus on methods – especially the Generalised Pareto Distribution – that provide means for generating the statistical likelihood of obtaining activities or fitnesses far greater than those observed in existing populations as measured with what are small numbers. These likelihoods typically decrease exponentially, as do the decreases in errors as a function of the size of the network and of the training data as found by deep neural network models as ‘universal approximators’. This is entirely consistent with the large differences between the minuscule amount of available sequence-activity data, that are necessarily local in character, reflecting evolutionary contingency, and the overall distribution (20^N, where N might usefully be decreased) that would be expected to contain examples that have much better properties than any observed thus far. This consequently requires careful choices of examples drawn from an extensive distribution (using active learning) for predictive modelling. For instance, a widespread view of a trade-off between catalytic activity and thermostability seems to follow directly from inadequate sampling. All of this has significant implications for the understanding, modelling, and optimisation of experiments in directed evolution and the biocatalysts they produce.

Chemical Society Reviews

How far can you go? Extrapolating values of catalytic activity from known protein landscapes in natural and directed evolution

Abstract

Article information

Download Citation

Permissions

How far can you go? Extrapolating values of catalytic activity from known protein landscapes in natural and directed evolution

Social activity

Search articles by author

Spotlight

Advertisements