How far can you go? Extrapolating values of catalytic activity from known protein landscapes in natural and directed evolution

Abstract

The number of possible variants representing the landscape of a protein sequence of length N residues, made of the standard unmodified proteinogenic amino acids, is 20N; its exhaustive experimental analysis is consequently intractable. Our focus is on the real and perceived shapes of different fitness landscapes. Epistasis refers to a phenomenon by which the ‘best’ amino acid at a given residue depends on the nature of the amino acid at one or more other residues. Because of epistasis, real protein landscapes display peaks representing local maxima in which weak mutation/strong-selection regimes can cause evolution to become trapped, leading to landscapes that are rugged. Fortunately, although they are necessarily somewhat rugged, such protein landscapes possess regularities that admit their modelling from more limited experimental data, using the methods of statistics and machine learning. We provide a variety of arguments that for typical proteins of length 300–500 residues some 105 or 106 examples, and in favourable cases even fewer, are likely sufficient to allow a reasonable initial modelling (and accurate predictive exploration) of the entire 20N landscape for properties such as kcat. The distribution of fitness effects (DFE) around an existing wild type is usually reasonably fitted statistically by a gamma distribution. However, we also survey modern ideas, especially extreme value theory, that allow extrapolation from the known, with a focus on methods – especially the Generalised Pareto Distribution – that provide means for generating the statistical likelihood of obtaining activities or fitnesses far greater than those observed in existing populations as measured with what are small numbers. These likelihoods typically decrease exponentially, as do the decreases in errors as a function of the size of the network and of the training data as found by deep neural network models as ‘universal approximators’. This is entirely consistent with the large differences between the minuscule amount of available sequence-activity data, that are necessarily local in character, reflecting evolutionary contingency, and the overall distribution (20N, where N might usefully be decreased) that would be expected to contain examples that have much better properties than any observed thus far. This consequently requires careful choices of examples drawn from an extensive distribution (using active learning) for predictive modelling. For instance, a widespread view of a trade-off between catalytic activity and thermostability seems to follow directly from inadequate sampling. All of this has significant implications for the understanding, modelling, and optimisation of experiments in directed evolution and the biocatalysts they produce.

Graphical abstract: How far can you go? Extrapolating values of catalytic activity from known protein landscapes in natural and directed evolution

Article information

Article type
Review Article
Submitted
20 Nov 2025
First published
13 May 2026
This article is Open Access
Creative Commons BY license

Chem. Soc. Rev., 2026, Advance Article

How far can you go? Extrapolating values of catalytic activity from known protein landscapes in natural and directed evolution

D. B. Kell and I. Roberts, Chem. Soc. Rev., 2026, Advance Article , DOI: 10.1039/D5CS01387A

This article is licensed under a Creative Commons Attribution 3.0 Unported Licence. You can use material from this article in other publications without requesting further permissions from the RSC, provided that the correct acknowledgement is given.

Read more about how to correctly acknowledge RSC content.

Social activity

Spotlight

Advertisements