Substrate prediction for RiPP biosynthetic enzymes via masked language modeling and transfer learning

Joseph D. Clark; Xuenan Mi; Douglas A. Mitchell; Diwakar Shukla

doi:10.1039/D4DD00170B

Substrate prediction for RiPP biosynthetic enzymes via masked language modeling and transfer learning†

Joseph D. Clark,

^a Xuenan Mi,

^b Douglas A. Mitchell

^cd and Diwakar Shukla

*^befg

Author affiliations

* Corresponding authors

^a School of Molecular and Cellular Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA

^b Center for Biophysics and Quantitative Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA

^c Department of Biochemistry, Vanderbilt University School of Medicine, Nashville, TN, USA

^d Department of Chemistry, Vanderbilt University, Nashville, TN, USA

^e Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA
E-mail: diwakar@illinois.edu

^f Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA

^g Department of Chemistry, University of Illinois at Urbana-Chamapaign, Urbana, IL, USA

Abstract

Ribosomally synthesized and post-translationally modified peptide (RiPP) biosynthetic enzymes often exhibit promiscuous substrate preferences that cannot be reduced to simple rules. Large language models are promising tools for predicting the specificity of RiPP biosynthetic enzymes. However, state-of-the-art protein language models are trained on relatively few peptide sequences. A previous study comprehensively profiled the peptide substrate preferences of LazBF (a two-component serine dehydratase) and LazDEF (a three-component azole synthetase) from the lactazole biosynthetic pathway. We demonstrated that masked language modeling of LazBF substrate preferences produced language model embeddings that improved downstream prediction of both LazBF and LazDEF substrates. Similarly, masked language modeling of LazDEF substrate preferences produced embeddings that improved prediction of both LazBF and LazDEF substrates. Our results suggest that the models learned functional forms that are transferable between distinct enzymatic transformations that act within the same biosynthetic pathway. We found that a single high-quality data set of substrates and non-substrates for a RiPP biosynthetic enzyme improved substrate prediction for distinct enzymes in data-scarce scenarios. We then fine-tuned models on each data set and showed that the fine-tuned models provided interpretable insight that we anticipate will facilitate the design of substrate libraries that are compatible with desired RiPP biosynthetic pathways.

Supplementary files

Article information

DOI: https://doi.org/10.1039/D4DD00170B
Article type: Paper
Submitted: 20 Jun 2024
Accepted: 28 Nov 2024
First published: 02 Dec 2024
This article is Open Access

Download Citation

Digital Discovery, 2025,4, 343-354

Permissions

Request permissions

Substrate prediction for RiPP biosynthetic enzymes via masked language modeling and transfer learning

J. D. Clark, X. Mi, D. A. Mitchell and D. Shukla, Digital Discovery, 2025, 4, 343 DOI: 10.1039/D4DD00170B

This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence. You can use material from this article in other publications, without requesting further permission from the RSC, provided that the correct acknowledgement is given and it is not used for commercial purposes.

To request permission to reproduce material from this article in a commercial publication, please go to the Copyright Clearance Center request page.

If you are an author contributing to an RSC publication, you do not need to request permission provided correct acknowledgement is given.

If you are the author of this article, you do not need to request permission to reproduce figures and diagrams provided correct acknowledgement is given. If you want to reproduce the whole article in a third-party commercial publication (excluding your thesis/dissertation for which permission is not required) please go to the Copyright Clearance Center request page.

Digital Discovery

Substrate prediction for RiPP biosynthetic enzymes via masked language modeling and transfer learning†

Abstract

Supplementary files

Article information

Download Citation

Permissions

Substrate prediction for RiPP biosynthetic enzymes via masked language modeling and transfer learning

Social activity

Search articles by author

Spotlight

Advertisements