Reprogramming pretrained language models for protein sequence representation learning

Ria Vinod; Pin-Yu Chen; Payel Das

doi:10.1039/D4DD00195H

Reprogramming pretrained language models for protein sequence representation learning†

Ria Vinod,‡^a Pin-Yu Chen

*^b and Payel Das*^b

Author affiliations

* Corresponding authors

^a Department of Computational and Molecular Biology, Brown University, USA
E-mail: ria.vinod@brown.edu

^b IBM Research, USA
E-mail: pin-yu.chen@ibm.com, daspa@us.ibm.com

Abstract

Machine learning-guided solutions for protein learning tasks have made significant headway in recent years. However, success in scientific discovery tasks is limited by the accessibility of well-defined and labeled in-domain data. To tackle the low-data constraint, recent adaptions of deep learning models pretrained on millions of protein sequences have shown promise; however, the construction of such domain-specific large-scale models is computationally expensive. Herein, we propose representation reprogramming via dictionary learning (R2DL), an end-to-end representation learning framework in which we reprogram deep models for alternate-domain tasks that can perform well on protein property prediction with significantly fewer training samples. R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences, by learning a sparse linear mapping between English and protein sequence vocabulary embeddings. Our model can attain better accuracy and significantly improve the data efficiency by up to 10⁴ times over the baselines set by pretrained and standard supervised methods. To this end, we reprogram several recent state-of-the-art pretrained English language classification models (BERT, TinyBERT, T5, and roBERTa) and benchmark on a set of protein physicochemical prediction tasks (secondary structure, stability, homology, and solubility) as well as on a biomedically relevant set of protein function prediction tasks (antimicrobial, toxicity, antibody affinity, and protein–protein interaction).

Supplementary files

Article information

DOI: https://doi.org/10.1039/D4DD00195H
Article type: Paper
Submitted: 14 Aug 2024
Accepted: 25 Apr 2025
First published: 23 May 2025
This article is Open Access

Download Citation

Digital Discovery, 2025,4, 1591-1601

Permissions

Request permissions

Reprogramming pretrained language models for protein sequence representation learning

R. Vinod, P. Chen and P. Das, Digital Discovery, 2025, 4, 1591 DOI: 10.1039/D4DD00195H

This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence. You can use material from this article in other publications, without requesting further permission from the RSC, provided that the correct acknowledgement is given and it is not used for commercial purposes.

To request permission to reproduce material from this article in a commercial publication, please go to the Copyright Clearance Center request page.

If you are an author contributing to an RSC publication, you do not need to request permission provided correct acknowledgement is given.

If you are the author of this article, you do not need to request permission to reproduce figures and diagrams provided correct acknowledgement is given. If you want to reproduce the whole article in a third-party commercial publication (excluding your thesis/dissertation for which permission is not required) please go to the Copyright Clearance Center request page.

Digital Discovery

Reprogramming pretrained language models for protein sequence representation learning†

Abstract

Supplementary files

Article information

Download Citation

Permissions

Reprogramming pretrained language models for protein sequence representation learning

Social activity

Search articles by author

Spotlight

Advertisements