Chemical language models can generate biomolecules atom-by-atom

Abstract

Chemical language models are powerful generative models, capable of learning complex molecular distributions such as the largest molecules in Pubchem. In this work, we further show that chemical language models can learn atom-level representations of substantially larger molecules – scaling even to biomolecules like proteins. We show that chemical language models can generate entire biomolecules atom by atom – effectively learning the multiple hierarchical layers of molecular information from primary sequence to tertiary structure. Even further, we demonstrate that chemical language models can explore chemical space and protein space simultaneously by generating novel examples of protein-drug conjugates. The results demonstrate the potential for atom level biomolecular design with chemical language models.

Graphical abstract: Chemical language models can generate biomolecules atom-by-atom

Supplementary files

Article information

Article type
Paper
Submitted
17 Mar 2025
Accepted
26 Jun 2025
First published
10 Jul 2025
This article is Open Access
Creative Commons BY license

Digital Discovery, 2025, Advance Article

Chemical language models can generate biomolecules atom-by-atom

K. Zhu, D. Flam-Shepherd and A. Aspuru-Guzik, Digital Discovery, 2025, Advance Article , DOI: 10.1039/D5DD00107B

This article is licensed under a Creative Commons Attribution 3.0 Unported Licence. You can use material from this article in other publications without requesting further permissions from the RSC, provided that the correct acknowledgement is given.

Read more about how to correctly acknowledge RSC content.

Social activity

Spotlight

Advertisements