Open Access Article
Kevin
Zhu‡
ab,
Daniel
Flam-Shepherd‡
ab and
Alán
Aspuru-Guzik
*abcde
aDepartment of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada. E-mail: aspuru.assistant@utoronto.ca
bVector Institute for Artificial Intelligence, Toronto, Ontario M5S 1M1, Canada
cDepartment of Chemistry, University of Toronto, Toronto, Ontario M5G 1Z8, Canada
dCanadian Institute for Advanced Research, Toronto, Ontario M5G 1Z8, Canada
eAcceleration Consortium, Toronto, Ontario M5S 3H6, Canada
First published on 10th July 2025
Chemical language models are powerful generative models, capable of learning complex molecular distributions such as the largest molecules in Pubchem. In this work, we further show that chemical language models can learn atom-level representations of substantially larger molecules – scaling even to biomolecules like proteins. We show that chemical language models can generate entire biomolecules atom by atom – effectively learning the multiple hierarchical layers of molecular information from primary sequence to tertiary structure. Even further, we demonstrate that chemical language models can explore chemical space and protein space simultaneously by generating novel examples of protein-drug conjugates. The results demonstrate the potential for atom level biomolecular design with chemical language models.
Recently, chemical language models1 were found to have the ability to generate larger, complex molecules, relative to small drug-like molecules such as the largest molecules in PubChem. These molecules are still much smaller than proteins, but this indicates that atom-level protein generation with language models is feasible. In this work, we demonstrate that chemical language models are capable of generating entire proteins atom by atom, including biomolecules beyond protein space. Specifically, we train models on various biomolecules including proteins from the Protein DataBank. We also create a synthetic biomolecular dataset by attaching small molecules from the ZINC database5 to single-domain antibodies (sdAbs) obtained from the antibody structural database.6
We discover that chemical language models can learn the language of proteins entirely from scratch – by learning to generate atom-level sequences that define proteins with valid primary sequences that can correspond to meaningful secondary, and tertiary structure, which we check using AlphaFold7 structure predictions. Importantly, the language model learns valid protein backbones and natural amino acid structures as well as the primary sequence patterns in the training proteins. We further demonstrate that chemical language models can generate novel proteins and small molecules together at the same time as protein-drug conjugates. In particular, we find that the model learns both the protein space of the single domain antibodies and the chemical space defined by the ZINC molecules – generating antibody–drug conjugates with valid and novel protein sequences and structures attached to novel drug-like molecules warheads similar to the structures in ZINC.
In this study, the datasets are constructed by using small proteins from the Protein Data Bank (PDB) – specifically between 50 and 150 residues – as well as sdAbs obtained from the antibody structural database.6 We use atom-level graph representations of each protein so that sidechain modifications can be made directly. For training, each protein can be parsed to a linear string representation, and random data augmentation can be used to increase the training data size. We describe the main details and results for each dataset in the following sections.
We check if samples generated by the model are proteins by analyzing if they preserve the basic structure of the protein backbone and natural amino acids form. First, we perform a backbone structure search and then attempt to arrange the backbone from the N terminus to the C terminus while simultaneously classifying each sidechain using another substructure search for the standard set of amino acids. If this is successful and there are no discontinuities in the backbone or other side chain errors, then we classify the sample as a protein and parse the amino acid sequence. By this process, we determine roughly ∼68.2% of samples are proteins, furthermore, all the parsed amino acid sequences are unique (there are no duplicates and the model isn't repeating specific proteins) and novel (they are different from the training sequences).
We compare the distribution of amino acids in the training sequences to the distribution learned by the model based on the generated samples. We plot histograms, in Fig. 1(B), displaying the frequency of occurrence of every amino acid in samples from both the model and the training data – both distributions are very similar and mostly overlap although for some amino acids, the language model slightly underestimates the training frequencies.
Using AlphaFold,7 in Fig. 1(B), we visualize selected examples of proteins generated by the language model. In each sample, residues are color-coded according to pLDDT, which is a per-residue estimate of the model's confidence on a scale from 0 to 100. Regions with pLDDT > 90 are dark blue and have high accuracy. Regions with pLDDT from 90 down to 70 are still expected to be good predictions and are colored light blue that transitions to green (with decreasing confidence). Regions with pLDDT between 50 and 70 are lower confidence and are colored yellow to green. The regions with pLDDT < 50 are not confident and likely disordered – these are colored red.
On this scale, in Fig. 1(C), we see that the selected proteins generated by the model result in good structure predictions – ranging between 70 and 90 pLDDT. This indicates that the model can generate proteins with well-defined structures that are not disordered. For a simple baseline comparison, we considered sequences of random amino acids, the structure predictions for these consistently result in disordered proteins with low pLDDT < 50.
Additionally, in Fig. 1(C), the proteins generated by the language model contain a variety of secondary structures including alpha helices, beta sheets, and omega loops. Globally, the generated proteins combine many of these secondary structures into various and unique domains. We can conclude, based on these samples, and further examples in ESI Fig. S1,† that language models can generate proteins, atom by atom, not just with valid primary sequences but proteins with meaningful secondary and tertiary structure.
Furthermore, in Fig. 1(C) and ESI Fig. S1,† under each generated protein we label the primary sequence percentage overlap between the generated proteins and their most similar PDB training example – which ranges from 86% to 40% (excluding one other generated protein that has no nearest PDB training example). This is evidence that the model draws heavily from the amino acid sequence patterns in its training data but does not memorize them.
Also, in ESI Fig. S2,† we plot histograms comparing atom-level properties of the samples generated from the model with the training data. The model roughly approximates the training distribution of atoms but slightly underestimates some properties.
After training, we again generate 1k samples from the language model for evaluation, and first test the model's ability to explore protein space and learn the distribution of single-domain antibodies – results are shown in Fig. 2(B–D). Similar to the standard protein data, we compare the distribution of amino acids in the training sequences to the distribution learned by the model. We plot histograms, in Fig. 3(B), displaying the frequency of occurrence of every amino acid in samples from both the model and the training data – from these, we can see the language model accurately learns the training distribution of amino acids. Similarly, in Fig. 3(C), the model accurately learns the size of the training sdAbs.
Similar to the standard proteins, we can attempt to determine the amino acid sequences of the single-domain antibodies (ignoring the warheads). We determine roughly ∼90.8% of samples are proteins and their primary sequences are unique and novel (there are no duplicates and all are different from training sequences).
Even further, examples of AlphaFold structure predictions,7 visualized in Fig. 2(D) and ESI S6,† confidently show that the language model can produce sequences that fold into the expected structure for single domain antibodies. Additionally, based on the primary sequence overlap of model samples with their nearest PDB training example in Fig. 1(C) and ESI Fig. S1,† the model learns to produce amino acid sequence structures that are similar to the training sdAbs. The primary sequence overlap with training examples ranges from 63% to 93% in the ESI.† Investigating further, we see that the model draws heavily from the sdAb sequences making new examples of sequences by memorizing small snippets of amino acids and using larger training snippets but with many single mutations randomly distributed throughout the snippet.
From the training examples and model samples, we detach and collect “warheads” which we expand the definition of to include the linker and sidechain in addition to the small molecule (warhead typically refers to just the small molecule). In Fig. 3(B), two examples of train and model warheads are shown as graphs to clarify this. Additional model and training warheads are shown as graphs in ESI Fig. S8 and S7† – as expected the same four linkers repeat across samples but the small molecules attached to them differ and are structurally similar to the ZINC molecules in the training warheads.
We also evaluate the language model's warheads in terms of their atom-level properties. In Fig. 3(A), the model captures the atom-level properties of the training warheads, specifically, it learns the continuous atom-level properties of the training warheads including log
P,11 drug-likeness (QED),12 Synthetic Accessibility Score (SA) and molecular graph complexity (BCT) as well as the number of atoms, bonds, rings and atoms in rings. However, the model slightly underestimates the main modes for QED and SA as well as the number of rings per warhead.
Additionally, we assess the model warheads and compare them with the training warheads, we find that model warheads are unique (there are no duplicates and the model is not repeating a few examples) as well as novel (the model does not make exact copies of warheads from the training data). Given that the linkers are memorized, this indicates that the model is learning to generate new small molecules similar to ZINC molecules and effectively exploring chemical space at the same time it learns to explore the protein space defined by the sdAbs.
Also, in ESI Fig. S9,† we see that the model does learn the atom-level properties of the training antibody drug-conjugates. Additionally, in ESI Fig. S3–S5,† we show a single train antibody drug-conjugate and four model samples.
Effectively we demonstrate that chemical language models can also serve as biological language models – capable of learning the language of proteins atom by atom. Instead of only learning representations of amino acid sequences, chemical language models generate entire molecular graphs, enabling us to show that chemical language models can be used to explore not just chemical space but also both chemical and protein space at the same time. This work is an initial demonstration of the potential of chemical language models beyond the scale and space that they were designed for – more work is necessary to design real biomolecules that can be experimentally validated.
Further work should be done to enable the model to more consistently generate valid backbone and amino acid form. This will also assist the model in learning distributions consisting of larger biomolecules including structures with more than 150 residues and multiple domains. Using memorizing transformers13 may help the model generate valid protein sequences. Also, other architectures built for longer sequence lengths14 can increase the size and range of structures that the model can learn. Another limitation is that chemical sequence representations do not include the three-dimensional structure of the biomolecule. A potential solution could involve using a point cloud representation of every atom of the biomolecule, combined with reinforcement learning15 or bayesian optimization16 to guide the model to make sidechain modifications in 3D using energy. Additionally, incorporating hierarchical representations – such as group SELFIE – could provide a more robust encoding of the data's inherent structure, allowing the model to represent both amino acid building blocks and small molecules simultaneously.17
The goal of this work is to demonstrate the power of chemical language models and their ability to learn atom-level representations of biomolecules. We envision future language models will be able to explore any combinatorial space in chemistry or biology using any representation type the user wishes.18
consists of standard selfies tokens, encoding all information in a molecular graph including: atom tokens {[C], [N],…}, bond tokens {[
C], [#N],…}, ring tokens: {[Ring1],[Ring2],…} branching tokens: {[Branch1_1], [Branch1_2],…}. In total for all datasets, the vocabulary is around ∼30 tokens.
such that MOL=([CT]1, [CT]2,…,[CT]n). The joint probabilities over a single molecule can be written as![]() | (1) |
Footnotes |
† Electronic supplementary information (ESI) available: Fig. S1: Examples of proteins generated by the model visualized by AlphaFold and colored by pLDDT. Fig. S2: Histograms of atom level properties for the basic proteins training data. These include exact molecular weight (MW), octanol–water partition coefficient (log P),11 molecular complexity (BCT), topological polar surface area (PSA), number of rings (RINGS), number of atoms (ATOMS), number of bonds (BONDS), number of fragments found by breaking up the molecule at rotatable bonds (FRAGS), number of carbons (C), number of nitrogens (N), number of oxygens (O), and number of any other atoms (OTHER). Fig. S3: Antibody–drug conjugates example antibody–drug conjugates from the training data (above) and one example produced by the language model (below) and plotted using ChemDraw as an atom-level graph. Fig. S4: Model antibody–drug conjugates example antibody–drug conjugates produced by the language model and plotted using ChemDraw as an atom-level graph. Fig. S5: Model antibody–drug conjugates example antibody–drug conjugates produced by the language model and plotted using ChemDraw as an atom-level graph. Fig. S6: Examples of single domain antibodies from samples of sdAb-drug conjugates (excluding warheads) generated by the model visualized by AlphaFold and colored by pLDDT. Fig. S7: examples of detached warheads from training single domain antibody–drug conjugates. Fig. S8: Examples of detached warheads from model generated single domain antibody–drug conjugates. Fig. S9: Histograms of atom level properties for the antibody–drug-conjugates training data. These include exact molecular weight (MW), octanol–water partition coefficient (log P),11 molecular complexity (BCT), topological polar surface area (PSA), number of rings (RINGS), number of atoms (ATOMS), number of bonds (BONDS), number of fragments found by breaking up the molecule at rotatable bonds (FRAGS), number of carbons (C), number of nitrogens (N), number of oxygens (O), and number of any other atoms (OTHER). See DOI: https://doi.org/10.1039/d5dd00107b |
| ‡ These two authors contributed equally. |
| This journal is © The Royal Society of Chemistry 2025 |