Going beyond SMILES enumeration for data augmentation in generative drug discovery

Abstract

Data augmentation can alleviate the limitations of small molecular datasets for generative deep learning by ‘artificially inflating’ the number of instances available for training. SMILES enumeration – wherein multiple valid SMILES strings are used to represent the same molecules – has become particularly beneficial to improve the quality of de novo molecule design. Herein, we investigated whether rethinking SMILES augmentation techniques could further enhance the quality of de novo design. To this end, we introduce four novel approaches for SMILES augmentation, drawing inspiration from natural language processing and chemistry insights: (a) token deletion, (b) atom masking, (c) bioisosteric substitution, and (d) self-training. Via systematic analysis, our results showed the promise of considering additional strategies for SMILES augmentation. Every strategy showed distinct advantages; for example, atom masking is particularly promising to learn desirable physico-chemical properties in very low-data regimes, and deletion to create novel scaffolds. This new repertoire of SMILES augmentation strategies expands the available toolkit to design molecules with bespoke properties in low-data scenarios.

Graphical abstract: Going beyond SMILES enumeration for data augmentation in generative drug discovery

Supplementary files

Article information

Article type
Paper
Submitted
20 Jan 2025
Accepted
05 Aug 2025
First published
14 Aug 2025
This article is Open Access
Creative Commons BY license

Digital Discovery, 2025, Advance Article

Going beyond SMILES enumeration for data augmentation in generative drug discovery

H. Brinkmann, A. Argante, H. ter Steege and F. Grisoni, Digital Discovery, 2025, Advance Article , DOI: 10.1039/D5DD00028A

This article is licensed under a Creative Commons Attribution 3.0 Unported Licence. You can use material from this article in other publications without requesting further permissions from the RSC, provided that the correct acknowledgement is given.

Read more about how to correctly acknowledge RSC content.

Social activity

Spotlight

Advertisements