Going beyond SMILES enumeration for data augmentation in generative drug discovery

Helena Brinkmann; Antoine Argante; Hugo ter Steege; Francesca Grisoni

doi:10.1039/D5DD00028A

Going beyond SMILES enumeration for data augmentation in generative drug discovery

Helena Brinkmann,^a Antoine Argante,^a Hugo ter Steege^a and Francesca Grisoni

*^ab

Author affiliations

* Corresponding authors

^a Institute for Complex Molecular Systems (ICMS), Eindhoven AI Systems Institute (EAISI), Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands
E-mail: f.grisoni@tue.nl

^b Centre for Living Technologies, Alliance TU/e, WUR, UU, UMC Utrecht, Utrecht, The Netherlands

Abstract

Data augmentation can alleviate the limitations of small molecular datasets for generative deep learning by ‘artificially inflating’ the number of instances available for training. SMILES enumeration – wherein multiple valid SMILES strings are used to represent the same molecules – has become particularly beneficial to improve the quality of de novo molecule design. Herein, we investigated whether rethinking SMILES augmentation techniques could further enhance the quality of de novo design. To this end, we introduce four novel approaches for SMILES augmentation, drawing inspiration from natural language processing and chemistry insights: (a) token deletion, (b) atom masking, (c) bioisosteric substitution, and (d) self-training. Via systematic analysis, our results showed the promise of considering additional strategies for SMILES augmentation. Every strategy showed distinct advantages; for example, atom masking is particularly promising to learn desirable physico-chemical properties in very low-data regimes, and deletion to create novel scaffolds. This new repertoire of SMILES augmentation strategies expands the available toolkit to design molecules with bespoke properties in low-data scenarios.

This article is part of the themed collection: 2025 Digital Discovery Emerging Investigators

Supplementary files

Article information

DOI: https://doi.org/10.1039/D5DD00028A
Article type: Paper
Submitted: 20 Jan 2025
Accepted: 05 Aug 2025
First published: 14 Aug 2025
This article is Open Access

Download Citation

Digital Discovery, 2025, Advance Article

Permissions

Request permissions

Going beyond SMILES enumeration for data augmentation in generative drug discovery

H. Brinkmann, A. Argante, H. ter Steege and F. Grisoni, Digital Discovery, 2025, Advance Article , DOI: 10.1039/D5DD00028A

This article is licensed under a Creative Commons Attribution 3.0 Unported Licence. You can use material from this article in other publications without requesting further permissions from the RSC, provided that the correct acknowledgement is given.

Digital Discovery

Going beyond SMILES enumeration for data augmentation in generative drug discovery

Abstract

Supplementary files

Article information

Download Citation

Permissions

Going beyond SMILES enumeration for data augmentation in generative drug discovery

Social activity

Search articles by author

Spotlight

Advertisements