A hitchhiker's guide to deep chemical language processing for bioactivity prediction

Rıza Özçelik; Francesca Grisoni

doi:10.1039/D4DD00311J

A hitchhiker's guide to deep chemical language processing for bioactivity prediction†

Rıza Özçelik

^ab and Francesca Grisoni

*^ab

Author affiliations

* Corresponding authors

^a Eindhoven University of Technology, Institute for Complex Molecular Systems, Eindhoven AI Systems Institute, Dept. Biomedical Engineering, Eindhoven, Netherlands
E-mail: f.grisoni@tue.nl

^b Centre for Living Technologies, Alliance TU/e, WUR, UU, UMC Utrecht, Netherlands

Abstract

Deep learning has significantly accelerated drug discovery, with ‘chemical language’ processing (CLP) emerging as a prominent approach. CLP approaches learn from molecular string representations (e.g., Simplified Molecular Input Line Entry Systems [SMILES] and Self-Referencing Embedded Strings [SELFIES]) with methods akin to natural language processing. Despite their growing importance, training predictive CLP models is far from trivial, as it involves many ‘bells and whistles’. Here, we analyze the key elements of CLP and provide guidelines for newcomers and experts. Our study spans three neural network architectures, two string representations, three embedding strategies, across ten bioactivity datasets, for both classification and regression purposes. This ‘hitchhiker's guide’ not only underscores the importance of certain methodological decisions, but it also equips researchers with practical recommendations on ideal choices, e.g., in terms of neural network architectures, molecular representations, and hyperparameter optimization.

Supplementary files

Article information

DOI: https://doi.org/10.1039/D4DD00311J
Article type: Communication
Submitted: 26 Sep 2024
Accepted: 13 Dec 2024
First published: 16 Dec 2024
This article is Open Access

Download Citation

Digital Discovery, 2025,4, 316-325

Permissions

Request permissions

A hitchhiker's guide to deep chemical language processing for bioactivity prediction

R. Özçelik and F. Grisoni, Digital Discovery, 2025, 4, 316 DOI: 10.1039/D4DD00311J

This article is licensed under a Creative Commons Attribution 3.0 Unported Licence. You can use material from this article in other publications without requesting further permissions from the RSC, provided that the correct acknowledgement is given.

Digital Discovery

A hitchhiker's guide to deep chemical language processing for bioactivity prediction†

Abstract

Supplementary files

Article information

Download Citation

Permissions

A hitchhiker's guide to deep chemical language processing for bioactivity prediction

Social activity

Search articles by author

Spotlight

Advertisements