MolEncoder: towards optimal masked language modeling for molecules

Fabian P. Krüger; Nicklas Österbacka; Mikhail Kabeshov; Ola Engkvist; Igor Tetko

doi:10.1039/D5DD00369E

MolEncoder: towards optimal masked language modeling for molecules

Fabian P. Krüger,

*^abc Nicklas Österbacka,

^a Mikhail Kabeshov,

^a Ola Engkvist

^ad and Igor Tetko

^c

Author affiliations

* Corresponding authors

^a AstraZeneca R&D, Discovery Sciences, Molecular AI, 431 83 Mölndal, Sweden

^b TUM School of Computation, Information and Technology, Department of Mathematics, Technical University of Munich, 80333 Munich, Germany

^c Helmholtz Munich – German Research Center for Environmental Health (GmbH), Institute of Structural Biology, Molecular Targets and Therapeutics Center, 85764 Neuherberg, Germany

^d Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, Sweden

Abstract

Predicting molecular properties is a key challenge in drug discovery. Machine learning models, especially those based on transformer architectures, are increasingly used to make these predictions from chemical structures. Inspired by recent progress in natural language processing, many studies have adopted encoder-only transformer architectures similar to BERT (Bidirectional Encoder Representations from Transformers) for this task. These models are pretrained using masked language modeling, where parts of the input are hidden and the model learns to recover them before fine-tuning on downstream tasks. In this work, we systematically investigate whether core assumptions from natural language processing, which are commonly adopted in molecular BERT-based models, actually hold when applied to molecules represented using the Simplified Molecular Input Line Entry System (SMILES). Specifically, we examine how masking ratio, pretraining dataset size, and model size affect performance in molecular property prediction. We find that higher masking ratios than commonly used significantly improve performance. In contrast, increasing model or pretraining dataset size quickly leads to diminishing returns, offering no consistent benefit while incurring significantly higher computational cost. Based on these insights, we develop MolEncoder, a BERT-based model that outperforms existing approaches on drug discovery tasks while being more computationally efficient. Our results highlight key differences between molecular pretraining and natural language processing, showing that they require different design choices. This enables more efficient model development and lowers barriers for researchers with limited computational resources. We release MolEncoder publicly to support future work and hope our findings help make molecular representation learning more accessible and cost-effective in drug discovery.

This article is part of the themed collection: AI in Drug Discovery at ICANN2025

Digital Discovery

MolEncoder: towards optimal masked language modeling for molecules

Abstract

Supplementary files

Article information

Download Citation

Permissions

MolEncoder: towards optimal masked language modeling for molecules

Social activity

Search articles by author

Spotlight

Advertisements