MolEncoder: towards optimal masked language modeling for molecules

Abstract

Predicting molecular properties is a key challenge in drug discovery. Machine learning models, especially those based on transformer architectures, are increasingly used to make these predictions from chemical structures. Inspired by recent progress in natural language processing, many studies have adopted encoder-only transformer architectures similar to BERT (Bidirectional Encoder Representations from Transformers) for this task. These models are pretrained using masked language modeling, where parts of the input are hidden and the model learns to recover them before fine-tuning on downstream tasks. In this work, we systematically investigate whether core assumptions from natural language processing, which are commonly adopted in molecular BERT-based models, actually hold when applied to molecules represented using the Simplified Molecular Input Line Entry System (SMILES). Specifically, we examine how masking ratio, pretraining dataset size, and model size affect performance in molecular property prediction. We find that higher masking ratios than commonly used significantly improve performance. In contrast, increasing model or pretraining dataset size quickly leads to diminishing returns, offering no consistent benefit while incurring significantly higher computational cost. Based on these insights, we develop MolEncoder, a BERT-based model that outperforms existing approaches on drug discovery tasks while being more computationally efficient. Our results highlight key differences between molecular pretraining and natural language processing, showing that they require different design choices. This enables more efficient model development and lowers barriers for researchers with limited computational resources. We release MolEncoder publicly to support future work and hope our findings help make molecular representation learning more accessible and cost-effective in drug discovery.

Graphical abstract: MolEncoder: towards optimal masked language modeling for molecules

Supplementary files

Article information

Article type
Paper
Submitted
18 Aug 2025
Accepted
25 Oct 2025
First published
28 Oct 2025
This article is Open Access
Creative Commons BY license

Digital Discovery, 2025, Advance Article

MolEncoder: towards optimal masked language modeling for molecules

F. P. Krüger, N. Österbacka, M. Kabeshov, O. Engkvist and I. Tetko, Digital Discovery, 2025, Advance Article , DOI: 10.1039/D5DD00369E

This article is licensed under a Creative Commons Attribution 3.0 Unported Licence. You can use material from this article in other publications without requesting further permissions from the RSC, provided that the correct acknowledgement is given.

Read more about how to correctly acknowledge RSC content.

Social activity

Spotlight

Advertisements