Widespread False Negatives in DNA-Encoded Library Data: How Linker Effects Impair Machine Learning-Based Lead Prediction

Abstract

DNA-encoded chemical libraries (DECLs) have become integral to early-stage drug discovery, yielding active compounds and extensive labeled datasets for machine learning (ML)-based prediction of bioactive molecules. However, the information content of DECL selection data remains scarcely explored. This study systematically investigates for the first time the prevalence of false negatives and the influence of the linker in DECL data. Using a focused DECL targeting the poly-(ADP-ribose) polymerases PARP1/2 and TNKS1/2 as a model system, we found that our DECL selections frequently miss active compounds, with numerous false negatives for each identified hit. The presence of the DNA-conjugation linker emerged as a factor contributing to the underdetection of active molecules. This bias toward false negatives compromises the predictive power of DECL data for prioritizing hits, anticipating target selectivity, and training ML models, as determined by analyzing the effects of undersampling and oversampling techniques in learning the PARP2 data. Conversely, the linker’s presence in DECLs offers advantages, such as enabling the identification of target-selective protein engagers, even when the underlying molecules themselves may not be selective. These findings highlight the challenges and opportunities of DECL data, emphasizing the need for best practices in data handling and ML model development in drug discovery.

Supplementary files

Article information

Article type
Edge Article
Submitted
01 Feb 2025
Accepted
07 May 2025
First published
09 May 2025
This article is Open Access

All publication charges for this article have been paid for by the Royal Society of Chemistry
Creative Commons BY-NC license

Chem. Sci., 2025, Accepted Manuscript

Widespread False Negatives in DNA-Encoded Library Data: How Linker Effects Impair Machine Learning-Based Lead Prediction

A. L. Montoya Arias, A. Hogendorf, S. Tingey, A. Kuberan, L. H. Yuen, H. Schüler and R. Franzini, Chem. Sci., 2025, Accepted Manuscript , DOI: 10.1039/D5SC00844A

This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence. You can use material from this article in other publications, without requesting further permission from the RSC, provided that the correct acknowledgement is given and it is not used for commercial purposes.

To request permission to reproduce material from this article in a commercial publication, please go to the Copyright Clearance Center request page.

If you are an author contributing to an RSC publication, you do not need to request permission provided correct acknowledgement is given.

If you are the author of this article, you do not need to request permission to reproduce figures and diagrams provided correct acknowledgement is given. If you want to reproduce the whole article in a third-party commercial publication (excluding your thesis/dissertation for which permission is not required) please go to the Copyright Clearance Center request page.

Read more about how to correctly acknowledge RSC content.

Social activity

Spotlight

Advertisements