Comparative machine learning and deep learning frameworks for robust carcinogenicity prediction and activity cliff analysis

Arkaprava Banerjee; Vinay Kumar; Kunal Roy

doi:10.1039/D5EM01001B

Comparative machine learning and deep learning frameworks for robust carcinogenicity prediction and activity cliff analysis

Arkaprava Banerjee,

^a Vinay Kumar

^a and Kunal Roy

*^a

Author affiliations

* Corresponding authors

^a Drug Theoretics and Cheminformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, Kolkata 700 032, India
E-mail: kunal.roy@jadavpuruniversity.in, kunalroy_in@yahoo.com

Abstract

Many industrial chemicals are recognized as carcinogenic to humans. In this study, we developed predictive models for binary carcinogenicity data in rats that are closely associated with human carcinogenicity. This study involves a range of feature-based and chemical language modeling approaches. After the training-test split and selection of essential structural and physicochemical descriptors, we developed a simple linear discriminant analysis model. Thereafter, we computed similarity- and error-based descriptors, pooled them with structural and physicochemical descriptors, and developed classification read-across structure–activity relationship (c-RASAR) models using a range of machine learning algorithms, including an artificial neural network (ANN). Additionally, the pooled feature matrix was used to compute two ARKA (arithmetic residuals in K-groups analysis) descriptors, and a simple logistic regression model was trained on the two-descriptor feature matrix. Moreover, we adopted the long short-term memory (LSTM) architecture to develop a model based on SMILES strings. The results suggested that the logistic regression RASAR-ARKA model was the best-performing, and it was subsequently used to predict external data efficiently, along with the ANN c-RASAR model. Moreover, the ARKA framework allowed us to identify activity cliffs and explain the reason for mispredictions. In addition to providing an efficient prediction framework, the structure–function analysis suggests that the presence of nitrogen atoms, including in hydrazine derivatives and nitrosamines, and greater branching are responsible for carcinogenicity, while increased molecular size reduces the carcinogenic potency.

Supplementary files

Article information

DOI: https://doi.org/10.1039/D5EM01001B
Article type: Paper
Submitted: 03 Dec 2025
Accepted: 08 Jan 2026
First published: 08 Jan 2026

Download Citation

Environ. Sci.: Processes Impacts, 2026, Advance Article

Permissions

Request permissions

Comparative machine learning and deep learning frameworks for robust carcinogenicity prediction and activity cliff analysis

A. Banerjee, V. Kumar and K. Roy, Environ. Sci.: Processes Impacts, 2026, Advance Article , DOI: 10.1039/D5EM01001B

To request permission to reproduce material from this article, please go to the Copyright Clearance Center request page.

If you are an author contributing to an RSC publication, you do not need to request permission provided correct acknowledgement is given.

If you are the author of this article, you do not need to request permission to reproduce figures and diagrams provided correct acknowledgement is given. If you want to reproduce the whole article in a third-party publication (excluding your thesis/dissertation for which permission is not required) please go to the Copyright Clearance Center request page.

Environmental Science: Processes & Impacts

Comparative machine learning and deep learning frameworks for robust carcinogenicity prediction and activity cliff analysis

Abstract

Supplementary files

Article information

Download Citation

Permissions

Comparative machine learning and deep learning frameworks for robust carcinogenicity prediction and activity cliff analysis

Social activity

Search articles by author

Spotlight

Advertisements