Comparative Machine Learning and Deep Learning Frameworks for Robust Carcinogenicity Prediction and Activity Cliffs Analysis

Abstract

A large number of industrial chemicals are known to possess carcinogenic properties in humans. In this study, we developed predictive models for binary carcinogenicity data in rats that are closely associated with human carcinogenicity. This study involves a range of feature-based and chemical language modeling approaches. After the training-test split and selection of essential structural and physicochemical descriptors, we developed a simple Linear Discriminant Analysis model. Thereafter, we computed similarity- and error-based descriptors, pooled them with structural and physicochemical descriptors, and developed classification Read-Across Structure-Activity Relationship (c-RASAR) models using a range of machine learning algorithms, including an Artificial Neural Network (ANN). Additionally, the pooled feature matrix was used to compute two ARKA (Arithmetic Residuals in K-Groups Analysis) descriptors, and a simple logistic regression model was trained on the two descriptor feature matrix. Moreover, we adopted the Long Short-Term Memory (LSTM) architecture to develop a model based on SMILES strings. The results suggested that the Logistic Regression RASAR-ARKA model was the best-performing, and it was subsequently used to predict external data efficiently, along with the ANN c-RASAR model. Moreover, the ARKA framework allowed us to identify Activity Cliffs and explain the reason for mispredictions. In addition to providing an efficient prediction framework, the structure-function analysis suggests that the presence of nitrogen atoms, including hydrazine derivatives and nitrosamines, and greater branching are responsible for carcinogenicity, while increased molecular size reduces their carcinogenic potency.

Supplementary files

Article information

Article type
Paper
Submitted
03 Dec 2025
Accepted
08 Jan 2026
First published
08 Jan 2026

Environ. Sci.: Processes Impacts, 2026, Accepted Manuscript

Comparative Machine Learning and Deep Learning Frameworks for Robust Carcinogenicity Prediction and Activity Cliffs Analysis

A. Banerjee, V. Kumar and K. Roy, Environ. Sci.: Processes Impacts, 2026, Accepted Manuscript , DOI: 10.1039/D5EM01001B

To request permission to reproduce material from this article, please go to the Copyright Clearance Center request page.

If you are an author contributing to an RSC publication, you do not need to request permission provided correct acknowledgement is given.

If you are the author of this article, you do not need to request permission to reproduce figures and diagrams provided correct acknowledgement is given. If you want to reproduce the whole article in a third-party publication (excluding your thesis/dissertation for which permission is not required) please go to the Copyright Clearance Center request page.

Read more about how to correctly acknowledge RSC content.

Social activity

Spotlight

Advertisements