Comparative Machine Learning and Deep Learning Frameworks for Robust Carcinogenicity Prediction and Activity Cliffs Analysis
Abstract
A large number of industrial chemicals are known to possess carcinogenic properties in humans. In this study, we developed predictive models for binary carcinogenicity data in rats that are closely associated with human carcinogenicity. This study involves a range of feature-based and chemical language modeling approaches. After the training-test split and selection of essential structural and physicochemical descriptors, we developed a simple Linear Discriminant Analysis model. Thereafter, we computed similarity- and error-based descriptors, pooled them with structural and physicochemical descriptors, and developed classification Read-Across Structure-Activity Relationship (c-RASAR) models using a range of machine learning algorithms, including an Artificial Neural Network (ANN). Additionally, the pooled feature matrix was used to compute two ARKA (Arithmetic Residuals in K-Groups Analysis) descriptors, and a simple logistic regression model was trained on the two descriptor feature matrix. Moreover, we adopted the Long Short-Term Memory (LSTM) architecture to develop a model based on SMILES strings. The results suggested that the Logistic Regression RASAR-ARKA model was the best-performing, and it was subsequently used to predict external data efficiently, along with the ANN c-RASAR model. Moreover, the ARKA framework allowed us to identify Activity Cliffs and explain the reason for mispredictions. In addition to providing an efficient prediction framework, the structure-function analysis suggests that the presence of nitrogen atoms, including hydrazine derivatives and nitrosamines, and greater branching are responsible for carcinogenicity, while increased molecular size reduces their carcinogenic potency.
Please wait while we load your content...