ARKA: a framework of dimensionality reduction for machine-learning classification modeling, risk assessment, and data gap-filling of sparse environmental toxicity data

Arkaprava Banerjee and Kunal Roy *
Drug Theoretics and Cheminformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, Kolkata 700 032, India. E-mail: kunal.roy@jadavpuruniversity.in

Received 29th March 2024 , Accepted 5th May 2024

First published on 6th May 2024


Abstract

Due to the lack of experimental toxicity data for environmental chemicals, there arises a need to fill data gaps by in silico approaches. One of the most commonly used in silico approaches for toxicity assessment of small datasets is the Quantitative Structure–Activity Relationship (QSAR), which generates predictive models for the efficient prediction of query compounds. However, the reliability of the predictions from QSARs derived from small datasets is often questionable from a statistical point of view. This is due to the presence of a larger number of descriptors as compared to the number of training compounds, which reduces the degree of freedom of the developed model. To reduce the overall prediction error for a particular QSAR model, we have proposed here the computation of the novel Arithmetic Residuals in K-groups Analysis (ARKA) descriptors. We have reduced the number of modeling descriptors in a supervised manner by partitioning them into K classes (K = 2 here) depending on the higher mean normalized values of the descriptors to a particular response class, thus preventing the loss of chemical information. A scatter plot of the data points using the values of two ARKA descriptors (ARKA_2 vs. ARKA_1) can potentially identify activity cliffs, less confident data points, and less modelable data points. We have used here five representative environmentally relevant endpoints (skin sensitization, earthworm toxicity, milk/plasma partitioning, algal toxicity, and rodent carcinogenicity of hazardous chemicals) with graded responses to which the ARKA framework was applied for classification modeling. On comparing the performance of the models generated using conventional QSAR descriptors and the ARKA descriptors, the prediction quality of the models derived from ARKA descriptors was found, based on multiple graded-data validation metrics-derived decision criteria, much better than the models derived from QSAR descriptors signifying the potential of ARKA descriptors in ecotoxicological classification modeling of small data sets. Additionally, this holds true for the Read-Across approach as well, since the Read-Across predictions using ARKA descriptors supersede the predictions generated from QSAR descriptors. For the ease of users, a Java-based expert system has been developed that computes the ARKA descriptors from the input of QSAR descriptors.



Environmental significance

The experimental data of chemical hazards for hundreds of endpoints being very limited, chemical regulatory authorities accept model-derived data for data-gap filling. However, the available sparse data for several endpoints are insufficient to develop statistically meaningful models forcing the modelers to use limited chemical features for final model development compromising the applicability domain and wide usability of the models for predictions. The problem of small data set classification modeling of ecotoxicity endpoints is addressed here by introducing the concept of Arithmetic Residuals in K-groups Analysis (ARKA) as a novel method of supervised dimensionality reduction which judges data set modelability and demonstrates enhanced external prediction quality compared to the corresponding quantitative structure–activity relationship (QSAR) models.

1 Introduction

A vast array of organic molecules found in the environment can potentially induce disruption in aquatic and terrestrial ecosystems.1 This is brought about by different structural, physicochemical, and electronic properties of such molecules that enable them to exert toxic effects on different species of flora and fauna. A growing area of environmental research is to assess the ecotoxicological risk of these harmful chemicals using non-animal alternative approaches. A real threat to the entire biodiversity is the lack of experimental data on the toxicity profile of most of the chemicals existing in the environment, which has resulted in a large data gap. Entry of such substances inside the living system, either directly or by the process of biomagnification, can cause a variety of adverse effects including the disruption of the endocrine system. Therefore, the identification of these unknown “hazardous materials” is of prime importance for their safe disposal, which will reduce the disruption in the ecosystem. Although extensive research studies are going on for the experimental toxicity assessment of these substances, they are often time-consuming and economically unviable. The limited availability of experimental ecotoxicological data warrants the need to shift toward computational methods for the quick, easy, and accurate predictions of the endpoints concerned.2,3 This is in line with regulatory bodies like the Organisation for Economic Co-operation and Development (OECD)4 which encourage the use of in silico approaches, thus reducing time, animal suffering, expenses, and manpower associated with animal experimentation.5 Since in silico approaches generate quick and accurate results, they can efficiently be used for data gap-filling.6 This approach is also acceptable to regulations like the European Union Registration, Evaluation, Authorisation and Restriction of Chemicals (EU REACH)7 which accepts data generated from non-animal approaches. Since in silico approaches have worldwide acceptability, they can be considered useful tools to fill ecotoxicity data gaps, thus enabling the identification of toxic substances and their corresponding toxicophores.

1.1 In silico modeling methods for ecotoxicity endpoints

Among the in silico prediction approaches, the Quantitative Structure–Activity Relationship (QSAR)8 has been one of the go-to methods for computational model development and predictions. In its basic form, QSAR generates a simple mathematical model that correlates various structural and physicochemical features with the endpoint of interest.8 With the advancements in this field, researchers have identified that the features may not necessarily be linearly correlated to the response values, thus introducing the concept of a non-linear relationship. To accommodate such correlations, various Machine Learning (ML) modeling algorithms have been adopted that can effectively incorporate non-linear relationships.9,10 These models are developed using a variety of algorithms that efficiently reflect the structure–activity relationship. However, the only drawback that can be associated with the application of different ML modeling algorithms is the lack of interpretability of the features since most of the ML models have a “black box”, although the recent innovations have focused on the explainability of the ML models by introducing concepts like SHAP analysis11 and Swiss knife.12 The limitations associated with the QSAR approach are exemplified quite often in the case of small datasets. Due to the limited number of compounds, the models should ideally be developed using a lower number of descriptors to comply with the statistical requirements, but this reduces the reliability of predictions since it compromises encoding the proper chemical information. There are two ways to deal with it: the first is to adopt non-statistical approaches like Read-Across (RA),13,14 and the second is to adopt dimensionality reduction techniques to reduce the size of the descriptor matrix. While the former approach primarily does not reflect the quantitative contribution of the descriptors, the latter approach adheres to the QSAR methodology. In the recent past, several studies on the applications of various in silico methodologies have been reported to accurately predict various ecotoxicity endpoints. Hung and Gini used deep learning-based QSAR modeling to predict the mutagenicity of diverse chemicals.15 Chatterjee et al., performed quantitative Read-Across using small datasets of nanoparticles to predict their toxicity.16 Banerjee and Roy integrated the concept of Read-Across into a statistical modeling framework and developed quantitative Read-Across Structure–Activity Relationship (q-RASAR) models to predict the androgen receptor binding affinity of environmental chemicals.17 Srisongkram ensembled the predictions of RA and ML-based QSAR to develop a stacked model for the prediction of skin cytotoxicity.18 A simple computational workflow for the prediction of environmentally relevant endpoints may be important for regulatory purposes from the viewpoint of transparency and easy transferability of the models, as already reported in several previous studies.19,20

1.2 The curse of dimensionality and the motivation of the current study

Small dataset modeling using the QSAR approach has been a very challenging job since a QSAR modeling data set needs to possess sufficient data points to perfectly train itself. To address this problem, different techniques like synthetic sample generation,21 double cross-validation,22 consensus predictions,23etc., have been used in the literature. The deficiency of sufficient data points warrants the QSAR modeler to include a higher number of features (descriptors) to establish a linear relationship between the data points. In such cases, the statistical aspect is compromised as the ultimate aim of a modeler is to develop highly predictive models using a lower number of descriptors. Moreover, the application of a higher number of descriptors coupled with ML algorithms generally tends to generate overfitted models that may not perform well on an external set of data. On the flip side of the coin, using a lower number of descriptors may not be able to develop robust and effective models since there is a loss of chemical information associated with the reduction in the number of descriptors. This calls for the development of new techniques that use a lower number of descriptors (i.e. a lower degree of freedom) while retaining the chemical information. This represents a form of dimensionality reduction technique that reduces the size of the descriptor matrix, yet retains the chemical information. While dimensionality reduction techniques like Principal Component Analysis (PCA)24 and Partial Least Squares (PLS)25 are already in use, we have presented here a simple form of dimensionality reduction technique – the ARKA descriptors, that effectively encode the chemical information of various descriptors in a particular form of computationally derived descriptors using an (A)rithmetic (R)esiduals in K-groups (A)nalysis approach. While developing models using a higher number of descriptors covers a wider chemical space as compared to models developed using a lower number of descriptors, such derived descriptors can encode the complete chemical information into a limited number of descriptors, thus not compromising the applicability domain of the developed models. Dimensionality reduction methods like PCA, when applied to a QSAR problem, conventionally use the descriptor matrix and do not derive any information from the experimental data of the training set compounds. On the other hand, the assignment of the descriptors into different classes and the weighing scheme for each descriptor in the computation of the ARKA descriptors uses the experimental response values of the training set compounds. In this sense, ARKA is a simple supervised dimensionality reduction framework which should be important considering that there is a lack of supervised dimensionality reduction algorithms.26 ARKA descriptors can potentially indicate activity cliffs while other dimensionality reduction techniques like PCA use only the descriptor matrix and at most can detect outliers and not the activity cliffs. We apply the suggested ARKA descriptors to classification modeling of five small data sets of environmental context which were previously analyzed using linear discriminant analysis employing conventional QSAR descriptors. We aim to examine the ability of the proposed framework in data set modelability analysis and the impact of using the novel descriptors on the external predictivity of the models. Additionally, we also apply different machine-learning-based classification modeling approaches for comparison purposes.

2 Materials and methods

2.1 Collection of environmental toxicity datasets

To evaluate the performance of the proposed novel descriptors and to check their performance on external data sets, we have taken five different environmental toxicity datasets. We have judiciously taken five sample ecotoxicity datasets, containing a limited number of data points, for which previously reported classification-based QSAR models were already reported. The purpose of selecting such data is as follows:

(1) These data sets contain a limited number of data points. We are aiming here to address the problems of classification modeling of smaller data sets starting with a relatively large pool of descriptors.

(2) The availability of the already-reported QSAR modeling descriptors has helped the proper comparison of the conventional QSAR models with the models developed using ARKA descriptors.

Dataset 1 represents the graded skin sensitization data of 405 diverse organic chemicals as reported by Banerjee and Roy.27 Dataset 2 consists of 163 graded data points for the chemical toxicity of earthworms as reported by Roy et al.28 Dataset 3 represents 185 graded data points for milk/plasma concentration ratios of drugs and environmental pollutants as reported by Kar and Roy.29 Dataset 4 consists of 105 graded data points on chemical toxicity towards Pseudokirchneriella subcapitata as reported by Pramanik and Roy.30 Dataset 5 reports the graded form of 70 rodent carcinogenicity potency data from the work of Kar et al.31 The original sets of QSAR descriptors as reported in the modeling analysis of the five data sets are reported in Table S1 in ESI SI-3.

2.2 The algorithm for the computation of the ARKA descriptors

Paola Gramatica in one of her works stated that QSAR modeling is not “Push a Button and Find a Correlation”.32 This is the driving force for researchers of the modern era to develop newer approaches to generating more efficient predictive ability of the models. Observing the pictorial architecture of an Artificial Neural Network published in Roy et al., 2015,8 we thought a concept could be developed by clustering the descriptors, assigning a suitable weight, and storing as a composite value in a single descriptor specific for each cluster, giving rise to a concept of dimensionality reduction. Since one of the key motives for this work is to stress the aspect of the simplicity of the computational approach thus allowing the broader scientific community to easily adopt the suggested strategy, we have used the same division of the training and test sets as reported by the previous authors for the computation of ARKA descriptors making the comparison of their performance with the conventional descriptors an easy task. Please note that the objective of the current work is not to develop the best model for each endpoint, but to establish the usefulness of the proposed method of dimensionality reduction in the case of small data set modeling. The basic idea behind the computation of the ARKA descriptors is to group the conventional QSAR descriptors based on a predefined criterion and then assign weightage to each descriptor in each group. We decided to explore the predictive performance of ARKA descriptors initially in a classification QSAR modeling framework and thus selected the data sets having graded response data. Although it is possible to partition the features into K-groups, in the present work we have restricted the value of K to 2 (corresponding to positive and negative classes).

Since feature selection is an integral step that is performed on the training set compounds, it is implied that the authors of the source datasets have selected the features based on the training set compounds only. Therefore, from a statistical point of view, the calculations of the ARKA descriptors should ideally be based on the training set. The first step is to normalize the training set descriptors such that the range of values for each descriptor column is from 0 to 1. This was followed by the grouping of the active and inactive class data points. The computation of the mean values of a particular descriptor in both the active and inactive classes was performed, and their difference (positive class descriptor mean – negative class descriptor mean) and absolute difference were calculated. This is the methodology of the most discriminating feature selection technique or the molecular spectrum analysis.33,34 It is to be noted that in this work, we have not performed additional feature selection based on the absolute mean difference values as we have already used the selected features from the previous references, and we are only considering the difference and absolute difference in mean values. Conceptually, this must be clear that this operation should be done using the normalized (scaled between 0 and 1) training set descriptor values and not using the standardized training set since the basic idea behind normalization is to bring the values of each descriptor into a same range, which is essential for the computation and comparison of the mean differences.

After the computation of the mean difference and absolute difference values of the selected features, we have assigned the descriptors to two different classes. Class 1 consists of descriptors having positive difference values while Class 2 consists of descriptors containing negative difference values. It is to be noted that defining the number of classes depends on the modeler, but in this work, we have adhered to simplicity and uniformity and defined two clusters for all the analyses on different datasets. Once the class membership has been defined, it is now essential to assign weightage to each descriptor of a particular class. A simple weighting strategy was adopted that defines the weightage of a particular descriptor of a class, which has been represented in eqn (1).

 
image file: d4em00173g-t1.tif(1)

Once all the descriptors in the two different clusters have been assigned the corresponding weightage, the computation of the Arithmetic Residuals in K-Groups Analysis (ARKA) descriptors can be easily done. The selected QSAR descriptors of the training and test sets were standardized using the Java-based tool Scale1.0 available from the DTC Lab Supplementary Website.35 In each of the standardized training and test data sets, the descriptor ARKA_1 encodes the information for the descriptors in Class 1 (i.e. descriptors having positive difference values) and ARKA_2 encodes the information for the descriptors in Class 2 (i.e. descriptors having negative difference values). Both ARKA_1 and ARKA_2 were calculated as the weighted sum of the descriptors in their respective classes. Considering a total of 5 contributing descriptors for a particular response, suppose descriptors x1, x2, and x3 have positive difference values and are members of Class 1, while descriptors x4 and x5 have negative difference values and are members of Class 2; the corresponding mathematical expressions for the computation of the ARKA descriptors (Fig. 1) have been represented in eqn (2) and (3).

 
ARKA_1 = w1 × x1 + w2 × x2 + w3 × x3(2)
 
ARKA_2 = w4 × x4 + w5 × x5(3)
In eqn (2) and (3), the terms x1,…, x5 represent the descriptor values while w1, …, w5 represent the corresponding weightage values. On generalizing the formulae, the computation of ARKA descriptors has been represented in eqn (4) where “n” represents the number of descriptors in a particular class.
 
image file: d4em00173g-t2.tif(4)


image file: d4em00173g-f1.tif
Fig. 1 Pictorial representation of the scheme for the computation of the ARKA descriptors.

As per the mathematical consideration, each of the two ARKA descriptors is the weighted sum of the descriptors acting positively or negatively to the response values (the positive and negative contributions are identified in a model-independent manner by observing the data structure in the training set).

On application of the above-mentioned concept, the computation of the ARKA descriptors for the training and test sets was performed. The workflow for the computation of ARKA descriptors has been presented in Fig. 2. It is to be noted that according to the algorithm, the number of classes denotes the number of ARKA descriptors that can be computed. In the present work, since an initial concept is presented, we have limited the number of classes and ARKA descriptors to 2. However, in the future, the modeler may want to increase the number of ARKA descriptors by adopting different clustering techniques like Hierarchical Clustering Analysis, and therefore, this framework is quite customizable.


image file: d4em00173g-f2.tif
Fig. 2 Workflow for the computation of ARKA descriptors and model development.

2.3 Model development and validation

Initially, simple Linear Discriminant Analysis (LDA) models were developed separately using conventional QSAR descriptors and the ARKA descriptors by the Python-based Scikit-learn library36 in Jupyter Notebook platform.37 Twenty times fivefold cross-validation was performed to check the robustness of the developed LDA models. Since these are classification-based models, common model quality metrics like R2 and error measures like MAE and RMSE cannot be used to evaluate the models' performance, since these metrics deal with quantitative continuous response data. Standard classification-based validation metrics like F1_score, Matthews Correlation Coefficient (MCC), Cohen's kappa (Ckappa), and Area Under the receiver operating Curve (AUC) have been used to evaluate the performance of the developed models since these metrics reflect the overall model performance for both positive and negative classes and can handle the class imbalance in the data set while evaluating the model performance. Additional Machine Learning (ML) models like Logistic Regression (LR),38 Support Vector Machine classifier (SVM),39 and Random Forest classifier (RF)40 were also then attempted using selected features and corresponding ARKA descriptors. While LDA is a statistical model and requires that the input variables be normally distributed, the same does not apply to the other models presented. In each case, the hyperparameters were optimized separately using a GridSearchCV approach adhering to a fivefold cross-validation strategy. The performance of these ML models was evaluated by the classification-based validation metrics stated above. The comparison of various models using standard QSAR descriptors and ARKA descriptors was based on the F1_score, MCC, Ckappa, and AUC of the test set data to compare the predictive performances. These metrics most effectively reflect the prediction performance of the developed models for both the active and inactive classes. The MCC is utilized as a measure of the quality of binary classifications. It considers true and false positives and negatives and is generally regarded as a balanced measure that can be used even if the classes have different sizes. Cohen's kappa coefficient is more informative than Accuracy when working with imbalanced data.41 However, we have additionally reported the prediction accuracy of the ARKA models (for both the training and test sets) in the ESI (vide infra). Among these different validation metrics, AUC can be deemed to be the most important metric as it reflects the complete classification scenario since it compares the true positive rate with the false positive rate.42 Thus, an analysis of variance (ANOVA)43 of the enhancement of AUC values in the ARKA models compared with corresponding QSAR models for five data sets has also been done.

2.4 Analysis of the applicability domain (AD) of the datasets

With every model being developed, there comes the necessity to evaluate the chemical space that the model encodes. This chemical space can be termed the applicability domain (AD), and it is believed that compounds lying outside this chemical space can generate unreliable predictions. As stated previously, the general concept is that the greater the number of descriptors, the larger the chemical space the model encodes. However, since the main focus of this work is to reduce the number of descriptors yet generate better predictive models, it is essential to analyze the AD status of the models that were developed. This calls for the computation of the Leverage values44 for the ease of identification of the structural outliers, separately using the QSAR descriptor matrix and the ARKA descriptor matrix for the training and test sets, and then identifying the number of structural outliers.

2.5 Development of the ARKA descriptor-calculating software

To make the computations more user-friendly, we have developed a Java-based ARKA descriptor calculating software: ARKA_descriptors-v1.0, which has been made available from the DTC Lab tools supplementary website (ARKA tab).35 This tool takes input from the training and test set files and calculates the corresponding ARKA descriptors for the training and test sets.

The detailed modeling analysis is represented in Fig. 3.


image file: d4em00173g-f3.tif
Fig. 3 Detailed workflow for the modeling analysis.

2.6 Application of ARKA descriptors to chemical Read-Across analysis

Read-Across is a similarity-based data gap-filling nonstatistical approach, which uses the endpoint information for one or more chemical(s) to predict the same endpoint for another similar chemical (based on structural features or mechanisms of action).45 The enhanced usage of Read-Across is promoted by regulatory frameworks to minimize new animal testing.46,47 We have applied the ARKA framework in deriving chemical Read-Across predictions of the endpoints considered and compared the quality of predictions using those derived from chemical Read-Across obtained using chemical descriptors. We have applied the Gaussian kernel-based similarity in the quantitative Read-Across algorithm of Chatterjee et al.16 As no models are generated for the training sets in case of Read-Across, the prediction quality can be evaluated only from the external set validation metrics.

3 Results and discussion

We have used five representative ecotoxicity datasets for our experimental work to demonstrate the predictive ability of the models developed using ARKA descriptors. We selected these data sets considering that these are small to moderate-sized, and QSAR models were already developed for these models. This makes our comparison task easier. We have used the same set of features (QSAR descriptors), the same division of the data sets and the same modelling strategy, as used in the original analyses for comparison purposes.

3.1 Calculation of the ARKA descriptors

The five different ecotoxicity datasets used for the computation and analysis of the ARKA descriptors have been provided in Excel sheets of ESI SI-1. A representative example of the calculation of ARKA descriptors on Dataset 2 has been provided in ESI SI-2. Note that each sheet of SI-2 contains specific calculations, and the final ARKA descriptors have been computed on the penultimate and the last sheets of the workbook.

3.2 Results of the linear discriminant analysis (LDA) models

It may be noted that the objective of the present study has not been to report new predictive models for various ecotoxicological endpoints but to demonstrate the usefulness of the ARKA approach over the conventional classification QSAR modeling approach in case of small modeling sets for external predictions of ecotoxicological data. We like to emphasize that the performance of the ARKA descriptors will depend on the initial set of features selected for QSAR models. With a different set of selected QSAR features, the quality of ARKA models will accordingly vary. Thus, the performance of ARKA models should always be compared with the QSAR models developed with the corresponding conventional descriptors from which ARKA descriptors have been computed. The evaluation of the predictive performance of the models developed using conventional QSAR descriptors and the ARKA descriptors was initially checked using a linear modeling framework (Linear Discriminant Analysis or LDA). For the LDA models, we have provided the coefficient values of individual descriptors (of the QSAR models and ARKA models) in Table S2 in ESI SI-3. As evident from Fig. 4, it is observed that in most of the cases, the LDA models generated using ARKA descriptors showed enhanced predictive performance (test sets) in terms of the validation metric values than the LDA models generated using a much higher number of conventional QSAR descriptors.
image file: d4em00173g-f4.tif
Fig. 4 Results of the external prediction quality (heatmap of quality metrics) of different models developed using conventional QSAR descriptors and ARKA descriptors (Ndesc = the number of descriptors).

3.3 Results of various Machine Learning (ML) models

Once we have compared the performance of the conventional QSAR descriptors with the ARKA descriptors based on a simple linear modeling framework, we wanted to explore the results of ARKA descriptors with different nonlinear classification modeling schemes. For this, we have adopted the Logistic Regression (LR), Support Vector Machine (SVM), and Random Forest (RF) classifiers to generate ML models using the conventional QSAR descriptors and the ARKA descriptors, separately. The success of the models developed using the ARKA descriptors is evident from the enhanced prediction quality (for test set compounds) as compared to the corresponding conventional QSAR models, which is represented in Fig. 4. Please note that while validating the models (QSAR or ARKA), the predictions for the test set compounds have been compared to the experimental observations while the models were developed solely from the training set compounds. Apart from the enhancement of the prediction quality metrics (like F1_score, MCC, Cohen's kappa, and AUC-ROC) in the majority of the cases, the use of ARKA descriptors was further supported by the reduced number of structural outliers as compared to the conventional QSAR descriptors, which proves that the entire chemical space has been preserved. From Fig. 4, it is clear that in most cases, the predictive performance (test sets) of a particular ML model is better when the ARKA descriptors have been used and not when the conventional QSAR descriptors have been used. The optimization of the hyperparameters separately for each ML model was performed using a Grid Search Cross-Validation (GridSearchCV) approach adhering to a 5-fold cross-validation strategy. The optimized hyperparameter setting used to develop the ML models has been represented in Table S3 in ESI SI-3. The 20 times 5-fold cross-validation data of all the different models developed on all the five different datasets have been tabulated in Table S4 of the ESI SI-3. From the results, it can be observed that in majority of the cases, the difference between Accuracy and cross-validated Accuracy (AccuracyCV) is lower in the models developed using ARKA descriptors, suggesting that the models developed using ARKA descriptors are more robust as compared to the corresponding models developed using conventional QSAR descriptors. This is one of the common observations when a modeler applies dimensionality reduction approaches to reduce the number of modeling variables, which validates the correctness of our concept of the ARKA framework. We have additionally reported now the prediction accuracy of the ARKA models (for both the training and test sets) in the ESI SI-3 (Table S5). Although in some cases the statistical quality of models (training sets) is somewhat inferior (but well acceptable statistically) for ARKA models compared to QSAR models, we are interested here in exploring the enhancement of predictive performance of the models on the test sets using ARKA descriptors, and hence only test set statistics are reported here (Fig. 4).

3.4 Evaluation of the predictive performance of the models developed using conventional QSAR descriptors and ARKA descriptors

A comprehensive evaluation of the predictive performance of the different models using the two different classes of descriptors has been performed by a voting approach (multicriteria decision-making). In this approach, a modeling algorithm was taken one at a time and its predictive performance using the QSAR descriptors and ARKA descriptors was evaluated using four different essential external validation metrics, namely F1_score, MCC, Ckappa, and AUC. The purpose of considering these metrics is that they are capable of providing the complete picture of the classifiability of the models by considering the correct and incorrect predictions of the actives/postives and the inactives/negatives in an unbiased manner. Moreover, these metrics especially AUC can handle imbalanced data and do not provide results that are biased to a particular class (actives or inactives). Moreover, AUC is a very suitable metric for comparing different models considering that this is derived from consideration of multiple threshold values.48 The model with a higher value of a particular metric has been assigned a value of 1 for that metric, while the other model with a lower value of the metric has been assigned 0. If the values of a particular metric are equal for both models, an equal value of 0.5 each has been assigned. This was done for all the different external validation metrics stated above, using all the different modeling algorithms (LDA, LR, SVM, and RF), for all five different datasets as shown in Fig. 5. A sum was calculated demonstrating the count of winner votes, specific for a particular validation metric, for both the QSAR and ARKA descriptors, and the validation metric with a higher count in QSAR or ARKA was considered the winner. An overall evaluation was done to get a comprehensive idea of the performance of the QSAR descriptors and the ARKA descriptors. In this case (represented as “composite” in Fig. 5), the count of voted winners for a particular validation metric in a particular modeling algorithm was taken into consideration, and the final voting is based on the sum of the voted winners among the QSAR and ARKA descriptors. Similar to the individual datasets, a sum was computed to count the voted winners. From the final analysis in the composite set, it was found that the AUC, MCC, and Ckappa of the models derived from the ARKA descriptors were clear winners, i.e., these models showed enhanced performance with the ARKA descriptors, while the F1_score showed equal performance for both the QSAR and ARKA descriptors. The complete picture of this analysis has been represented in the form of a heat map in Fig. 5.
image file: d4em00173g-f5.tif
Fig. 5 Heat map demonstrating the voting results and how the models developed using ARKA descriptors showed enhanced predictive performance. 1 indicates a winner model for a particular metric, 0 indicates a loser model for a particular metric and 0.5 indicates a tie.

We have now additionally compared the enhancement of model quality for different data sets and different modeling algorithms taking AUC-ROC as the objective function and performed an analysis of variance of the change of AUC-ROC values due to the factors of data sets and modeling algorithms. The results show there has been indeed an enhancement of prediction quality on using ARKA descriptors in most of the cases (Table 1). The analysis of variance (ANOVA) results of the enhancement of AUC values in ARKA models compared with corresponding QSAR models for five data sets showed that the variations in the enhancement values were due to neither the structures of the datasets, nor the Machine Learning algorithms employed. This enhancement in the quality has been brought about by the ARKA descriptors, signifying their importance.

Table 1 The enhancement of AUC-ROC (test set) due to the use of ARKA descriptors compared to QSAR descriptors
Data set LDA SVM RF LR ANOVA results
1 0.02 0.03 −0.04 0.02 Row (data set) effect (insignificant)
2 0.01 −0.08 0.03 0 F1(df 4,12) = 0.14 (p = 0.965)
3 0.02 0.02 −0.04 0.01 Column (modeling method) effect
4 0.04 0.04 0.02 0.01 (Insignificant)
5 −0.03 0.02 0.05 −0.05 F2(df 3,12) = 0.45 (p = 0.722)


3.5 Analysis of the applicability domain (AD) of the models developed from QSAR descriptors and ARKA descriptors

For the proper evaluation of a particular model, it becomes imperative to consider the applicability domain, i.e., the chemical space that the model encodes. We have computed the Leverages of the QSAR and ARKA descriptor matrices for all five different datasets. On analysis of the number of outliers, it was observed that the number of outliers computed using the ARKA descriptors was not only lower in all five different test sets but also lower in four of the five different training sets as well (Table S6 in ESI SI-3). Although the chemical information of the ARKA descriptors is the same as that of the QSAR descriptors considered, they can make use of the information in a better way by partitioning the descriptors into K classes.

3.6 Establishing a generalized relationship between ARKA_1 and ARKA_2 with the observed activity values – the identification of potential activity cliffs and less confident data points

We have generated violin plots representing ARKA_1 vs. activity and ARKA_2 vs. activity for both the training and test sets in all 5 different datasets (the results of Dataset 1 are given in Fig. 6 and the results of other data sets in Fig. S1 of ESI SI-3). As a general observation from Fig. 6 and S1, the median values of the ARKA_1 descriptor in the active class are higher than the inactive class for both the training and test sets. Similarly, the median values of the ARKA_2 descriptor are higher in the inactive class than the active class for both the training and test sets, thus justifying the probable outcome. Also, from a representative example of Dataset 2 (Fig. S2 in ESI SI-3), it is observed that in most of the active compounds, the values of ARKA_1 are positive and those of ARKA_2 are negative. Similarly, for most of the inactive compounds, the values of ARKA_1 are negative and those of ARKA_2 are positive. Similar observations may also be checked with other data sets (for example, the results from Dataset 1 are given in Fig. S3 and S4 of ESI SI-3).
image file: d4em00173g-f6.tif
Fig. 6 Violin plots of ARKA_1 vs. activity and ARKA_2 vs. activity for the training and test sets (Dataset 1).

3.7 Dataset modelability from ARKA descriptors

A scatter plot of ARKA_2 vs. ARKA_1 descriptors for the training/test set compounds may indicate the modelability/model performance of the data set. Using knowledge from the data distribution of the violin plots, it can be inferred that the active compounds would most likely be present in the fourth quadrant of the scatter plots (where ARKA_1 is positive and ARKA_2 is negative), and the inactive compounds would most likely be present in the second quadrant (where ARKA_1 is negative and ARKA_2 is positive). Generally, based on the analysis of the studied data sets, for a confident classification of a data point into the positive and negative classes, it is found that the absolute values of both ARKA_1 and ARKA_2 descriptors should be more than 0.5 and their absolute difference is expected to be more than 0.75. Any point very near to the X or Y axis will have low decidability for classification purposes. As per Fig. 7, the 4th and 2nd quadrants signify confident positives and negatives, respectively. If a negative (or inactive) compound is found in the 4th quadrant or a positive (active) compound in the 2nd quadrant, these may be potential activity cliffs reducing the modelability of the data set (in the case of the training set) or potential prediction cliffs indicative of poor prediction performance (in the case of the test set). Again, the compounds falling in the 1st and 3rd quadrants represent less confident data points for classifiability. Any misclassified compound falling in these quadrants as shown in Fig. 7 may be less confident activity/prediction cliffs.
image file: d4em00173g-f7.tif
Fig. 7 Generalized interpretation to identify activity cliffs, borderline compounds, less modelable data points, and less confident data points based on their positions in the ARKA_2 vs. ARKA_1 plot. Note that this interpretation is a “model independent” process, where simply the values of ARKA_1 and ARKA_2 are efficient enough to identify the nature of the data points.

We illustrate this (Fig. 8) with Dataset 1 where the scatter plot of ARKA_2 vs. ARKA_1 shows five positive activity cliffs (compounds 260, 278, 320, 362, and 370) and one negative activity cliff (compound 93) in the training set, and on comparing the definition of activity cliffs based on Banerjee–Roy similarity coefficients sm1 and sm2,27 it was found that all the six compounds identified as activity cliffs from the training set, as evidenced from the plot of Fig. 8, complying with both the approaches (i.e., ARKA descriptors and Banerjee–Roy similarity coefficients). In the case of the test set, one active compound (compound 349) and two negative compounds (compounds 74 and 94) were identified as the potential prediction cliffs (poor prediction performance potential). It is interesting that compound 74 was identified as a poor prediction potential data point also based on Banerjee–Roy similarity coefficients sm1 and sm2.27 It is also observed that sm1 and sm2 identify a greater number of activity cliffs as compared to ARKA_1 and ARKA_2. This is because the sm1 and sm2 method detects additional activity cliffs that fall in either 1st or 3rd quadrants or near the axes thus showing less confidence in their activity cliff behavior since the ARKA criteria for determining activity cliffs are more conservative than that of Banerjee–Roy coefficients. The modelability analysis of Datasets 2 and 3 can be found in the ESI SI-3.


image file: d4em00173g-f8.tif
Fig. 8 Representative scatter plots of ARKA_2 vs. ARKA_1 for the training and test sets of Dataset 1.

3.8 Analysis of the conventional QSAR descriptors encoded in ARKA_1 and ARKA_2

This section of the manuscript will detail the various conventional QSAR descriptors that are encoded within the ARKA descriptors in the present study.

Case study 1: MDF analysis of descriptors contributing to skin sensitization potential of diverse organic chemicals.

The dataset used for this study (Dataset 1) reports the skin sensitization data of harmful organic chemicals based on a local lymph node assay (LLNA) of murine species. The basic chemical theory behind skin sensitization is that the skin proteins act as nucleophiles and the sensitizers act as electrophiles.49 As evident from our clustering analysis, two descriptors contributed to ARKA_1 while 12 descriptors contributed to ARKA_2. The descriptors gmin and minsssCH represent the minimum atom E-state value in a molecule and the minimum E-state value of a tertiary carbon atom, respectively. These descriptors have a near-equal contribution to ARKA_1, as evident from the weightage values (Fig. 9). As these descriptors refer to the electronic environment of a molecule, they have a significant contribution towards electrophilic properties in the molecules, making them active skin sensitizers. On the other hand, the descriptors B06[C–N], B04[N–O], B10[C–O] depict the presence or absence of atom pairs C⋯N, N⋯O, and C⋯O at the topological distances of 6, 4, and 10 respectively, and they have higher contributions than most of the other descriptors contributing to ARKA_2. As evident from these 2D-atom pair descriptors, it is evident that they indicate the presence of heteroatoms rich in electrons, thereby reducing the electrophilic properties of the compounds (Fig. 9).


image file: d4em00173g-f9.tif
Fig. 9 Analysis of the descriptors potentiating and inhibiting skin-sensitizing properties of a molecule.

Case study 2: MDF analysis of descriptors contributing to the chemical toxicity to earthworms.

The dataset used for this study (Dataset 2) reports the chemical toxicity of diverse organic chemicals to earthworms. As evident from our clustering analysis, the descriptors B03[O–O] (presence or absence of O⋯O at the topological distance 3), ETA_Psi_1 (hydrogen bonding propensity and/or polar surface area), B02[O–S] (presence or absence of O⋯S at the topological distance 2) and X3A (lower degree of branching) contribute to ARKA_1. Similarly, descriptors like B09[C–C] (presence or absence of C⋯C at the topological distance 9), MLOGP2 (squared Moriguchi o/w partition coefficient), ETA_Beta (depicting electron richness), and MLOGP (Moriguchi o/w partition coefficient) contribute to ARKA_2. Among the descriptors contributing to ARKA_1, it is observed that the descriptor B03[O–O] (higher electronegativity) has a significantly higher contribution than the others (Fig. 10). Additionally, the descriptors ETA_Psi_1, and B02[O–S] have similar contributions and X3A has the least contribution. Similarly, among the descriptors contributing to ARKA_2, B09[C–C] (molecular size and hydrophobicity) contributes the maximum, MLOGP2 and ETA_Beta have near-equal contributions, and MLOGP has the least contribution.


image file: d4em00173g-f10.tif
Fig. 10 QSAR descriptors that are encoded in ARKA_1 and ARKA_2 for all five different datasets. Note that the model for dataset 4 is derived from only one ARKA descriptor (ARKA_1).

Case study 3: MDF analysis of descriptors contributing to the milk/plasma concentration ratios of drugs and environmental pollutants.

The dataset used for this study (Dataset 3) reports the data for milk/plasma concentration ratios of drugs and environmental pollutants. The cluster analysis suggests that the descriptors ETA_EtaP_B_RC (indicating branching), nCrs (number of sp3 hybridized secondary carbons present in a ring), and S_tsC (electronic environment of an acetylenic carbon atom) contribute to the ARKA_1 descriptor (positive class). Similarly, the descriptors nAB (depicting the number of aromatic bonds present in the compound), nRCONHR (depicting the number of secondary aliphatic amides present in the compound), and Jurs-DPSA-1 (depicting the difference in the partial positive solvent-accessible surface area and the partial negative solvent-accessible surface area) contribute to the ARKA_2 descriptor (negative class). As evident from the contributions of the descriptors constituting ARKA_1, nCrs has the highest contribution while ETA_EtaP_B_RC has the lowest contribution (Fig. 10). Similarly, in the case of ARKA_2, the descriptors nAB and nRCONHR have similar and highest contributions to ARKA_2.

Case study 4: MDF analysis of descriptors contributing to the toxicity towards P. subcapitata.

The dataset used for this study (Dataset 4) reports the toxicity of organic chemicals towards P. subcapitata. Like in the previous cases, the features were clustered based on the difference values. Since all the features had a positive difference value, this was the only set that used only one ARKA descriptor (ARKA_1) to generate models. Among the different features, MW (denoting the molecular weight of the compound) had the highest contribution towards ARKA_1, which was followed by the contributions of Atype_C_24 (representing fragments containing secondary carbon atoms), 2χv (denoting size and shape) and S_aaaC (representing fused ring system) (Fig. 10).

Case study 5: MDF analysis of descriptors contributing to rodent carcinogenicity potential.

The dataset used for this study (Dataset 5) reports the rodent carcinogenic potency. On assigning clusters to the descriptors, it was observed that the descriptor MAXDP (a measure reflecting the electrophilicity of a molecule) contributes to ARKA_1. Similarly, the descriptors Wap (denoting the Wiener index, i.e., the edge count through the shortest path between all pairs of non-hydrogen atoms), nRNNOx (representing the number of N-nitroso groups that are aliphatic), and Cl-086 (depicting the presence of Cl atoms attached to an sp3 hybridized carbon atom) contribute to ARKA_2. As evident from the weightage values, among the three descriptors contributing to ARKA_2, the descriptor nRNNOx is observed to have the highest contribution in the computation of the ARKA_2 descriptor (Fig. 10).

The details of the descriptors contributing to ARKA_1 and ARKA_2, for all the five different datasets, have been represented in Fig. 10.

3.9 Application of the ARKA framework to chemical Read-Across predictions

We have applied the ARKA descriptors for chemical-similarity-based Read-Across classification analysis for all the considered endpoints and compared the results with the predictions obtained from conventional QSAR descriptors (Table 2). The application of the Gaussian kernel-based similarity16 showed that the ARKA framework outperforms the conventional QSAR descriptors in the external prediction quality for most of the data sets. This warrants further studies on the application of the ARKA framework in similarity-based cheminformatics studies.
Table 2 Effects of ARKA descriptors on the chemical Read-Across-based external predictions using the Gaussian kernel function for five data sets (Ndesc = the number of descriptors, MCC = Matthews correlation coefficient, Ckappa = Cohen's kappa)a
Dataset Descriptors N desc F1_score MCC Ckappa AUC
a The winner metric values are shown in bold.
1 QSAR 14 0.729 0.21 0.209 0.66
ARKA 2 0.699 0.235 0.227 0.66
2 QSAR 8 0.6 0.42 0.412 0.78
ARKA 2 0.645 0.472 0.467 0.79
3 QSAR 6 0.361 −0.079 −0.079 0.43
ARKA 2 0.375 −0.144 −0.143 0.49
4 QSAR 4 0.9 0.753 0.723 0.95
ARKA 1 0.923 0.812 0.795 1
5 QSAR 4 0.917 0.713 0.673 0.96
ARKA 2 0.917 0.713 0.673 0.95


3.10 Limitations and future prospects

With every new method being developed, the associated limitations show the avenues for future prospects. In this particular method of dimensionality reduction, we have stressed the modeling of small datasets, considering mainly the environmental aspects where the toxicity and ecotoxicity data are limited, amplifying the need for data gap-filling. Primarily, the two ARKA descriptors (ARKA_1 and ARKA_2) should be employed for modeling small datasets that do not have a very large number of modeling features. When larger datasets with a considerably higher number of QSAR features are involved, the computation of only two classes of ARKA descriptors becomes somewhat of an oversimplification, since there is a high chance that the information of the larger pool of descriptors may not be efficiently encoded in just two classes of ARKA descriptors. From a general conscience, this calls for the need to develop a greater number of ARKA descriptors by dividing the original QSAR descriptor pool into K-groups (“K-groups analysis”), instead of just two groups. By generating a somewhat greater number of ARKA descriptors, a lower number of features get encoded into a single ARKA descriptor, which reduces noise and redundancy. However, it may differ from case to case depending on the complexity of a data set. For example, within the results of the reported five datasets, Dataset 1 had the highest number of compounds (n = 471) and the highest number of descriptors (Ndesc = 14), and it was observed that the models developed using two-descriptors (ARKA_1 and ARKA_2) had better predictive performance than the QSAR models developed using 14 descriptors (Data set 1). In this work, we have worked on classification-based models and compared the predictive performance of the models generated using conventional QSAR descriptors and ARKA descriptors. However, this method of dimensionality reduction can also be explored when the response values are quantitative, leading to the development of regression-based models. To develop the regression-based models, the scheme for the computation of the descriptors may remain the same, and these descriptors may be submitted to a regression-based modeling algorithm. However, in the initial step where we have grouped the active and inactive classes of the training compounds, this grouping should be done taking a certain value as the threshold (preferably the experimental response mean of the training compounds) in case of regression modeling, since this will contain quantitative data for the endpoint values. Additionally, the computation and assignment of the weightage to each descriptor can be customized. In this work, we have chosen a simple arithmetic weighing strategy but other modelers may also use a customized weighing strategy based on their choice. One can apply different weighting summation schemes as applied in the computation of mixture descriptors from multi-component chemical mixtures using various algebraic expressions like Quadratic mixture descriptors, Logarithmic mixture descriptors, etc.50 ARKA need not always be a two-descriptor modeling approach, and the clustering and weighting strategies can be customized using a variety of algorithms, which currently appear to be interesting for future studies. The application of this concept to fingerprints would also be very interesting for future works.

The ARKA descriptors can also play an important role in the similarity assessment of chemicals for regulatory decision-making. The plot of ARKA_2 vs. ARKA_1 can not only identify potential activity cliffs but can also help one to understand the similar types of chemicals that are grouped in a cluster – a basic form of Read-Across. Additionally, it is also possible to identify the chemical nature and possible adverse outcome pathways (AOPs) of the close congeners using the concepts of Read-Across51,52 and quantitative Read-Across structure–activity relationship (q-RASAR).27,53,54 Furthermore, this approach can not only be used in assessing environmental/ecotoxicity endpoints but can also be extended to other fields like drug discovery.55

4 Conclusion

As a thumb rule for any statistical modeling analysis for small datasets, there should be a minimal number of descriptors used for modeling. This enhances the degree of freedom of the developed model and increases the statistical reliability. In this particular work, we have used the same amount of chemical information from the descriptors used in the previously reported ecotoxicological QSAR models and encoded them in such a way that has significantly lowered the number of modeling descriptors – ARKA descriptors (a form of supervised dimensionality reduction technique). We found that two ARKA descriptors can potentially identify activity cliffs, less confident data points, and less modelable data points; the results obtained in this study comply with the previously reported method of the detection of activity cliffs utilizing the concept of chemical similarity.26 On retraining the model using ARKA descriptors, it was observed that the models generated using ARKA descriptors had better predictivity (for the test sets) as compared to the previously published QSAR models, which was evident from various classification-based statistical validation metrics (Mathews correlation coefficient, Cohen's kappa, F1_score, and AUC-ROC) that provide an unbiased result in evaluating the classification ability of a model into the actives and the inactives even for imbalanced data sets. To ensure that this observation was not limited to a particular modeling algorithm, we have trained various additional Machine Learning (ML) models using the QSAR descriptors and ARKA descriptors separately, the comparative prediction results of which strengthen our inference. From the modeling exercise on five diverse ecotoxicity data, we observe that in most of the cases, models developed using ARKA descriptors outperform the predictive ability of the models developed using conventional QSAR descriptors. Therefore, we infer that the models generated using ARKA descriptors can quickly and efficiently identify toxic environmental chemicals with enhanced predictivity, thus leading to increased reliability of the predictions. However, there is room for further development of the approach by its applications in regression-based and/or Read-Across approaches, classification modeling of larger ecotoxicity data sets, and exploring other customized ways of weighing strategies in deriving ARKA descriptors.

Declaration

A preprint version of this article is available from https://doi.org/10.26434/chemrxiv-2024-jqkjv.

Author contributions

AB – conceptualization, computation, validation, writing – initial draft, software; KR – conceptualization, supervision, writing – editing, funding.

Conflicts of interest

There are no conflicts to declare.

Acknowledgements

Life Sciences Research Board, Defence Research and Development Organisation, Govt. of India (LSRB/01/15001/M/LSRB-394/SH&DD/2022). AB thanks Life Sciences Research Board, Defence Research and Development Organisation, Govt. of India, for a Senior Research Fellowship.

References

  1. K. Khan and K. Roy, Ecotoxicological Risk Assessment Of Organic Compounds Against Various Aquatic And Terrestrial Species: Application Of Interspecies I-QSTTR And Species Sensitivity Distribution Techniques, Green Chem., 2022, 24, 2160–2178,  10.1039/D1GC04320J.
  2. N. Fjodorova, M. Novich, M. Vrachko, V. Smirnov, N. Kharchevnikova, Z. Zholdakova, S. Novikov, N. Skvortsova, D. Filimonov, V. Poroikov and E. Benfenati, Directions In QSAR Modeling For Regulatory Uses In OECD Member Countries, EU And In Russia, J. Environ. Sci. Health, Part C: Environ. Carcinog. Ecotoxicol. Rev., 2008, 26, 201–236,  DOI:10.1080/10590500802135578.
  3. K. Khan, B. Baderna, C. Cappelli, C. Toma, A. Lombardo, K. Roy and E. Benfenati, Ecotoxicological QSAR Modeling Of Organic Compounds Against Fish: Application Of Fragment Based Descriptors In Feature Analysis, Aquat. Toxicol., 2019, 212, 162–174,  DOI:10.1016/j.aquatox.2019.05.011.
  4. OECD, https://www.oecd.org/about/, accessed on 18th March 2024.
  5. G. Piir, I. Kahn, A. T. Garcia-Sosa, S. Sild, P. Ahte and U. Maran, Best Practices For QSAR Model Reporting: Physical And Chemical Properties, Ecotoxicity, Environmental Fate, Human Health, And Toxicokinetics Endpoints, Environ. Health Perspect., 2018, 126, 126001,  DOI:10.1289/EHP3264.
  6. A. Banerjee, P. De, V. Kumar, S. Kar and K. Roy, Quick And Efficient Quantitative Predictions Of Androgen Receptor Binding Affinity For Screening Endocrine Disruptor Chemicals Using 2D-QSAR And Chemical Read-Across, Chemosphere, 2022, 309, 136579,  DOI:10.1016/j.chemosphere.2022.136579.
  7. EU REACH, https://echa.europa.eu/it/regulations/reach/legislation, accessed on 18th March 2024.
  8. K. Roy, S. Kar and R. N. Das, Understanding the Basics of QSAR for Applications in Pharmaceutical Sciences and Risk Assessment, Academic press, NY, 2015,  DOI:10.1016/C2022-0-00080-5.
  9. K. Mansouri, N. F. Cariello, A. Korotcov, V. Tkachenko, C. M. Grulke, C. S. Sprankle, D. Allen, W. M. Casey, N. C. Kleinstreuer and A. J. Williams, Open-Source QSAR Models For Pka Prediction Using Multiple Machine Learning Approaches, J. Cheminf., 2019, 11, 60,  DOI:10.1186/s13321-019-0384-1.
  10. G. Gini and F. Zanoli, Machine Learning and Deep Learning Methods in Ecotoxicological QSAR Modeling, in Ecotoxicological QSARs, ed. K. Roy, Springer, NY, pp. , pp. 111–149,  DOI:10.1007/978-1-0716-0150-1_6.
  11. R. Rodriguez-Perez and J. Bajorath, Interpretation Of Compound Activity Predictions From Complex Machine Learning Models Using Local Approximations And Shapley Values, J. Med. Chem., 2020, 63, 8761–8777,  DOI:10.1021/acs.jmedchem.9b01101.
  12. P. Karpov, G. Godin and I. V. Tetko, Transformer-CNN: Swiss Knife For QSAR Modeling And Interpretation, J. Cheminf., 2020, 12, 17,  DOI:10.1186/s13321-020-00423-w.
  13. S. Manganelli and E. Benfenati, Use of Read-Across Tools, in Silico Methods for Predicting Drug Toxicity. Methods in Molecular Biology, ed. E. Benfenati, Humana Press, New York, NY, 2016, vol 1425,  DOI:10.1007/978-1-4939-3609-0_13.
  14. N. Ball, M. T. D. Cronin, J. Shen, K. Blackburn, E. D. Booth, M. Bouhifd, E. Donley, L. Egnash, C. Hastings, D. R. Juberg, A. Kleensang, N. Kleinstreuer, E. D. Kroese, A. C. Lee, T. Luechtefeld, A. Maertens, S. Marty, J. M. Naciff, J. Palmer, D. Pamies, M. Penman, A. N. Richarz, D. P. Russo, S. B. Stuard, G. Patlewicz, B. van Ravenzwaay, S. Wu, H. Zhu and T. Hartung, Toward Good Read-Across Practice (GRAP) guidance, ALTEX, 2016, 33, 149–166,  DOI:10.14573/altex.1601251.
  15. C. Hung and G. Gini, QSAR Modeling Without Descriptors Using Graph Convolutional Neural Networks: The Case Of Mutagenicity Prediction, Mol. Diversity, 2021, 25, 1283–1299,  DOI:10.1007/s11030-021-10250-2.
  16. M. Chatterjee, A. Banerjee, P. De, A. Gajewicz-Skretna and K. Roy, A Novel Quantitative Read-Across Tool Designed Purposefully To Fill The Existing Gaps In Nanosafety Data, Environ. Sci.: Nano, 2022, 9, 189–203,  10.1039/D1EN00725D.
  17. A. Banerjee and K. Roy, First Report Of q-RASAR Modeling Toward An Approach Of Easy Interpretability And Efficient Transferability, Mol. Diversity, 2022, 26, 2847–2862,  DOI:10.1007/s11030-022-10478-6.
  18. T. Srisongkram, Ensemble Quantitative Read-Across Structure–Activity Relationship Algorithm For Predicting Skin Cytotoxicity, Chem. Res. Toxicol., 2023, 36, 1961–1972,  DOI:10.1021/acs.chemrestox.3c00238.
  19. M. H. Keshavarz, F. Gharagheizi, A. Shokrolahi and S. Zakinejad, Accurate Prediction Of The Toxicity Of Benzoic Acid Compounds In Mice Via Oral Without Using Any Computer Codes, J. Hazard. Mater., 2012, 30(237–238), 79–101,  DOI:10.1016/j.jhazmat.2012.07.048.
  20. M. Jafari, M. H. Keshavarz and H. Salek, A Simple Method For Assessing Chemical Toxicity Of Ionic Liquids On Vibrio Fischeri Through The Structure Of Cations With Specific Anions, Ecotoxicol. Environ. Saf., 2019, 182, 109429,  DOI:10.1016/j.ecoenv.2019.109429.
  21. J. Sivakumar, K. Ramamurthy, M. Radhakrishnan and D. Won, Synthetic Sampling From Small Datasets: A Modified Mega-Trend Diffusion Approach Using K-Nearest Neighbors, Knowledge-Based Systems, 2022, 236, 107687,  DOI:10.1016/j.knosys.2021.107687.
  22. A. Nath, P. De and K. Roy, In Silico Modelling Of Acute Toxicity Of 1, 2, 4-Triazole Antifungal Agents Towards Zebrafish (Danio Rerio) Embryos: Application Of The Small Dataset Modeller Tool, Toxicol. in Vitro, 2021, 75, 105205,  DOI:10.1016/j.tiv.2021.105205.
  23. K. Khan, V. Kumar, E. Colombo, A. Lombardo, E. Benfenati and K. Roy, Intelligent Consensus Predictions Of Bioconcentration Factor Of Pharmaceuticals Using 2D And Fragment-Based Descriptors, Environ. Int., 2022, 170, 107625,  DOI:10.1016/j.envint.2022.107625.
  24. S. Wold, K. Esbensen and P. Geladi, Principal Component Analysis, Chemom. Intell. Lab. Syst., 1987, 2, 37–52,  DOI:10.1016/0169-7439(87)80084-9.
  25. S. Wold, M. Sjostrom and L. Eriksson, PLS-Regression: A Basic Tool Of Chemometrics, Chemom. Intell. Lab. Syst., 2001, 58, 109–130,  DOI:10.1016/S0169-7439(01)00155-1.
  26. J. T. Vogelstein, E. W. Bridgeford, M. Tang, D. Zheng, C. Douville, R. Burns and M. Maggioni, Supervised dimensionality reduction for big data, Nat. Commun., 2021, 12, 2872,  DOI:10.1038/s41467-021-23102-2.
  27. A. Banerjee and K. Roy, Prediction-Inspired Intelligent Training For The Development Of Classification Read-Across Structure–Activity Relationship (c-RASAR) Models For Organic Skin Sensitizers: Assessment Of Classification Error Rate From Novel Similarity Coefficients, Chem. Res. Toxicol., 2023, 36, 1518–1531,  DOI:10.1021/acs.chemrestox.3c00155.
  28. J. Roy, P. K. Ojha, E. Carnesecchi, A. Lombardo, K. Roy and E. Benfenati, First Report On A Classification-Based QSAR Model For Chemical Toxicity To Earthworm, J. Hazard. Mater., 2020, 386, 121660,  DOI:10.1016/j.jhazmat.2019.121660.
  29. S. Kar and K. Roy, Prediction Of Milk/Plasma Concentration Ratios Of Drugs And Environmental Pollutants Using In Silico Tools: Classification And Regression Based Qsars And Pharmacophore Mapping, Mol. Inf., 2013, 32, 693–705,  DOI:10.1002/minf.201300018.
  30. S. Pramanik and K. Roy, Predictive Modeling Of Chemical Toxicity Towards Pseudokirchneriella Subcapitata Using Regression And Classification Based Approaches, Ecotoxicol. Environ. Saf., 2014, 101, 184–190,  DOI:10.1016/j.ecoenv.2013.12.030.
  31. S. Kar, O. Deeb and K. Roy, Development Of Classification And Regression Based QSAR Models To Predict Rodent Carcinogenic Potency Using Oral Slope Factor, Ecotoxicol. Environ. Saf., 2012, 82, 85–95,  DOI:10.1016/j.ecoenv.2012.05.013.
  32. P. Gramatica, S. Cassani, P. P. Roy, S. Kovarich, C. W. Yap and E. Papa, QSAR Modeling Is Not “Push A Button And Find A Correlation”: A Case Study Of Toxicity Of (Benzo-)Triazoles On Algae, Mol. Inf., 2012, 31, 817–835,  DOI:10.1002/minf.201200075.
  33. M. Murcia-Soler, F. Perez-Gimenez, F. J. Garcia-March, M. T. Salabert-Salvador, W. Diaz-Villanueva and P. Medina-Casamayor, Discrimination And Selection Of New Potential Antibacterial Compounds Using Simple Topological Descriptors, J. Mol. Graphics Modell., 2003, 21, 375–390,  DOI:10.1016/S1093-3263(02)00184-5.
  34. R. N. Das and K. Roy, Predictive Modeling Studies For The Ecotoxicity Of Ionic Liquids Towards The Green Algae Scenedesmus Vacuolatus, Chemosphere, 2014, 104, 170–176,  DOI:10.1016/j.chemosphere.2013.11.002.
  35. DTC Lab tools Supplementary Website, https://sites.google.com/jadavpuruniversity.in/dtc-lab-software/home, accessed on 18th March 2024.
  36. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg and J. Vanderplas, Scikit-learn: Machine Learning in Python, J. Mach. Learn. Res., 2011, 12, 2825–2830 Search PubMed.
  37. T. Kluyver, B. Ragan-Kelly, F. Perez, B. E. Granger, M. Bussonnier, J. Frederic, K. Kelley, J. B. Hamrick, J. Grout, S. Corlay and P. Ivanov, Jupyter Notebooks-a publishing format for reproducible computational workflows, in Positioning and Power in Academic Publishing: Players, Agents and Agendas: Proceedings of the 20th International Conference on Electronic Publishing, ed. F. Loizides and B. Schmidt, IOS Press, 2016, pp. 87–90 Search PubMed.
  38. J. C. Stoltzfus, Logistic Regression: A Brief Primer, Acad. Emerg. Med., 2011, 18, 1099–1104,  DOI:10.1111/j.1553-2712.2011.01185.x.
  39. K. W. Lau and Q. H. Wu, Online Training Of Support Vector Classifier, Pattern Recognit., 2003, 36, 1913–1920,  DOI:10.1016/S0031-3203(03)00038-4.
  40. M. Pal, Random Forest Classifier For Remote Sensing Classification, Int. J. Remote Sens., 2003, 26, 217–222,  DOI:10.1080/01431160412331269698.
  41. I. M. De Diego, A. R. Redondo, R. R. Fernández, J. Navarro and J. M. Moguerza, General Performance Score For Classification Problems, Appl. Intell., 2022, 52, 12049–12063,  DOI:10.1007/s10489-021-03041-7.
  42. F. S. Nahm, Receiver Operating Characteristic Curve: Overview And Practical Use For Clinicians, Korean J. Anesthesiol., 2022, 75, 25–36,  DOI:10.4097/kja.21209.
  43. G. W. Snedecord and W. G. Cochran, Statistical Methods, Wiley-Blackwell, NJ, 8th edition, 1989 Search PubMed.
  44. P. Gramatica, E. Giani and E. Papa, Statistical External Validation And Consensus Modeling: A QSPR Case Study For Koc Prediction, J. Mol. Graphics Modell., 2007, 25, 755–766,  DOI:10.1016/j.jmgm.2006.06.005.
  45. OECD Grouping of Chemicals: Chemical Categories and Read-Across: https://www.oecd.org/chemicalsafety/risk-assessment/groupingofchemicalschemicalcategoriesandread-across.htm/#:%7E:text=Intheread-acrossapproach,samemodeormechanismsof, accessed on 18th March 2024.
  46. S. Kovarich, L. Ceriani, M. F. Gatnik, A. Bassan and M. Pavan, Filling Data Gaps By Read-Across: A Mini Review On Its Application, Developments And Challenges, Mol. Inf., 2019, 38, 1800121,  DOI:10.1002/minf.201800121.
  47. G. Patlewicz, Chemical Categories and Read-across, EUR 21898 EN, European Commission Directorate General Joint Research Centre, 2005, https://publications.jrc.ec.europa.eu/repository/bitstream/JRC31792/Chemical%20Categories%20and%20Read%20across_Dec.pdf Search PubMed.
  48. C. X. Ling, J. Huang and H. Zhang, AUC: A better measure than accuracy in comparing learning algorithms, Advances in Artificial Intelligence, Canadian AI 2003, Lecture notes in computer science, ed. Y. Xiang and B. Chaib-draa, Springer, 2003, vol. 2671, pp. 329–341,  DOI:10.1007/3-540-44886-1_25.
  49. S. J. Enoch, M. T. D. Cronin, T. W. Schultz and J. C. Madden, Quantitative And Mechanistic Read Across For Predicting The Skin Sensitization Potential Of Alkenes Acting Via Michael Addition, Chem. Res. Toxicol., 2008, 21, 513–520,  DOI:10.1021/tx700322g.
  50. D. A. Saldana, L. Starck, P. Mougin, B. Rousseau and B. Creton, Prediction Of Flash Points For Fuel Mixtures Using Machine Learning And A Novel Equation, Energy Fuels, 2013, 27, 3811–3820,  DOI:10.1021/ef4005362.
  51. L. E. Lizarraga, G. W. Suter, J. C. Lambert, G. Patlewicz, J. Q. Zhao, J. L. Dean and P. Kaiser, Advancing The Science Of A Read-Across Framework For Evaluation Of Data-Poor Chemicals Incorporating Systematic And New Approach Methods, Regul. Toxicol. Pharmacol., 2023, 137, 105293,  DOI:10.1016/j.yrtph.2022.105293.
  52. N. Spinu, M. T. D. Cronin, S. J. Enoch, J. C. Madden and A. P. Worth, Quantitative Adverse Outcome Pathway (QAOP) Models For Toxicity Prediction, Arch. Toxicol., 2020, 94, 1497–1510,  DOI:10.1007/s00204-020-02774-7.
  53. A. Banerjee and K. Roy, On Some Novel Similarity-Based Functions Used In The ML-Based Q-RASAR Approach For Efficient Quantitative Predictions Of Selected Toxicity End Points, Chem. Res. Toxicol., 2023, 36, 446–464,  DOI:10.1021/acs.chemrestox.2c00374.
  54. V. Kumar, A. Banerjee and K. Roy, Breaking the Barriers: Machine-Learning-Based c-RASAR Approach for Accurate Blood–Brain Barrier Permeability Prediction, J. Chem. Inf. Model., 2024 DOI:10.1021/acs.jcim.4c00433.
  55. G. Patlewicz and J. M. Fitzpatrick, Current And Future Perspectives On The Development, Evaluation, And Application Of In Silico Approaches For Predicting Toxicity, Chem. Res. Toxicol., 2016, 29, 438–451,  DOI:10.1021/acs.chemrestox.5b00388.

Footnote

Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4em00173g

This journal is © The Royal Society of Chemistry 2024