Consideration of predicted small-molecule metabolites in computational toxicology †

Xenobiotic metabolism has evolved as a key protective system of organisms against potentially harmful chemicals or compounds typically not present in a particular organism. The system's primary purpose is to chemically transform xenobiotics into metabolites that can be excreted via renal or biliary routes. However, in a minority of cases, the metabolites formed are toxic, sometimes even more toxic than the parent compound. Therefore, the consideration of xenobiotic metabolism clearly is of importance to the understanding of the toxicity of a compound. Nevertheless, most of the existing computational approaches for toxicity prediction do not explicitly take metabolism into account and it is currently not known to what extent the consideration of (predicted) metabolites could lead to an improvement of toxicity prediction. In order to study how predictive metabolism could help to enhance toxicity prediction, we explored a number of di ﬀ erent strategies to integrate predictions from a state-of-the-art metabolite structure predictor and from modern machine learning approaches for toxicity prediction. We tested the integrated models on ﬁ ve toxicological endpoints and assays, including in vitro and in vivo genotoxicity assays (AMES and MNT), two organ toxicity endpoints (DILI and DICC) and a skin sensitization assay (LLNA). Overall, the improvements in model performance achieved by including metabolism data were minor (up to +0.04 in the F1 scores and up to +0.06 in MCCs). In general, the best performance was obtained by averaging the probability of toxicity predicted for the parent compound and the maximum probability of toxicity predicted for any metabolite. Moreover, including metabolite structures as further input molecules for model training slightly improved the toxicity predictions obtained by this averaging approach. However, the high complexity of the metabolic system and associated uncertainty about the likely metabolites apparently limits the bene ﬁ t of considering predicted metabolites in toxicity prediction.


Introduction
The metabolic system has evolved as the primary defense system against xenobiotic, potentially toxic substances. Its protective function is based on the biotransformation of xenobiotics into more hydrophilic and, hence, more rapidly excretable compounds (metabolites). However, a minority of metabolites produced by the metabolic system are more active than their parent compound (which is exploited by the prodrug concept) or even toxic. 1 The important role of metabolism in the toxicity of small organic molecules highlights the need for the consideration of metabolic pathways also in the computational prediction of toxicity. However, so far only a few in silico models for toxicity prediction have integrated metabolism information. For example, Dmitriev et al. 2 built linear models for the prediction of rat acute toxicity using self-consistent regression, thereby considering parent compounds and measured metabolites. More specically, they trained a model on about 3000 parent compounds and used it to predict the LD 50 value of 37 test parent compounds and their measured metabolites (around 200 known metabolites). To calculate the nal LD 50 value, different strategies for averaging the LD 50 values predicted for the parent compounds and their metabolites were investigated. However, only minor improvements in the overall performance of the model were achieved compared to using only the predicted probability of the parent compounds (R 2 increased from 0.78 to 0.81 and RMSE remained at 0.49). In a more recent study from the same research group, 3 classication models based on a Bayesian approach were trained on parent compounds with annotated bioactivity data for a variety of endpoints. The bioactivity of a compound was then calculated as the maximum probability predicted among the parent compound and its measured metabolites. For the 28 endpoints in the "toxic and adverse effects" category (with data sets ranging from 15 to 5583 toxic and non-toxic compounds), an increase of up to 0.14 in the precision and 0.16 in the recall during leave-one-out crossvalidation (CV) was obtained on average (compared to taking the predicted probability of the parent compound only). These results show that the consideration of metabolism in prediction models can substantially improve the identication of potentially toxic compounds.
Data on measured metabolites can be valuable for estimating the toxicity of compounds but such approaches rely on the availability of experimental data. For this reason, in silico approaches to predict the likely metabolites of substances based on molecular structures are in high demand. Several predictors of this kind are available today, including Bio-Transformer, 4 CyProduct, 5 GloryX, 6 Meteor Nexus, 7 SyGMA, 8 TIssue MEtabolic Simulator (TIMES) 9 and XenoSite. 10 In previous works, researchers from the Laboratory of Mathematical Chemistry (LMC) have combined in silico models for toxicity prediction with their TIMES metabolite predictor. The rst model from LMC taking into account the parent compound and its metabolites (predicted with the S-9 metabolism simulator of TIMES) was developed for the prediction of in vitro mutagenicity (i.e. outcomes of the AMES assay). 11,12 This AMES model was based on decision trees trained on the reactivity prole of compounds and labeled a compound as toxic if any of its predicted metabolites were predicted as toxic. The evaluation of the model on the training data showed that the metabolism-aware approach resulted in lower sensitivity (0.77) and specicity (0.74) compared to the performance of the model considering only the parent compound (sensitivity 0.82; speci-city 0.94). The lower sensitivity obtained by this approach may be related to the fact that compounds without any predicted metabolites were automatically classied as inactive. Another drawback of this approach is the decrease in specicity due to false positive predictions derived from non-mutagenic parents with metabolites predicted as mutagenic. In addition to the training data, the model was evaluated on a test set of 36 mutagenic compounds, obtaining a sensitivity of 0.58 (corresponding to 21 correctly classied compounds). Despite the overall drop in performance, the metabolism-aware approach correctly identied compounds of which their mutagenicity is related to the metabolites formed.
Two further decision tree models from LMC targeting skin and respiratory sensitization, respectively, 13,14 also included the evaluation of several properties of predicted metabolites (e.g. reactivity prole or ability to cross-link proteins) to classify the parent compounds as non-sensitizers or sensitizers (further distinguished between strong or weak sensitizers in the case of the skin). The evaluation of this skin sensitization model on the training data yielded 80% correct predictions for strong sensitizers, 34% for weak sensitizers and 72% for non-sensitizers, while the respiratory sensitization model obtained a sensitivity of 0.89 and a specicity of 0.52.
A further model of this kind from LMC was reported for the in vivo micronucleus test (MNT). 15 By comparing the assay outcomes of the (in vitro) AMES assay with a liver genotoxicity and an MNT in vivo assays, bioactivated compounds and "bioexhausted" compounds (i.e. highly reactive compounds interacting with off-targets before reaching the target) were analyzed to establish in vitro-in vivo relationships. Based on this analysis, an in vivo rat liver metabolism predictor reproducing phase II conjugation reactions and detoxication pathways was developed. The toxicity prediction model of MNT applied on the predicted metabolites (derived with the in vivo metabolite predictor) reached a sensitivity of 0.82 and a specicity of 0.61 on the training data.
The performance of this MNT model, as well as the skin and respiratory sensitization models, was not compared to the performance of models not considering predicted metabolites. Therefore it is not possible to conclude on the benets or drawbacks of these metabolism-aware models compared to models considering only parent compounds.
Overall, these recent reports on efforts to enhance toxicity prediction of small organic molecules by the consideration of their biotransformation provide valuable insights and starting points for the further development of methods for computational toxicology. Although metabolism is key to understanding the pharmacokinetics and toxicity of compounds, the inherent uncertainty of the complex metabolic data could also hinder the improvement of models integrating this information. So far, the existing works on this topic are either based on only a few parent compounds and their measured metabolites, or focused on a single endpoint, making it therefore difficult to derive more general conclusions.
With this work, we aim to provide a systematic study on how, and to what extent, the consideration of metabolism can help the in silico prediction of toxicity. In order to cover a wide chemical space and make models applicable to new, untested compounds, we referred to the use of predicted metabolites. Five relevant toxicological endpoints and assays were selected for investigation: the in vitro AMES assay (considering metabolic activation with S-9 liver extract), the in vivo micronucleus test (MNT), a skin sensitization assay (the murine local lymph node assay, LLNA), and the drug-induced liver injury (DILI) and cardiological complications (DICC) endpoints. [16][17][18] All selected endpoints and assays have in common that their outcome is known to be related, to some extent, to the biological activity of metabolites. Positive outcomes of the genotoxicity assays (AMES and MNT) and the skin sensitization assay (LLNA) can be produced by reactive metabolites that bind to DNA or skin proteins. The in vitro AMES assay (considering metabolic activation) was specically chosen to evaluate the impact of adding metabolism information to a less complex endpoint (that is less dependent on pharmacokinetic variables than other in vivo endpoints). Moreover, reactive metabolites are also known to be a recurrent trigger of idiosyncratic adverse effects of drugs (i.e. unpredictable and infrequent adverse reactions oen unrelated to dose). 16 The role of metabolites in the two organ toxicity endpoints (DILI and DICC), oen triggered by idiosyncratic adverse reactions, was hence also investigated. 17,18 Materials and methods Data sets AMES. AMES assay data were collected from the Chemical Carcinogenesis Research Information System (CCRIS), 19 the Genetic Toxicology Data Bank (GENE-TOX) 20 and the U.S. National Toxicology Program (NTP; Table S1 †). 21 These data sources were selected because they provide information about the consideration of metabolic activation in the assay setup. Since the inuence of the metabolites on the toxic effect was investigated in this study, only results obtained from the AMES assay accounting for metabolic activation were considered.
More specically, the CCRIS database (stored in XML le format) was queried for mutagenicity studies based on the AMES assay, resulting in 67 907 study results (i.e. experimental assay outcomes on a set of compounds). For extracting these studies, the word "ames" was queried in the test system eld ("mstu/tsstm") of the XML le. The retrieved AMES data were further ltered for experiments that test for metabolic activation, by querying the data for the words "liver", "hepatocytes", "s9" and "s-9" in the "matvm" eld. The resulting data (38 267 study results) were further curated by removing any inconclusive or potentially ambiguous results. This was achieved by removing studies with results labeled as "weak" or as both "positive" and "negative" (e.g. "positive (retest was negative)"). Also inconclusive results caused by precipitating compounds were removed from the data set by querying the labels "negative" and "precipitation" (e.g. "negative, precipitation at 3 highest doses.").
The remaining data (38 200 study results) were labeled as "toxic" if the results eld matched the word "positive", or "nontoxic" if the results eld matched the word "negative". To obtain only one result per compound, the data were deduplicated based on the CAS number and any compounds with conicting class labels were removed from the data set. This resulted in 4721 compounds with AMES data.
The GENE-TOX database was obtained from PubChem. 22 The different genotoxicity study types contained in this database were queried to select only those studies belonging to the AMES assay (i.e. matching the "Histidine reverse gene mutation, Ames assay" assay type). From the 1057 compounds with AMES data only the 238 results considering metabolic activation (i.e. matching "with metabolic activation" in the "activation" eld) were conserved. The activity labels were used as is.
The NTP AMES data set contains 64 246 study results. Results from assay setups without S-9 activation and from assays with microsome-activating conditions of less than 5% were removed from the data set. Results without an activity label reported in the study conclusion and results labeled as "equivocal" were removed from the data set. These ltering steps resulted in 40 859 study results. Study outcomes with a "positive" or "weakly positive" study conclusion label were annotated as "toxic", and study outcomes with the "negative" conclusion label as "non-toxic". Compounds were deduplicated based on the CAS number, and duplicate compounds with conicting labels were removed from the data set. In contrast to the above data sets, the NTP set did not include SMILES strings for the tested compounds. The SMILES strings were obtained by querying PubChem via the PUG REST interface 23 using the CAS numbers provided with the NTP data set. This resulted in 1959 compounds annotated with AMES results.
The data from the three databases were merged based on the canonical SMILES (see section Structure preparation for details). Compounds with identical canonical SMILES but differing AMES activity labels (72 compounds) were removed from the data set. This resulted in a total of 5061 compounds (1908 toxic and 3153 non-toxic compounds; Table 1).
Micronucleus test. MNT data was collected, as described by Garcia de Lomana et al., 24 from (i) the European Chemicals Agency (ECHA; available at the eChemPortal), 25 (ii) the European Food Safety Authority (EFSA), curated by Benigni et al., 26 and (iii) the work of Yoo et al. 27 The nal, processed and deduplicated MNT data set consists of a total of 1775 compounds (315 toxic and 1460 non-toxic compounds; Table  1).
Drug-induced liver injury. The data set for the DILI endpoint was obtained from the veried DILIrank data (i.e. the revised version of their original DILIrank data set) of the U.S. Food and Drug Administration (FDA). 28 These data were derived from the observed hepatotoxicity of FDA-approved drugs described in drug labeling documents as well as evidence in literature. The drugs in this data set are classied as "most-DILI-concern", "less-DILI-concern", "no-DILI-concern" and "ambiguous-DILIconcern". For this study, binary class labels were assigned: 182 "most-DILI-concern" and 271 "less-DILI-concern" compounds were labeled as "toxic", 268 "no-DILI-concern" compounds as "non-toxic", and 239 "ambiguous-DILI-concern" compounds were removed from the data set. The nal, processed and deduplicated DILI data set consists of a total of 661 compounds (435 toxic and 226 non-toxic compounds; Table 1).
Drug-induced cardiological complications. The data set for DICC was compiled, as described by Garcia de Lomana et al., 24 from the work of Cai et al. 29 The DICC data set covers ve cardiological complications: hypertension, arrhythmia, heart block, cardiac failure and myocardial infarction. Compounds were labeled as "toxic" if they were active in at least one of the ve cardiological endpoints and labeled as "non-toxic" otherwise. The nal, processed and deduplicated DICC data set contains a total of 3208 compounds (965 toxic and 2243 nontoxic compounds; Table 1). Murine local lymph node assay. The data set for the LLNA was obtained from the work of Wilm et al. 30 The binary activity labels from this data set were used as is, resulting, aer processing and deduplication in a total of 1270 compounds (521 toxic and 749 non-toxic compounds; Table 1).

Structure preparation
The standardization of the molecular structures followed the same procedure as described by Garcia de Lomana et al. 24 (with one exception, indicated below). Briey, the SMILES strings were standardized with the ChemAxon Standardizer 31 node in KNIME 32 to remove solvents and salts, annotate aromaticity, neutralize charges and mesomerize structures (i.e. returning the canonical resonant form of the molecule). Moreover, compounds containing any element other than H, B, C, N, O, F, Si, P, S, Cl, Se, Br and I as well as multi-component compounds were removed from the data set. Lastly, compounds with fewer than four heavy atoms or with molecular weight greater than 1000 Da (this criterion has been introduced for the current work only) were ltered out from the respective data set.
For the remaining standardized structures, canonical SMILES were derived with RDKit 33 in KNIME. These canonical SMILES were used for the deduplication of compounds in each data set. Compounds with identical canonical SMILES but conicting labels for an endpoint were removed from the respective endpoint data set.

Descriptor calculation
Molecular structures were encoded with count-based Morgan ngerprints with a radius of 2 bonds and a length of 2048 bytes (computed with the "RDKit Count-Based Fingerprint" node in KNIME) plus 119 1D and 2D physicochemical property descriptors (computed with the "RDKit descriptor calculation" node in KNIME). These RDKit physicochemical property descriptors capture properties such as the number of occurrences of a specic atom type, bond or ring, as well as global molecular properties such as polarity and solubility. Moreover, up to two acidic and two basic pK a values were calculated for each molecule with the "pK a " KNIME node from ChemAxon. 34 For molecules with fewer than two acidic or basic groups, the remaining pK a feature values were lled with the mean value of the respective data set.

Model development and evaluation
Prior to model development, a variance lter was applied to all input features to remove those with a variance of less than 0.001. The remaining features were scaled with the Stand-ardScaler class of scikit-learn 35 by subtracting the mean and scaling to unit variance. Both variance ltering and scaling were performed individually for each data set.
The models were evaluated within a 5-fold cross-validation (CV) framework by splitting the data into 80% training and 20% test set with the StratiedShuffleSplit class of scikit-learn. To account for data imbalance, oversampling with SMOTENC (an extension of SMOTE that handles categorical features) 36 was performed on the training set (with a ratio of samples in the minority class with respect to the majority class of 0.8). All molecular ngerprints and discrete RDKit descriptor features (e.g. number of hydrogen bond donors or ring count) were specied as categorical features in SMOTENC.
For evaluating the performance of the models, the precision, recall, F1 score and Matthews Correlation Coefficient (MCC) were calculated on the respective test set of the CV. The precision measures the proportion of true positive predictions out of all positive predictions, while the recall measures the proportion of correctly predicted positive samples. The F1 score is the harmonic mean of precision and recall. The MCC takes into consideration all four classes of predictions (true positive, true negative, false positive and false negative predictions) and ranges between À1 and +1 (being +1 the perfect prediction). Both the F1 score and the MCC are robust against data imbalance.
Differences in the performance between models were evaluated with the nonparametric Mann-Whitney U test. 37 For comparing a pair of models, the values for a given performance metric obtained in the different CV runs were used as input for the "mannwhitneyu" function implemented in SciPy. 38 The pvalue threshold of 0.05 was applied to consider a difference as signicant. Due to the negligible number of signicant results, a correction of the p-value accounting for the number of comparisons performed was deemed to be not necessary.

Metabolite prediction with Meteor Nexus
The metabolites were predicted with Meteor Nexus, 7,39 a leading soware package for metabolism prediction that is widely applied in the industries. Meteor Nexus covers a broad range of approximately 500 manually curated biotransformations gathered from several public sources and proprietary data sets from member organizations of Lhasa Limited.
In this study, starting from the prepared molecular structures (canonical SMILES), four generations of metabolites were predicted and subsequently scored with the "Site of Metabolism (SOM) Scoring" method, 40 which is the default scoring method of Meteor Nexus. Other processing options were retained at their default setting. The score given to each metabolite is based on experimental data for compounds that are chemically related to the query compound around the site of metabolism. The molecular structures of the predicted metabolites were prepared and standardized following the same procedure described for the parent compounds (starting from the SMILES string output by Meteor Nexus).

Predicted metabolite information as input descriptors for parent compounds
Two different approaches for including metabolite information as input features in machine learning were explored (Fig. 1A). In the rst approach, the above-mentioned molecular ngerprints and physicochemical properties for each parent compound were concatenated with chemical descriptors calculated for the top-5 predicted metabolites of that parent compound (if available; metabolite scoring with Meteor). The chemical descriptors of the metabolites comprise count-based Morgan ngerprints (radius of 2; length of 1024 bytes) and all of the 200 physicochemical property descriptors of RDKit listed under "rdkit.Chem.Descriptors._descLis". For parent compounds with fewer than ve predicted metabolites, the empty values of the Morgan ngerprint vectors from the remaining metabolites were lled with zeros (indicating the absence of the structural feature) and the features corresponding to RDKit descriptors were lled with the mean value of the whole data set for that feature. Models were trained combining the molecular descriptors of the parent compounds with (a) Morgan ngerprints of the metabolites, (b) RDKit physicochemical property descriptors of the metabolites or (c) a combination of both.
In the second approach, the above-mentioned ngerprints and physicochemical properties for each parent compound were concatenated with a count-based "biotransformation ngerprint". The biotransformation ngerprint encodes the number of occurrences of a particular biotransformation (as labeled by Meteor Nexus) in the predicted metabolic tree. For each endpoint data set only those biotransformation predicted for at least one parent compound were included in the ngerprint. The feature length of the ngerprint ranges from 238 for the LLNA data set to 330 for the AMES data set. In addition to models based on the complete descriptor vector, models were also built on subsets of features selected prior to model building (in an attempt to reduce noise related to the sparsity of the biotransformation ngerprints). The feature selection was conducted on all descriptors (including ngerprints and physicochemical descriptors) and using the LassoCV implementation from scikit-learn within a 5-fold CV. Any feature with an output coefficient of zero was removed from the data prior to the training of the RF models.
Combination of the probabilities of toxicity predicted for a parent compound and its predicted metabolites Overall predicted probability of a compound's toxicity. An overall probability for the parent compounds' toxicity was calculated by combining the predicted probabilities for the parent compounds and their predicted metabolites.
Two types of models were used for predicting the probability of toxicity: (i) Baseline model: without the consideration of metabolites (i.e. trained only on the parent compounds).
(ii) Metabolism-aware model: with the consideration of metabolites (i.e. trained on the parent compounds and labeled metabolites).
The molecular descriptors dened in the "Descriptor calculation" section were used as input features for the parent compounds and metabolites in both types of models. For the metabolism-aware model the labels of the metabolites were assigned according to the workow described in "Assignment of toxicity labels to metabolites". The predicted probabilities for the parent compounds (with the baseline model) were used as a baseline result to analyze whether model performance improves when considering metabolites for the prediction of toxicity.
In an attempt to obtain the most accurate predicted probability for the parent compounds and metabolites, two approaches combining the baseline model and metabolismaware model were investigated: (a) Baseline-approach: baseline model for the prediction of both parent compounds and metabolites.
(b) Hybrid-approach: baseline model for the prediction of parent compounds plus metabolism-aware model for the prediction of metabolites.
To obtain the overall probability of toxicity of a compound (i.e. with the consideration of its metabolites), the selected model was applied to calculate the probability of toxicity of the parent compound and that of the predicted metabolites (up to four generations; Fig. 2). In addition, a number of different strategies for ltering predicted metabolites according to their relevance to toxicity were explored by a grid search. These lters are based on calculated log P, the Meteor score and/or predicted phase II metabolism, and are intended to remove any non-toxic (since readily excretable or unlikely) metabolites. The investigated threshold values, below which metabolites were removed, are 0 and 3 for log P, and 100, 200 and 300 for the Meteor score. When the phase II metabolism lter was applied, metabolites formed by phase II reactions, as well as those metabolites further transformed by phase II reactions, were ltered out. A grid search over the 23 possible combinations of lters (always including the possibility of not ltering for one or more properties) was performed.
The predicted probabilities of toxicity calculated for the selected metabolites were then combined with the predicted probability for the respective parent compound. For the combination of the predicted probabilities of toxicity, four strategies were explored (Fig. 1B): (1) Strategy 1: mean predicted probability over all compounds (i.e. the parent compound and all predicted metabolites).
(2) Strategy 2: median predicted probability over all compounds (i.e. the parent compound and all predicted metabolites).
(3) Strategy 3: maximum predicted probability among the parent compound and its predicted metabolites.
(4) Strategy 4: mean between the predicted probability of the parent compound and the maximum probability among all predicted metabolites.
If the overall probability was above 0.5, the compound was predicted as toxic and otherwise as non-toxic.
Assignment of toxicity labels to metabolites. In preparation of the use of the predicted metabolites for the generation of the metabolism-aware models, the metabolites were assigned toxicity labels according to the following procedure, individually for each endpoint data set: (1) All metabolites with identical canonical SMILES as a parent compound were assigned the toxicity label of the parent compound.
(2) All metabolites not covered by step 1 and originating from non-toxic parent compounds were labeled as "non-toxic".
(3) All metabolites not covered by step 1 and originating from toxic parent compounds were compared with the already labeled metabolites. If an identical metabolite (based on the canonical SMILES) was labeled in one of the previous steps (as toxic or non-toxic), the same label was assigned.
Data splitting. All models were trained within a 5-fold CV framework. In order to ensure comparability between the baseline models and the metabolism-aware models, the same splits (with regard to parent compounds) were used in both cases.
To ensure that no data leak occurred in the metabolismaware model due to the presence of identical metabolites in the training and test sets, the following procedure was conducted on each split: (1) Stratied shuffle split was applied on the parent compounds (see Model development for details).
(2) The metabolites from the parent compounds in the test and training set were collected independently.
(3) The metabolites in the training set, which were also present in the test set (as parent or metabolite), were removed from the training set.
(4) The compounds of the training set were deduplicated based on the canonical SMILES (duplicates may appear due to repeated metabolites or metabolites identical to parent compounds).
Machine learning methods for further modeling optimization RF, gradient boosted trees and k-nearest neighbors models with optimized hyperparameters were also trained in the hybridapproach. The scikit-learn implementations 'GradientBoos-tingClassier' and 'KNeighborsClassier' were used for training the gradient boosted trees and k-nearest neighbor models, respectively. The hyperparameter optimization was conducted on the training set within a grid search evaluated on an inner 5fold CV over the hyperparameters shown in Table 3.
A further set of molecular descriptors, the Continuous and Data-Driven molecular Descriptors (CDDD), 41 was employed as input for RF models. These descriptors are derived from a neural network trained to translate between two syntactically Fig. 2 Workflow for calculating the overall probability of toxicity. The baseline model or the metabolism-aware model are used to predict the probability of toxicity of parent compounds and predicted metabolites independently. The predictions for a compound and its predicted metabolites are then combined into an overall probability to obtain the toxicity label. different molecular representations. In order to make the translation, the model rst learns to compress meaningful information for the representation of molecules into a vector. This vector can hence be used as a data-driven molecular descriptor, offering a conceptually different method to represent molecules, compared to the xed Morgan ngerprints and RDKit physicochemical descriptors.

Analysis of the chemical space of the parent compounds and their predicted metabolites
To understand the nature and composition of the metabolites predicted for the parent compounds in each data set, several characteristics of the predicted metabolites were analyzed.
The predicted metabolites result from phase I or phase II reactions (considering up to four generations of metabolites).
The number of unique metabolites for the individual parent compounds (aer removing duplicate metabolites from the respective metabolic tree) varied greatly (from 0 to 828). However, the median number of predicted metabolites among all parent compounds of an endpoint-specic data set was between 8 and 12 in all cases (Table 4).
By comparing the molecular properties of the parent compounds and their predicted metabolites (Fig. 3 reports on the AMES and MNT data sets; the graphs for the other endpoints are provided in Fig. S1 †) we found the latter to have, averaged over all endpoints, a higher molecular weight (+43.9 Da) as well as a larger polar surface area (+44.4 A 2 ). The predicted metabolites also tended to have a lower log P value than the parent compounds (À1.5; averaged over all endpoints). These shis are primarily a result of the addition of polar groups to the parent compounds, which make them more water soluble and therefore easier to excrete. This observation is in concordance with the higher number of hydrogen bond donors and acceptors observed in metabolites compared to parent compounds (1.8 more hydrogen bond donors and acceptors on average; Fig. 3). Overall, the shis in the physicochemical property space between the parent compounds and the predicted metabolites are consistent with those observed for parent compounds and experimentally detected metabolites, 42 a fact that supports the relevance of the predicted metabolites.

Analysis of metabolites originating from toxic and non-toxic parent compounds
The toxicity observed for a compound may be a direct result of the parent compound or of one or several of its metabolites.  Understanding the differences in the metabolites formed by toxic and non-toxic compounds may therefore help in their discrimination. However, when comparing the physicochemical properties of the (predicted) metabolites originating from toxic and from non-toxic parent compounds, we did not detect any substantial, systematic differences. This is not surprising because toxic effects may be related to a single metabolite, which is difficult to detect. Most notable was a minor shi in the log P distribution (see Fig. S2 † for an example of the log P distributions of AMES and MNT): the log P of metabolites originating from non-toxic compounds was generally lower (log P of 0.8; averaged over all metabolites of all endpoints) than for metabolites from toxic compounds (log P of 1.2; averaged over all metabolites of all endpoints). The higher log P of metabolites originating from toxic parent compounds could be related to the observed toxicity, as these metabolites are more likely to evade excretion and to cross membranes.
Another aspect that could differ from toxic to non-toxic compounds are the types of biotransformations that they are undergoing. Testa et al. 43 observed that some reactions are more prone to generate reactive or toxic metabolites than others. They showed that toxic metabolites are mainly formed by redox reactions, followed by conjugation reactions and, lastly, hydrolysis. Hence, the type of biotransformation that a compound undergoes may be an indicator of the compound's toxicity. To investigate whether the types of biotransformations in the metabolic trees of toxic and non-toxic compounds differ, the percentage of parent compounds of each toxicity class undergoing each biotransformation (as labeled by Meteor Nexus) was calculated for all endpoints.
We observed that some biotransformations occur more frequently in toxic parent compounds than in non-toxic ones (and vice versa). However, there was no single biotransformation observed to be related to the same toxicity class for all endpoints (see Fig. S3 † for the examples of AMES and MNT). For instance, "aromatic reductive dehalogenation" is predicted more frequently for toxic compounds in the MNT assay (than for non-toxic compounds in this assay) while it is more oen observed for non-toxic compounds in the AMES assay (than for toxic compounds in this assay).
In an analogous way, the enzymes catalyzing biotransformations in the metabolic tree of toxic and non-toxic compounds were also investigated. Similar results as for the biotransformations were observed, but, in this case, the differences between classes were smaller (i.e. there were few enzymes metabolizing a higher percentage of toxic or non-toxic compounds).

Baseline performance of the models
To enable the (later) quantication of the added value of metabolism prediction in toxicity prediction we generated baseline models trained exclusively on physicochemical properties of the parent compounds (encoded by count-based Morgan ngerprints and RDKit physicochemical property descriptors; see Materials and methods section for details).
The mean F1 score obtained by the baseline models within 5fold CV ranged from 0.64 (for MNT) to 0.82 (for AMES; Table 5). The superior performance of the AMES baseline model (F1 score at least 0.09 higher than for any other baseline model) is attributed to the larger size of the data set (it is the biggest data set considered in this study with at least 1853 compounds more than any other data set) as well as the nature of the endpoint: the AMES test is an in vitro assay carried out on bacteria, hence representing a more simple problem than the in vivo endpoints based on living mammals and considered in this work. Among the in vivo endpoints, the model for the LLNA assay, a skin sensitization assay measuring cellular proliferation in the draining lymph nodes of mice, obtained the highest mean F1 score (0.73). The lowest F1 score (0.64) was obtained by the MNT baseline model. The precision and recall yielded by each endpoint-specic model were on a similar level in all cases, indicating a balanced ratio of false positive and false negative predictions.

Metabolite information as input descriptors for parent compounds
Molecular descriptors for metabolites. One or several chemical features present in the metabolites could be associated with the toxic effect observed for a parent compound. In an attempt to include this information in the model, molecular descriptors of the ve best-scored predicted metabolites were included as further input features for model building. These molecular descriptors include (a) count-based Morgan ngerprints, (b) RDKit physicochemical property descriptors and (c) a combination thereof (see Materials and methods for details). In cases where fewer than ve metabolites were predicted for a parent compound (between 12% and 24% of the compounds; Table 4), the remaining features were lled with zeros (in the case of the Morgan ngerprints) or with the mean value of the feature (in the case of the RDKit property descriptors). The trained models were evaluated by comparing the predicted label for each test parent compound with their experimental toxicity label within 5-fold CV.
When comparing the performance of these models containing metabolite information with that of the baseline models, no improvements of performance were observed (Table  S2 †). The few minor gains in performance did not exceed a value of +0.04 among all evaluated metrics and were not signicant (at a p-value of 0.05; Table S3 †). In several cases the addition of descriptors for the predicted metabolites led to small decreases in performance (up to a value of À0.09 among all metrics).
Biotransformation ngerprint. Our analysis of the types of biotransformations recorded for toxic and non-toxic compounds (see "Analysis of metabolites originating from toxic and non-toxic parent compounds") found indications that this information could be utilized to enhance toxicity prediction. Therefore, we derived a biotransformation ngerprint which encodes the number of occurrences of each biotransformation in the predicted metabolic tree of a compound. In combination with the molecular descriptors calculated for the parent compounds, this biotransformation ngerprint was used for the training of machine learning models (see the Materials and methods section for details).
Within the 5-fold CV framework, the performance of these models was comparable to the baseline performance of each endpoint. For all evaluated metrics the difference from the baseline performance did not exceed AE0.01 (Tables S4 and  S5 †). The lack of an improvement in performance may be related to the sparsity of the biotransformation ngerprint: most of the biotransformations were not predicted to take place on more than 10% of the compounds. This low coverage of compounds may not be sufficient to enhance toxicity prediction. In order to remove possible noise caused by the sparse ngerprints, feature selection with a lasso model was applied to all input features (in order to discard irrelevant features prior to the training of the RF model). However, no relevant improvement in the performance compared to the baseline models was observed when feature selection was included prior to model training (F1 score deviations ranged from À0.05 to +0.01 among all endpoints).

Combination of predicted probabilities for parent compounds and metabolites
Another approach for considering metabolite information in toxicity prediction is the calculation of an "Overall predicted probability of toxicity" by combining the probabilities predicted for the parent compounds and their metabolites. A related approach (although based on distinct modeling methods and utilizing measured metabolites; explored for different endpoints) was applied, with some success, by Dmitriev et al. 2 and Filimonov et al. 3 (see the Introduction section for details).
In this work, we explored four strategies to combine prediction probabilities: Strategy 1: mean of the probabilities of the parent compound and all predicted metabolites.
Strategy 2: median probability of the parent compound and all predicted metabolites.
Strategy 3: maximum probability among the parent compound and all predicted metabolites.
Strategy 4: mean between the predicted parent compound probability and the maximum probability among all metabolites (i.e. the probability of the metabolite that the model deems most likely to be toxic, among all predicted metabolites).
To evaluate model performance, the obtained "Overall probability of toxicity" (derived by the different strategies) was compared to the experimental toxicity label of each parent   compound (within 5-fold CV; see the "Data splitting" section for details). Note that all predicted metabolites (not only the ve best-scored metabolites) were considered here. The four strategies were applied to two approaches that differ in the underlying models used for calculating the predicted probabilities (Fig. 4ii). In the baseline-approach, we applied the baseline models on the test parent compounds and their metabolites and combined them with each of the abovementioned strategies. With strategy 1, strategy 2 and strategy 3, a drop in F1 score and MCC was observed for all investigated endpoints. Strategies 1 and 2 especially showed a decrease in recall (up to À0.17), which was sometimes compensated, to some extent, by an increased precision (up to +0.04), while the opposite effect was observed for strategy 3 (Table S6 †).
Out of the four strategies, the best classication performance was obtained, in general, with strategy 4. However, the gain in F1 scores compared to the respective baseline models was 0.02 or less (and hence not signicant, according to the Mann-Whitney U test; see Table S7 † for details). Compared to strategies 1 and 2, strategy 4 may provide a well-balanced compromise between an improved capacity to detect toxicity related to metabolism and noise introduced by the predicted metabolites. A similar result was also observed in the study by Dmitriev et al.,2 where several strategies to combine the predicted LD 50 value (for acute rat toxicity) for parent compounds and their measured metabolites were investigated (mean of the predicted LD 50 of all metabolites; mean of the predicted LD 50 of the parent compound and all metabolites; maximum predicted LD 50 among all metabolites; mean of the predicted LD 50 of the parent compound and the most toxic metabolite). In agreement with our observations, Dimitriev et al. obtained their best results for the prediction of acute rat toxicity when taking the mean of the predicted LD 50 for the parent compound and that of the most toxic metabolite. Also the increase in model performance (compared to taking the prediction of the parent compound only) in their case was minor (+0.03 in R 2 and no differences in RMSE).
In the hybrid-approach, the predicted probabilities of the metabolites to be toxic were calculated with a dedicated model. We addressed the possibility that the absence of relevant improvement by the four above-mentioned strategies was due to a decient coverage of the chemical space of the metabolites by the baseline model. The differences observed in the chemical space of parent compounds and metabolites (see the section "Analysis of the chemical space of the parent compounds and predicted metabolites" for details) could indicate that some metabolites fall outside the applicability domain of the models trained only on parent compounds (baseline models).
To expand the chemical space coverage of the models and try to improve the toxicity predictions for the metabolites, models including metabolites as input data (i.e. with their molecular descriptors as input features and the assigned toxicity as class label) were also developed (metabolism-aware models). The toxicity label of the metabolites for these models was assigned following the workow described in the section "Assignment of toxicity labels to metabolites". Instead of applying this straightforward labeling approach, the toxicity labels of the metabolites could have also been predicted with the baseline model. However, we did not investigate this option further as it would increase the complexity of the workow and does not t the purpose of this study. By labeling the metabolites we pretend to analyze whether the reason for the small model performance improvement is due to poor quality of the predicted probabilities of toxicity of the metabolites. Hence, predicting the toxicity label of the metabolites would suffer from the same limitation. We acknowledge that any manual or automatic metabolite labeling approach is a limitation of this study. The only way to overcome this limitation is the use of a large dataset of metabolites with measured toxicities. However, to our best knowledge no such dataset is in existence in the public domain.
With the hybrid-approach we aim to obtain the best predictions for each compound by predicting the probability of the parent compound to be toxic with the baseline model, and the probability to be toxic of the individual metabolites with the metabolism-aware model. Note that we also investigated the possibility to predict both the toxicity of the parent compound and the metabolites with the metabolism-aware model, but we did not see a relevant improvement compared to the baselineor hybrid-approaches in this case and therefore did not further investigate this direction.
Compared to the baseline-approach, the hybrid-approach yielded better results in toxicity prediction. However, with improvements in the F1 scores and MCCs not exceeding 0.03 and 0.05, respectively, these results are not signicantly better (based on the Mann-Whitney U test) than those obtained with the baseline model (Table 6). Few signicant improvements were recorded for precision or recall for the MNT and DICC models (Table S8 †).
The decrease in performance with strategies 1 and 2 (considering the predictions of all metabolites) in combination with the hybrid-approach was in general not as drastic as with the baseline-approach. This may indicate that the predicted probabilities for the metabolites were more accurate and did not include as much noise in the overall prediction. Again in this case, the best performance was observed with strategy 4 (averaging the probability of the parent compound and the most toxic metabolite), with only minor improvements in the F1 score of up to +0.03. Only for the DILI endpoint the F1 score decreased (by À0.02) with this strategy.
In addition, we analyzed whether the improvements in model performance may be limited due to the consideration of metabolites that are irrelevant to the observed toxic effect. In order to reduce the noise in the prediction caused by these metabolites, we applied several metabolite lters removing predicted metabolites that (a) have a low Meteor Nexus prediction score, (b) have a low calculated log P, or (c) are predicted to be further metabolized by conjugating enzymes (Fig. 4i).
Metabolites predicted with a low score by Meteor Nexus may be less likely to be observed in vivo and hence irrelevant to toxicity prediction. Metabolic reactions oen lead to compounds with low log P values, making them more water soluble and therefore easier to excrete. These metabolites are also unlikely to cross membranes and they are less likely to induce toxic effects. Along the same lines, phase II metabolism facilitates the conjugation of compounds with polar moieties, making them more water soluble. It has already been observed that only few conjugation reactions lead to toxic metabolites. 43 Following this reasoning, several thresholds for removing metabolites based on their Meteor Nexus score as well as calculated log P values were investigated. Also strategies to remove metabolites formed by phase II reactions, or remove metabolites which are further transformed by phase II reactions were explored. A grid search over all ltering possibilities (and all above-mentioned approaches and strategies) was conducted on each data set to obtain the most favorable combinations.
In most cases, reducing the number of metabolites considered for the prediction based on these parameters did not yield better models. Among the ve top-ranked models (based on the F1 score) of the grid search, only in a few cases, minor improvements of up to +0.06 among all metrics and endpoints were observed (Table S9 †). However, these performance improvements were not signicant for any endpoint compared to the baseline performance (Table S10 †).

Exploration of further modeling approaches with the hybridapproach
To evaluate whether the predictions may be improved by optimizing the modeling approach, different machine learning modeling methods with optimized hyperparameters (within a grid search; see Materials and methods section for details) and a further, distinct set of descriptors (CDDD descriptors) 41 were investigated at the example of the best performing approach, namely the hybrid-approach.
The F1 score obtained for the following machine learning setup combinations is shown in Table S11: † RF, gradient boosted trees and k-nearest neighbors, each with and without the use of oversampling with SMOTENC (based on Morgan ngerprint and RDKit physicochemical descriptors as input descriptors). Moreover, the performance of RF models trained on CDDD descriptors (including oversampling with SMOTE) are also provided.
The results obtained with these new models do not deviate from those obtained with the models generated with the initial modeling setup (i.e. RF with xed hyperparameters; combination of Morgan ngerprints and RDKit physicochemical descriptors; oversampling with SMOTENC; results reported in Table 6): the largest observed improvement in F1 scores yielded by the new models was of just +0.01. The conclusions derived in the 'Combination of predicted probabilities for parent compounds and metabolites' section remain consistent with the new results. The explicit incorporation of predicted metabolite information in toxicity prediction models did not signicantly improve the toxicity predictions of these models either. Although there was oen no benet compared to the baseline models (or the benet was small), the best strategy for combining the predicted probabilities of parent compounds and metabolites was, also in this case, strategy 4 (taking the mean between the predicted probability of the parent compound and the maximum probability among all predicted metabolites).

Conclusions
In this work we systematically investigated a variety of strategies to enhance toxicity prediction by taking into account xenobiotic metabolism. Our results show that none of these strategies produces models that consistently outperform others. The best results were obtained by averaging the probability of toxicity predicted for the parent compound and the maximum probability of toxicity predicted for any metabolite. This approach yielded models with F1 scores up to +0.03 higher than the baseline models disregarding metabolism. We observed that models trained exclusively on the parent compounds oen produce poor predictions for the metabolites as their chemistry oen differs. Including labeled metabolites in the training set of the models slightly improved the predictions of toxicity for the metabolites and hence the overall result of averaging the probabilities of toxicity for parent compounds and their metabolites. In some cases, discarding unlikely or water-soluble metabolites slightly improved the predictions (F1 score up to +0.04 higher than for the baseline models).
While metabolites can be key to detecting and understanding toxicity, they also add a new layer of complexity. The metabolites formed, their concentrations in the organism, and their excretion kinetics are oen unknown. Therefore, including metabolism data in toxicity prediction poses veritable challenges. The fragile balance between added signal and added noise, when working with predicted metabolites in machine learning, may explain the small differences in performance of the models including metabolism information for toxicity prediction compared to the baseline models. It is clear from these results that there is still a long way to go in the development of sufficiently accurate models for metabolism prediction which, in turn, can boost toxicity prediction.

Data availability
All data sets used in this study are publicly available. Due to licensing reasons, the original data and the predicted metabolites cannot be provided with this publication. However, a detailed protocol for the reproducible collection and preprocessing of the data utilized in this work is provided in the Materials and methods section. Moreover, Table S1 † contains links for downloading the original data and complementary information about the data sets. Also detailed KNIME work-ows used for preprocessing each data set and calculating the chemical descriptors of the parent compounds are provided in the ESI. † The workows and parameters used for developing the models and necessary for reproducing the results are described in detail in the Materials and methods section. The code used for model training and evaluation can be accessed at https:// github.com/marinaglr/metabio.

Conflicts of interest
MGL and MM are employed at BASF SE. AV served as consultant for BASF SE.