Paulina
Körner
a,
Juliane
Glüge
*a,
Stefan
Glüge
b and
Martin
Scheringer
a
aInstitute of Biogeochemistry and Pollutant Dynamics, ETH Zürich, 8092 Zürich, Switzerland. E-mail: juliane.gluege@usys.ethz.ch
bInstitute for Computational Life Science, ZHAW, 8820 Wädenswil, Switzerland
First published on 26th September 2024
The focus of this work is to enhance state-of-the-art Machine Learning (ML) models that can predict the aerobic biodegradability of organic chemicals through a data-centric approach. To do that, an already existing dataset that was previously used to train ML models was analyzed for mismatching chemical identifiers and data leakage between test and training set and the detected errors were corrected. Chemicals with high variance between study results were removed and an XGBoost was trained on the dataset. Despite extensive data curation, only marginal improvement was achieved in the classification model's performance. This was attributed to three potential reasons: (1) a significant number of data labels were noisy, (2) the features could not sufficiently represent the chemicals, and/or (3) the model struggled to learn and generalize effectively. All three potential reasons were examined and point (1) seemed to be the most decisive one that prevented the model from generating more accurate results. Removing data points with possibly noisy labels by performing label noise filtering using two other predictive models increased the classification model's balanced accuracy from 80.9% to 94.2%. The new classifier is therefore better than any previously developed classification model for ready biodegradation. The examination of the key characteristics (molecular weight of the substances, proportion of halogens present and distribution of degradation labels) and the applicability domain indicate that no/not a large share of difficult-to-learn substances has been removed in the label noise filtering, meaning that the final model is still very robust.
Environmental significanceResistance to environmental degradation is one of the characteristics of hazardous substances. Our newly developed yes/no classification model for ready biodegradation is currently the most accurate model that is available for organic chemicals and will enable a better prediction of ready biodegradation. We also present a list of substances that is called “curated_removed” with “noisy labels” (uncertain degradability); these substances should no longer be used to train and test degradation models. Instead, these substances should be tested further experimentally to elucidate their biodegradability behavior. |
The distinction as to whether a substance is persistent or not is in the first step at a regulatory level often assessed by using ready-biodegradability tests (RBT). Biodegradation is an important degradation mechanism for chemicals in the environment; it refers to the capacity of a substance to be broken down and transformed into simpler compounds by microorganisms.6 If a chemical passes the RBT, it will likely be readily biodegradable (RB) in the environment. Conversely, chemicals that do not pass the RBT are likely to be not readily biodegradable (NRB) in the environment.3,7 However, it has also been shown that the test results depend on several factors, including the test procedure, the initial concentration of the substrate, and the activity and adaptation of the microbial population.7 Additionally, environmental conditions such as temperature, pH, and oxygen levels can impact the test results.7 Nevertheless, RBT are established as screening tests in many regulatory frameworks and form the basis for the assessment of persistence.3,8,9 Models that can predict the biodegradation of substances have also been developed in the past as cheaper and less time-consuming alternatives to experimental studies.1,8,10–21 A summary of the previous work on models on ready biodegradability is provided in Section 2.
Recently, Huang and Zhang20 used a machine learning (ML) approach to build both, a classification and a regression model, to predict the ready biodegradability of organic substances. They gathered the largest dataset so far with 12750 samples for 6032 substances for regression and 6139 substances for classification. The classification dataset was based on the regression data but enhanced with data from Lunghini et al.19 The original dataset was obtained through the eChemPortal, which accesses data from the Japan Chemicals Collaborative Knowledge (J-CHECK) database, Canadian Categorization Results (CCR), the European Chemicals Agency (ECHA) database and the Organisation for Economic Cooperation and Development (OECD) Existing Chemicals Screening Information Data Sets (SIDS).22 Seven molecular fingerprints (FPs) were tested in Huang and Zhang20 as input features describing the chemicals. The addition of features containing information about chemical speciation was also examined. In total, 14 ML algorithms were tested, and the best results were achieved with the Molecular Access System key (MACCS key) as input features and an eXtreme Gradient Boosting (XGBoost) model. The XGBClassifier achieved a balanced accuracy of 84.9%. Adding further features containing information on chemical speciation, meaning whether or not the chemical is charged, improved the balanced accuracy to 87.6%. Huang and Zhang20 used a model-centric approach with an emphasis on feature and model selection and hyperparameter tuning rather than data quality.23–25
In contrast, the Data-Centric Artificial Intelligence (DCAI) paradigm that has emerged in recent years shifts the focus towards the systematic design, engineering, and continuous improvement of data to build robust and efficient ML models. This strong focus on data quality ensures that ML models generalize better, making them more effective tools for real-world applications.23 Refining data includes enhancing the quality of individual data points and the dataset in total. Even though the model-centric and data-centric approaches are often contrasted, it is important to emphasize their complementary nature. Both paradigms should be combined to build robust ML-based systems.23
With the current paper, we intend to improve the ML model of Huang and Zhang20 by first taking a data-centric and then a model-centric approach. In particular, we want to answer the question of how important correct representations of chemical structures (Simplified Molecular Input Line Entry Specification (SMILES)) are for the model and whether it is possible to bring the model to a balanced accuracy of over 90% by improving SMILES alone. If this is not possible, the aim is to find out what is preventing the model from making better predictions – too much noise in the data labels themselves, features that cannot adequately represent the chemicals, or whether the model cannot generalize well enough. Noisy labels refer here to mislabeled or inaccurately labeled instances in the training and test datasets.26,27 To investigate these points, first, the dataset published by Huang and Zhang20 is analyzed, and all SMILES are assessed to determine whether they match the provided Chemical Abstracts Service Registry Number™ (CAS RN™). Follow-up steps include assessing the data labels and critically investigating the test and training sets of the ML model. Finally, a model-centric approach is applied to examine if other ML algorithms are more suitable for the curated dataset.
Model | Dataset size | Balanced accuracy | Sensitivity | Specificity |
---|---|---|---|---|
Howard et al. (1992)10 (non-linear) | 264 | |||
Test set | 7.4% | 88.8%* | — | — |
Boethling et al. (1994)1 (non-linear) | 295 | |||
Training set | — | 93.2%* | — | — |
Loonen et al. (1999)11 (with fragment interactions) | 894 | |||
Test set | 25% | 89%* | — | — |
Tunkel et al. (2000)12 (linear) | 884 | |||
Validation set | 33.3% | 74.9%* | — | — |
Cheng et al. (2012)13 | 1440 | |||
Test set (GASVM-kNN) | 11.4% | 81.9% | 72.6% | 91.2 |
External test set (GASVM-kNN) | 27 | 53.8% | 25.0% | 82.6% |
External test set (consensus model) | 27 | 100% | 100% | 100% |
Mansouri et al. (2013)14 (consensus II) | 1055 | |||
Test set | 20% | 91%† | 89%† | 94%† |
External test set | 670 | 87%‡ | 81%‡ | 94%‡ |
Cao and Leung (2014)15 | 1055 | |||
Test set | 20% | 86.0% | 77% | 93% |
External test set | 670 | 83.5% | 74% | 93% |
Lombardo et al. (2014)16 | 728 | |||
Test set | 20% | 82.1% | 87.3% | 76.9& |
External test set | 874 | 78.4% | 73.1% | 83.6% |
Blay et al. (2016)17 (ANN) | 130 | |||
Test set | 20% | 91.5% | 94.1% | 88.9% |
Zhan et al. (2017)18 (NBC) | 1055 | |||
Test set | 20% | 83.8% | 86.1% | 81.5% |
External test set | 670 | 82.6% | 79.6% | 85.6% |
Lunghini et al. (2020)19 | 3146 | |||
Test set | 30% | 81%* | — | — |
External test | 362 | 75%* | 65% | 85% |
Huang and Zhang (2022)20 | 6139 | |||
Test set | 20% | 84.9% | 89.0% | 80.9% |
Test set with chemical speciation | 20% | 87.6% | 87.8% | 87.4% |
Yin et al. (2023)21 | 1928 | |||
Test set | 26% | 87.3% | 94% | 72% |
Model | Dataset size | Accuracy | Sensitivity | Specificity |
---|---|---|---|---|
a Signifies balanced accuracy. | ||||
BIOWIN1 | ||||
Train | 295 | 89.5% | 97.3% | 76.1% |
External test set (MITI) | 884 | 65.4% | 92.7% | 44.3% |
External test set (premanufacture notices (PMN)) | 305 | 54% | 85% | 44% |
BIOWIN2 | ||||
Train | 295 | 93.2% | 97.3% | 86.2% |
External test set (MITI) | 884 | 67.5% | 86.0% | 53.3% |
External test set (PMN) | 305 | 67% | 78% | 63% |
BIOWIN5 | ||||
Test | 295 | 81.4% | 80.2% | 82.3% |
External test set (PMN) | 305 | 83% | 82% | 83% |
BIOWIN6 | ||||
Test set | 295 | 80.7% | 78.6% | 82.3% |
External test set (PMN) | 305 | 83% | 72% | 87% |
VEGA | ||||
Test set | 146 | 81.7% | 87.3% | 76.9% |
External test set | 491 | 80.7% | 75.6% | 90.7% |
OPERA | ||||
Test set | 411 | 79% | 81% | 77% |
Boethling et al. (1994)1 continued the work of Howard et al. (1992)10 and built linear and nonlinear classification models using 295 compounds and 36 molecular substructures plus molecular weight. The modeling approach was the same as in Howard et al. (1992), just using slightly different molecular substructures. No validation set was created, therefore the performance of the two models was only given for the training set. The linear model achieved an accuracy of 89.5%, the nonlinear model an accuracy of 93.2% on the training set. The models of Boethling et al. (1994)1 were used later on for BIOWIN1 and BIOWIN2 (see Section 2.2).
Loonen et al. (1999)11 trained models on a dataset containing 894 compounds tested under the Ministry of International Trade and Industry of Japan (MITI) protocol. The chemicals were characterized by a set of 127 predefined structural fragments. Partial least squares (PLS) discriminant analysis was used for the model development. The authors pointed out that hydroxy, ester, and acid groups that were present were easily degraded, while aromatic rings and halogen substituents were not conducive to biodegradation. The average percentage of correct predictions from four external validation studies was 83%. However, no predictions were made for <10% of the substances because the calculated scores were in the borderline area between readily and not readily biodegradable. Model optimization by including fragment interactions improved the model predicting capabilities to 89%.
Tunkel et al. (2000)12 refitted the molecular substructures of Boethling et al. (1994)1 to 884 compounds tested under the MITI protocol. Two-third of the compounds were used for the training set, one-third for the validation set. Again, a linear and non-linear model were developed. The linear model had an accuracy on the validation set of 74.9%, the nonlinear model an accuracy of 73.6%, respectively. The models of Tunkel et al. (2000)12 were used later on for BIOWIN5 and BIOWIN6 (see Section 2.2).
Cheng et al. (2012)13 trained models on a dataset containing 1440 compounds tested under the MITI protocol. Different features and molecular fingerprints were used to construct Support Vector Machines (SVM), k-Nearest Neighbors (kNN), Naive Bayes (NB), and Decision Tree (DT). The best model (SVM with genetic algorithm – GASVM-kNN) achieved a balanced accuracy of 81.9% in 5-fold Cross-Validation (CV). The best seven combinations of models and features and a consensus model were also tested on 27 new chemicals, which were experimentally tested for their biodegradability under the Japanese MITI test protocol. The consensus model and two of the other models predicted the test results of all 27 substances 100% correctly.13 In contrast, the formerly best model (GASVM-kNN) only achieved a balanced accuracy of 53.8%.
Mansouri et al. (2013) trained kNN, Partial Least Squares Discriminant Analysis (PLSDA), and SVM models on a dataset of 1055 experimental biodegradation data points. The dataset originated from the National Institute of Technology and Evaluation of Japan (NITE) and underwent thorough data screening and improvement. Additionally, an external test set of 670 substances was created based on data from Cheng et al. (2012)13 and the Canadian DSL database. All three models (kNN, PLSDA, and SVM) performed similarly well. Mansouri et al. (2013) created two consensus models based on the three models. The first consensus model assigned each substance the most common label predicted from the three models. The second consensus model only assigned a class to a substance if the three models agreed on one label. The second consensus model performed best, achieving an accuracy of 91% and 87% on the test and external test set, respectively. However, the second consensus model only made predictions on 85% of the test set and on 87% of the external test set as a molecule was only assigned if the three models classified it in the same class; otherwise, it was not assigned. Overall, all models showed conservative behavior with a higher specificity than sensitivity.14
Cao and Leung (2014)15 used the data of Mansouri et al. (2013)14 and introduced the differential evolution (DE) algorithm into the SVM to optimize the parameters of the classifier in order to produce an improved classifier called DE-SVC. The DE-SVC had a slightly lower performance than the consensus II model of Mansouri et al. (2013)14 but was able to classify all substances, which was not the case for the consensus model II of Mansouri et al. (2013).14
Lombardo et al. (2014)16 built a decision tree with a seven rule-set based on 728 compounds that were split in a training set (80%) and an internal test set (20%). Additionally, a set of 874 compounds that originate from the study of Cheng et al. (2012)13 was used as external test set. The fragments for this model derive both from a statistical part (SARpy) and an expert-based part. The balanced accuracy was 82.1% on the internal test set and 78.4% on the external test set, respectively. The model of Lombardo et al. (2014) was used later on for VEGA (see Section 2.2).16
Blay et al. (2016)17 developed a ready-biodegradable prediction for fragrants using 130 compounds. They applied linear discriminant analysis (LDA) and artificial neural networks (ANNs) to build two classification models. To use external validation, a random set of molecules was held out before the training. This hold-out set contained a 20% of the original dataset of 130 molecules. Additionally, internal validation in LDA was applied as 5-fold cross validation. The LDA model had a balanced accuracy based on the 5-times cross validation of 86.5%. The ANN had a balanced accuracy based on the external validation set of 91.5%.
Zhan et al. (2017)18 developed a naïve Bayesian classifier (NBC) to classify the 1055 compounds from Mansouri et al. (2013).14 Three representative structure partitioning methods, including Murcko framework, Scaffold Tree and a scheme based on different complexities of ring combinations and side chains, were used to characterize the structural features of the studied molecules. About 284 RB and 553 NRB chemicals (80%) served as training set and the remaining chemicals as the test set I. In addition, the test set II collected by Mansouri et al. (2013)14 was also used. The best descriptors achieved a balanced accuracy of 85.6% on test set I and 83.8% on test set II, respectively.
Lunghini et al. (2020)19 created a new ready biodegradability dataset by curating and combining data from multiple data sources and additional industry data. This new dataset contained 3146 data points. Furthermore, an additional test set was created based on data from Cheng et al. (2012)13 and Mansouri et al. (2013).14 Lunghini et al. (2020) trained three models based on SVM with linear and Radial Basis Function Kernels (RBF kernels), Random Forest (RF) and NB. Finally, a consensus model was created, which makes a decision based on the majority vote of the three sub-models. The consensus model had balanced accuracies of 81 ± 1.4% on the test set and 75% on the external test set.
Yin et al. (2023)21 trained models on a dataset containing 1928 compounds of which 1424 were used in the training set and 504 in the test set. CORINA descriptors, MACCS fingerprints, and ECFP_4 fingerprints were utilized to characterize the molecules and were used as input features for models after filtering. Models were built using the SVM, DT, RF, and deep neural network (DNN) algorithms. In addition, models based on Graph- and Transformer-CNN models were constructed. The balanced accuracy of the best performing model (Transformer-CNN with 77 MACCS key fingerprints) achieved a balanced accuracy of 87.3%.
The models from the scientific work are summarized in Table 1.
The predictive models BIOWIN1 and BIOWIN2 were trained on a dataset of only 295 substances. The BIOWIN1 model is based on multiple linear regressors, while the BIOWIN2 model is based on logistic regression.1,10,28 Therefore, they are also called linear and non-linear models, respectively. BIOWIN1 and BIOWIN2 have a reported accuracy of 65% and 67% on an external test set containing 884 substances, and a reported accuracy of 54% and 67% on an external test set containing 305 substances, respectively.
The models BIOWIN5 and BIOWIN6, which are also part of EPI Suite™, were developed with a similar approach as BIOWIN1 and BIOWIN2 but were trained on a dataset of 884 discrete organic substances from the MITI ready biodegradation tests.12,28 Again, multiple linear regressions were performed to obtain a linear model, BIOWIN5, and a logistic regression was fitted to create a non-linear model, BIOWIN6. BIOWIN5 and BIOWIN6 have a reported accuracy of 83% on an external test set containing 305 substances.
VEGA, which stands for Virtual models for property Evaluation of chemicals within a Global Architecture, is a non-proprietary and openly available tool designed to predict the ready biodegradability of chemical compounds.29 The model behind VEGA is based on Lombardo et al. 2014 and a dataset of 728 mono-constituent organic substances tested according to the OECD 301C Modified MITI(I) Test. An external testing dataset was extracted from Cheng et al. (2012).13 VEGA was developed based on 78 substructures statistically related to ready biodegradability, which were extracted using expert knowledge and the SARpy software.16,29 VEGA's performance scores are similar to the performance of BIOWIN5 and BIOWIN6 on the test set and the external test set.16
OPERA is a freely accessible application that contains Quantitative Structure–Activity Relationship (QSAR) models to predict thirteen different physicochemical and environmental fate properties of organic chemicals.8 Among those thirteen models is a model for assessing the ready biodegradability of organic substances.30 The biodegradation model is based on data from the PHYSPROP database. To ensure data quality, a workflow was utilized for data curation, which involved standardizing chemical structures, correcting the identity of chemicals, and only selecting high-quality data.8 The data curation resulted in a dataset of 1609 substances. The ten most impactful molecular descriptors were calculated using PaDEL, an open-source software for calculating molecular descriptors and FPs.30 The model was trained using a weighted k-nearest neighbor approach and was validated using 5-fold CV.8 The OPERA model predicted the ready biodegradability of the substances in the test set with a balanced accuracy of 79%, a sensitivity of 81%, and a specificity of 77%.8,30
The open-access applications are summarized in Table 2.
First, the CAS RN™ and their corresponding SMILES were split into two groups based on whether or not they were verified by Glüge et al.31 In cases where a CAS RN™ was included in the Gluege-Dataset, the verified and valid SMILES for this CAS RN™ from the Gluege-Dataset was used as the SMILES for this data point. Furthermore, for the substances in the Gluege-Dataset, it was also checked if the experimental study was based on read-across. Studies based on read-across were removed. For the substances not checked by Glüge et al.,31 valid SMILES had to be retrieved. For one-component substances, the SMILES were retrieved via an Application Programming Interface (API) based on the CAS RN™ from CAS Common Chemistry.32 For the remaining substances, a weight-of-evidence approach was taken. The SMILES had to be found from at least two independent sources. If this was not possible, the substance was removed from the dataset.
Once the SMILES were found by CAS RN™, further processing steps were performed. Mixtures and organometallic substances were removed, and all counterions were removed from the SMILES representations. For stereoisomers, the SMILES of one stereoisomer was randomly selected. Furthermore, for all ionizable substances, the retrieved SMILES was replaced with the SMILES of the substance's dominant species at pH 7.4 and 298 K. The dominant species was retrieved from the pKa plugin in MarvinSketch 22.18 by using the option “show distribution chart” in “macro” mode.33 Substances were removed when no dominant species existed under the specified conditions. Huang and Zhang20 did not adjust the SMILES of ionizable substances but rather introduced extra features (pKa and α-values) that represent the chemical specification of the substances. They reported a performance increase in the balanced accuracy from 84.9% to 87.6% when including pKa and α-values as extra features. However, using the same model, we could not reproduce this performance increase (see Table S6 in ESI-I†). Therefore, we did not include information on chemical specification directly as features. However, this information is reflected in the SMILES.
To overcome these problems, we applied two existing estimation models for biodegradability to filter the data for label noise. Specifically, BIOWIN5 and BIOWIN6 were used to identify and filter out data points with noisy labels (see also Section 2). If BIOWIN5 and/or BIOWIN6 disagreed with the experimental study result, the substance was removed from the CuratedSCS dataset. The subsequently obtained dataset was called CuratedBIOWIN. All substances that were removed in this step were grouped in the CuratedProblematic dataset (Fig. 2). The substances from the CuratedProblematic dataset were tested afterward with a third classifier which was a XGBClassifier trained on the CuratedBIOWIN dataset. In cases where the third classifier agreed with the experimental label of a substance in CuratedProblematic, that substance was readded to the CuratedBIOWIN dataset. Otherwise, the substance remained removed. This led to the creation of the CuratedFinal and the CuratedRemoved datasets. The workflow is also explained again in Table S2 in the ESI-1.†
The analysis explored three key characteristics: the molecular weight of the substances, the proportion of halogens present, and the distribution of the biodegradation labels. Further, the Applicability Domain (AD) of the models trained on the three datasets was determined using the Tanimoto similarity. The Tanimoto similarity calculates similarities between two chemicals based on the number of common molecular fragments.20,38 The defined ADs were then used to evaluate how many of the substances in the Distributed Structure-Searchable Toxicity (DSSTox) database are in the ADs of the models. As Huang and Zhang20 did the same for their model, it was possible to compare the broadness of the ADs of the models. More information regarding the similarity threshold and how the AD was applied to the DSSTox database is given in Section S4 in ESI-I.†
Finally, the feature space of the Curated-Datasets was visualized and analyzed using Uniform Manifold Approximation and Projection (UMAP). UMAP is a dimensionality reduction technique used to visualize high-dimensional data in a lower-dimensional space that preserves the underlying structure of the data.39 UMAP was also used to evaluate the impact of using three different chemical representations as model input. More information on the three chemical representations tested can be found in ESI-1† Section S7.2.
To compare the performance metrics of models trained on the different Curated-Datasets, the test sets were kept fixed for all models. Maintaining a consistent test set is typically recommended when a data-centric approach is used and the dataset is augmented. This ensures that any observed changes in model performance are genuinely attributed to data augmentation rather than variations in the test sets. Due to the limitations found and a lack of information regarding the original test set used by Huang and Zhang,20 the test sets were derived from the CuratedSCS dataset. However, a partial objective of the data augmentation was to eliminate data points with noisy labels. Therefore, the models were also tested on fixed test sets from the CuratedBIOWIN dataset.
All models trained on the different datasets were evaluated using these identical test sets. To do so, the CuratedSCS or the CuratedBIOWIN dataset was randomly split into five training (80%) and test (20%) sets using a random seed of 42. Stratified splitting was used to maintain an approximate class distribution across all training and test subsets (cf.44 Ch. 7.10). Further, this ensures that every sample from the dataset was in the test set once. The test sets were then employed as the test sets for all datasets. The training sets were constructed for each dataset by removing the substances of the test set from the dataset.
LazyPredict offers the capability to evaluate the performance of nearly all estimators from the SKLEARN library on a given dataset.45 SKLEARN is an open-source ML library containing diverse algorithms.46 Beyond the SKLEARN estimators, LazyPredict also assesses the performance of XGBoost. The outcome of LazyPredict is a table that ranks the most effective models, showing their performance metrics alongside the time taken in seconds for fitting the model on the provided dataset.45
3721 of the substances in the Huang-Classification-Dataset were also in the Gluege-Dataset. It was found that for approximately 20% of these substances, the SMILES added by Huang and Zhang20 converted to InChI™ that did not match the InChI™ associated with the CAS RN™. 5.0% of the SMILES added by Huang and Zhang20 did not even convert into InChI™ with the same chemical formula as the InChI™ corresponding to the CAS RN™ and 8.5% did not have the same InChI™-main-layer. Examples of added SMILES that were not according to the CAS RN™ can be found in Table S3 in ESI-1.† Table S4 in ESI-1† shows examples of substances that appeared in the Huang-Classification-Dataset multiple times with different versions of the same SMILES. We concluded overall that the quality of the SMILES is not sufficient to continue working with them. Therefore, all SMILES that were added by Huang and Zhang20 were removed and new validated SMILES were added (see Section 3.2).
Dataset | Data points | RB | NRB |
---|---|---|---|
Huang-Classification-Dataset | 6139 | 34.9% | 65.1% |
CuratedSCS | 5185 | 34.6% | 65.4% |
CuratedBIOWIN | 3864 | 31.9% | 68.1% |
CuratedProblematic | 1321 | 42.6% | 57.4% |
CuratedFinal | 4371 | 31.7% | 68.3% |
CuratedRemoved | 814 | 50.1% | 49.9% |
A third classifier (an XGBClassifier trained on the CuratedBIOWIN dataset) was consulted for the substances in the CuratedProblematic dataset to make a decision. According to this third classifier, 507 substances were added back to the CuratedBIOWIN classification dataset. The CuratedFinal dataset thus contains 4371 substances, the CuratedRemoved dataset 814 (Table 3).
Fig. 3 Model balanced accuracy reported by Huang and Zhang20 and balanced accuracies for the XGBClassifiers trained on the Huang-Classification-Dataset and the Curated-Datasets. The trained classifiers were tested five times on fixed test sets from (a) the CuratedSCS and (b) the CuratedBIOWIN dataset. The definition of “balanced accuracy” is given in ESI-1† Section S1.8. |
The model trained on the CuratedSCS dataset also has a balanced accuracy of 80.9 ± 1.7%. The XGBClassifiers trained on the CuratedBIOWIN and the CuratedFinal datasets show balanced accuracies of 78.3 ± 0.9% and 79.4 ± 0.8%, respectively.
Fig. 3a shows that despite correcting dataset limitations such as incorrect CAS RN™–SMILES pairings or removing read-across studies, no improvement was observed in the performance of the classifier that was trained on the Huang-Classification-Dataset and the CuratedSCS dataset. Furthermore, removing substances with potentially wrong labels also did not increase the performance of the model. This might be attributed to three different reasons: (1) a significant portion of the data points in the test sets may have noisy labels due to high variance in experimental studies, (2) the used features may not cover all required information to predict the ready biodegradability, and (3) the model may have been unable to learn and generalize well enough to make correct predictions for difficult-to-predict substances.
As a matter of fact, no dataset of substances that contain only accurate labels could be identified. For the majority of substances, only one experimental study result for ready biodegradation carried out for 28 days exists. The substances with multiple such test results could often not be labeled with certainty because conflicting study results exist.
However, to build robust models, label noise should be reduced as much as possible. Therefore, the model performance was also evaluated on test sets derived from the CuratedBIOWIN dataset as shown in Fig. 3b. The Replicated-Huang-Classifier had a balanced accuracy of 88.0 ± 1.3% and, therefore, performed similarly to the best-performing classifier reported by Huang and Zhang.20 The XGBClassifier trained on the CuratedSCS dataset performed similarly with a balanced accuracy of 88.4 ± 2.1%.
The model trained on the CuratedFinal dataset, the CuratedFinal-Classifier, was the best-performing model and achieved a balanced accuracy of 94.2 ± 1.2%, a sensitivity of 91.6 ± 2.8%, and a specificity of 96.9 ± 0.4%. Therefore, the CuratedFinal-Classifier showed a higher performance than any other previously published classifier (see also Section 2). The classifier trained on the CuratedBIOWIN dataset only had a slightly lower balanced accuracy of 93.7 ± 1.0%. The performance metrics of all classifiers are provided in Table S7 in ESI-1.†
However, it has to be kept in mind that the models were tested on test sets from the CuratedBIOWIN dataset, which underwent label noise filtering. This might have introduced bias, or the difficult-to-predict data points could have been removed. A thorough analysis of the new datasets was therefore carried out to understand if this is the case.
Characteristic | CuratedSCS | CuratedBIOWIN | CuratedFinal |
---|---|---|---|
All substances | 100% | 74.5% | 84.3% |
Molecular weight | |||
0 to 250 Da | 100% | 68.9% | 80.8% |
250 to 500 Da | 100% | 79.7% | 87.0% |
500 to 750 Da | 100% | 75.4% | 83.1% |
750 to 1000 Da | 100% | 87.0% | 88.0% |
1000 to 2000 Da | 100% | 62.9% | 65.7% |
Halogens | |||
F | 100% | 86.1% | 96.6% |
Br | 100% | 79.4% | 93.4% |
Cl | 100% | 85.4% | 92.0% |
Biodegradation class | |||
NRB | 100% | 75.9% | 86.8% |
RB | 100% | 67.4% | 75.6% |
Table 4 shows that there are slight compositional differences between the CuratedSCS and the CuratedBIOWIN and CuratedFinal datasets. The largest difference was observed for the group “1000 to 2000 Da” with a difference of up to 19% compared to “all substances”. However, all other sub-groups in Table 4 are within −9% and +13% of the percentages of “all substances”, meaning that no characteristic or chemical group in these other groups could be identified that was disproportionally more or less present in the CuratedBIOWIN or the CuratedFinal datasets than in the CuratedSCS dataset. Therefore, based on the three characteristics analyzed, the CuratedProblematic and CuratedRemoved datasets can be considered to be very similar in composition to the CuratedBIOWIN and CuratedFinal datasets.
Fig. 4 shows the visual representations of the datasets after dimensionality reduction with the UMAP algorithm. Subplots (a–c) show the results for the CuratedBIOWIN and CuratedProblematic datasets. Subplots (d–f) show the results for the CuratedFinal and CuratedRemoved datasets.
Huang and Zhang20 had found that 98.4% of the substances in the DSSTox database would fall within the AD of the Huang-Classifier. For the XGBClassifier trained on the CuratedSCS, CuratedBIOWIN, and the CuratedFinal datasets, it was found that 97.9%, 97.3%, and 97.7% of the substances in the DSSTox database are in the AD, respectively. Therefore, reducing the dataset size due to curation based on the CAS RN™–SMILES pairings and removing data points with noisy labels did not significantly reduce the AD. This indicates that the substances in the curated datasets comprise a similarly broad chemical space as the substances in the Huang-Classification-Dataset. If the substances in the CuratedProblematic and CuratedRemoved datasets had different structural characteristics, the AD of the XGBClassifier trained on the CuratedBIOWIN and CuratedFinal datasets should have been much narrower than the reported AD of the Huang-Classifier. However, one has to note that the Tanimoto Index is based on molecular fragments. If these molecular fragments do not cover certain properties of substances (such as intramolecular hydrogen bonds), then the AD would also not reveal if substances with these properties were excluded from our test set.
The first two point were addressed in the model-centric approach. On the first point, four different feature creation methods were tested. The resulting number of features for each data point ranged from 167 to 2048. However, none of the methods led to an improved model performance for the CuratedFinal dataset when the test set from CuratedSCS was used. To evaluate whether the lack of performance improvement was due to the model's inability to learn and generalize well enough (point 2), 31 ML models were screened. No ML algorithm could be identified that led to a significant performance increase for the CuratedFinal dataset. However, even though we could not find an improved feature creation method and ML algorithm, that does not mean they do not exist. Our findings do not indicate it, but the lack of performance increase might still be due to an inadequate model algorithm or features.
Another explanation is the presence of data with noisy labels in the test datasets. RBT results depend on various factors and have also been shown to depend on the used test protocol and the laboratory that carried out the test.19 However, test sets without label noise are necessary to build and evaluate robust ML models. Given the inherent noise in RBT results, no fully reliable test set could be identified, so label noise filtering was employed here in the second approach to the test set as well. All models tested on the CuratedBIOWIN dataset performed significantly better than when tested on the CuratedSCS dataset, which was not filtered for label noise. When the training set was also filtered for label noise, the balanced accuracy increased from 88.4 ± 2.1% to 94.2 ± 1.2%.
Overall, 184 substances could not be labeled due to contradicting experimental test results and 814 substances were identified as having potentially noisy labels (ESI-2†). Of these 814 substances, 408 were assigned to be RB. This is concerning because substances that have been found to be RB in RBT do not have to undergo further testing for their biodegradability.
Our findings indicate that label noise filtering can lead to a more robust and reliable classifier for predicting the aerobic ready-biodegradability of chemicals. We could not find any indications that the label noise filtering led to the removal of difficult-to-learn substances. However, we cannot completely exclude that instead of data points with noisy labels, data points with difficult-to-learn substances have been removed. Therefore, we recommend that those substances that prevent the model from being more accurate (substances in CuratedRemoved) should be tested further experimentally to investigate whether the labels were noisy or not.
Footnote |
† Electronic supplementary information (ESI) available: ESI-1 contains seven sections with additional information on inter alia some terms and definitions used in the article, the SMILES-retrieval pipeline, label noise filtering, applicability domain and model performance. ESI-2 is an MS Excel file that contains all datasets that were used and generated in this study. See DOI: https://doi.org/10.1039/d4em00431k. The entire python code including the XGBClassifier that was trained on the CuratedFinal dataset is provided in our GitHub repository under https://github.com/pkoerner6/Prediction-of-Aerobic-Biodegradability-of-Organic-Chemicals. The final classification model is also available in a graphical user interface under https://biodegradability-prediction-app.streamlit.app/ |
This journal is © The Royal Society of Chemistry 2024 |