Open Access Article
Daniel
Crusius
a,
Flaviu
Cipcigan
b and
Philip C.
Biggin
*a
aDepartment of Biochemistry, University of Oxford, South Parks Road, Oxford OX1 3QU, UK. E-mail: philip.biggin@bioch.ox.ac.uk
bIBM Research Europe, The Hartree Centre STFC Laboratory, Sci-Tech Daresbury, Warrington WA4 4AD, UK
First published on 4th June 2024
Data-driven techniques for establishing quantitative structure property relations are a pillar of modern materials and molecular discovery. Fuelled by the recent progress in deep learning methodology and the abundance of new algorithms, it is tempting to chase benchmarks and incrementally build ever more capable machine learning (ML) models. While model evaluation has made significant progress, the intrinsic limitations arising from the underlying experimental data are often overlooked. In the chemical sciences data collection is costly, thus datasets are small and experimental errors can be significant. These limitations of such datasets affect their predictive power, a fact that is rarely considered in a quantitative way. In this study, we analyse commonly used ML datasets for regression and classification from drug discovery, molecular discovery, and materials discovery. We derived maximum and realistic performance bounds for nine such datasets by introducing noise based on estimated or actual experimental errors. We then compared the estimated performance bounds to the reported performance of leading ML models in the literature. Out of the nine datasets and corresponding ML models considered, four were identified to have reached or surpassed dataset performance limitations and thus, they may potentially be fitting noise. More generally, we systematically examine how data range, the magnitude of experimental error, and the number of data points influence dataset performance bounds. Alongside this paper, we release the Python package NoiseEstimator and provide a web-based application for computing realistic performance bounds. This study and the resulting tools will help practitioners in the field understand the limitations of datasets and set realistic expectations for ML model performance. This work stands as a reference point, offering analysis and tools to guide development of future ML models in the chemical sciences.
The ML literature distinguishes two types of uncertainty: aleatoric and epistemic.16–18 Aleatoric uncertainty arises due to random or systematic noise in the data. ML models are capable of fitting noise perfectly,19 therefore it is important to consider the aleatoric limit, a maximum performance limit of ML models due to noise in the underlying data. The aleatoric limit primarily refers to the evaluation or test set data: it has been shown that performance of ML models trained on noisy data can potentially surpass the expected performance due to noise in the training set, if evaluated on a noise-free dataset.18 Nonetheless, in practice, training and test datasets usually have comparable noise levels, and this effect most likely remains hidden. Epistemic uncertainty, on the other hand, is uncertainty due to limited expressiveness of a model, known as model bias; and suboptimal parameter choice, often referred to as model variance.17
In this study, we specifically focus on how aleatoric uncertainty, or experimental noise, can limit ML model performance. We extend the method by Brown et al. to define performance bounds for common datasets in chemistry and materials, distinguishing between experimental noise (σE) and prediction noise (σpred). Assuming a perfect model (σpred = 0), we obtain the aleatoric limit or maximum performance bound. When incorporating non-zero model prediction noise σpred, which could arise from model bias, model variance, or noise in the training dataset, we also identify a realistic performance bound.
The method of Brown derives performance bounds by computing performance metrics between a set of data points and the same set with added noise. If the added noise matches the size of the underlying experimental error, the method reveals limits of model accuracy that should not be surpassed.
We investigate the impact of data range, experimental error, and dataset size on these performance bounds. We then examine nine ML datasets from biological, chemical, and materials science domains, estimate performance bounds based on experimental errors, and compare to reported performance of leading ML models.
Our analysis uses synthetic datasets uniformly distributed in the range [0,1]. For regression tasks, we use both the Pearson correlation coefficient R and the coefficient of determination r2 as evaluation metrics. To obtain maximum performance bounds, we add noise to the dataset labels and compute the evaluation metrics between the original dataset labels and the noisy labels. For the realistic performance bounds, instead of the original dataset labels, we consider a second set of noisy prediction labels, which simulate a model evaluation. Repeating this procedure multiple times yields distributions for each performance metric, from which we can estimate standard deviations or confidence intervals of the performance bounds.
Additionally, we compute a maximum performance bound for binary classification tasks obtained from regression datasets, for which we use the Matthews correlation coefficient MCC, as well as the Area Under the Receiver Operating Characteristic Curve ROC-AUC as performance metrics. Details of this method are described in Section 4.1.
The performance bounds can be computed for different noise distributions. Here, we exclusively consider Gaussian noise: first, we add Gaussian noise of a single level across all data points to identify general trends. Next, we mirror real-world data complexities by considering different noise levels depending on the label size. We study how the presence of two noise levels changes performance bounds relative to Gaussian noise of a single level. In principle, performance bounds could also be derived for other noise distributions, such as uniform, bimodal, or cosh distributed noise.
What is the impact of dataset size on these bounds? Increasing the dataset size at constant noise levels did not improve the maximum or realistic performance bounds of the datasets. However, the standard deviations of the observed performance metrics reduced. Thus, the predictive power of a dataset of larger size can be more confidently defined. This effect is similar to what is observed for significance testing, when comparing two distributions.3 The performance bounds considered here do not assess how well or efficiently a ML model might learn from a given dataset. The maximum performance bounds consider an intrinsic predictive limitation when evaluating models, based on the experimental uncertainty present in the datasets alone. The realistic performance bounds also consider a prediction error σpred. It is important to point out that σpred will likely depend on the specific ML model and contributions of model bias, model variance, as well as how well the model can deal with experimental noise in the training data. In principle, models trained on datasets with noise-levels of σE can achieve higher predictive performance (i.e. σpred < σE), if evaluated on a test set with noise < σE.18 A future avenue of research could be to train ML models on abundant noisy data, while evaluation could be performed on smaller high-quality datasets. Thus, models with high predictive power could be obtained, even if the performance bounds of the training data sets are lower.
As can be seen in Fig. 2, the dataset with σE = 0.1 had a higher maximum performance bound relative to the dataset with the two noise levels. Furthermore, the performance bound was more sharply defined, i.e. had a lower standard deviation σR. For comparison, the resulting distributions of Pearson correlation R for single noise levels of σE = 0.05 and σE = 0.2 are also plotted. Therefore, noise of two levels (high and low) is worse than a moderate noise level for all datapoints. This hints at a wider ranging conclusion: presence of a few outliers or datapoints with high noise in an otherwise low-noise dataset can degrade performance disproportionately. We exemplarily show this by varying the location of the noise barrier, as shown in Fig. 2c, which is equivalent to changing the fraction of the dataset that is exposed to high noise levels. The maximum expected performance bound decreased steadily with increasing fraction of datapoints experiencing high noise levels. Therefore, datapoints with high noise levels should be excluded, if possible, to maximise predictive performance of a given dataset.
| Dataset name/range | No. of datapoints | Assay/experimental method | Experimental error estimate σE | Mean of maximum (realistic) performance bound: Pearson R (regression), MCC (classification) | Mean of maximum performance bound in ML eval. metric | Mean of realistic performance bound in ML eval. metric | Best ML model performance/model name/data split |
|---|---|---|---|---|---|---|---|
| a Not defined for the classification case. b Estimated by us, based on pairwise estimate of repeats performed in the original assay literature. c Estimated by us, based on pairwise error estimate via duplicates in raw data. d Rzepiela et al. report two different models. Bold ML performance metric values indicate models exceeding the estimated maximum performance bounds. | |||||||
| Drug binding | |||||||
| CASF 2016 (PDBBind 2016 core set)/9.75 log units | 285 | Binding affinity, multiple targets log Ki | 0.69 pKi units24 | 0.95 (0.91) | R: 0.95 | R: 0.91 |
R: 0.845/ΔLin_F9XGB 34/— |
| BACE regression/6.0 log units | 1513 | Binding affinity, single target log Ki | 0.69 pKi units24 | 0.89 (0.79) | RMSE: 0.69 | RMSE: 0.98 | RMSE: 1.32/RF 35/scaffold |
| BACE classification/{0,1} | 1513 | Binding affinity converted to binary classes: 0,1 | 0.69 pKi units24 | MCC: 0.69 (—) | ROC-AUC: 0.84 | —a | ROC-AUC: 0.86/Uni-Mol/scaffold |
| Drug pharmacokinetics/molecular | |||||||
| Lipophilicity AstraZeneca (MolNet, TDC)/6.0 log units | 4200 | Lipophilicity assay log-ratio | 0.34b log units | 0.96 (0.93) | MAE: 0.27 | MAE: 0.38 | MAE: 0.47/Chemprop-RDKit 36/scaffold |
| AqSolDB (TDC)/15.3 log units | 9982 | Solvation assay log (S) | 0.56c log units | 0.97 (0.95) | MAE: 0.45 | MAE: 0.63 | MAE: 0.76/Chemprop-RDKit 36/scaffold |
| Caco-2 permeability (Wang) (TDC)/4.3 log units | 906 | Caco-2 permeability assay (log (Papp)) | 0.42 log units30 | 0.88 (0.77) | MAE: 0.34 | MAE: 0.47 | MAE: 0.27/MapLight 37/scaffold |
| Rzepiela dataset/3.5 log units | 4367 | Pampa permeability assay (log (Papp)) | 0.2 log units for high-perm., 0.6 log units for low perm21 | 0.91 (0.83) | r 2: 0.80 | r 2: 0.66 |
r
2: 0.81/0.77d/Rzepiela QSPR 21/random |
| Buchwald–Hartwig HTE/0–100% | 3955 | Chemical reaction yields, obtained via high-throughput screening (%) | 5.3b % | 0.98 (0.96) | r 2: 0.96 | r 2: 0.93 |
r
2: 0.95/yield-BERT 38/random |
| Materials | |||||||
| Matbench: matbench_expt_gap/11.7 eV | 4604 | Experimentally measured band gaps (eV) | 0.14c eV | 1.0 (0.99) | MAE: 0.11 | MAE: 0.16 | MAE: 0.29/Darwin 39/random (NCV) |
The AqSolDB dataset26 is an aggregation of aqueous solubility measurements. We estimated the experimental error as σE = 0.56 log units via reported duplicates in the raw data that were removed in the compiled dataset. Since the range of the AqSolDB dataset is large (15.3 log units) relative to the error estimate (0.56 log units), performance bounds are high. The best reported ML model performance does not reach the performance bounds.
The lipophilicity dataset27 has a smaller range of 6.0 log units compared to some of the previous datasets, however, estimated performance bounds are still high. This is because all datapoints are from the same assay with an estimated experimental error of σE = 0.32 log units of the assay.28 Reported ML models have not reached the performance bounds of the dataset.
The Rzepiela dataset (ref. 21) is a collection of PAMPA permeability measurements, all performed via the same assay. In the publication, the authors report experimental error estimates that are different for high and low permeability compounds. We have simulated the effect of two levels of noise in Section 2.1 for a synthetic dataset and apply the same method here. We used a value of σE,1 = 0.2 log units for values of log
Peff > −7.6, and a value of σE,2 = 0.6 log units for values of log
Peff ≤ −7.6. As already seen for the synthetic dataset, performance bounds are decreased due to the higher noise levels of some of the data points. The ML model performance reported exceeds the performance bounds estimated here. It could be that the reported experimental error is too large, or the ML model might be fitting to noise in the dataset. The authors applied 10-fold cross-validation with random splits to generate training and test data sets and evaluate ML model performance. The dataset contained 48 topologically different macrocyclic scaffolds, so there might have been structurally similar compounds in the train and test set, and it would be interesting to see how performance of the reported QSPR models would change for e.g. a scaffold-based split.
The Caco-2 dataset29 is a collection of Caco-2 permeability measurements with a range of 4.25 log units, aggregated from different publications. We used an error estimate of σE = 0.42 log units from an inter-lab comparison study for Caco-2 assays.30 The reported ML model performance is higher than the maximum performance bounds, indicating potential issues with fitting to noise.
Finally, we investigated a dataset of reaction yields (range of 0–100%) of Buchwald–Hartwig reactions from a high throughput experiment.25 We estimated a noise level of σE = 5.3%, which is based on repeat measurements performed as part of validating the original experimental protocol.31 The best reported ML models have high reported r2 scores and are between the realistic and maximum performance bounds. This could indicate a high-quality ML model, but since the dataset was split randomly, some fitting of noise cannot be ruled out.
The Rzepiela and Caco-2 permeability datasets and ML models were both flagged. The underlying datasets are complex permeability endpoints with a narrow data range relative to the estimated error, resulting in relatively low performance bounds.
The BACE classification ML model also exceeded the performance bounds estimated.
Our findings highlight the need to carefully consider noise when building ML models based on experimental data, since several ML models report performances that seem unlikely given the estimated experimental error of the underlying data. Future studies and novel ML algorithms should consider the easy to calculate performance bounds when evaluating model performance, to ensure that advancements in ML models are genuine and do not result from overfitting to experimental noise.
ML model evaluations themselves are still a debated topic, but efforts such as the therapeutic data commons (TDC) that include pre-defined datasets, data-splits and standardised evaluation metrics are a step in the right direction. However, the commonly reported tabular benchmarks of ML models are not enough, and more thorough evaluations based on statistical tests should be used to convincingly claim performance advances of new algorithms.3 When generating evaluation datasets, we recommend increasing the data range, or reducing the experimental error if possible. Additionally, the use of low-noise data points as test sets should be considered, if data of varying quality is available.
Datasets with computational endpoints are often used in materials science applications. Such datasets do not have experimental noise, and use of these synthetic datasets is a promising path forward if experimental data is scarce or impossible to acquire. For synthetic datasets and corresponding ML models, it will be interesting to further study the addition of artificial noise of varying levels to see how different ML models deal with noise, and if they can surpass the noise levels given in training datasets when evaluated on noise-free or low-noise test sets.18 When constructing synthetic datasets of experimentally measurable endpoints, e.g. via physics-based simulations, addition of noise to the same levels as observed in experiments should be considered. Further, one should ensure to mirror the data range of experimental assays with the synthetic datasets. Otherwise, the performance bounds will be artificially increased, the task is effectively simplified, and models should not be expected to transfer well to predicting the underlying experimental tasks.
We obtain the noisy labels y′ by adding noise to the labels y (see Fig. 4 for several examples of synthetic datasets with different noise levels). Given an original label yi, a noise sample ni, we obtain a noisy label
via:
We can then compute regression metrics, such as the Pearson correlation coefficient R, coefficient of determination r2, etc., directly between the original dataset labels y, and the noisy labels y′ to obtain maximum performance bounds, since we do not consider any predictor noise. For estimating a realistic performance bound, we draw a second set of noisy labels
, with noise from a Gaussian with mean μ = 0 and standard deviation σpred. We then compute the relevant metrics between y′ and
, which effectively simulates evaluation of a ML model.
To simulate effects of noise when converting regression datasets to binary classification datasets, we add noise as described to the labels y to obtain noisy labels y′. Then, with a sharply defined class boundary b, which serves to split the dataset into binary classes {0,1}, we obtain the noise-free class labels ycvia
The noisy classification labels
are then equivalently defined as
We can then compute classification metrics, such as Matthews correlation coefficient MCC, or ROC-AUC, etc. between yc and
. For both classification and regression performance bound estimates, we independently repeat the noise addition and performance bound computation 1000 times if not specified otherwise. This yields a distribution of values for each metric considered, of which we report the mean and standard deviation.
We also performed addition of Gaussian noise of two different levels. For this, we split the dataset along a boundary b′. To obtain the noisy labels y′, we add Gaussian noise of σ1 to all values of y that are below b′; for values above b′ we add Gaussian noise of σ2. The estimation of the performance bounds is then performed as described above.
. Fig. 4 shows an example synthetic dataset with N = 50 with various levels of experimental noise added in (b), (c), (d).
• Labels: experimental or computational observable.
• Source: single source and assay or aggregate of multiple sources or assays.
• Task: regression task, or classification task (or regression converted to classification).
Every dataset has the following properties: (1) range of labels or number of classes in the classification context, (2) size of experimental error, which is often unknown or not reported, and (3) number of datapoints. When estimating performance bounds, selection of a realistic estimate of the experimental noise is key. In the following, we detail the selected datasets and how error estimates were obtained.
The BACE dataset23 (N = 1513) is part of the MoleculeNet benchmark suite.42 As the BACE dataset originates from various sources, we assume an experimental error of 0.69 log units, identical to the CASF 2016 dataset. Since the BACE dataset has been used for both regression and classification, we also derive performance bounds for the classification task. The BACE dataset was obtained from https://moleculenet.org/datasets-1 on March 21, 2024.
D at pH 7.4). All data points were measured via a single, well-defined shake-flask method,28 and we estimated an experimental standard deviation of 0.34 log units (RMSE: 0.46 log units). This value was based on a pairwise comparison of reported assay values to the 22 reference literature values as reported in the assay publication.28 This includes six compounds for which the reported assay values were outside of the assay range, <−1.5 or >4.5; we set those values to be equal to −1.5 or 4.5, respectively. The assay publication lists an RMSE of 0.2 log units (corresponding standard deviation of 0.16 log units), which can be obtained if the six ‘out-of-range’ datapoints are excluded. The experimental range of the assay is 6.0 log units. The lipophilicity dataset was obtained via the Therapeutic Data Commons python package, as described at https://tdcommons.ai/single_pred_tasks/adme/#lipophilicity-astrazeneca on March 20, 2024.
The Wang Caco-2 permeability dataset29 (N = 906) is another of the datasets listed in the Therapeutic Data Commons repository. The dataset is an aggregate of Caco-2 permeability measurements from different sources. Caco-2 cells are used as an in vitro model to simulate the human intestinal tissue. Since this dataset was compiled from different sources, we estimated the experimental error based on a quantitative inter-lab comparison study to be 0.42 log units.30 This is based on 10 compounds, measured in seven different laboratories, yielding 169 value pairs that were used to estimate the standard deviation. The Wang dataset was obtained via the Therapeutic Data Commons python package on March 20, 2024, as described at https://tdcommons.ai/single_pred_tasks/adme/#caco-2-cell-effective-permeability-wang-et-al.
The Rzepiela dataset21 (N = 3600) is a single source, single-assay dataset of macrocycle PAMPA measurements (parallel artificial membrane permeability assay). Different to many other datasets encountered, the authors provide an uncertainty estimate depending on the permeability value. Experimental error was higher for low permeability values (0.6 log units for permeabilities of (−log Peff ∼ 7.6)). At higher permeability values (−log Peff ∼ 5.8), the standard error of PAMPA measurement is only ∼0.2 log units. To estimate performance bounds, we applied noise levels σE,1 = 0.6 log units for values > 6.7; and σE,2 = 0.2 log units for values ≤ 6.7. The Rzepiela dataset was obtained from the original publication supplementary data.
The AqSolDB dataset26 (N = 9982) is an aggregate of a total of 9 different datasets of experimental aqueous solubility measurements (log
S). When merging the 9 datasets, the authors attempted to select the most reliable values if duplicates were present. Some of the datapoints have an associated standard deviation if duplicates were measured. We estimated the experimental error via pairwise computation of the standard deviation based on duplicate values using the method of Kramer24 and as defined in Section 4.4. This yields an overall experimental standard deviation of σE = 0.56 log units. The AqSolDB dataset was obtained via the Therapeutic Data Commons python package, as described at https://tdcommons.ai/single_pred_tasks/adme/#solubility-aqsoldb, on March 20, 2024.
The Buchwald–Hartwig HTE dataset25 (N = 3955) is a single source, high-throughput experimentation-based dataset of reaction yield measurements of a palladium-catalysed Buchwald–Hartwig cross-coupling reaction. To the best of our knowledge, no experimental uncertainties were recorded as part of the dataset directly. The high-throughput experimental protocol was developed in the Merck Research Laboratories for nanomole-scale experimentation in 1536-well plates.31 In the original protocol publication, 64 reactions were run twice as part of an experiment. We used these 64 reactions to estimate an experimental standard deviation based on the pairwise method defined in Section 4.4. This yields an experimental standard deviation of the high-throughput protocol of σE = 5.3%, which we used as an approximate error for the Buchwald–Hartwig HTE dataset. The Buchwald dataset was obtained from https://github.com/rxn4chemistry/rxn_yields on March 21, 2024.
None of the datasets considered here had individually reported standard deviations for all datapoints (1). For datasets that originated from a single, well-defined assay, we used the reported standard deviation of that assay as a noise estimate.
For datasets that are aggregates of multiple studies or methods performed by different labs, we went back to the raw data before de-duplication, if available, and estimated the standard deviation based on pairwise deviations according to the method described by Kramer et al.24 and briefly summarised here: The estimated experimental standard deviation σE is computed from all possible m pairs of measured duplicate values (the pair i has the measured values ypub,i,1, ypub,i,2):
If no duplicate raw data was available, we looked for quantitative inter-lab comparison studies of the specific methods to obtain a noise estimate. For classification datasets, it is more difficult to find reliable noise estimates. For the BACE classification task, we went back to the original regression data, added noise to the regression labels, while maintaining the same class boundary as used for conversion to the classification task. We then derived noisy classification labels, which we compared to the true classification labels as described in Section 4.1 to obtain estimates of the classification performance metrics.
| This journal is © The Royal Society of Chemistry 2025 |