Are we fitting data or noise? Analysing the predictive power of commonly used datasets in drug, materials, and molecular-discovery.

Data-driven techniques for establishing quantitative structure property relations are a pillar of modern materials and molecular discovery. Fuelled by the recent progress in deep learning methodology and the abundance of new algorithms, it is tempting to chase benchmarks and incrementally build ever more capable machine learning (ML) models. While model evaluation has made significant progress, the intrinsic limitations arising from the underlying experimental data are often overlooked. In the chemical sciences data collection is costly, thus datasets are small and experimental errors can be significant. These limitations of such datasets affect their predictive power, a fact that is rarely considered in a quantitative way. In this study, we analyse commonly used ML datasets for regression and classification from drug discovery, molecular discovery, and materials discovery. We derived maximum and realistic performance bounds for nine such datasets by introducing noise based on estimated or actual experimental errors. We then compared the estimated performance bounds to the reported performance of leading ML models in the literature. Out of the nine datasets and corresponding ML models considered, four were identified to have reached or surpassed dataset performance limitations and thus, they may potentially be fitting noise. More generally, we systematically examine how data range, the magnitude of experimental error, and the number of data points influence dataset performance bounds. Alongside this paper, we release the Python package NoiseEstimator and provide a web-based application for computing realistic performance bounds. This study and the resulting tools will help practitioners in the field understand the limitations of datasets and set realistic expectations for ML model performance. This work stands as a reference point, offering analysis and tools to guide development of future ML models in the chemical sciences.


Introduction
6][7][8][9] The focus of the ML community and literature is often on state-of-the-art algorithms.1][12][13] Assessing the variability in experimental data is important, 14 but ML applications in chemistry are also often limited by the high cost and presence of experimental noise in the data.This challenge is recognised but not always accounted for when evaluating ML model performance and uncertainty. 157][18] Aleatoric uncertainty arises due to random or systematic noise in the data.ML models are capable of fitting noise perfectly, 19 therefore it is important to consider the aleatoric limit, a maximum performance limit of ML models due to noise in the underlying data.The aleatoric limit primarily refers to the evaluation or test set data: It has been shown that performance of ML models trained on noisy data can potentially surpass the expected performance due to noise in the training set, if evaluated on a noise-free dataset. 18Nonetheless, in practice, training and Please do not adjust margins Please do not adjust margins test datasets usually have comparable noise levels, and this effect most likely remains hidden.Epistemic uncertainty, on the other hand, is uncertainty due to limited expressiveness of a model, known as model bias; and suboptimal parameter choice, often referred to as model variance. 17his study, we specifically focus on how aleatoric uncertainty, or experimental noise, can limit ML model performance.We extend the method by Brown et al to define performance bounds for common datasets in chemistry and materials, distinguishing between experimental noise () and prediction noise ().Assuming a perfect model ( = 0), we obtain the aleatoric limit or maximum performance bound.When incorporating non-zero model prediction noise  , which could arise from model bias, model variance, or noise in the training dataset, we also identify a realistic performance bound.
The method of Brown derives performance bounds by computing performance metrics between a set of data points and the same set with added noise.If the added noise matches the size of the underlying experimental error, the method reveals limits of model accuracy that should not be surpassed.
We investigate the impact of data range, experimental error, and dataset size on these performance bounds.We then examine nine ML datasets from biological, chemical, and materials science domains, estimate performance bounds based on experimental errors, and compare to reported performance of leading ML models.

Results and Discussion
In section 2.1, we analyse the general influence of dataset properties, such as the data range, the size of experimental errors, and the number of data points on the maximum and realistic performance bounds of datasets used for ML models.Utilising synthetic datasets, we specifically investigate how Gaussian noise, applied at one and two levels, affects these bounds.This analysis is the foundation for section 2.2, where we compare estimated performance bounds of nine real-world ML datasets to reported performance of leading ML models.This allows us to distinguish between datasets where ML models have reached the limit of performance due to experimental error, and datasets where there is still room for ML model improvement.

Impact of data range, experimental error, and number of datapoints on realistic and maximum performance bounds
In the following, we investigate the effect of data range, magnitude of experimental error, and dataset size on performance bounds using the method developed by Brown et al 20 , described in detail in section 4.1 and extended by us to classification datasets.We define two types of performance bounds: a maximum performance bound where we only assume presence of an experimental error  , and a realistic performance bound, which also considers model prediction error  .The maximum performance bounds consider an intrinsic predictive limitation when evaluating ML models, based on the experimental uncertainty present in the datasets alone.For the realistic performance bounds, we assumed a prediction error   equal to the experimental error  , which we assume to be reasonable for most ML models.
Our analysis uses synthetic datasets uniformly distributed in the range [0,1].For regression tasks, we use both the Pearson correlation coefficient R and the coefficient of determination r 2 as evaluation metrics.To obtain maximum performance bounds, we add noise to the dataset labels and compute the evaluation metrics between the original dataset labels and the noisy labels.
For the realistic performance bounds, instead of the original dataset labels, we consider a second set of noisy prediction labels, which simulate a model evaluation.Repeating this procedure multiple times yields distributions for each performance metric, from which we can estimate standard deviations or confidence intervals of the performance bounds.
Additionally, we compute a maximum performance bound for binary classification tasks obtained from regression datasets, for which we use the Matthews correlation coefficient MCC, as well as the Area Under the Receiver Operating Characteristic Curve ROC-AUC as performance metrics.Details of this method are described in section 4.1.
The performance bounds can be computed for different noise distributions.Here, we exclusively consider Gaussian noise: First, we add Gaussian noise of a single level across all data points to identify general trends.Next, we mirror real-world data complexities by considering different noise levels depending on the label size.We study how the presence of two noise levels changes performance bounds relative to Gaussian noise of a single level.In principle, performance bounds could also be derived for other noise distributions, such as uniform, bimodal, or cosh distributed noise.

Gaussian noise of one level
First, we consider adding Gaussian noise with standard deviations , which we present in % relative to the dataset range [0,1] of the synthetic datasets: A noise level of 10 % corresponds to Gaussian noise drawn from a normal distribution with  = 0 and standard deviation  = 0.1.Figure 1 shows maximum performance bounds () for regression (Fig. 1a, 1d), realistic performance bounds ( =  ) for regression (Fig. 1b, 1e), and maximum performance bounds () for classification (Fig. 1c, 1f) for different dataset size and noise levels.As expected, increased noise levels reduced the maximum and realistic performance bounds of a dataset.For regression tasks, noise levels of   ≤ 15% yielded maximum Pearson correlation coefficients of  > 0.9.Noise levels Please do not adjust margins Please do not adjust margins of   ≤ 10% yielded r 2 scores of  2 > 0.9.To increase performance bounds of a dataset, one therefore needs to reduce noise levels or increase the range of the data.
What is the impact of dataset size on these bounds?Increasing the dataset size at constant noise levels did not improve the maximum or realistic performance bounds of the datasets.However, the standard deviations of the observed performance metrics reduced.Thus, the predictive power of a dataset of larger size can be more confidently defined.This effect is similar to what is observed for significance testing, when comparing two distributions. 3The performance bounds considered here do not assess how well or efficiently a ML model might learn from a given dataset.The maximum performance bounds consider an intrinsic predictive limitation when evaluating models, based on the experimental uncertainty present in the datasets alone.The realistic performance bounds also consider a prediction error  .It is important to point out that   will likely depend on the specific ML model and contributions of model bias, model variance, as well as how well the model can deal with experimental noise in the training data.In principle, models trained on datasets with noise-levels of   can achieve higher predictive performance (i.e.  <   ), if evaluated on a test set with noise <   . 18A future avenue of research could be to train ML models on abundant noisy data, while evaluation could be performed on smaller high-quality datasets.Thus, models with high predictive power could be obtained, even if the performance bounds of the training data sets are lower.The range for all datasets is [0,1], with datapoints distributed uniformly over the whole range.For the classification datasets, the regression datasets were divided into 0 (inactive) for values < 0.5 , and 1 (active) for values ≥ 0.5 .This was done before and after addition of noise, such that noise can lead to misclassification of datapoints.

Gaussian noise of two levels in a single dataset
For some experimental measurements, error sizes can vary with the absolute size of the quantity measured.Size dependent errors were seen in the Rzepiela dataset, 21 one of the nine datasets we study in more detail in section 2.  As can be seen in Fig. 2, the dataset with   = 0.1 had a higher maximum performance bound relative to the dataset with the two noise levels.Furthermore, the performance bound was more sharply defined, i.e. had a lower standard deviation  .For comparison, the resulting distributions of Pearson correlation R for single noise levels of   = 0.05 and   = 0.2 are also plotted.
Therefore, noise of two levels (high and low) is worse than a moderate noise level for all datapoints.This hints at a wider ranging conclusion: presence of a few outliers or datapoints with high noise in an otherwise low-noise dataset can degrade performance disproportionately.We exemplarily show this by varying the location of the noise barrier, as shown in Fig. 2c, which is equivalent to changing the fraction of the dataset that is exposed to high noise levels.The maximum expected performance bound decreased steadily with increasing fraction of datapoints experiencing high noise levels.Therefore, datapoints with high noise levels should be excluded, if possible, to maximise predictive performance of a given dataset.

Are we fitting data or noise? Assessing performance bounds of application datasets and comparison to ML model performance
The maximum and realistic performance bounds for a total of nine datasets from drug discovery, materials discovery, and molecular discovery applications that were used for building ML models are shown in Table 1 and Fig. 3.We used error estimates in the following order of preference as available: (1) reported experimental standard deviations for datapoints, (2) reported standard deviation for the specific experimental assay, (3) standard deviation estimated from duplicate values via pairwise comparison (see section 4.4 for details), (4) standard deviation obtained from inter-lab comparison studies of the general method.
Table 1 shows a detailed overview of the datasets used, the experimental error estimates, and the resulting maximum and realistic performance bounds for Pearson R / MCC, as well as the performance bounds in the evaluation metric of the best performing ML models from the literature.Fig. 3 shows a direct comparison of the performance bounds with the reported ML performance for all datasets considered.For three out of the nine datasets, ML model performance exceeded or was at the maximum performance bound, and thus the reported ML performance seems unrealistically high given the error estimates made here.An additional ML model exceeds the realistic performance bound but is below the maximum performance bound.The other five datasets have ML models that are below the performance bounds.We discuss the individual datasets in more detail as follows.

Drug binding tasks
Both the CASF2016 22 and the BACE 23  CASF2016 has a range of 9.75 log units, while BACE-r only covers 6 log units.Since both datasets originate from different laboratories and do not necessarily use the exact same experimental protocol, we estimated the experimental error   = 0.69 log units.This estimate is based on a systematic study of duplicate values in the ChEMBL database. 12,24 wing to the greater range, the maximum and realistic performance bounds of CASF2016 are higher than that of BACE-r, even though the experimental error estimate is the same.For both BACE-r and CASF2016, development of improved ML models seems possible, given the dataset performance bounds.Conversion of the BACE dataset into a classification task (BACE-c) leads to a ML model that exceeds the maximum predictive performance of the classification dataset.This suggests that the classification task simplified the bioactivity prediction task, however, the model might also fit to noise in the dataset.

Drug pharmacokinetics and molecular ML tasks
Next, we consider properties relevant in both molecular and drug discovery settings: Chemical reaction yields via the Buchwald-Hartwig HTE dataset 25 , physicochemical properties such as aqueous solubility and lipophilicity, as well as in-vitro (PAMPA) and invivo (Caco-2) permeability assays.
The AqSolDB dataset 26 is an aggregation of aqueous solubility measurements.We estimated the experimental error as   = 0.56 log units via reported duplicates in the raw data that were removed in the compiled dataset.Since the range of the AqSolDB dataset is large (15.3 log units) relative to the error estimate (0.56 log units), performance bounds are high.The best reported ML model performance does not reach the performance bounds.
The lipophilicity dataset 27 has a smaller range of 6.0 log units compared to some of the previous datasets, however, estimated performance bounds are still high.This is because all datapoints are from the same assay with an estimated experimental error of   = 0.32 log units of the assay. 28Reported ML models have not reached the performance bounds of the dataset.
The Rzepiela dataset 21 is a collection of PAMPA permeability measurements, all performed via the same assay.In the publication, the authors report experimental error estimates that are different for high and low permeability compounds.We have simulated the effect of two levels of noise in section 2.1 for a synthetic dataset and apply the same method here.We used a value of  ,1 = 0.2 log units for values of   > -7.6, and a value of  ,2 = 0.6 log units for values of   ≤ -7.6.As already seen for the synthetic dataset, performance bounds are decreased due to the higher noise levels of some of the data points.ML model performance reported exceeds the performance bounds estimated here.It could be that the reported experimental error is too large, or the ML model might be fitting to noise in the dataset.The authors applied 10-fold cross-validation with random splits to The Caco-2 dataset 29 is a collection of Caco-2 permeability measurements with a range of 4.25 log units, aggregated from different publications.We used an error estimate of   = 0.42 log units from an inter-lab comparison study for Caco-2 assays. 30The reported ML model performance is higher than the maximum performance bounds, indicating potential issues with fitting to noise.

Faraday Discussions Accepted Manuscript
Finally, we investigated a dataset of reaction yields (range of 0-100%) of Buchwald-Hartwig reactions from a high throughput experiment. 25We estimated a noise level of   = 5.3%, which is based on repeat measurements performed as part of validating the original experimental protocol. 31The best reported ML models have high reported r 2 scores and are between the realistic and maximum performance bounds.This could indicate a high-quality ML model, but since the dataset was split randomly, some fitting of noise cannot be ruled out.

Materials science datasets
Many of the common materials science ML datasets have computational rather than experimental endpoints.This avoids the issue of experimental noise and allows construction of accurate ML models.We chose a dataset of experimentally measured band gaps 32 reported as part of the Matbench suite 33 of materials science benchmarks.However, only non-zero values were measured experimentally.We estimated the experimental noise as   = 0.14  from the unprocessed dataset that contained duplicate values.The estimated performance bounds are high, since the noise value is small relative to the range of the dataset (11.7 eV) and further ML model improvements seem possible.

ML Model Performances Exceeding Performance Bounds
Out of the nine datasets studied, four datasets surpassed the estimated realistic performance bounds.Three out of these four cases also reached or surpassed the estimated maximum performance bounds.Why do certain ML models surpass our calculated performance bounds?Two of the flagged models (Rzepiela, Buchwald) were evaluated using random data splits, which might lead to inflated performance estimates due to overfitting to noise, memorisation, and overlap between train and test sets.
The Rzepiela and Caco-2 permeability datasets and ML models were both flagged.The underlying datasets are complex permeability endpoints with narrow data range relative to the estimated error, resulting in relatively low performance bounds.
The BACE classification ML model also exceeded the performance bounds estimated.
Our findings highlight the need to carefully consider noise when building ML models based on experimental data, since several ML models report performances that seem unlikely given the estimated experimental error of the underlying data.Future studies and novel ML algorithms should consider the easy to calculate performance bounds when evaluation model performance, to ensure that advancements in ML models are genuine and do not result from overfitting to experimental noise.Please do not adjust margins Please do not adjust margins model performance uncertainty.In general, increasing the dataset size leads to higher confidence in the value of the performance metrics, but does not yield increases in the performance bounds themselves.The value of the maximum and realistic performance bounds is determined by the size of the experimental noise relative to the data range.The here defined performance bounds can serve as a quantitative evaluation metric to assess if models fit to noise.This could also be applied during model training: Evaluating ML models on a validation dataset and ensuring that performance bounds are not exceeded could serve as an alternative, quantitative metric to avoid over-fitting.As part of this study, we identified 9 commonly used ML datasets from drug-, molecular-, and materials-discovery and derived a systematic protocol to estimate realistic experimental errors.We show that for some datasets, reported ML model performance exceeds or is close to what we believe to be an upper performance limit.High ML performance is encouraging, but only if the model evaluation was rigorous.ML model performance that is at the performance bounds or even higher suggests that some ML models may be fitting to noise.This is a significant issue because these models will likely underperform in application scenarios.For some of the datasets investigated, ML model performance has not yet reached the maximum performance that could theoretically be achieved with the underlying datasets.This highlights the need for further efforts relating to model and algorithm development, e.g. for ligand binding affinity predictions.

Faraday Discussions Accepted Manuscript
ML model evaluations themselves are still a debated topic, but efforts such as the therapeutic data commons (TDC) that include pre-defined datasets, data-splits and standardised evaluation metrics are a step in the right direction.However, the commonly reported tabular benchmarks of ML models are not enough, and more thorough evaluations based on statistical tests should be used to convincingly claim performance advances of new algorithms. 3When generating evaluation datasets, we recommend increasing the data range, or reducing the experimental error if possible.Additionally, the use of low-noise data points as test sets should be considered, if data of varying quality is available.
Datasets with computational endpoints are often used in materials science applications.Such datasets do not have experimental noise, and use of these synthetic datasets is a promising path forward if experimental data is scarce or impossible to acquire.For synthetic datasets and corresponding ML models, it will be interesting to further study the addition of artificial noise of varying levels to see how different ML models deal with noise, and if they can surpass the noise levels given in training datasets when evaluated on noise-free or low-noise test sets. 18When constructing synthetic datasets of experimentally measurable endpoints, e.g.via physics-based simulations, addition of noise to the same levels as observed in experiments should be considered.Further, one should ensure to mirror the data range of experimental assays with the synthetic datasets.Otherwise, the performance bounds will be artificially increased, the task is effectively simplified, and models should not be expected to transfer well to predicting the underlying experimental tasks.

Addition of Gaussian noise and estimation of performance metric bounds
For a dataset of size N, with range [  ,   ], and labels  we draw N random samples from a normal (Gaussian) distribution with mean  = 0 and standard deviation  equal to the desired experimental noise level via the NumPy package 40 .The probability density for the Gaussian distribution is We obtain the noisy labels ′ by adding noise to the labels y (see Fig. 4 for several examples of synthetic datasets with different noise levels).Given an original label  , a noise sample  , we obtain a noisy label  ′  via: We can then compute regression metrics, such as Pearson correlation coefficient R, coefficient of determination r 2 , etc., directly between the original dataset labels y, and the noisy labels ′ to obtain maximum performance bounds, since we do not consider any predictor noise.For estimating a realistic performance bound, we draw a second set of noisy labels  ′  , with noise from a Gaussian with mean =0 and standard deviation  .We then compute the relevant metrics between ′ and  ′  , which effectively simulates evaluation of a ML model.
To simulate effects of noise when converting regression datasets to binary classification datasets, we add noise as described to the labels y to obtain noisy labels ′.Then, with a sharply defined class boundary , which serves to split the dataset into binary classes {0,1}, we obtain the noise-free class labels   via

Faraday Discussions Accepted Manuscript
Open Please do not adjust margins Please do not adjust margins The noisy classification labels   ′ are then equivalently defined as We can then compute classification metrics, such as Matthews correlation coefficient MCC, or ROC-AUC, etc. between   and   ′.
For both classification and regression performance bound estimates, we independently repeat the noise addition and performance bound computation 1000 times if not specified otherwise.This yields a distribution of values for each metric considered, of which we report the mean and standard deviation.
We also performed addition of Gaussian noise of two different levels.For this, we split the dataset along a boundary '.To obtain the noisy labels ′, we add Gaussian noise of  1 to all values of  that are below b'; for values above b' we add Gaussian noise of  2. The estimation of the performance bounds is then performed as described above.

Synthetic dataset generation
Synthetic datasets were generated via the NumPy package 40 .All synthetic datasets are of range [0,1] with datapoints distributed uniformly over the full range.After generating a uniformly distributed dataset of size N, we draw N random samples from a normal (Gaussian) distribution with  = 0 and  equal to the desired noise level as described in the previous section.This noise is then added to the datapoints as described in section 4.1 to obtain ′ or  ′  .Figure 4 shows an example synthetic dataset with  = 50 with various levels of experimental noise added in (b), (c), (d).

Experimental dataset selection and dataset details
We selected datasets that were used for ML modelling from drug discovery, materials science, and molecular science applications.
We can distinguish datasets based on the following attributes: • Labels: Experimental or computational observable

Drug binding datasets
The CASF 2016 dataset 22 (also referred to as PDBbind 2016 core set, N=285) is a commonly used evaluation dataset for ML / DL scoring functions for the prediction of protein ligand binding affinities. 41Experimental error of binding affinity data depends on the specific binding assay method, error estimates range from around 0.2 log units for industrial drug research up to 0.69 log units for public affinity data from various sources, as applicable for PDBbind. 15,24 he data was obtained from http://www.pdbbind.org.cn/casf.php.The experimental error estimate used was 0.69 log units, as derived in Kramer et al.This is based on 2,540 systems with 7,667 measurements.
The BACE dataset 23 (N=1513) is part of the MoleculeNet benchmark suite. 42As the BACE dataset originates from various sources, we assume an experimental error of 0.69 log units, identical to the CASF 2016 dataset.Since the BACE dataset has been used for Please do not adjust margins

Please do not adjust margins
To obtain an estimate of experimental noise, we relied on the following order of preference: (1) reported experimental standard deviations for datapoints, (2) the reported standard deviation for the specific experimental assay (if a single well-defined assay was performed for the entire dataset), (3) standard deviation estimated from duplicate values via pairwise comparison, (4) interlab comparison studies of the general method used.
None of the datasets considered here had individually reported standard deviations for all datapoints (1).For datasets that originated from a single, well-defined assay, we used the reported standard deviation of that assay as a noise estimate.
For datasets that are aggregates of multiple studies or methods performed by different labs, we went back to the raw data before de-duplication, if available, and estimated the standard deviation based on pairwise deviations according to the method described by Kramer et al 24  If no duplicate raw data was available, we looked for quantitative inter-lab comparison studies of the specific methods to obtain a noise estimate.For classification datasets, it is more difficult to find reliable noise estimates.For the BACE classification task, we went back to the original regression data, added noise to the regression labels, while maintaining the same class boundary as used for conversion to the classification task.We then derived noisy classification labels, which we compared to the true classification labels as described in section 4.1 to obtain estimates of the classification performance metrics.

Faraday Discussions Accepted Manuscript
a. Department of Biochemistry, University of Oxford, South Parks Road, Oxford OX1 3QU, UK b.IBM Research Europe, The Hartree Centre STFC Laboratory, Sci-Tech Daresbury, Warrington WA4 4AD, UK + Presenting author at Faraday Discussion Paper Faraday Discussions 2 | J. Name., 2012, 00, 1-3 This journal is © The Royal Society of Chemistry 20xx

Figure 1 :
Figure 1: Shown are the distributions of different performance metrics for regression (a, b, d, e) and classification (c, f) of synthetic datasets as heatmaps.The mean values of the performance metrics are shown in the heatmaps, the standard deviations are overlayed as black contour lines.The synthetic datasets vary in sample size as shown on the x-axes and noise levels  , given in relative units to the data range on the y-axes.For cases (a), (d), (c), (f), we only considered experimental noise  ; for cases (b) and (e), we considered experimental noise   and predictor noise   =  .The range for all datasets

2 . 4 |
Here, we simulate this effect for a synthetic dataset of  = 100 of range [0,1], by adding Gaussian noise with  ,1 = 0.2 to the lower half of the dataset ( < 0.5), and a second noise level of  ,2 = 0.05 to the other half of the dataset ( ≥ 0.5).We compute maximum performance bounds and directly compare this case to adding Gaussian noise of   = {0.05,0.1, 0.2} to the whole dataset.J. Name., 2012, 00,1-3  This journal is © The Royal Society of Chemistry 20xxPlease do not adjust margins Please do not adjust margins

Figure 2 :
Figure 2: (a) Synthetic dataset of size  = 100 with Gaussian noise of two levels added (  E,1 = 0.2 for [0,0.5), E,2 = 0.05 for [0.5,1]) is shown in blue.The same synthetic dataset with Gaussian noise of  E = 0.1 is shown in red.(b) Shown are the distribution of Pearson correlation R for the two different scenarios as histograms.As can be seen, the maximum expected performance for a dataset with the two levels of low and high noise (blue) is worse than the single level of moderate noise (red).For comparison, the low and high noise levels are also shown when applied to the whole dataset (see black dashed/dot-dashed lines, respectively).(c) Variation of mean and standard deviation of the Pearson correlation R with the noise barrier location of a uniform synthetic dataset.Varying the noise barrier location corresponds to varying the fraction of the dataset that experiences high noise addition.At a barrier location of 0, Gaussian noise with  ,1 = 0.05 is added to the entire dataset (dashed line in (b)).If the barrier location is 1.0, the entire dataset experiences Gaussian noise with  ,2 = 0.2 (dotted line in (b)).The barrier location of 0.5 corresponds to the blue case in (b).

Figure 3 :
Figure 3: Performance bounds for different datasets compared to reported ML performance from the literature.Metrics that have best performance at a value of 1.0 are shown in blue (left axis), error-metrics with the best performance at values of 0 are shown in orange (right axis).For each dataset, the mean and standard deviation of the realistic performance bounds (   =  ), as well as the mean of the maximum performance bounds are shown, if defined.The reported ML model performances for the BACE classification dataset (BACE-c), the Caco-2, and the Rzepiela datasets seem unrealistically high, given the estimated experimental error.For most other datasets, reported ML model performance remains below the realistic performance bounds, indicating further room for ML model improvement.

Figure 4 :
Figure 4: Uniformly distributed synthetic datasets of size N = 50, with no added noise (a), Gaussian noise added with standard deviations of  = 0.1 (b),  = 0.25 (c), and  = 0.5 (d).If we consider the classification case, the boundary b is shown as a vertical dashed line.Resulting false negatives (fn) and false positives (fp) due to addition of noise are colour coded.Predictor noise   = 0 for all cases.

•
Source: Single source and assay or aggregate of multiple sources or assays • Task: Regression task, or classification task (or regression converted to classification) Every dataset has the following properties: (1) Range of labels or number of classes in the classification context, (2) size of experimental error, which is often unknown or not reported, and (3) number of datapoints.When estimating performance bounds, selection of a realistic estimate of the experimental noise is key.Following, we detail the selected datasets and how error estimates were obtained.

(
and briefly summarised here: The estimated experimental standard deviation   is computed from all possible m pairs of measured duplicate values (the pair i has the measured values  ,,1,  ,,2):  ,,1 - ,,2 ) 2

Open
Access Article.Published on 04 June 2024.Downloaded on 8/24/2024 10:39:35 PM.This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article Online DOI: 10.1039/D4FD00091A Faraday Discussions Accepted Manuscript Open Access Article.Published on 04 June 2024.Downloaded on 8/24/2024 10:39:35 PM.This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article Online DOI: 10.1039/D4FD00091A training and test data sets and evaluate ML model performance.The dataset contained 48 topologically different macrocyclic scaffolds, so there might have been structurally similar compounds in the train and test set, and it would be interesting to see how performance of the reported QSPR models would change for e.g. a scaffold-based split.
Open Access Article.Published on 04 June 2024.Downloaded on 8/24/2024 10:39:35 PM.This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.6 | J. Name., 2012, 00, 1-3 This journal is © The Royal Society of Chemistry 20xx Please do not adjust margins Please do not adjust margins generate

Table 1 :
Maximum and realistic performance bounds for chemical datasets, compared to leading ML models.