From the journal Digital Discovery Peer review history

You do not have JavaScript enabled. Please enable JavaScript to access the full features of the site or access our non-JavaScript page.

Round 1

Manuscript submitted on 21 Dec 2022

Editor’s decision letter

08-Mar-2023

Dear Mr Tom:

Manuscript ID: DD-ART-12-2022-000146
TITLE: Calibration and generalizability of probabilistic models on low-data chemical datasets with DIONYSUS

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry

************

Reviewer comments

Reviewer 1

This work studies machine learning specific to small (<2000) datasets for molecular design, providing an empirical comparison of different modelling techniques and molecular representations. They analysed the data according to specific metrics that they described: predictive performance, calibration of uncertainty, generalization, and quality of uncertainty in optimization campaigns. They performed experiments on two common applications for molecular design: iterative molecule selection with Bayesian optimization and generalization on out-of-distribution (OOD) molecules.
Comments on the manuscript:
1. The statement that molecular design datasets with <2000 data points are more relevant to real-world experiment is an interesting one. Unfortunately, I am not an expert in this specific subfield, but it could be helpful to elaborate further with substantiation to provide clarity on their motivation.

2. The choice of datasets is comprehensive, with a variety of regression and binary classification tasks. The authors also note that multi-class is a possible extension for future work. However, the spread of volumes could possibly be a little skewed, since the regression datasets run from 150-1116, while the classification datasets are between 1513-1870. In figure 4, we see this being the case where the bottom row of results indicates somewhat close performance for all combinations of featurization and models. Also seen in figure 8, where all 3 classification datasets seems to perform similarly, with the exception of BACE. Is there a way to explain why the clusters for BACE are like this?

3. In page 7/22 of the main text, the authors state that “exact GPs are effective in low-data regimes”. While true, it would be nice if this can be substantiated.

4. Across all figures, I would prefer if the number of datapoints is put in brackets next to the dataset name for better clarity of results.

5. For Bayesian optimization, the idea of categorizing an evaluation as a ‘hit’ to be top 10% (regression) or positive label (classification) is quite interesting. To my understanding, the authors want to illustrate and compare the consistency of results by reporting the number of hits – i.e. the frequency at which it is able to return relatively good results. Is there a clear motivation behind doing this rather than simply returning the mean value (regression) or accuracy (classification) instead?

6. The study of a modified UCB acquisition function is quite useful to illustrate the exploration-exploitation trade-off in BO. Figure 7 indicates that “Models with better predictions and calibration do not necessarily give faster Bayesian optimization” which is shown by how lower beta values allow faster convergence despite poorer models. Would this imply that in a greedy approach, the randomness of sampling is what drove optimisation instead?

Code comments:
1. Datasets may be biased by size and clustering as discussed above.
2. Not all library versions are specified in the repository.
3. The scripts seem to be working in order, although BO ones are unclear.

Reviewer 2

In my opinion the following issues should be addressed before the paper is suitable for publication

1. The authors write” Despite poorer optimization, the fully variance-based sampling strategy achieves better R2 and AMA scores than the greedy strategy … Models with better predictions and calibration do not necessarily give faster Bayesian optimization.” … “When performing Bayesian optimization, even though purely predictive models (UCB with β = 0) find hits faster, their model performance is much worse than any model with some exploratory component (β > 0). We found that for the UCB acquisition function on the Delaney dataset, β = 0.33 tends to give best performance.”

The authors never really explain the observation as far as I can see. To me, the explanation is likely that a low β leads to training sets with molecules that tend to have values close to the target value. So the ML model makes better predictions in this area, which speeds up the search. But the overall performance is worse compared to ML models from training sets where the largest uncertainties are systematically eliminated (β = 1).

Furthermore, is the search efficiency for β = 0 and 0.33 really statistically different given the confidence intervals?

2. In addition to random selection, I would also suggest 1-nearest neighbor as an appropriate null model. In other words: what happens if you simply pick the molecule that is most similar to the current best molecule in the training set? The similarity could be computed using either the representation or as a Tanimoto similarity.

3. The Mordred descriptors include Crippen logP, which has been parameterised against experimental logP values. logP values correlate very strongly with solubility and solvation free energies. Thus, for these two datasets the good performance of Mordred descriptors for these two datasets may not be transferable to other datasets and could impact conclusions such as:

• Mordred features are quite robust, independent of model choice.
• GPs with Mordred features are a solid modelling choice for small datasets. This combination fared well in all tasks and experiments. Model setup and optimization is relatively straightforward.

This should be discussed

4. In general, I would suspect that simpler representations would perform better for small datasets. For example, that’s probably why the GNN doesn’t perform well. So, using ECFP4 (or some other simpler fingerprints, like MACCS keys) may perform better than ECFP6.

5. The authors write “Note that the graph embeddings were not used due to the poor performance of GNN embedder at extremely low data regimes (~50).” Where did that come from? I can’t see any data to support this. It also surprises me. Given point 4, I would suggest a pretrained model to outperform the GNN for small datasets.

6. The authors write “For regression, the initial design dataset comprises 5% of the dataset size and is randomly sampled.” I don’t understand this sentence. Does that mean that the first ML model that is employed is trained on 1116*0.05 = 56 points for the Delayne set?

If so, then I don’t understand the curves in Figure 6 prior to the dotted line. Are N < 56, is y the lowest solubility value for a random sample of N points? If so, why are the curves different? Is it because it corresponds to different random samples? This should be clarified. Also, how is 5% chosen?

7. Related to point 6: there are a couple of Baysian optimisation studies (that probably should be cited: 10.1039/d1me00093d and 10.1002/anie.202108116) that get better-than-random performance with fewer than 50 points. However, their feature vectors are significantly simpler than the ones used here (see point 4). This should be discussed.

Jan Jensen (I choose to review this paper non-anonymously)

Reviewer 3

The authors have made a python package called DIONYSUS that can be used to calibrate and evaluate probabilistic models on chemical problems with low datasets. The package contains four popular chemical representations that can be used with both deep learning and machine learning models. The authors assessed the performance of various combinations of representations and prediction models for binary classification and regression tasks with small data. The evaluation was based on calibration, accuracy metrics (R^2 and AUROC), and generalization ability. The results showed that deep learning models underperformed due to the lack of data, and Gaussian process and NGBoost models with Mordred descriptors performed the best among the other combinations, which supports a common expectation that shallow learning performs better than deep learning for low-data problems.

Unlike other studies that focus solely on achieving state-of-the-art accuracy, this study emphasizes the importance of assessing models from various perspectives, particularly their generalization ability of prediction on out-of-distribution data. Regarding this, the authors also devised a new data clustering method. This is important for many chemical and biological problems that suffer from small data. Overall, the manuscript is well-written and provides a useful toolkit. Therefore, this reviewer recommends that the manuscript be published in this journal.

Author response

Reviewer 1:

1. The statement that molecular design datasets with <2000 data points are more relevant to real-world experiment is an interesting one. Unfortunately, I am not an expert in this specific subfield, but it could be helpful to elaborate further with substantiation to provide clarity on their motivation.

Molecular design datasets, particularly those with experimentally determined targets are limited due to temporal and monetary cost associated with conducting the measurements. Such datasets also typically contain several flavours of bias, including bias toward reporting of positive results only (e.g. chemical reactions with large yield), bias toward measuring molecules under criteria such as the number of atoms, elements types, and the availability of synthetic routes. Such biases are common in molecular datasets generated in both academia and industry, and can further limit their size.

We further elaborated on the point by adding to the introduction, Section I:
“In real-world molecular design campaigns, particularly in the initial stages, only small molecular datasets (≤ 2000 data points) are available due to the expense (monetary, resource, or labour) associated with the design, synthesis, and characterization of chemicals. Examples of particular applications in the low-data regime include design of optoelectronic materials (ie. organic photovoltaics,6 or photoswitching molecules7), prediction of biochemical properties (ie. olfactory response,8,9 or mosquito repellency10), and drug discovery.11,12”

2. The choice of datasets is comprehensive, with a variety of regression and binary classification tasks. The authors also note that multi-class is a possible extension for future work. However, the spread of volumes could possibly be a little skewed, since the regression datasets run from 150-1116, while the classification datasets are between 1513-1870. In figure 4, we see this being the case where the bottom row of results indicates somewhat close performance for all combinations of featurization and models. Also seen in figure 8, where all 3 classification datasets seems to perform similarly, with the exception of BACE. Is there a way to explain why the clusters for BACE are like this?

We thank the reviewer for the note about the dataset sizes. The datasets were chosen to span a wide range of sizes. We address possible biases of the dataset size by considering data ablation studies with cluster splits of each dataset.

We note that the tasks and chemical space of the datasets are very different from each other, and we can only make qualitative assessments of trends across regression and classification tasks, and across the datasets. For example, the close performance in Figure 4 for the classification sets is likely due to binary classification being a relatively simpler task than regression of continuous values, and the AUROC metric being less strict than the R2 score. Instead we make conclusions about the features and models within the dataset, such as the relative performance of GPs or the Mordred descriptors.

The BACE molecules in the cluster space predominantly form two distinct and separated superclusters. This is inherent to the dataset, and will vary with different molecules and tasks. As more data is included in the training set, we can observe the fraction of molecules (f) in the training data that occupy the lower cluster of the reduced dimensional space (UMAP 2 < 6.0). This fraction is normalized by the fraction observed in the test set, meaning a normalized training fraction of f = 1.0 is the most representative of the held out test set, which we observe occurs near the maximal size of the training set.

Looking at this fraction as a function of the training size, we see a peak at around 50% of the training data, at the same point where we observe a drop in performance and calibration of the model in the generalizability experiment. A high f indicates the clusters in the splits are mostly in the lower cluster, explaining the poorer performance on the test set. Not only does this explain the characteristic shape of the generalizability plots for BACE, but also demonstrates the ability of the cluster splits to simulate OOD molecules within the dataset.

The plots demonstrating this are added in SI: Figures S5 and S6. Text was added to explain our observation at the end of Section IV C.:
“For all models, there is a dip in performance at training set size of about 50% for BACE due to the distribution of molecules in the chemical space of the dataset. The cluster splits feature space is shown in Figure S5. The BACE molecule clusters predominantly fall into two larger superclusters. Initially, we observe an increase in AUROC and decrease in ECE due to inclusion of more training data. However, at around 50% of the training set size, the training set becomes more concentrated around one of the superclusters, and hence is less representative of the test set (Figure S6). This is particularly pronounced in the calibration metrics, in which there is a rise in ECE at around 50% of the BACE dataset. This not only explains the characteristic shape of the model performance metrics for BACE ablated cluster splits, but also demonstrates the ability of the cluster splits to simulate OOD molecules within molecular datasets.”

3. In page 7/22 of the main text, the authors state that “exact GPs are effective in low-data regimes”. While true, it would be nice if this can be substantiated.

We have included a more detailed description of why GPs are effective in low-data regimes, emphasizing the ability to perform direct inference with fewer parameters, making the model more robust to overfitting. We have also added citations for works describing the use of GPs in low-data regimes. Additions are found in Section III C.:
“GP is a distance-aware Bayesian model in which predictions are made via transformed relative distances between data points, dictated by a kernel function. Unlike Bayesian deep learning, which require approximate inference of the posterior distribution such as through approximate variational methods,70,71 GPs allow for exact inference of the posterior predictive distribution directly from the input, using relatively few parameters and making it more robust to overfitting. Additionally, exact inference becomes computationally expensive in larger datasets, making GPs an ideal choice for low-data probabilistic predictions, as demonstrated in many previous works.7,36,72,73”

4. Across all figures, I would prefer if the number of datapoints is put in brackets next to the dataset name for better clarity of results.
We thank the reviewer for the suggestion. We have updated the figures to include the total number of datapoints in each dataset. The changes are found in Figures 4, 8 and 9.

5. For Bayesian optimization, the idea of categorizing an evaluation as a ‘hit’ to be top 10% (regression) or positive label (classification) is quite interesting. To my understanding, the authors want to illustrate and compare the consistency of results by reporting the number of hits – i.e. the frequency at which it is able to return relatively good results. Is there a clear motivation behind doing this rather than simply returning the mean value (regression) or accuracy (classification) instead?

We thank the reviewer for the question. Indeed we wanted to use hits as a metric in order to discern between the efficiency of each optimization strategy in a simulated molecular discovery campaign, where there is a specific design scenario (maximizing some property) and the optimal structure is unknown. Here, we would like to accumulate as many promising molecular structures as quickly as possible. [1] In the drug discovery pipeline, for instance, it is beneficial to collect many promising drug candidates during early-stage tasks such as hit identification, given that the attrition rate of drug candidates is high later in the pipeline (e.g. pre-clinical and clinical) for reasons of efficacy and toxicity. [2]

Regarding the preference toward recording hits rather than accuracy for the Bayesian optimization tasks with binary labels: surrogate model classification accuracy across the entire dataset is not necessarily correlated with the hit identification rate, and these two metrics may correspond to different research objectives. If the goal is to maximize classification accuracy across the molecular library, it would behoove one to use an active learning-type strategy instead. This discrepancy in performance between these two paradigms is illustrated in Figure 7 for Bayesian optimization experiments with continuous targets (i.e. regression surrogate model).

To increase readability of our results, we have normalized the metrics in Tables II and III by the maximum number of “hits” achievable in the dataset in order to emphasize the score-like nature of the “hits” in achieving a certain design scenario. Additional discussion is added in Section IV B referring to these scores. Text is added to describe this normalization in Section IV B.:
“To succinctly summarize all experiments, the number of hit molecules are recorded for the optimization trace over the separate runs, not including those found by serendipity in the initial random search. The number of hits is normalized by the total number of possible hits accessible throughout the optimization to give a fraction of hits achieved, mimicking a real-life molecular design scenario of finding as many materials that extremize a certain property as possible. The results for the datasets of interest are shown in Table II and III.”

6. The study of a modified UCB acquisition function is quite useful to illustrate the exploration-exploitation trade-off in BO. Figure 7 indicates that “Models with better predictions and calibration do not necessarily give faster Bayesian optimization” which is shown by how lower beta values allow faster convergence despite poorer models. Would this imply that in a greedy approach, the randomness of sampling is what drove optimisation instead?

We thank the reviewer for this question. Indeed, the Bayesian optimization experiments with the interpolated UCB acquisition function indicated that by tuning the exploration-exploitation parameter, we were able to interpolate between two different behaviours of active learning: one where the speed of convergence to the best molecule in the dataset is prioritized (mean term emphasized), and the other where regression performance across the entire dataset is prioritized (uncertainty term emphasized).

In the former case, we intended to comment on how models with better overall prediction/calibration on the dataset do not necessarily give better Bayesian optimization performance. A mean-emphasized UCB function (lower value of beta) does not promote selection of molecules for which the model is highly uncertain and selects molecules only based on their merit value. Thus, the model becomes very accurate only for high-merit molecules only. We would therefore not say that randomness drove the optimization per se, although the initial design random sampling is important in this case.

Code comments:
1. Datasets may be biased by size and clustering as discussed above.
We provide analysis over a wide variety of dataset sizes in order to observe qualitative trends across datasets and tasks. Ablation studies in the generalizability experiment systematically study the effect of the size on the model performance and calibration. Dataset size differences between regression and binary classification are addressed in our above response.

While each dataset will have biases (not only in the size, but also in how the data is obtained and the difficulty of the task), our experiments aim to observe qualitative effects of model and featurization choices by looking across a wide variety of chemical datasets.

2. Not all library versions are specified in the repository.
This has been updated in the repository.

3. The scripts seem to be working in order, although BO ones are unclear.
The script has been improved and updated in the repository.

Reviewer 2:

1. The authors write” Despite poorer optimization, the fully variance-based sampling strategy achieves better R2 and AMA scores than the greedy strategy … Models with better predictions and calibration do not necessarily give faster Bayesian optimization.” … “When performing Bayesian optimization, even though purely predictive models (UCB with β = 0) find hits faster, their model performance is much worse than any model with some exploratory component (β > 0). We found that for the UCB acquisition function on the Delaney dataset, β = 0.33 tends to give best performance.” The authors never really explain the observation as far as I can see. To me, the explanation is likely that a low β leads to training sets with molecules that tend to have values close to the target value. So the ML model makes better predictions in this area, which speeds up the search. But the overall performance is worse compared to ML models from training sets where the largest uncertainties are systematically eliminated (β = 1). Furthermore, is the search efficiency for β = 0 and 0.33 really statistically different given the confidence intervals?

We thank the reviewer for the comment. We agree that this observation is under-explained in the main text, and the reviewer’s explanation of the differences in performance for the various β values is almost identical to our own. Note that we have changed β for the interpolated UCB (used in the scan) to be βinter to improve readability, and avoid confusion with the term in the typical UCB. Higher βinter gives better surrogate models as the training set becomes more diverse and representative of the full dataset, including points of greatest uncertainty into the training data.

Indeed, the difference in search efficiency for β = 0 and 0.33 is not significant for the GPs due to overlap in the confidence intervals. Likewise, for β = 0, 0.25, and 0.5 for NGBoost. We were trying to indicate the improvement in the performance metrics, particularly in the R2 score in the earlier batches of optimization when using more explorative strategies like β = 0.33 for the GPs (Figure 7B).

We have added text to the manuscript Sec IV B. to indicate the similarity of their search efficiencies, but the improvement of prediction and calibration metrics at β = 0.33. We also elaborate on the improved surrogate model performance with higher βinter values:
“In the BO traces (Figure 7A), the best mean trace values correspond to βinter = 0.25 for GPs, and βinter = 0.5 for NGBoost, although there is overlap in the confidence intervals with lower β traces. This corresponds to β = 0.33 and 1, respectively, in the typical UCB acquisition function (Eq. 7). At higher values of βinter, optimization performance quickly degrades in the GP model, while NGBoost remains performant, identifying the optimal molecule within the budget. NGBoost is able to find the optimal molecule at all values of βinter , and in general, NGBoost performs better than GP at data regimes of ∼ 100 molecules.”

“The variance-based exploration strategy suggests more diverse candidates at each iteration, allowing the models to train on a more diverse set of molecules, hence improving the prediction and calibration metrics on the test set. Models with better predictions and calibration do not necessarily give faster Bayesian optimization. A good balance of optimization search efficiency and surrogate model performance is achieved at βinter = 0.25 for GPs, and 0.5 for NGBoost. At these values, we observe improved predictions and uncertainty calibration, especially at early stages of the optimization, with similar search efficiencies to those of more exploitative (lower βinter) strategies.”

2. In addition to random selection, I would also suggest 1-nearest neighbor as an appropriate null model. In other words: what happens if you simply pick the molecule that is most similar to the current best molecule in the training set? The similarity could be computed using either the representation or as a Tanimoto similarity.

We thank the reviewer for proposing to use 1-nearest neighbor as an additional baseline model. We have included this baseline for all BO traces (Figure 6, S1, and S2). The nearest neighbor is determined by the Tanimoto similarity of the circular Morgan fingerprints of the molecules, as suggested by the reviewer. The tabular results are also added in Tables II and III.

Where appropriate, we have also added comparison to this new null model. We have added the following explanation for our baseline models in Sec. IV. B.:
“As baseline search strategies, we use a random search (random) and a 1-nearest-neighbour strategy (1-nn), the latter of which queries the molecule which has the highest MFP Tanimoto similarity to the current best for measurement.”

3. The Mordred descriptors include Crippen logP, which has been parameterised against experimental logP values. logP values correlate very strongly with solubility and solvation free energies. Thus, for these two datasets the good performance of Mordred descriptors for these two datasets may not be transferable to other datasets and could impact conclusions such as:
• Mordred features are quite robust, independent of model choice.
• GPs with Mordred features are a solid modelling choice for small datasets. This combination fared well in all tasks and experiments. Model setup and optimization is relatively straightforward.

We thank the reviewer for raising this point. The Mordred descriptors contain 21 descriptors that are based on the Crippen [3] partition coefficient (logP) and molar refractivity (MR). We have masked these descriptors and re-evaluated the prediction and calibration performance on the Delaney and Freesolv datasets. The difference in the prediction R2 score is not statistically significant for both datasets, with and without masking the Crippen-based values in Mordred. The miscalibration score is also largely unaffected, with the exception of the BNN and GP models in the Delaney dataset, with the masked Mordred features counterintuitively performing slightly better. We conclude that the presence of the Crippen values do not strongly affect our results for the solvation-based datasets, and the conclusions on the transferability of the Mordred descriptors between datasets remain valid.

We hypothesize that the success of Mordred features can be explained via the effect of "random kitchen sinks” [5]: a high-dimensional collection of random features can serve as a reliably strong feature set, as models learn the correlations between dimensions of the features and to the target. In this case, the Mordred features can be interpreted as a large collection of random molecular graph statistics, and when this collection is large enough, we have a robust feature set that performs well across multiple scenarios.

We have added plots demonstrating the masked Mordred feature performance in the SI Figure S1, and made a note in the main text (Section IV A.):
“We note that the Mordred features contain 21 descriptors that are related to the Crippen partition coefficient and molar refractivity, which are parameterized to experimental solubility measurements.84 To ensure that there is no advantage provided to the solvation datasets (Freesolv and Delaney), the Crippen-related descriptors are masked, and the predictions are repeated for the datasets (Figure S1). Indeed we see no statistically significant difference between predictions on the original Mordred features, and those that have the Crippen-related descriptors removed, indicating that the impact of the Crippen values in the Mordred features on the performance of the model on the solvation datasets is low.”

4. In general, I would suspect that simpler representations would perform better for small datasets. For example, that’s probably why the GNN doesn’t perform well. So, using ECFP4 (or some other simpler fingerprints, like MACCS keys) may perform better than ECFP6.

There are a myriad of hyperparamters involved in selecting the fingerprints. Our goal was to attain the simplest MFP that still encoded important molecular motifs. The choice of ECFP6 (radius 3) is motivated by the work in [6], which observed that ECFP6 outperforms ECFP4 in that it can capture aromatic six-member ring structures and their attached functional groups, which minimally requires radius 3 to be fully encoded. The dataset used in [6] has 449 data points. Ultimately, the optimal choice of structural fingerprints is highly dependent on the dataset and task. We have added this detail in Section III A.:
“A radius of 3 was chosen in order to capture important molecular motifs such as the aromatic six-member ring structure, present in many organic molecules in our datasets.61”

Regarding the GNN, the poor performance at lower data regime is likely related to the number of parameters in the neural networks, which require more data to properly train.

Additionally, we remove zero variance feature dimensions to remove features that contain no information across the molecules in the dataset, increasing speed and removing redundant dimensions. We have included this in the discussion of the data pre-processing in Section III B.:
“For all datasets, the SMILES were canonicalized using RDKit, while duplicated, invalid, and entries with fragments or salts are removed. Additionally, all feature dimensions that have zero variance across the molecules were removed to reduce the size of feature space along redundant features, and improve speed of the models.”

5. The authors write “Note that the graph embeddings were not used due to the poor performance of GNN embedder at extremely low data regimes (~50).” Where did that come from? I can’t see any data to support this. It also surprises me. Given point 4, I would suggest a pretrained model to outperform the GNN for small datasets.

The GNN model is pretrained on the datasets we have studied, using the same splits in the first experiment as our probabilistic models. The pretraining required to generate the graph embeddings is done on the molecules and targets of the entire dataset. This would give an unfair advantage to the embeddings, as the GNN would have access to values prior to measurement in the BO setting. Instead the pretrained model can only be trained on the initial starting set of the BO (~50 molecules). The poor performance of the same GraphNets architecture in GNNGP on BioHL dataset suggests that the GNN embeddings at the even lower data regime encountered in BO will not be a good feature choice.

To clarify this point, we have added to the text in Section IV B.:
“Note that graph embeddings were not used, as the GraphNets GNN embedder requires training on the targets. While the embeddings may be effective when trained on the entire dataset, the GNN embedder would have seen the data prior to measurement in the setting of a BO experiment. Instead, the embeddings would have to be trained from the data available to the BO algorithm (∼ 25 − 100), and would not be informative, provided the performance of the GraphNets architecture in the GNNGP on BioHL (Figure 4).”

Additionally, to clarify the pretraining and generation of the GraphNets embedding, we have added to the caption of Figure 2:
“Graph embeddings are extracted from the global node of a GraphNets GNN predictor pretrained on the molecules and targets of the dataset. Molecules in the test set are not accessible during pretraining to ensure no data leakage.”

6. The authors write “For regression, the initial design dataset comprises 5% of the dataset size and is randomly sampled.” I don’t understand this sentence. Does that mean that the first ML model that is employed is trained on 1116*0.05 = 56 points for the Delayne set?
If so, then I don’t understand the curves in Figure 6 prior to the dotted line. Are N < 56, is y the lowest solubility value for a random sample of N points? If so, why are the curves different? Is it because it corresponds to different random samples? This should be clarified. Also, how is 5% chosen?

Indeed, the initial sample size for the Delaney dataset is 5%, so 1116*0.05=56, which are selected randomly based on the seed of the run. This is what is indicated by the vertical dotted line in each subplot of Figure 6. The traces in the region to the left of the line are slightly different because each optimization run commences with its own random seed, regardless of the surrogate model used. Indeed, this approach could introduce some variability when comparing the optimization performance of each strategy, but since we execute 30 independent runs for each strategy, we believe this variability is inconsequential, as evidenced by the overlapping confidence intervals.
Regarding the choice of 5% of the dataset, we address this in our response to the next point (comment 7).
We agree that Figure 6 would benefit from added clarification, and have updated the figure in the revised version of the manuscript to better explain the initial design portion of the experiments. We have added to the caption of Figure 6 to reflect the changes, and changed the vertical dotted line into a shaded region to indicate the initial random sampling strategy. This was also done for the optimization traces in Figure 7, S2 and S3.

We have added the following sentence to Sec. IV. B. to clarify the experimental parameters of the BO experiments:
“For regression, the initial design dataset comprises 5% of the dataset size and is randomly sampled based on the seed of the run, with a minimum of 25 molecules. For binary classification, we start with 10%, with a maximum of 100 molecules, to avoid only sampling molecules of the same class. In the case of the Delaney dataset, which comprises 1116 molecules, we use 56 for the initial design. For all datasets except for BioHL, the allotted budget is 250 measurements, excluding the randomly sampled initial design. The budget for the BioHL dataset is reduced to 75 measurements due to the small size of the dataset”

7. Related to point 6: there are a couple of Baysian optimisation studies (that probably should be cited: 10.1039/d1me00093d and 10.1002/anie.202108116) that get better-than-random performance with fewer than 50 points. However, their feature vectors are significantly simpler than the ones used here (see point 4). This should be discussed.

We thank the reviewer for pointing us to these Bayesian optimization studies. Indeed, the number of initial design points can greatly affect the outcome of a data-driven optimization campaign and is therefore an important hyperparameter for real-world experiments. Also, we certainly agree that smaller, expert-crafted feature vectors tailored to a specific application will often outperform more general higher-dimensional features, especially on low-data predictive tasks.

Regarding “Bayesian optimization of nanoporous materials” by Deshwal et al.: The authors elect to vary the initial training set size ( {5, 10, 15, 20, 25} ) and study its impact on BO performance. They find that their algorithm can deliver promising COFs faster if initialized with fewer points, indicating that a small initial training set is sufficient to ground the surrogate model. We believe this is an interesting observation. However, one striking difference between this study and ours is the construction of feature vectors. Deshwal et al. construct a 12-element representation of COFs based on expert-selected features representing structural and chemical descriptors thereof. Expert–selected descriptors can accelerate Bayesian optimization performance if they provide sufficient correlation with the objective. [4] In the current study, we focus on providing a general purpose tool which can be used without the need for extensive down-selection of molecular features on an application-specific basis. Thus, we deal with larger feature vectors, and we may need more initial design points to assure proper training of the surrogate model.

Regarding “Bayesian optimization of high-entropy alloy compositions for electrocatalytic oxygen reduction” by Pedersen et al.: The authors use a GP surrogate to map alloy molar fraction vectors to current densities. In their experiments however (as the reviewer has pointed out), the molar fraction input vector has only 5 elements, while ours typically have thousands. In our experience, lower dimensional problems often enable starting from fewer initial design points.

Finally, both studies employ strictly GP or tree-based surrogate models, which, according to our study and others, perform well in limited-data settings. In our study, we also test neural network-based surrogates which typically struggle in this regime. Thus, a (relatively) large initial design set (5/10%) is chosen in an attempt to level the playing field in the optimization experiments.

We have cited both articles and have added the following discussion to main text Sec IV. B.:
“Although several studies73,95 have shown that a smaller initial design set (as low as 5 data points) can result in increased BO performance on materials science design tasks, we elect not to vary the number of initial design points, as we are focused on the effects of the surrogates and the molecular features on the optimization. The aforementioned studies both employ only GP or tree-based surrogate models, which have been observed to perform sufficiently well with limited training data, while in this study, we also consider neural network surrogate models, which typically have trouble training in such data regimes. Additionally, the studies employ expert-crafted low-dimensional features tailored to their respective datasets, which have been shown to improve BO performance,93 while we use general purpose higher-dimensional features across multiple datasets and tasks. Thus, we select relatively larger initial design sets of 5% and 10% of the datasets for regression and binary classification, respectively, in order to account for the inclusion of deep surrogate models, and the effects of higher-dimensional but multi-purpose molecular features.”

Reveiwer 3:

The authors have made a python package called DIONYSUS that can be used to calibrate and evaluate probabilistic models on chemical problems with low datasets. The package contains four popular chemical representations that can be used with both deep learning and machine learning models. The authors assessed the performance of various combinations of representations and prediction models for binary classification and regression tasks with small data. The evaluation was based on calibration, accuracy metrics (R^2 and AUROC), and generalization ability. The results showed that deep learning models underperformed due to the lack of data, and Gaussian process and NGBoost models with Mordred descriptors performed the best among the other combinations, which supports a common expectation that shallow learning performs better than deep learning for low-data problems. Unlike other studies that focus solely on achieving state-of-the-art accuracy, this study emphasizes the importance of assessing models from various perspectives, particularly their generalization ability of prediction on out-of-distribution data. Regarding this, the authors also devised a new data clustering method. This is important for many chemical and biological problems that suffer from small data. Overall, the manuscript is well-written and provides a useful toolkit. Therefore, this reviewer recommends that the manuscript be published in this journal.

We thank the reviewer for the positive assessment of our work.

#####
Citations in the response:
[1] Graff, David E., Eugene I. Shakhnovich, and Connor W. Coley. "Accelerating high-throughput virtual screening through molecular pool-based active learning." Chemical science 12.22 (2021): 7866-7881.
[2] Sun, Duxin, et al. "Why 90% of clinical drug development fails and how to improve it?." Acta Pharmaceutica Sinica B (2022).
[3] Wildman, Scott A., and Gordon M. Crippen. "Prediction of physicochemical parameters by atomic contributions." Journal of chemical information and computer sciences 39.5 (1999): 868-873.
[4] Häse, Florian, et al. "Gryffin: An algorithm for Bayesian optimization of categorical variables informed by expert knowledge." Applied Physics Reviews 8.3 (2021): 031406.
[5] Rahimi, Ali, and Benjamin Recht. "Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning." Advances in neural information processing systems 21 (2008).
[6] Jorner, Kjell, et al. "Machine learning meets mechanistic modelling for accurate prediction of experimental activation energies." Chemical Science 12.3 (2021): 1163-1175.

Round 2

Revised manuscript submitted on 20 Apr 2023

Editor’s decision letter

21-Apr-2023

Dear Mr Tom:

Manuscript ID: DD-ART-12-2022-000146.R1
TITLE: Calibration and generalizability of probabilistic models on low-data chemical datasets with DIONYSUS

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry

Reviewer comments

Reviewer 2

The authors have addressed my concerns
Jan Jensen

Reviewer 1

My comments and questions have been satisfactorily addressed, and I am happy to recommend this for publication.

Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.