From the journal Digital Discovery Peer review history

You do not have JavaScript enabled. Please enable JavaScript to access the full features of the site or access our non-JavaScript page.

Round 1

Manuscript submitted on 13 Mar 2023

Editor’s decision letter

13-Apr-2023

Dear Dr Reker:

Manuscript ID: DD-ART-03-2023-000037
TITLE: Improving Molecular Machine Learning Through Adaptive Subsampling with Active Learning

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

************

Reviewer comments

Reviewer 1

Wen et al. present an interesting and relevant study of the capacity of subsampling with active learning to improve the performance of machine learning models. The manuscript is well written but I see a few major issues and questions that need to be addressed before the manuscript can be recommended for publication:

The Authors should introduce the individual sampling methods that they are comparing, and they should motivate their choice (i.e., make the case that the selected methods are indeed representative of the state of the art). They should clearly define, use and distinguish the terms “sampling”, “subsampling”, “oversampling” and “undersampling” throughout the manuscript.

The Authors should please check whether the selected sampling methods are at all applicable to binary molecular fingerprints. For example, I don’t think that SMOTE can be used in this scenario.

Abstract: Should include concrete results (values).

Abstract: The statement “We show that active learning-based subsampling can lead to better molecular machine learning performance” is too broad (in fact, it may be misleading) since this was only shown for random forest and only for the case of Morgan fingerprints.

Abstract and Methods section: The Authors should clarify that this work is about binary classification tasks.

Methods section (this should actually become a part of the Results section): The Authors should motivate their choice of the data sets employed: what makes them particularly useful to investigate this research question? How do the individual data sets differ from each other – what aspects do the individual data sets cover?

Methods section: The Authors should motivate why they use a different data split protocol for the Breast Cancer data set than the MoleculeNet AI data sets. The implications of their choice should be elaborated.

Methods section, pg 7: It seems that “Balanced” is undefined or at least confusing, since there are many things that can be balanced (e.g., weights, or instances).

Results section: Observations made by the Authors should be reported in past tense, not present tense (e.g., page 8).

Results section: “In contrast to previous studies”: Readers will expect their citation at this point, not some sentences later.

Results section, page 9: “To determine whether this drop”: “this” is undefined; avoid linking back to the previous paragraph.

Conclusions section: It should be made clear that the observations were only made for random forest and only in the combination with binary Morgan fingerprints.

Figure 1: Parts of the figures are clearly too small for interpretable. Please enlarge the panels, e.g., by using landscape format and by splitting the figure into two parts. Importantly, all axes of all panels and inlets should be scaled consistently (within a series) to make them comparable. In the current version, the use of different scales leads causes severe problems to the interpretation.

Reviewer 2

From the data perspective, I think the authors did a great job of documenting their protocol and open-sourcing their code, as well as sharing an example notebook. The only little thing I would suggest to the authors is adding the access date of the datasets. Also, if the authors also find it desirable, maybe write the package into an installable package for easier usage, instead of just a script file.

Reviewer 3

Authors benchmark the active learning subsampling of molecular and breast cancer datasets against other subsampling methods. Subsampling is shown to generally improve the performance of the random forest (RF) models (trained with the subsampled datasets), both compared to models trained with full data and other subsampling methods, albeit there is some variation in the results depending on the dataset in question and the reference subsampling approach.

This contribution is a comprehensive benchmarking work that also provides insights on how AL subsampling principally works. I appreciate especially the analysis as a function of error levels in the data, analyses on the possible origins of the AL benefit, and the results being supported with statistical testing. The manuscript should be of interest to the readers of Digital Discovery after the following major and minor comments have been addressed properly by the authors:

1. Can the authors comment on the computational load of the active learning (AL) subsampling compared to other subsampling methods, or possible downsides of the AL approach?

2. Are there any specific properties of datasets/applications that are expected to benefit (or not to benefit) from the AL subsampling? This would be useful information for the readers trying to decide if they should adopt the AL approach in their project and to me it seems the variety of datasets would provide some basis also for this kind of insight.

3. The benchmarking has been performed with RF models as the surrogate models of AL subsampling approach and RF as the model of choice for the final benchmarking. Can the authors comment on how a change of the 'final use' model would affect the results? Is the AL benefit expected to remain if the 'final use' model would be of different type than what is used as the surrogate model during the subsampling step, or should the models be users choose a similar type of model? This is useful information for the readers because increasingly complex model architectures (i.e., hard to tune and choose hyperparameters for, slow to train) are being developed as also the authors state, as well as models tailored to a specific application. In these cases, it would be of interest to be able to use simple models for the initial stages of the ML modeling pipeline and to switch to a complex model only at later stages.

4. In Fig. 1B 'random selection' runs, the proportion of pos. labels in the smallest % of train data runs seems surprisingly high for Clintox and HIV datasets (~45% and 30%, respectively, if I read the graph correctly) compared to the low proportion of pos. labels in the full data (7% and 3 %, respectively) and to the fact that there are 20 repetitions in the runs that should provide pretty good statistics. Is the proportion of pos. labels not supposed to fluctuate only little with random selection?

4. In page 9, lines 17-21 seem like repetition of page 6 line 22 onward.

5. The graphs in Fig. 1F are shown with the error rate that maximally benefits AL. Do the results hold also for 0% error rate? Consider commenting on this topic e.g. in the supplementary because readers could then compare to the estimated error rates in their own applications.

6. As such, Fig. 1 is difficult to read both as a print or on a computer screen due to blurred text and lines especially in the inlets. Please consider restructuring the figure, e.g. moving some of the inlets to a new row.

Author response

Thank you very much for this helpful feedback, which we have addressed in full. Our responses to the individual comments are pasted below and attached as a separate PBP file.

Referee: 1

Wen et al. present an interesting and relevant study of the capacity of subsampling with active learning to improve the performance of machine learning models. The manuscript is well written but I see a few major issues and questions that need to be addressed before the manuscript can be recommended for publication:
Thank you very much for your positive feedback and careful evaluation of our manuscript. We are thrilled about your supportive comments and constructive feedback and have addressed all your suggestions as described below to further improve the manuscript.

The Authors should introduce the individual sampling methods that they are comparing, and they should motivate their choice (i.e., make the case that the selected methods are indeed representative of the state of the art). They should clearly define, use and distinguish the terms “sampling”, “subsampling”, “oversampling” and “undersampling” throughout the manuscript.
We thank the reviewer for raising these important points. We have now included a brief description of all sampling methods used in the manuscript in the methods section. The method section now reads:
“Imblearn is a Python library that provides various imbalanced learning techniques to address the issue of imbalanced datasets. Some of the implemented undersampling methods in imblearn include AllKNN, which applies the k-NN algorithm to every sample to remove majority class samples, and CondensedNearestNeighbour, which uses k-NN to reduce majority class samples while retaining all minority class samples. Another approach, EditedNearestNeighbours, removes majority class samples based on the k-NN algorithm's classification errors. InstanceHardnessThreshold removes samples that are misclassified by a classifier with high hardness scores. NearMiss selects majority class samples based on distance to minority class samples, while NeighbourhoodCleaningRule removes noisy samples by applying k-NN to every sample. OneSidedSelection selects majority class samples based on distance to minority class samples and their nearest neighbors, and RandomUnderSampler randomly removes samples from the majority class. Oversampling methods in imblearn include BorderlineSMOTE, which generates synthetic samples for the minority class near the borderline between minority and majority classes using SMOTE. RandomOverSampler duplicates samples from the minority class, while SMOTE generates synthetic samples for the minority class based on interpolation between existing samples. SMOTEN is an extension of SMOTE that works with categorical data. Finally, SVMSMOTE uses SVM to generate synthetic samples for the minority class.”
Additionally, we have included further clarification that we did not manually choose certain sampling methods but holistically implemented all methods available in imblearn (version 0.8). This was done to prevent any personal bias in the selection process and to enable the evaluation of a wide range of sampling methods. The method section now reads:
“We used all 15 established sampling strategies from the imblearn Python library (version 0.8) with default parameters.”
We have also added definitions for “sampling”, “subsampling”, “oversampling” and “undersampling” in the introduction and have undergone another round of proofreading to ensure that we use these terms consistently throughout the manuscript. The introduction now reads:
“One of the state-of-the-art approaches for data curation is sampling of the training data, i.e. algorithmic selection of training data to reduce class imbalances. We mainly distinguish two types of sampling methods, oversampling and subsampling (also called “undersampling”). In oversampling, we duplicate datapoints from the minority class or create new synthetic datapoints to increase the number of minority class samples. In subsampling, we reduce the number of majority samples and other biases to mitigate imbalances in the training data.”

The Authors should please check whether the selected sampling methods are at all applicable to binary molecular fingerprints. For example, I don’t think that SMOTE can be used in this scenario.
The reviewer raises a very important point that was not sufficiently clear in the previous version of the manuscript. The expert reviewer is correct that SMOTE was originally implemented with continuous variables in mind since the oversampled data in SMOTE will create synthetic data with continuous features. The same reasoning applies to BorderlineSMOTE and SVMSMOTE. Therefore, SMOTEN has been implemented as an alternative SMOTE strategy that has been specifically implemented with categorical and binary features in mind. That being said, since we here implement Random Forest models that threshold the data, there is no practical reason why SMOTE could not be used on binary features since the model effectively “binarizes” any type of input. Interestingly, our data shows that the classic SMOTE implementation outperforms SMOTEN on most of our datasets (see for example Supplementary Table 4), indicating that the Random Forest model can make effective use of data samples with SMOTE even when the original data was binary or categorical. Therefore, we were able to include all the here described sampling methods. To further clarify this point in the manuscript, we have now added this clarification to the methods:
“We note that SMOTE and its extensions except for SMOTEN do not intrinsically support categorical or binary features as used in this study. However, since we here implement a random forest model that thresholds the data, we are still able to apply these methods here and can compare our performance against the SMOTE approach.”

Abstract: Should include concrete results (values).
Thank you very much for this great suggestion. We have now included concrete result values in the abstract by including the following sentence:
“Active subsampling can achieve an increase in performance of up to 139% compared to training on the full dataset.”

Abstract: The statement “We show that active learning-based subsampling can lead to better molecular machine learning performance” is too broad (in fact, it may be misleading) since this was only shown for random forest and only for the case of Morgan fingerprints.
We would like to thank the reviewer for raising this important point and adjusted the abstract to describe our work using random forest models more clearly. The sentence “better molecular machine learning performance” has been replaced with the following sentence:
“We show that active learning-based subsampling leads to better performance of a random forest model trained on Morgan circular fingerprints on all four established binary classification tasks when compared to both training models on the complete training data and 19 state-of-the-art subsampling strategies.”

Abstract and Methods section: The Authors should clarify that this work is about binary classification tasks.

Thank you for pointing out this important detail. We have adjusted the abstract and method section to include a clarification that this work is focused on binary classification tasks. The abstract has been modified as described above, the method section now includes the sentence:
“All four single-task binary classification datasets from the MoleculeNet AI benchmarking repository were accessed via DeepChem.”

Methods section (this should actually become a part of the Results section): The Authors should motivate their choice of the data sets employed: what makes them particularly useful to investigate this research question? How do the individual data sets differ from each other – what aspects do the individual data sets cover?
We fully agree with the reviewer on this important point and have expanded the description of the datasets in the result section. The result section now reads:
“The datasets are “BBBP”, “BACE”, “ClinTox”, and “HIV”. “BBBP” contains 2039 molecular structures annotated for whether they can cross the blood-brain-barrier. “BACE” is a dataset of 1513 molecules annotated for their ability to inhibit human beta-secretase 1. A total of 1478 molecules are annotated in “ClinTox” for whether they caused toxicity in clinical trials. “HIV” is the largest dataset in our benchmark and contains 41127 molecules annotated for their ability to inhibit HIV replication. Importantly, our datasets thereby cover a range of different sizes (from 1478 for ClinTox up to 41127 for HIV), different class imbalances (e.g., 76% positive for BBBP, balanced for BACE, 4% positive HIV), both in vitro and in vivo readouts, and of different modelling complexity based on previously published benchmarking results.”

Methods section: The Authors should motivate why they use a different data split protocol for the Breast Cancer data set than the MoleculeNet AI data sets. The implications of their choice should be elaborated.
Thank you very much for this important question. The reason we used a different splitting technique for the Breast Cancer data is that this dataset is not molecular and therefore cannot be split following the same scaffold-based splitting strategy. We have now clarified this in the methods:
“We carried out a 50:50 scaffold split (as implemented in DeepChem) for the molecular datasets and a 50:50 stratified split for the Breast Cancer dataset since this dataset does not contain molecular structures and can therefore not be split based on scaffold groups.”

Methods section, pg 7: It seems that “Balanced” is undefined or at least confusing, since there are many things that can be balanced (e.g., weights, or instances).

We appreciate the reviewer bringing up this important point, we have now clarified this further. Our control algorithm balances the number of instances. This has been clarified in the method section as follows:
“”Balanced” uses random supervised subsampling to create a training data subset with an equal number of instances belonging to either of the two binary class labels.”

Results section: Observations made by the Authors should be reported in past tense, not present tense (e.g., page 8).
Thank you for this important clarification. We have written the result section in past tense, but it seems there have been a few instances where we incorrectly used present tense. We have undergone another round of proofreading and have adjusted tenses throughout.

Results section: “In contrast to previous studies”: Readers will expect their citation at this point, not some sentences later.

Per this reviewer’s recommendation, we have now added relevant references directly after the statement “In contrast to previous studies”.
Results section, page 9: “To determine whether this drop”: “this” is undefined; avoid linking back to the previous paragraph.
Thank you for bringing this to our attention, we have adjusted the sentence to replace “this” as follows:
“To determine whether the improvement in performance for models trained at the “turning point” compared to performance for models trained on the complete dataset is statistically significant, we repeated our active learning runs 20 times with different initial training datapoints.”

Conclusions section: It should be made clear that the observations were only made for random forest and only in the combination with binary Morgan fingerprints.
Thank you for suggesting this important clarification, we have now adjusted the conclusion section to include this important detail by adding the following sentence:
“We implemented an automated data curation pipeline based on active machine learning that can improve performance of a random forest model using binary Morgan fingerprints for a range of different machine learning applications.”

Figure 1: Parts of the figures are clearly too small for interpretable. Please enlarge the panels, e.g., by using landscape format and by splitting the figure into two parts. Importantly, all axes of all panels and inlets should be scaled consistently (within a series) to make them comparable. In the current version, the use of different scales leads causes severe problems to the interpretation.

Thank you very much for this suggestion, we have now split Figure 1 into two figures and have moved the inlets into a new panel to improve readability. After trying to adjust scales across series, we noted that this will make the figures unreadable since many of the graphed data have different scales (e.g., some of the datasets contain mostly positive samples and other datasets contain mostly negative samples – scaling them consistently will make the “positive selection ratio” panel unreadable). Therefore, we thank the reviewer for this suggestion but politely decline to adjust the scale of individual figures to enable readability of the figures.

Referee: 2

Comments to the Author
From the data perspective, I think the authors did a great job of documenting their protocol and open-sourcing their code, as well as sharing an example notebook.
Thank you very much for your positive feedback, we are excited about this approach and hope that open-sourcing the code and the example notebook will enable full reproducibility and integration of this approach into pipelines developed by other researchers.
The only little thing I would suggest to the authors is adding the access date of the datasets.
Thank you for this suggestion, we have now included the accession dates for the datasets in the methods section of the manuscript.
Also, if the authors also find it desirable, maybe write the package into an installable package for easier usage, instead of just a script file.

Thank you very much for this excellent proposal. We have now made the package installable via the following command.
pip install git+https://github.com/RekerLab/active-subsampling.git

We have also updated the GitHub repository and the example notebook to reflect this change. We hope that this will further improve the reproducibility of our research and will allow other scientists in academia and industry to use our method.

Referee: 3

Comments to the Author
Authors benchmark the active learning subsampling of molecular and breast cancer datasets against other subsampling methods. Subsampling is shown to generally improve the performance of the random forest (RF) models (trained with the subsampled datasets), both compared to models trained with full data and other subsampling methods, albeit there is some variation in the results depending on the dataset in question and the reference subsampling approach.

This contribution is a comprehensive benchmarking work that also provides insights on how AL subsampling principally works. I appreciate especially the analysis as a function of error levels in the data, analyses on the possible origins of the AL benefit, and the results being supported with statistical testing. The manuscript should be of interest to the readers of Digital Discovery after the following major and minor comments have been addressed properly by the authors:
Thank you very much for your careful evaluation of our manuscript and your positive feedback, which we have addressed as described below to further improve the manuscript.

1. Can the authors comment on the computational load of the active learning (AL) subsampling compared to other subsampling methods, or possible downsides of the AL approach?

We are very grateful to this reviewer for raising this critical question and apologize for not having been sufficiently clear about the potential downsides of the AL subsampling approach yet. As suggested by this expert reviewer, the computational cost to run the active learning campaign is significantly larger compared to established sampling approaches. However, we expect this additional computational time to be offset by not having to benchmark multiple alternative approaches. This has now been clarified in the conclusion section.
“Admittingly, running a full retrospective active learning campaign is computationally more expensive than other currently implemented sampling approaches, but we expect this additional computation time to be offset by not having to benchmark multiple different sampling approaches.”

2. Are there any specific properties of datasets/applications that are expected to benefit (or not to benefit) from the AL subsampling? This would be useful information for the readers trying to decide if they should adopt the AL approach in their project and to me it seems the variety of datasets would provide some basis also for this kind of insight.
We thank the reviewer for bringing up this important point and agree that insights into when this technique would be most beneficial would be helpful for other researchers. Based on our analysis, we have found that active subsampling appears beneficial in all our benchmarks including datasets of different sizes, imbalances, and describing different types of properties. Our analysis suggests that active subsampling could be particularly beneficial when applied to datasets with wrong annotations, such as data coming from high-throughput screens with false positive and false negative readouts. We also expect that open sourcing our code will enable other researchers to explore this approach to provide additional data on its applicability. This all has been clarified in the conclusion section of the manuscript as follows:
“This effect was consistent across all our datasets, indicating that active learning as a subsampling technique could be useful for molecular datasets of various sizes, class imbalances, and describing different types of properties. It appears the benefits of subsampling are most pronounced when introducing error to the datasets, indicating that this technique could be particularly useful for data with incorrect annotations, for example including artifactual readouts from high-throughput screens. We have made the code of this work available and hope that broad deployment will not only aid other researchers in there data curation workflows but also help to further characterize the most beneficial use cases for this novel sampling technique.”

3. The benchmarking has been performed with RF models as the surrogate models of AL subsampling approach and RF as the model of choice for the final benchmarking. Can the authors comment on how a change of the 'final use' model would affect the results? Is the AL benefit expected to remain if the 'final use' model would be of different type than what is used as the surrogate model during the subsampling step, or should the models be users choose a similar type of model? This is useful information for the readers because increasingly complex model architectures (i.e., hard to tune and choose hyperparameters for, slow to train) are being developed as also the authors state, as well as models tailored to a specific application. In these cases, it would be of interest to be able to use simple models for the initial stages of the ML modeling pipeline and to switch to a complex model only at later stages.

We thank the reviewer for suggesting this interesting extension of our work. We are currently exploring this concept and this work is ongoing, but we do not yet have publishable results ready to share. In full responsiveness to the reviewer, we have included this potential future aspect of the work in the conclusion section:
“In the future, it will be important to test whether other machine learning models beyond random forest could be used for sampling and whether a data sample extracted by one machine learning model might be transferable to another machine learning model.”

4. In Fig. 1B 'random selection' runs, the proportion of pos. labels in the smallest % of train data runs seems surprisingly high for Clintox and HIV datasets (~45% and 30%, respectively, if I read the graph correctly) compared to the low proportion of pos. labels in the full data (7% and 3 %, respectively) and to the fact that there are 20 repetitions in the runs that should provide pretty good statistics. Is the proportion of pos. labels not supposed to fluctuate only little with random selection?
We are very grateful to this reviewer for so carefully evaluating our manuscript and spotting this potential point of confusion. The reason we observe a higher proportion of positive labels for “random selection” at the beginning is that all our selection algorithms are initialized in the exact same way with one positive datapoint d_1∈A^+ and one negative datapoint d_2∈A^- to enable machine learning model initialization, which is necessary to enable tracking the model performance for every sampling iteration since otherwise the model cannot be trained. This will lead to the data be artificially more balanced at the beginning of random selection until it rapidly converges towards the underlying dataset distribution. We have now clarified this in the figure caption to avoid confusion about this aspect of our work. The caption now reads:
“Note that the random sampling will be initialized in the same way as the active learning algorithm, with one positive and one negative example.”

4. In page 9, lines 17-21 seem like repetition of page 6 line 22 onward.
Thank you for pointing out this redundancy. We have adjusted the language on page 6 and on page 9 to avoid repetition.

5. The graphs in Fig. 1F are shown with the error rate that maximally benefits AL. Do the results hold also for 0% error rate? Consider commenting on this topic e.g. in the supplementary because readers could then compare to the estimated error rates in their own applications.
Thank you so much for bringing up this very important point. We have now conducted the suggested experiment and have compared all methods at 0% error rate. Interestingly, although AL subsampling is occasionally outperformed by other subsampling methods, it remains the only method that is the best performing method in more than one benchmark datasets and shows the highest median performance. This data is now included as Supplementary Table 4. We have additionally added the following paragraph in the result section of the manuscript:
“For additional context, we also compared the performance of active learning-based subsampling against all the other state-of-the-art subsampling methods without error introduction. Although active learning-based subsampling does not outcompete every other method on all datasets, it was the only method that performed best in more than one dataset (BBBP and BACE) and also showed the highest median performance across all datasets (Table S4). This shows that, although active learning-based subsampling appears particularly attractive for erroneous data, it also presents a competitive approach for other types of data.”

6. As such, Fig. 1 is difficult to read both as a print or on a computer screen due to blurred text and lines especially in the inlets. Please consider restructuring the figure, e.g. moving some of the inlets to a new row.
Thank you very much for this suggestion, we have now split Figure 1 into two figures and have moved the inlets into new, separate panels to improve readability.

Round 2

Revised manuscript submitted on 07 May 2023

Editor’s decision letter

12-Jun-2023

Dear Dr Reker:

Manuscript ID: DD-ART-03-2023-000037.R1
TITLE: Improving Molecular Machine Learning Through Adaptive Subsampling with Active Learning

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions.

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

************

Reviewer comments

Reviewer 1

The authors have made a significant effort to address my comments. The single answer that I am not fully satisfied with concerns my point regarding the use of SMOTE on binary molecular fingerprints.

The authors state that "we here implement Random Forest
models that threshold the data, there is no practical reason why SMOTE could not be
used on binary features since the model effectively “binarizes” any type of input.
Interestingly, our data shows that the classic SMOTE implementation outperforms
SMOTEN on most of our datasets (see for example Supplementary Table 4), indicating
that the Random Forest model can make effective use of data samples with SMOTE
even when the original data was binary or categorical. Therefore, we were able to
include all the here described sampling methods."

However, I do continue to see problems with using SMOTE because:
1. SMOTE only generates new samples on the class which is underrepresented
(e.g., active). If the ML model gets a fingerprint with fractional numbers, it can immediately predict "active". Sure, this helps on validation data, but in the end the ML model is possibly good for the wrong reasons. In fact, this could be the problem underlying the Author's observation that models using SMOTE outperform those using SMOTEN.
2. The interpretation of such models is tricky. One could e.g. investigate the nearest neighbors in the training set. If the neighbor does not have a real fingerprint (of 0 and 1), what can we then learn from this?
3. SMOTE changes the distribution of the fingerprint bits - there may also be problems related to that.

The Authors should please consider these issues carefully and use only sampling methods that are adequate for use with binary fingerprint data.

Apart from that, all seems fine and sound now.

Reviewer 3

The authors have addressed the comments that I had on the manuscript and I am convinced of their responses.

However, after seeing the responses I have a minor comment that needs to be addressed to avoid confusion among readers: In Figure 1C, the authors state that the blue and red curves are both initialized with one positive and one negative sample. This means that, in all the graphs, both blue and red curves should start from the value of 50% positive labels (at near 0% of training data). This is not the case now (some curves do not start even near 50%). It could be caused either because there is a mistake in the algorithm, or because the plot does not have enough resolution to show the initiation step. It is difficult for the readers judge which one is the case. To convince the future readers and to ensure a positive reception to their good work, the authors should double-check that the initial sampling and the algorithm truly work as they state, and then clarify the issue for example by plotting in higher resolution or by noting e.g. in the caption that the initiation step is cut out from the plots.

I will recommend accepting the manuscript for publication in Digital Discovery after this concern has been appropriately reacted to.

There is a typo in p. 12 subtitle (Staring Point).

Author response

Dear Dr. Linda Hung and Digital Discovery editorial team.

Thank you very much for your feedback. We have addressed the remaining questions as described in the attached PBP. We have updated the manuscript accordingly. Please let us know if any further information or clarification is needed. Thank you for your consideration and we look forward to hearing from you,

Daniel Reker

Referee: 1

The authors have made a significant effort to address my comments.
Thank you for your helpful feedback and recognizing our effort to address your comments.

The single answer that I am not fully satisfied with concerns my point regarding the use of SMOTE on binary molecular fingerprints.The authors state that "we here implement Random Forest models that threshold the data, there is no practical reason why SMOTE could not be used on binary features since the model effectively “binarizes” any type of input. Interestingly, our data shows that the classic SMOTE implementation outperforms SMOTEN on most of our datasets (see for example Supplementary Table 4), indicating that the Random Forest model can make effective use of data samples with SMOTE even when the original data was binary or categorical. Therefore, we were able to
include all the here described sampling methods."

However, I do continue to see problems with using SMOTE because:
1. SMOTE only generates new samples on the class which is underrepresented
(e.g., active). If the ML model gets a fingerprint with fractional numbers, it can immediately predict "active". Sure, this helps on validation data, but in the end the ML model is possibly good for the wrong reasons. In fact, this could be the problem underlying the Author's observation that models using SMOTE outperform those using SMOTEN.
2. The interpretation of such models is tricky. One could e.g. investigate the nearest neighbors in the training set. If the neighbor does not have a real fingerprint (of 0 and 1), what can we then learn from this?
3. SMOTE changes the distribution of the fingerprint bits - there may also be problems related to that.
The Authors should please consider these issues carefully and use only sampling methods that are adequate for use with binary fingerprint data.
We are grateful to this reviewer for sharing their detailed and careful consideration of the applicability of SMOTE to our data. In full responsiveness to this referee, we have now removed all usage of SMOTE and extensions thereof (SVMSMOTE and BorderlineSMOTE) except for SMOTEN (which allows categorical features and is therefore compatible with our binary fingerprints) from our manuscript and figures.

Apart from that, all seems fine and sound now.
Thank you again for your helpful comments and support.

Referee: 2

NA

Referee: 3

The authors have addressed the comments that I had on the manuscript and I am convinced of their responses.
Thank you very much for your helpful feedback and confirming that we have responded to your comments adequately.

However, after seeing the responses I have a minor comment that needs to be addressed to avoid confusion among readers: In Figure 1C, the authors state that the blue and red curves are both initialized with one positive and one negative sample. This means that, in all the graphs, both blue and red curves should start from the value of 50% positive labels (at near 0% of training data). This is not the case now (some curves do not start even near 50%). It could be caused either because there is a mistake in the algorithm, or because the plot does not have enough resolution to show the initiation step. It is difficult for the readers judge which one is the case. To convince the future readers and to ensure a positive reception to their good work, the authors should double-check that the initial sampling and the algorithm truly work as they state, and then clarify the issue for example by plotting in higher resolution or by noting e.g. in the caption that the initiation step is cut out from the plots.
We thank the reviewer for pointing out this potential confusion and we apologize this was not sufficiently clear in the manuscript. As suspected by the referee, we had omitted the initial step and thereby skipped the balanced initial training dataset in the graph. When plotting the positive selection ratio after adding the first new datapoint, this datapoint often is of a specific class for very imbalanced datasets – therefore leading to some curves starting close to 30%.
To avoid any future confusion, we have updated the figure to include the first step – thereby starting all curves at 50%. Furthermore, we have updated the caption to explain this aspect of the graph.

I will recommend accepting the manuscript for publication in Digital Discovery after this concern has been appropriately reacted to.
Thank you very much for your positive evaluation and recommendation.

There is a typo in p. 12 subtitle (Staring Point).
We thank this reviewer for pointing out this typo. We have corrected this typo and also went over the manuscript one more time to rule out any other typos.

Round 3

Revised manuscript submitted on 19 Jun 2023

Editor’s decision letter

22-Jun-2023

Dear Dr Reker:

Manuscript ID: DD-ART-03-2023-000037.R2
TITLE: Improving Molecular Machine Learning Through Adaptive Subsampling with Active Learning

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.

From the journal Digital Discovery Peer review history

Improving molecular machine learning through adaptive subsampling with active learning

Round 1

Reviewer 1

Reviewer 2

Reviewer 3

Round 2

Reviewer 1

Reviewer 3

Round 3

Transparent peer review