Peer review - Synthetic data enable experiments in atomistic machine learning

21-Feb-2023

Dear Dr Deringer:

Manuscript ID: DD-ART-12-2022-000137
TITLE: Synthetic data enable experiments in atomistic machine learning

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions.

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

************

Reviewer comments

Reviewer 1

Overall, this publication is a valuable resource for researchers in the field, and the inclusion of the publicly available dataset and clear instructions on how to use it make it an especially valuable contribution.

The study at hand creates a new database from scratch using ASE and C-GA-17 as implemented im LAMMPS. The authors have made the dataset publicly available on github, which is a valuable resource for researchers in the field. The authors provide a detailed explanation of the data and how it can be accessed and analyzed. Additionally, the authors have provided a small example on their Github repository demonstrating how to read the features (energies and MD trajectory) of each data point (extended xyz) using ASE, which is a helpful resource for those looking to use the dataset in their own research. The authors show that the atomic energies of their carbon-systems can be machine-learned. For this purpose, they compare three approaches, namely GPR, NN, and DKL, concluding that NN approaches are most suitable in terms of efficiency and accuracy for the atomic energy prediction task.

Thus, the study at hand presents a valuable contribution to the field of atomic-scale ML models. However, there are a few areas where the publication could be improved.

One concern is the method used for data splitting. While the code on the authors' Github repository indicates that the Python Random Shuffle method is used, this is not clearly stated in the paper. It is also not clear how a test-training-validation split of the data is performed in the NN pipeline. To improve the transparency (and reproducibility) of the study, it would be helpful for the authors to clearly state the data splitting method in the paper and to include more detailed documentation in the code on Github (e.g. by adding docstrings).

Another issue is the installation instructions provided in the Github repository. The package "sklearn" should be replaced with "scikit-learn" in the requirements.txt file, and it would also be helpful to specify version numbers for all packages in the requirements.txt file. These changes would help ensure that other researchers are able to easily set up and run the code.

Overall, while this publication presents a valuable contribution to the field, there are a few areas where it could be improved. I suggest the article for publication with the comments above taken into consideration.

Reviewer 2

A brief summary of this work is that the authors use synthetic data from an ML potential to pre-train a new potential that can then be fine-tuned for a specific task. They find this is better than training from scratch, and it mitigates the need for large, expensive training data generation in some cases. The work is well written and supported by evidence, and I recommend it for publication in its present form.

I agree with the author's assessment of the novelty in the work. On one hand, what they do is similar to ideas like transfer learning, or ideas from large language models where embeddings are used as input features to new models that are fine-tuned. On other hand, it is probably true that the specific implementation here has not been reported before, and the article is well written to articulate the similarities and differences in the present work and related work. The community would benefit from this discussion.

The main suggestion I have for the authors is that they register the git repos with an archive service like Zenodo or Figshare so that it has a DOI and is archived at the version used with the publication.

Reviewer 3

This work proposes the use of synthetic data, specifically atomic energies obtained from existing potentials, to aid the analysis in atomistic machine learning. Through four main numerical tests on -- a) the accuracy of various potentials, b) hyperparameter selection in GPR, c) transfer learning in NN potential, and d) UMAP visualization of atomic environments, the authors demonstrated the effectiveness of the synthetic data. For example, DKL won’t perform better than a simple NN model, and a pretraining—finetuning approach can improve NN performance. This work would be of interest to researchers working in AI for materials/chemistry, particularly potential developers. I would recommend its publication in Digital Discovery.

Strengths:

1. Multiple tasks, demonstrating the use case of synthetic data in various settings.
2. Very clear presentation and nicely drawn figures.

Weakness/places for improvement:

1. In the “GPR insights” section, it reads “In retrospect, both plots in Fig. 4 seem to confirm settings that have been intuitively used in ML potential fitting using the GAP framework.” Since in the current settings, both the DFT dataset and synthetic data are available, will the conclusion on hyperparameter selection change if it is tested using the cell energies from DFT (as in most potential fitting cases)? Readers would be interested in how synthetic data can help in this case, and a comparison between using the DFT cell energies and the synthetic atomic energies should be helpful.

2. In Fig 5b, I believe it will be more informative to test against some DFT total energy, instead of on atomic energies from the GAP model. After all, the directly trained NN and the finetuned NN are trained against DFT energies in the last step. We prefer DFT as the standard instead of the GAP (which was fitted to DFT), right?

3. Fig 5d suggests that, if more QM data is used, direct training will outperform pretraining. The required QM data points to achieve this is actually not large; probably on the order of 5k? While it is not surprising that direct training will perform well with more data, any idea on the practical adoption of the pretraining approach, given that 5k data points are not difficult to obtain?

4. The authors mentioned their previous teacher-student model, and also “Transfer learning for atomistic models has been demonstrated by Smith et al., although we are not aware of prior work going from ML-potential to DFT-level accuracy, or performing this transfer learning via synthetic atomic energies (rather than quantum- mechanical per-structure energies or forces).”
There are highly related publications the authors might want to take a look at? The work by Shui et al (https://doi.org/10.48550/arXiv.2210.08047) goes from physics-based potentials to DFT-level accuracy using transfer learning, and the work by Yoo et al (https://doi.org/10.1103/PhysRevMaterials.3.093802) considers training against atomic energies.

Author response

Dear Dr Hung,

Thank you very much indeed for your email - we have revised our manuscript accordingly. Please find enclosed a point-by-point response to the reviewers' comments (submitted as PDF, and also appended below).

Yours sincerely,

Volker Deringer

===

Response to Reviewers’ Comments – Manuscript ID DD-ART-12-2022-000137

We thank all three reviewers for their careful evaluation of our manuscript and for their constructive suggestions. [In the plain-text version of this response, the reviewers' comments are highlighted by ">" signs, and our response is interspersed.]

Referee: 1

> Overall, this publication is a valuable resource for researchers in the field, and the inclusion of the publicly available dataset and clear instructions on how to use it make it an especially valuable contribution.

> The study at hand creates a new database from scratch using ASE and C-GA-17 as implemented im LAMMPS. The authors have made the dataset publicly available on github, which is a valuable resource for researchers in the field. The authors provide a detailed explanation of the data and how it can be accessed and analyzed. Additionally, the authors have provided a small example on their Github repository demonstrating how to read the features (energies and MD trajectory) of each data point (extended xyz) using ASE, which is a helpful resource for those looking to use the dataset in their own research. The authors show that the atomic energies of their carbon-systems can be machine-learned. For this purpose, they compare three approaches, namely GPR, NN, and DKL, concluding that NN approaches are most suitable in terms of efficiency and accuracy for the atomic energy prediction task.

> Thus, the study at hand presents a valuable contribution to the field of atomic-scale ML models. However, there are a few areas where the publication could be improved.

Response: We thank the reviewer for their positive evaluation, and we address their suggestions below.

> One concern is the method used for data splitting. While the code on the authors' Github repository indicates that the Python Random Shuffle method is used, this is not clearly stated in the paper. It is also not clear how a test-training-validation split of the data is performed in the NN pipeline. To improve the transparency (and reproducibility) of the study, it would be helpful for the authors to clearly state the data splitting method in the paper and to include more detailed documentation in the code on Github (e.g. by adding docstrings).

Response: We agree – being explicit about the exact validation procedure used is very important.

Action: We added clarification regarding data splitting on p. 4, and added additional details in the Appendix on p. 9. Please note that during revision, the reported values for directly- and pre-trained models have been corrected and updated to take into account all 10 cross-validation folds. These are small changes in the final values, and do not change the interpretation of any results.

We added docstrings to the code, as requested, and also added comments in the associated notebooks. Merge requests highlighting these changes can be found here:

- https://github.com/jla-gardner/synthetic-data-experiments/pull/1
- https://github.com/jla-gardner/synthetic-fine-tuning-experiments/pull/1

> Another issue is the installation instructions provided in the Github repository. The package "sklearn" should be replaced with "scikit-learn" in the requirements.txt file, and it would also be helpful to specify version numbers for all packages in the requirements.txt file. These changes would help ensure that other researchers are able to easily set up and run the code.

Action: We updated the “requirements.txt” file on GitHub, changing “sklearn” to “scikit-learn”, and adding version numbers for all packages (see the above merge requests).

> Overall, while this publication presents a valuable contribution to the field, there are a few areas where it could be improved. I suggest the article for publication with the comments above taken into consideration.

Response: We thank the reviewer for their recommendation to publish; their comments have been addressed as detailed above.

Referee: 2

> A brief summary of this work is that the authors use synthetic data from an ML potential to pre-train a new potential that can then be fine-tuned for a specific task. They find this is better than training from scratch, and it mitigates the need for large, expensive training data generation in some cases. The work is well written and supported by evidence, and I recommend it for publication in its present form.

> I agree with the author's assessment of the novelty in the work. On one hand, what they do is similar to ideas like transfer learning, or ideas from large language models where embeddings are used as input features to new models that are fine-tuned. On other hand, it is probably true that the specific implementation here has not been reported before, and the article is well written to articulate the similarities and differences in the present work and related work. The community would benefit from this discussion.

Response: We thank the reviewer for their strongly positive review of our work.

> The main suggestion I have for the authors is that they register the git repos with an archive service like Zenodo or Figshare so that it has a DOI and is archived at the version used with the publication.

Response: This is a good suggestion. We agree that, alongside public git repositories, it is important to timestamp and archive the specific version of the code at the time of publication.

Action: We uploaded all code and related data to Zenodo, where they have been assigned the following permanent DOIs:

- 10.5281/zenodo.7688015 (https://zenodo.org/record/7688015)
- 10.5281/zenodo.7688032 (https://zenodo.org/record/7688032)

Additionally, we deposited a copy of the full dataset in Zenodo:

- 10.5281/zenodo.7704087 (https://zenodo.org/record/7704087)

Referee: 3

> This work proposes the use of synthetic data, specifically atomic energies obtained from existing potentials, to aid the analysis in atomistic machine learning. Through four main numerical tests on -- a) the accuracy of various potentials, b) hyperparameter selection in GPR, c) transfer learning in NN potential, and d) UMAP visualization of atomic environments, the authors demonstrated the effectiveness of the synthetic data. For example, DKL won’t perform better than a simple NN model, and a pretraining—finetuning approach can improve NN performance. This work would be of interest to researchers working in AI for materials/chemistry, particularly potential developers. I would recommend its publication in Digital Discovery.

Response: We thank the reviewer for their recommendation to publish, and for their helpful suggestions for improvement. We respond to the individual points below.

> Strengths:

> 1. Multiple tasks, demonstrating the use case of synthetic data in various settings.
> 2. Very clear presentation and nicely drawn figures.

> Weakness/places for improvement:

> 1. In the “GPR insights” section, it reads “In retrospect, both plots in Fig. 4 seem to confirm settings that have been intuitively used in ML potential fitting using the GAP framework.” Since in the current settings, both the DFT dataset and synthetic data are available, will the conclusion on hyperparameter selection change if it is tested using the cell energies from DFT (as in most potential fitting cases)? Readers would be interested in how synthetic data can help in this case, and a comparison between using the DFT cell energies and the synthetic atomic energies should be helpful.

Response: We agree with the reviewer that a more specific comparison between DFT-based and synthetic datasets would be useful. We considered showing timing benchmarks using the GAP framework, but ultimately decided not to do so: the comparison would not be entirely fair, as the “learning tasks” are different (GAP includes derivatives in the fitting, for example).

To nonetheless more clearly substantiate our claim that synthetic atomic energies are useful for experiments with GPR models, we now give a specific timing example for our own models.

Action: We now state that

“We argue that the speed at which these atomistic GPR energy models can be fit makes this proposition attractive (for reference, training a model on 106 atomic environments using 5000 sparse points took 61 minutes on a mid-range dual-CPU node)” (p. 5).

> 2. In Fig 5b, I believe it will be more informative to test against some DFT total energy, instead of on atomic energies from the GAP model. After all, the directly trained NN and the finetuned NN are trained against DFT energies in the last step. We prefer DFT as the standard instead of the GAP (which was fitted to DFT), right?

Response: This is a good suggestion, and accordingly we added a new table with the requested information. We think that this additional display item complements well the scatter plots in Fig. 5b, which we decided to keep in the revised manuscript. We note that we use the table to summarise both the atomic-energy and per-cell DFT energy errors (which both address different aspects of our approach, and which we think are useful to provide to the reader).

Action: We added a table on p. 6 that reports errors both for the atomic-energy and the per-cell DFT energy predictions. We updated the corresponding text in the “Pre-training” subsection.

> 3. Fig 5d suggests that, if more QM data is used, direct training will outperform pretraining. The required QM data points to achieve this is actually not large; probably on the order of 5k? While it is not surprising that direct training will perform well with more data, any idea on the practical adoption of the pretraining approach, given that 5k data points are not difficult to obtain?

Response: We thank the reviewer for bringing this point to our attention. We agree entirely that 5k data points are not difficult to obtain at the DFT level. However, we envision that (in future work) pre-training and subsequent fine-tuning might become particularly useful for higher-level data that are much more demanding to obtain – e.g., periodic RPA or quantum Monte Carlo.

Action: We now discuss this point in the revised manuscript:

“The trends present in Fig. 5d suggest that on the order of 5000 QM data labels would close the gap between fine-tuned and directly trained models. While this amount of data would not be difficult to obtain with standard DFT approaches, there are more accurate methods available (such as periodic random-phase approximation or quantum Monte Carlo) where this amount of data would be much more expensive to generate. We therefore propose that these would be particularly interesting use cases for synthetic pre-training, especially because this technique provides the largest performance increase for low N.” (p. 7)

> 4. The authors mentioned their previous teacher-student model, and also “Transfer learning for atomistic models has been demonstrated by Smith et al., although we are not aware of prior work going from ML-potential to DFT-level accuracy, or performing this transfer learning via synthetic atomic energies (rather than quantum- mechanical per-structure energies or forces).”
There are highly related publications the authors might want to take a look at? The work by Shui et al (https://doi.org/10.48550/arXiv.2210.08047) goes from physics-based potentials to DFT-level accuracy using transfer learning, and the work by Yoo et al (https://doi.org/10.1103/PhysRevMaterials.3.093802) considers training against atomic energies.

Response: We thank the reviewer for these suggestions. We agree that they are helpful for providing context, and we now cite and briefly discuss them in the revised manuscript.

Action: We now cite the work by Shui et al. (new Ref. 47), that by Yoo et al. (new Ref. 66), and an additional reference cited in the Yoo et al. study (new Ref. 69). We removed our statement about “prior work” (p. 7) to make the discussion more neutral, and more focused on the actual references.

Additionally, we briefly reference the work by Shui et al. in the context of embeddings, as it is somewhat related to our Fig. 6d:

“Where applicable, our findings are in line with a very recent study by Shui et al., who demonstrated that pre-training can lead to more meaningful embeddings for SchNet models compared to random initialisation” (p. 8).

For context, we finally added two references from the wider field of ML research, on the subject of pre-training NN models in domains different from chemistry (new Refs. 73 and 74).

In conclusion, we thank the reviewers again for their helpful comments. We believe that their feedback and suggestions have substantially strengthened the manuscript, and we hope that it will be interesting and useful to a wide readership as an article in Digital Discovery.

Editor’s decision letter

20-Mar-2023

Dear Dr Deringer:

Manuscript ID: DD-ART-12-2022-000137.R1
TITLE: Synthetic data enable experiments in atomistic machine learning

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

Reviewer comments

Reviewer 1

Overall, I found the updated code very convincing and appreciate the changes that have been made to it. I would especially like to point out that the documentation is helpful in understanding the code and that the versioning in 'requirements.txt' makes it easy for third persons to install and use.

Reviewer 3

I’d like to thank the authors for considering my comments! The revised manuscript has addressed my concerns and I recommend its publication.

From the journal Digital Discovery Peer review history

Round 1

Reviewer 1

Reviewer 2

Reviewer 3

Round 2

Reviewer 1

Reviewer 3

Transparent peer review