From the journal Digital Discovery Peer review history

Impact of noise on inverse design: the case of NMR spectra matching

Round 1

Manuscript submitted on 16 Jul 2023
 

05-Sep-2023

Dear Dr von Lilienfield:

Manuscript ID: DD-ART-07-2023-000132
TITLE: Impact of noise on inverse design: The case of NMR spectra matching

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

************


 
Reviewer 1

This paper analyzes the existing NMR spectra database in terms of the impact of noise. By changing experimental settings, it examines how accurately noise-con terminated query can identify the correct isomer in the database. This type of analysis is not done before and can potentially serve as a way of evaluating new NMR spectra prediction algorithms. Nevertheless I have the following comments.

1) Please explain more about Surge and the NMR spectra estimation about the Surge generated molecules. The authors mix two datasets with different DFT levels (QM9 and Surge). Please discuss why it is not a problem.

2) About section 3E. Figure 5 can possibly be used for evaluating new spectra estimation algorithms. Can you elucidate about this possibility?

3) The author uses KRR method proposed by their group. Are there any reasons to choose it over other machine learning-based spectra estimation methods? Please show other possible spectra estimation methods and why KRR is preferable.

Reviewer 2

It is nice for the author to provide an all-in-one package to reproduce the manuscript. However, required Python packages (and their version) are currently not included in the README file. That is, I have to guess which package to install. I managed to do so for most of them but can't find the module named 'g2s', which seems not well-known in the community.

Reviewer 3

The paper is well-written overall. There could be more explanations. What is the relationship between gaussian noise and MAE. In fig. 2 and the start of III A, this is used together, I guess it means gaussian noise which causes a mean deviation of X MAE (it isn't really an error, is it?). That should be clearer.
Also, in III A, it says "Fig.2 a-b) depicts a sigmoidal shaped trend of Top-1 elucidation accuracies at increasing candidate pool sizes NQM9 as a function of MAE." I don't think this is strictly correct, the "trend at (is at an correct English preposition here, btw?) pool sizes" isn't sigmoidal, the sigmoidal refers to the education performance vs. MAE curves, the multiple curves together aren't sigmoidal. Perhaps make that two sentences, explaining one curve first, then the combination in a second sentence?
My general remark about the paper is that it is nice, but not really that insightful. Essentially, you say if there is a two-step structure elucidation, where the first step produces a candidate list and the second ranks them by nmr similarity, then you need a combination of a first step which produces a good enough==small enough candidate list and a second step which uses a good enough spectrum prediction to do the ranking. To a degree, they can compensate for each other. Now that is not very surprising, everybody was working with that assumption anyway in the field. And the paper does not really quantify it, since it says that it all depends on the structure (and two of the three examples in Fig. 2 are even relatively similar).


 

We would like to thank all reviewers and the editor for the constructive criticism and for contributing to the overall improvement of the quality of the manuscript. For a point-by-point response to the reviewers’ comments and corresponding changes we refer to the uploaded response letter and to the revised manuscript.

This text has been copied from the PDF response to reviewers and does not include any figures, images or special characters:

We would like to thank all reviewers and the editor for the constructive criticism and for contributing to the overall improvement of the quality of the manuscript. In the following we will provide a pointby-point response to the reviewers’ comments and refer to the corresponding changes to the revised MS.

I. REVIEWER 1

Reviewer Comment:
This paper analyzes the existing NMR spectra database in terms of the impact of noise. By changing experimental settings, it examines how accurately noise-con terminated query can identify the correct isomer in the database. This type of analysis is not done before and can potentially serve as a way of evaluating new NMR spectra prediction algorithms. Nevertheless I have the following comments.
Authors’ response:
We thank the reviewer for investing time into providing us with valuable feedback for our work. We appreciate the recognition of the novelty and potential application of our analysis. In the following we will address all comments and criticism.

Reviewer Comment:
1) Please explain more about Surge and the NMR spectra estimation about the Surge generated molecules. The authors mix two datasets with different DFT levels (QM9 and Surge). Please discuss why it is not a problem.
Authors’ response:
We thank the reviewer for raising this point. Surge has solely been used to systematically generate graphs of the C7O2H10 chemical space. Next, 3D structures have been generated using the ETKDG method in RDKit. Lowest lying conformer structures were sampled using the CREST algorithm, using a semi-empirical GFN2-xTB/GFN-FF composite method in a meta-dynamics based sampling scheme. Finally, the lowest energy conformers have been relaxed using GFN2-xTB and used as an input to 13C and 1H machine learning models to predict the corresponding chemical shifts, respectively.
While the 13C and 1H machine learning models have been trained on B3LYP geometries, our previous research on the C7O2H10 chemical space has shown that xTB-GFN2 relaxed geometries are within 0.06 ˚A RMSD of the corresponding B3LYP structures (see SI1 Tab.1). Hence, within this proof of concept application, we relied on the xTB-GFN2 relaxed geometries as reasonable approximations to DFT structures of the C7O2H10 chemical space.
To further explain the data generation as well as to make the reader aware and clarify this point in the manuscript, we edited Sec.II D as follows:
To extend the QM9-NMR C7O2H10 constitutional isomers space, we used the systematic graph enumeration software Surge2 to generate 54’641 SMILES. 3D geometries of all SMILES have been generated using the ETKDG3 method in RDKit. Lowest lying conformer structures were sampled using the CREST4 algorithm, using the GFN2-xTB/GFNFF composite method in a meta-dynamics based sampling scheme, with a final relaxation at the GFN2-xTB level. Adding all successfully generated structures to QM9, a total pool size of 56.95k C7O2H10 isomers was obtained. as well as Sec.III D:
Note that within this proof of concept application we rely on xTB-GFN2 relaxed geometries as queries, which on average are within 0.06 ˚A RMSD of C7O2H10 B3LYP level of theory structures.

Reviewer Comment:
2) About section 3E. Figure 5 can possibly be used for evaluating new spectra estimation algorithms. Can you elucidate about this possibility?
Authors’ response:
A fundamental concept in statistical learning is the relationship between out-of-sample prediction (test) error and the number of training points. Such learning curves as depicted in Fig.5 (orange) reflect the direct inversely proportional relationship of these two quantities on a double logarithmic scale and aid in assessing machine learning data needs and efficiency. However, learning curves of chemical shift prediction models do not provide full insight in how well spectra of similar compounds can be distinguished. Hence, as shown in Fig.5, considering both learning and elucidation performance curves (shown in red and blue), the ML prediction error can be evaluated as a function of both pool size and training data, respectively. As such, new spectra estimation algorithms that improve with more training data can be evaluated in terms of data efficiency and elucidation strength; and they also offer the ability to extrapolate to estimate how much data would be necessary in order to elucidate a certain percent of queries at a certain pool size.
To clarify the potential appeal of extending learning curves this way, we added the following sentence in Sec.III E:
Similar to learning curves, showing the systematic decay of out-of-sample machine learning prediction errors as a function of training data, elucidation performance curves show for a specific elucidation threshold, e.g. 90%, the machine learning prediction error as a function of pool size. Note that while learning curves of chemical shift predictions only show the predictive accuracy, e.g. in terms of MAE, the addition of elucidation performance allow a multifaceted evaluation of new spectra estimation algorithms, considering data efficiency as well as pool size.

Reviewer Comment:
3) The author uses KRR method proposed by their group. Are there any reasons to choose it over other machine learning-based spectra estimation methods? Please show other possible spectra estimation methods and why KRR is preferable.
Authors’ response:
Kernel-ridge regression (KRR) machine learning has been chosen due to its simplicity and due to its common use as well as efficient performance in learning NMR properties from quantum chemical simulations5–11. Note that Gerrard et al.5 reported that there is no significant advantage in using neural networks in terms of accuracy for the datasets considered, e.g. QM9. Hence, due to the simplicity and yet effectiveness of KRR, this model and representation choice seems preferable, and also follows the example made by the authors of the existing benchmark on the QM9NMR dataset9. In terms of practical arguments, this choice also seems preferable since all models and learning curves could be conveniently generated within hours without the need for extensive deep learning workflows or GPU resources, begetting the question as to why anybody would even consider using neural networks for such a study - other than it is currently fashionable. However, we would like to note that while within this work the focus was not on model comparisons but rather a proof of concept of our analysis, other regressor such as neural networks (or random forests) could certainly have been used as well.
To better reflect on this discussion and also highlight other possible ML approaches, we added the following sentence to Sec.II C:
We relied on kernel ridge regression (KRR) for machine learning 13C and 1H chemical shifts as presented in Ref.9 and commonly being used in learning NMR properties from quantum chemical calculations5–8,10,11. We use a Laplacian kernel and the local atomic Faber-Christensen-Huang-Lilienfeld (FCHL1912) representation with a radial cutoff9 of 4 ˚A. The kernel width and regularization coefficient have been determined through 10-fold crossvalidation on a subset of 10’000 chemical shifts of the training set. Note that while we relied on KRR within this work, other NMR shift estimation methods could have been used such as Hierarchically ordered spherical environment (HOSE) codes13,14 or neural network based approaches15–18.

II. REVIEWER 2
Reviewer Comment:
It is nice for the author to provide an all-in-one package to reproduce the manuscript. However, required Python packages (and their version) are currently not included in the README file. That is, I have to guess which package to install. I managed to do so for most of them but can’t find the module named ’g2s’, which seems not well-known in the community.
Authors’ response:
We thank the reviewer for investing time into testing our code and for identifying an issue with the Python packages. We added a corresponding requirements.txt file which lists all necessary packages and versions. Moreover, we removed the g2s import from the corresponding files and manually added the necessary helper functions. We tested the requirements in a fresh Python environment to confirm that all imports work now. 

III. REVIEWER 3
Reviewer Comment:
The paper is well-written overall. There could be more explanations.
Authors’ response:
We thank the reviewer for investing time into providing us with valuable feedback for our work and appreciate the recognition that the paper is well written.

Reviewer Comment:
What is the relationship between gaussian noise and MAE. In fig. 2 and the start of III A, this is used together, I guess it means gaussian noise which causes a mean deviation of X MAE (it isn’t really an error, is it?). That should be clearer.
Authors’ response:
We thank the reviewer for highlighting this unclarity and it is correct that the MAE indeed refers to the error, or rather the deviation, caused by applying the Gaussian noise. To make the results comparable with the ML based predictions, we chose to present an error metric rather than standard deviation or sigma values, using the relation that the mean absolute deviation of a normal distribution is 8 of the standard deviation19.
To avoid possible confusions, we adapted Fig.2 and 3 to mean absolute deviation (MAD) instead of MAE and edited the corresponding captions, Sec.II B and Sec.III A-C to distinguish better between deviation and MAE. We refer to the revised manuscript for all changes (red).

Reviewer Comment:
Also, in III A, it says ”Fig.2 a-b) depicts a sigmoidal shaped trend of Top-1 elucidation accuracies at increasing candidate pool sizes NQM9 as a function of MAE.” I don’t think this is strictly correct, the ”trend at (is at an correct English preposition here, btw?) pool sizes” isn’t sigmoidal, the sigmoidal refers to the education performance vs. MAE curves, the multiple curves together aren’t sigmoidal. Perhaps make that two sentences, explaining one curve first, then the combination in a second sentence?
Authors’ response:
We thank the reviewer for this comment. We adapted the sentence such that the increasing pool size is only mentioned in a second sentence:
Fig.2 a-b) depicts a sigmoidal shaped trend of Top-1 elucidation performances as a function of mean absolute deviation (MAD) corresponding to 8 of the standard deviation19 caused by applying the Gaussian noise. Note that increasing the maximum candidate pool size NQM9 leads to an offset of the trend towards less permissible errors.

Reviewer Comment:
My general remark about the paper is that it is nice, but not really that insightful. Essentially, you say if there is a two-step structure elucidation, where the first step produces a candidate list and the second ranks them by nmr similarity, then you need a combination of a first step which produces a good enough==small enough candidate list and a second step which uses a good enough spectrum prediction to do the ranking. To a degree, they can compensate for each other. Now that is not very surprising, everybody was working with that assumption anyway in the field. And the paper does not really quantify it, since it says that it all depends on the structure (and two of the three examples in Fig. 2 are even relatively similar).
Authors’ response:
Current automated ML based elucidation workflows commonly rely on 1) automatically searching through the full candidate graph space using enumerations or reinforcement learning and 2) only use a single type of shift for scoring. Our presented analysis demonstrates how one can provide quantitative insights into potential performance benefits through quantifying how the interplay between ML accuracy, candidate space and training data can be balanced with ML based elucidation methods. As shown in Fig.3, this kind of analysis is applicable to any system, and not restricted to just three stoichiometries. Here, we have demonstrated it for the 20 most common stoichiometries in the QM9 chemical space (Fig 3). The presented learning and elucidation performance curves in Fig.5 quantify the accuracy of shift predictions to achieve a certain elucidation success and, thus, in combination with multiple shift types can significantly reduce ML data needs. A potential and interesting extension of the shown analysis, as hinted towards by the reviewer, could be to estimate the predictive accuracy to successfully elucidate unseen chemical spaces or stoichiometries. However, such a potential ML application builds on the presented work and would go dramatically beyond its scope as a more diverse and systematically enumerated NMR dataset would first have to be generated. Nevertheless, since such a method could be of immense value, we now mention it in our outlook in Sec.IV as:
Rather than solely relying on more accurate models, future approaches could deal with estimating the applicability of machine learning models to successfully elucidate unseen chemical spaces, as well as including explicit knowledge of chemical reactions, functional groups or data from mass spectrometry, infrared- or Raman spectroscopy, respectively.
To the best of our knowledge and also mentioned by reviewer 1, the analysis presented in this manuscript has not been performed yet and quantifies:
1. The necessary accuracy of chemical shift predictions to reach specific elucidation performances.
2. Trends across chemical compound space on elucidation performance.
3. ML data needs in relation to the elucidation performance and candidate space.
4. Benefits of combining multiple shift types to lessen matching ambiguities and significantly reduce ML data needs.
For all these reasons, we do believe that our work is a significant stepping stone, and we hope that given all the included changes and explanations the manuscript provides useful insights for the community.

REFERENCES
1D. Lemm, G. F. von Rudorff, and O. A. von Lilienfeld, “Machine learning based energy-free structure predictions of molecules, transition states, and solids,” Nature Communications, vol. 12, July 2021.
2B. D. McKay, M. A. Yirik, and C. Steinbeck, “Surge: a fast open-source chemical graph generator,” Journal of Cheminformatics, vol. 14, Apr. 2022.
3S. Riniker and G. A. Landrum, “Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation,” Journal of Chemical Information and Modeling, vol. 55, pp. 2562–2574, Dec. 2015. Publisher: American Chemical Society.
4P. Pracht, F. Bohle, and S. Grimme, “Automated exploration of the low-energy chemical space with fast quantum chemical methods,” Phys. Chem. Chem. Phys., vol. 22, pp. 7169–7192, 2020.
5W. Gerrard, L. A. Bratholm, M. J. Packer, A. J. Mulholland, D. R. Glowacki, and C. P. Butts, “IMPRESSION – prediction of NMR parameters for 3-dimensional chemical structures using machine learning with near quantum chemical accuracy,” Chemical Science, vol. 11, no. 2, pp. 508–515, 2020.
6W. Gerrard, C. Yiu, and C. P. Butts, “Prediction of 15n chemical shifts by machine learning,” Magnetic Resonance in Chemistry, vol. 60, pp. 1087–1092, Aug. 2021.
7R. Gaumard, D. Dragu´n, J. N. Pedroza-Montero, B. Alonso, H. Guesmi, I. Malkin Ond´ık, and T. Mineva, “Regression machine learning models used to predict dft-computed nmr parameters of zeolites,” Computation, vol. 10, no. 5, p. 74, 2022.
8Y.-H. Tsai, M. Amichetti, M. M. Zanardi, R. Grimson, A. H. Daranas, and A. M. Sarotti, “ML-j-DP4: An integrated quantum mechanics-machine learning approach for ultrafast NMR structural elucidation,” Organic Letters, vol. 24, pp. 7487–7491, May 2022.
9A. Gupta, S. Chakraborty, and R. Ramakrishnan, “Revving up 13C NMR shielding predictions across chemical space: benchmarks for atoms-in-molecules kernel machine learning with new data for 134 kilo molecules,” Machine Learning:
Science and Technology, vol. 2, p. 035010, May 2021.
10M. Cordova, E. A. Engel, A. Stefaniuk, F. Paruzzo, A. Hofstetter, M. Ceriotti, and L. Emsley, “A machine learning model of chemical shifts for chemically and structurally diverse molecular solids,” The Journal of Physical Chemistry C, vol. 126, pp. 16710–16720, Sept. 2022.
11F. M. Paruzzo, A. Hofstetter, F. Musil, S. De, M. Ceriotti, and L. Emsley, “Chemical shifts in molecular solids by machine learning,” Nature Communications, vol. 9, Oct. 2018.
12A. S. Christensen, L. A. Bratholm, F. A. Faber, and O. Anatole von Lilienfeld, “FCHL revisited: Faster and more accurate quantum machine learning,” The Journal of Chemical Physics, vol. 152, p. 044107, Jan. 2020. Publisher: American Institute of Physics.
13W. Bremser, “Hose—a novel substructure code,” Analytica Chimica Acta, vol. 103, no. 4, pp. 355–365, 1978.
14S. Kuhn and S. R. Johnson, “Stereo-aware extension of HOSE codes,” ACS Omega, vol. 4, pp. 7323–7329, Apr. 2019.
15P. A. Unzueta, C. S. Greenwell, and G. J. O. Beran, “Predicting density functional theory-quality nuclear magnetic resonance chemical shifts via ∆-machine learning,” Journal of Chemical Theory and Computation, vol. 17, pp. 826–840, Jan. 2021.
16H. Rull, M. Fischer, and S. Kuhn, “Nmr shift prediction from small data quantities,” arXiv preprint arXiv:2304.03361, 2023.
17H. Han and S. Choi, “Transfer learning from simulation to experimental data: NMR chemical shift predictions,” The Journal of Physical Chemistry Letters, vol. 12, pp. 3662–3668, Apr. 2021.
18E. Jonas and S. Kuhn, “Rapid prediction of nmr spectral properties with quantified uncertainty,” Journal of cheminformatics, vol. 11, no. 1, pp. 1–7, 2019.
19R. C. Geary, “The ratio of the mean deviation to the standard deviation as a test of normality,” Biometrika, vol. 27, p. 310, Oct. 1935.




Round 2

Revised manuscript submitted on 20 Sep 2023
 

11-Oct-2023

Dear Dr von Lilienfield:

Manuscript ID: DD-ART-07-2023-000132.R1
TITLE: Impact of noise on inverse design: The case of NMR spectra matching

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below. Please address the remaining minor issues noted by the data reviewer directly in your code.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry


 
Reviewer 3

My concerns have been addressed.

Reviewer 1

The authors addressed all of my comments appropriately.

Reviewer 2

Data and codes were deposited fine this time but there's still some minor errors caused by hardcoded file locations.




Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.
Creative Commons BY license