From the journal Digital Discovery Peer review history

Automated quantum chemistry for estimating nucleophilicity and electrophilicity with applications to retrosynthesis and covalent inhibitors

Round 1

Manuscript submitted on 17 Nov 2023
 

04-Dec-2023

Dear Dr Jensen:

Manuscript ID: DD-ART-11-2023-000224
TITLE: Automated Quantum Chemistry for Estimating Nucleophilicity and Electrophilicity with Applications to Retrosynthesis and Covalent Inhibitors

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Joshua Schrier
Associate Editor, Digital Discovery

************


 
Reviewer 1

The manuscript by Ree, Goller and Jensen develops an automated pipeline to estimate the main nucleophilic and electrophilic sites in small organic molecules without exotic atoms. Additionally, they provide the community with a webserver to make their method accessible to non-experts.

In my opinion, the main novelty here stems from the automatization level (including the atom site identification part) and the ease of use through the webserver interface and open source code. The use of methyl cation and methyl anion affinities instead of other indicators (e.g. hydride and proton affinities) to complement Mayr's model has been introduced by others, and thus in that sense this work is derivative (or engineering-like) in nature.

Thorough benchmarking of the many level of theory compromises that are necessary to make such computations routine, as well as validation regarding applications, are included as well, which were not present in previous related work.

Overall, this is a well written, concise and to the point, nice piece of work and should be published. I commend the authors for their work!

Before that, however, I would like the following (minor) comments addressed:

- A major weakness, in my opinion, is the use of predetermined moieties that are hardcoded (from what I understood) as plausible nucleophilic and electrophilic sites. This is not so different from the manual identification that the authors are trying to solve in the first place, and is not trivial to extend to other functional groups (e.g. sulfones, boranes).

- Why does an approach similar to RegioML not apply here? Couldn't the computations be bypassed with ML? Is this just something for future work?

- Can the authors guarantee that clustering based on RMSDs can not select the same conformer twice if, for instance, there is a tert-butyl moiety that rotates arbitrarily? In other words: are the heavy atom RMSDs computed using atom indices? If not, I may suggest using dihedral angle lists to perform clustering instead, in the spirit of CENSO (shipped with CREST) by the Grimme group.

- Figure 1 caption and discussion is unclear. I think the text could be rewritten to go panel by panel and detail what is x and what is y clearly.

- Something similar applies to Figure 3. I also ask the authors to rewrite or relabel appropriately so that every panel is addressed individually from the text. For example, it costs nothing to say "The propargylamides (Figure 3b and 3e)" etc.

- P11 The claim that the R2 coeff. with canonical HOMO and LUMO should probably be supported either in SI or, most likely, with a reference cited there.

- Some journal names are abbreviated and others aren't. Please standardize.

After discussing or dealing with these points, I think the manuscript can be published in Digital Discovery.

Reviewer 2

1. Data Sources

1b. If using an external database, is an access date or version number provided?

Data Reviewer Comment:
In the manuscript, while the two QM-derived datasets by Tavakoli et al. are referenced, specific access dates are not provided. While the relevant paper is cited, it would be beneficial to include a direct link to the dataset (https://cdb.ics.uci.edu/cgibin/ReactivitiesDatasetsWeb.html) for ease of access. This link could be conveniently placed either within the manuscript itself or in the GitHub repository.

1c. Are any potential biases in the source dataset reported and/or mitigated?

Data Reviewer Comment:
The documentation for the source datasets, as presented in the literature, does not explicitly address any potential biases. For a more comprehensive understanding and interpretation of the results, it would be advisable for the authors to discuss any known or potential biases in these datasets, or to state clearly if such biases are not identified or reported in the original sources.

6. Code and reproducibility

6b. Are scripts to reproduce the findings in the paper provided?

Data Reviewer Comment:
The provided code on GitHub is well-organized and user-friendly. However, the SLURM commands are hard coded in the Python scripts, which may not be compatible with various supercomputing environments. To enhance usability, I recommend making these SLURM commands user-defined arguments in the scripts.

Additionally, while the datasets necessary to reproduce the findings are included as CSV files, their correspondence to specific figures in the manuscript is not clearly delineated. For greater clarity, I suggest adding detailed information in the GitHub README to clearly specify which CSV file relates to which figure. This enhancement would streamline the process of matching the data with the results presented.

Reviewer 3

This manuscript presents an innovative approach to estimating nucleophilicity and electrophilicity via an automated quantum chemistry-based workflow. This methodology, which focuses on computing methyl cation and anion affinities to quantify these properties, represents a significant step forward in computing these important physical organic parameters. The validation against experimental data and higher-level quantum mechanical calculations underscores the robustness of the proposed workflow. This work would be an excellent addition to Digital Discovery, given the minor issues identified are adequately addressed.

Minor Issues:

1. The presented method mainly focuses on DMSO as a solvent. While theoretically applicable to other solvents, it would be beneficial for the authors to validate this approach in various other common organic solvents, both polar and non-polar. This would enhance the understanding of the method's universality and applicability across different chemical environments.

2. The authors’ approach offers a novel perspective compared to existing data-driven research in the field. Given the prior work in this area, particularly references 21 and 23, it would be informative for the authors to select representative examples and conduct a comparative analysis. Such an analysis should focus on the efficiency and accuracy of the proposed method in contrast to established machine learning approaches. This would not only benchmark the new method against existing technologies but also provide valuable insights into its unique advantages and potential limitations.

Reviewer 4

This study presents an automated procedure for computing methyl cation affinities (MCA) and methyl anion affinities (MAA) for quantifying nucleophilicity and electrophilicity, respectively. The workflow involves conformational search using force-field method followed by semiempirical optimization with a DFT method and final energy evaluation using a DFT composite method. Implicit solvation correction is applied throughout the calculations. The most impressive parts of the workflow are the high degree of automation and the high efficiency in the requirements of time and computational resources. The manuscript is recommended for publication but I have some comments for consideration.

The results from the new workflow are compared to PBE0-D3(BJ)/DEF2-TZVP COSMO(∞) level of theory. However, the comparison would benefit from a better description of the latter method. Is the same level of theory used both for energy evaluation and geometry optimization? What about conformational search?

The workflow uses a minimal basis set + heavy atom polarization for geometry optimization, which is usually not considered sufficient for anions. Does that have an effect on the MAA? Do the single point DFT energies include diffuse functions?

The authors cite a textbook that states that more or less all organic reactions involve electrophiles and nucleophiles. This may be correct, but it does not mean that all electrophiles and nucleophiles are identical and have the same regioselective preference. In other words, methyl cations and methyl anions may not always be the best probes, e.g. when analyzing retrosynthetic routes. This should be acknowledged when the discussing the applicability of the computed MCA and MAA for retrosynthesis. The difference between different nucleophiles becomes clear when reading the section about the Covalent Reactivity Inhibitors, where the CH3S- probe give very different results than the CH3- probe. Or is there some limitation in the computational procedure when it comes to handling sulfide anions?

In general the discussion of the use of the MCA for analyzing the Covalent Reactivity Inhibitors seems too limited. The results raise a lot of interesting questions that are not analyzed or answered. Maybe the topic for another paper?


 

Reviewer 1:

- A major weakness, in my opinion, is the use of predetermined moieties that are hardcoded (from what I understood) as plausible nucleophilic and electrophilic sites. This is not so different from the manual identification that the authors are trying to solve in the first place, and is not trivial to extend to other functional groups (e.g. sulfones, boranes).

The implemented number of SMIRKS for automatically detecting plausible nucleophilic and electrophilic sites is quite extensive and includes sulfones (indeed anything with double bonds). We have now added boranes and, in general, the list of SMIRKS can be easily modified and further extended in “src/esnuel/locate_atom_sites.py”.


- Why does an approach similar to RegioML not apply here? Couldn't the computations be bypassed with ML? Is this just something for future work?

As stated in the paper, an atom-based ML model similar to RegioML could indeed be suitable for predicting the MAA and MCA scores. However, due to the lack of large and diverse datasets, ML methods are currently only applicable to confined parts of the chemical space and not great at generalizing to out-of-sample molecules. Therefore, we are currently using the presented QM-based workflow to generate a large and diverse synthetic dataset, that can then be used to train an atom-based ML model similar to RegioML.


- Can the authors guarantee that clustering based on RMSDs can not select the same conformer twice if, for instance, there is a tert-butyl moiety that rotates arbitrarily? In other words: are the heavy atom RMSDs computed using atom indices? If not, I may suggest using dihedral angle lists to perform clustering instead, in the spirit of CENSO (shipped with CREST) by the Grimme group.

The clustering method is performed using RDKit where molecules are aligned before calculating the heavy atom RMSD and doing the clustering. However, the clustering method is only used to remove conformers that are similar to minimize the computational time. Hence, the clustering method does not compromise the quality of the results.


- Figure 1 caption and discussion is unclear. I think the text could be rewritten to go panel by panel and detail what is x and what is y clearly.

We have added some references to the specific panels, for example:

“Specifically, the R2 coefficients of 0.84 and 0.94 are somewhat similar to 0.89 and 0.96 for MCAs and MAAs, respectively, with the latter based on Gibbs free energies at the PBE0-D3(BJ)/DEF2-TZVP COSMO(∞) level of theory.”

“Specifically, the R2 coefficients of 0.84 and 0.94 in Figures 1a and 1b are somewhat similar to 0.89 and 0.96 for MCAs and MAAs, respectively, with the latter based on Gibbs free energies at the PBE0-D3(BJ)/DEF2-TZVP COSMO(∞) level of theory as reported by Van Vranken and Baldi.15,16”

- Something similar applies to Figure 3. I also ask the authors to rewrite or relabel appropriately so that every panel is addressed individually from the text. For example, it costs nothing to say "The propargylamides (Figure 3b and 3e)" etc.

All of the panels are now referenced individually in the text.

- P11 The claim that the R2 coeff. with canonical HOMO and LUMO should probably be supported either in SI or, most likely, with a reference cited there.

This statement has now been clarified with a reference to prior results from Hermann et al. (J. Comput. Aided Mol. Des. 2020, 35, 531–539).


- Some journal names are abbreviated and others aren't. Please standardize.

This has been corrected.



Reviewer 2:

1. Data Sources
1b. If using an external database, is an access date or version number provided?

Data Reviewer Comment:
In the manuscript, while the two QM-derived datasets by Tavakoli et al. are referenced, specific access dates are not provided. While the relevant paper is cited, it would be beneficial to include a direct link to the dataset (https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdb.ics.uci.edu%2Fcgibin%2FReactivitiesDatasetsWeb.html&data=05%7C01%7Cjhjensen%40chem.ku.dk%7Cb279612f0a664b2b975b08dbf4d30ccd%7Ca3927f91cda14696af898c9f1ceffa91%7C0%7C0%7C638372960154046006%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=WkTL7oIF%2BiUw66x%2BeUN0F4%2FiYJeToQg3fZ6excfcmx4%3D&reserved=0) for ease of access. This link could be conveniently placed either within the manuscript itself or in the GitHub repository.

We have now provided an overview of the different datasets including access dates and direct links in our GitHub repository as a README file inside the data folder: https://github.com/jensengroup/ESNUEL/tree/main/data/README.md


1c. Are any potential biases in the source dataset reported and/or mitigated?

Data Reviewer Comment:
The documentation for the source datasets, as presented in the literature, does not explicitly address any potential biases. For a more comprehensive understanding and interpretation of the results, it would be advisable for the authors to discuss any known or potential biases in these datasets, or to state clearly if such biases are not identified or reported in the original sources.

While dataset biases are highly important for ML models, our presented method is based on quantum chemical calculations and the datasets do not influence the quality of the results.
Of course, any potential dataset bias would be relevant for the evaluation of our method, which is why we test our method against several very different datasets.
There are no reported biases in the original sources.


6. Code and reproducibility
6b. Are scripts to reproduce the findings in the paper provided?

Data Reviewer Comment:
The provided code on GitHub is well-organized and user-friendly. However, the SLURM commands are hard coded in the Python scripts, which may not be compatible with various supercomputing environments. To enhance usability, I recommend making these SLURM commands user-defined arguments in the scripts.

The SLURM commands can now be modified via command line arguments and the GitHub README is updated accordingly.


Additionally, while the datasets necessary to reproduce the findings are included as CSV files, their correspondence to specific figures in the manuscript is not clearly delineated. For greater clarity, I suggest adding detailed information in the GitHub README to clearly specify which CSV file relates to which figure. This enhancement would streamline the process of matching the data with the results presented.

This has been clarified in the README file inside the data folder: https://github.com/jensengroup/ESNUEL/tree/main/data/README.md



Reviewer 3:

Minor Issues:
1. The presented method mainly focuses on DMSO as a solvent. While theoretically applicable to other solvents, it would be beneficial for the authors to validate this approach in various other common organic solvents, both polar and non-polar. This would enhance the understanding of the method's universality and applicability across different chemical environments.

The choice of solvent does not appear to alter the predictions significantly. For example, van Vranken and co-workers [15,16] compared MCA and MAA predicted for DMSO and an infinite dielectric constant and found a very high degree of correlation (R2 values ​​≥ 0.95). The experimental values shown in Figure 1a are obtained for 3 different solvents (dichloromethane, acetonitrile, and DMSO) while the calculated values are obtained with DMSO. Furthermore, the correlation is not better for the points measured for DMSO compared to dichloromethane.


2. The authors’ approach offers a novel perspective compared to existing data-driven research in the field. Given the prior work in this area, particularly references 21 and 23, it would be informative for the authors to select representative examples and conduct a comparative analysis. Such an analysis should focus on the efficiency and accuracy of the proposed method in contrast to established machine learning approaches. This would not only benchmark the new method against existing technologies but also provide valuable insights into its unique advantages and potential limitations.

We compare our approach to the ML model of ref. 24:
“The ability to replicate higher-level results is further supported by the very strong correlation in the bottom panels of Figure 1 with R2 coefficients of 0.98 and 0.99 for MCAs and MAAs, respectively. These results actually outperform the ML models by Tavakoli et al. 24, which achieved 10-fold cross-validation R2 coefficients of 0.92 ± 0.02 and 0.94 ± 0.02 for MCAs and MAAs, respectively.” Furthermore, we note that despite what is stated in the paper, the model is not available to others.

The method described in reference 23 only gives one prediction per molecule and is therefore not directly comparable (i.e. more limited) to our method. Reference 21 only presents cross-validation errors for the training set (and only for nucleophilicity), which, while more accurate, would not be a fair comparison. Finally, it is not clear from the GitHub repo how to obtain predictions for individual atoms, which we would need for a thorough comparison.


Reviewer 4:

The results from the new workflow are compared to PBE0-D3(BJ)/DEF2-TZVP COSMO(∞) level of theory. However, the comparison would benefit from a better description of the latter method. Is the same level of theory used both for energy evaluation and geometry optimization? What about conformational search?

The PBE0-D3(BJ)/DEF2-TZVP COSMO(∞) calculations are extracted from refs. 15, 16, and 24. Both the energy evaluation and geometry optimization are performed at the PBE0-D3(BJ)/DEF2-TZVP COSMO(∞) level of theory. Details on the conformational search are not described.


The workflow uses a minimal basis set + heavy atom polarization for geometry optimization, which is usually not considered sufficient for anions. Does that have an effect on the MAA? Do the single point DFT energies include diffuse functions?

In ref. 15, MAAs at the PBE0-D3(BJ)/def2-TZVP COSMO(∞) level of theory are shown to give the same or better (R2 0.97 vs 0.96) linear correlation with Mayr electrophilicity compared to B3LYP/6-311++G(3df,2pd), which demonstrates that the inclusion of diffuse functions has a minor or no effect on the MAAs.
The presented workflow employs the r2SCAN-3C functional for the final energy evaluation, which uses a modified version of the def2-TZVP basis set and does not include diffuse functions.


The authors cite a textbook that states that more or less all organic reactions involve electrophiles and nucleophiles. This may be correct, but it does not mean that all electrophiles and nucleophiles are identical and have the same regioselective preference. In other words, methyl cations and methyl anions may not always be the best probes, e.g. when analyzing retrosynthetic routes. This should be acknowledged when the discussing the applicability of the computed MCA and MAA for retrosynthesis. The difference between different nucleophiles becomes clear when reading the section about the Covalent Reactivity Inhibitors, where the CH3S- probe give very different results than the CH3- probe. Or is there some limitation in the computational procedure when it comes to handling sulfide anions?

We have added the following sentence:
“ In summary, the use of MCAs and MAAs to flag retrosynthetic steps for further inspection
seems promising but will require further work. For example, the effect of using other reactivity probes than methyl anion and cation.”


In general the discussion of the use of the MCA for analyzing the Covalent Reactivity Inhibitors seems too limited. The results raise a lot of interesting questions that are not analyzed or answered. Maybe the topic for another paper?

The goal of both this section and the one on retrosynthesis, is to give examples of use cases and set direction for possible future studies. As such, we feel they are appropriate and important to the paper.




Round 2

Revised manuscript submitted on 18 Dec 2023
 

30-Dec-2023

Dear Dr Jensen:

Manuscript ID: DD-ART-11-2023-000224.R1
TITLE: Automated Quantum Chemistry for Estimating Nucleophilicity and Electrophilicity with Applications to Retrosynthesis and Covalent Inhibitors

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Joshua Schrier
Associate Editor, Digital Discovery


 
Reviewer 1

The manuscript is now suitable for publication.

Reviewer 3

My previous concerns have been resolved. I support its publication.

Reviewer 2

The authors have satisfactorily addressed all comments related to the data and code. The revisions and additions are well-reflected in both the updated manuscript and the corresponding GitHub repository. Based on these updates, I recommend the manuscript for publication in Digital Discovery.




Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.
Creative Commons BY license