From the journal Digital Discovery Peer review history

Fast exploration of potential energy surfaces with a joint venture of quantum chemistry, evolutionary algorithms and unsupervised learning

Round 1

Manuscript submitted on 29 Jun 2022
 

07-Aug-2022

Dear Dr Mancini:

Manuscript ID: DD-ART-06-2022-000070
TITLE: Artificial Intelligence meets Quantum Chemistry for the effective exploration and exploitation of intra- and inter-molecular energy landscapes.

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions.

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy from CASRAI, https://casrai.org/credit/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry

************


 
Reviewer 1

The paper "Artificial Intelligence meets Quantum Chemistry for the effective exploration and exploitation of intra- and inter-molecular energy landscapes" presents a new strategy, based on genetic algorithms, for the automatic exploration of complex potential energy surfaces.
The work tackles a problem that is certainly extremely relevant in computational chemistry - i.e., the application of machine learning algorithms to automatic reaction network exploration - and that is well suited for "Digital Discovery".
However, I believe that the authors should significantly restructure and extend the presentation of the algorithm before the paper can be considered for publications.
In fact, from the current version of the paper, it is in my opinion extremely hard to understand how the algorithms proposed by the authors work in practice, and how they differ one from each other.
This makes it very difficult to interpret the results that are reported in the final part of the paper.
Specifically, I believe that the changes in the theory part should address the following issues:

1. In the introduction, the authors mention that one of the novelties introduced by the present paper compared to their previous one is the use of internal coordinates.
However, the theory part does not discuss - if I am not mistaken - in sufficient detail how internal coordinates are actually used in the context of the present work.
Defining internal coordinates for systems as complex as the ones that the authors study in the present work (especially the hydrated iron, where non-bonded molecules are present)
is highly non trivial and, therefore, the authors should explain in detail in the theory section how they exploit the power of internal coordinates.

2. Still in the context of internal coordinates, the authors mention that they introduce a "complete new set of operators".
However, from my understanding, it's not clear what these operators are from the discussion in the theory part.

3. In the first column of Page 2, the sentence "Once again we carry out ..." does not read correctly.

4. In section 2.1, I do not understand how the lambda/mu ratio can be kept unitary and, at the same time, have a selection rate s != 1.
More in general, I believe that the author should explain in much more detail how the algorithm works, focusing on the meaning of the lambda/mu/s parameters.
The authors should also highlight how this algorithm works, in practice, when applied to PES exploration.
In fact, in the last part of Page 2 the description of the algorithm is abstract, while at the beginning of Page 3 the authors mention how their choice of the parameters enables saving computational time.
In order to understand why this is the case, I believe that a clearer description of how the GA is combined with electronic-structure calculations is crucial.

5. The description of the algorithm in the first column of Page 3 is entirely not clear to me.
I guess the main issue is that the authors consistently interchange between GA-specific (gene/alleles) and chemistry (dihedral/conformers...) terms.
This makes the algorithm description extremely confusing, in my opinion.
I believe that the text would be much clearer if the authors first introduced the algorithm from an abstract perspective, and then described its application to the PES exploration problem.
A figure (or taking a small molecule as a test case) may help a lot.

6. Similarly to point 5.: it's not clear to me how the Simulated Binary Crossover works in practice. Specifically: what are P1 and P2? How are the C1 and C2 parameters used in practice?

7. Regarding the presentation of the Latin Hypercube: why is the output of the LH algorithm N ensembles of m replicas? If the interval is divided in N non-overlapping intervals, and an element is
sampled from each of these intervals, how is it possible that the sample has m replicas? I would assume that one gets m ensembles of N replicas - however, only if the intervals contain the
same number of elements.

8. Is the m parameter introduced at the beginning of section 2.4.1 the same as the one used on page 3? If not, I would suggest using another symbol.

9. The "hall of fame" mechanism should be explained in much more detail - possibly in the theory section. Specifically, it's not clear to me if it's an alternative to the algorithm that the
authors propose to generate the initial population, or if it's an alternative to the whole GA. Also, it's not clear where the equation reported at the beginning of page 6 comes from.

10. In the application section, the authors explain that they are not including the "stiff" coordinates in the exploration.
They instead modify the "large-amplitude" ones and relax, at each step, the stiff coordinates.
However, how do they define the "large-amplitude" coordinates? In the introduction, they claim to not only include dihedral angles, but instead use a "complete new set of coordinates".
How is this realized, in practice?

11. Figures 2 and 3 lack the labels.

12. In the caption of Figure 5, the authors say that one of the two plots is obtained with PM7 calculations. Shouldn't it be DFTB?

Note that my comments mostly concern the theory part. This is the case because the results part is, in my opinion, difficult to read as a result of the problems of the theory part.
I, therefore, leave the comments to the result part to the second revision.

Reviewer 2

The manuscript “Artificial Intelligence meets Quantum Chemistry for the effective exploration and exploitation of intra- and intermolecular energy landscapes.” by Mancini and coworkers describes a sampling algorithm based on evolutionary algorithms and the use of curvilinear coordinates. The efficient & fairly automated sampling of structures is of utmost importance for becoming predictive with quantum chemical simulations. Hence, this paper will be of interest to the computational chemistry community. Apart from the title, which does not reflect the content of the paper (remark 1), I think that the manuscript can be published after a revision of the following points:

1. The title is attempting to be catchy, but misleading and not reflecting the content of the paper. First, the presented method does not use artificial intelligence & I think we should be careful when using this term and distinguish it from other subfields, like optimization and machine learning. Cluster analyses etc. are part of unsupervised learning algorithms. Yet, it is used to reduce the number of considered structures, nothing is “learned” from it & e.g., used in subsequent simulations. Hence, this does not qualify as “artificial intelligence”. Furthermore, to call the presented approach “effective” requires the presentation of timings (remark 2).

2. An important point about sampling algorithms is the computational time that one spends to sample the conformational space. Due to extensive sampling and reoptimization that is necessary in CREST, which is the current state-of-the-art, one is roughly restricted to systems of about 200 atoms. Even then, the sampling may not be complete as the authors also pointed out in this work. But how is the computational cost compared to CREST, e.g., to obtain the data in Table 3.

3. The claim made in on p.2 that the “current standards” only permit an a posteriori interpretation of spectroscopic data is not really justified. Particularly, since the publication of CREST along with modern SE methods, the state-of-the-art became that a conformational sampling plus refinement is done that precedes the calculation of spectroscopic properties. Particularly, the Grimme group has promoted this idea strongly in several papers:
10.1021/acs.joc.1c02008, 10.1021/acs.jpca.1c00971, 10.1002/anie.201708266. So, the claim made by the authors is not justified or it must be clarified what actually distinguishes the present work from the established workflows described in the literature.

4. On p. 3: The 4th operator (mirror) would change the chirality of chiral compounds and should be avoided, if one is interested in simulating optical rotation or circular dichroism. This should be discussed & is it possible to exclude it or only activate it for achiral compounds?

5. This paper is quite strong in self citations (22/80). I recommend giving credit also to other groups who worked in this field and in computational spectroscopy. I already mentioned some papers in remark 3. Two more from the Grimme group actually fit well to be cited on p.8 along with Refs. 69 & 70: 10.1039/C3CP52293H and 10.1002/jcc.23649.

6. None of the identified structural ensembles are given in the SI. The authors must the structural ensembles (digitally, not in a pdf) that were generated with the different approaches. Particularly the ones used in the computation of spectroscopic data (Table 6) must be provided.

7. On p.11: “…requires proper account of 4d electrons”. I am not entirely sure, which shells are considered explicitly in SC and LC (this information should be given maybe in the SI). But could also other shells play a role here or can this be ruled out?

Minor issues:
A. p.2: “Once again we carry out an in depth benchmark of again we last generation…” That sentence is broken.
B. p.2: “brief recap of IM-EA” -> “brief recap of the IM-EA”
C. p.3: What is eta in equation 1 & 2?
D. p.4: “GFN-xTB” -> “GFN2-xTB”
E. p.7: “best XTB and DFTBA replicas”, but in the caption of Figure 5, it says “PM7 and XTB”. Which one is it. Also here, the two plots should be labeled (method and molecule investigated).
F. p.13: “authors red” -> “authors read”

Reviewer 3

In this article, Mancini and co-workers expand on their recent previous work (Mancini, G. et al., J. Chem. Phys., 2020, 153, 124110, ref. 15 of the present work) by improving their algorithms to explore complex potential energy surfaces. Their updated approach is showcased via application to two rather different and challenging systems – conformational structures of aspartic acid in the gas phase, and the microsolvation structure of silver cations in water.

The core work is sound. The authors convincingly demonstrate that their approach can efficiently find all major conformers of the aspartic acid molecule and provide useful benchmark results for different algorithm parameters (mutation, crossover, use of elitism in the evolutionary algorithm) and levels of theory. The demonstrated agreement with rotational constants determined by microwave spectroscopy is impressive, and their analysis of the conformer that the CREST approach was unable to find in their test runs was quite interesting. Similarly, in the case of the silver cation in water, the authors show that their approach is able to find structures in reasonable agreement with experimental XAS data, with nearly linear structures in the first coordination sphere.

The work here is clearly original and of significant impact, particularly as the field moves towards larger and more flexible systems. While only two molecular systems were considered, they are wildly different and so the extent of agreement between the computational methods presented here and the experimental results is quite impressive. This paper should definitely be published.

My only minor note that the authors may wish to address is that Figure 5 and its accompanying caption are a bit confusing. First, in what is certainly just a typo, the caption indicates that the two plots are of data calculated with the PM7 and XTB methods (not obvious from the caption which is the top plot and which is the bottom plot), but the accompanying text in section 4.1.1 says this is for the XTB and DTFBA methods. A further point with this figure is that I was uncertain how (or if) I should interpret the x- and y-axes or instead simply look at the clustering patterns. In particular, I’m aware this is a low dimensional representation of high dimensional objects, and so there may not be any physically intuitive mapping of the data to the plot axes. That said, given that the objects in the two different plots are presumably very similar sets of conformers, the difference in scale of approximately a factor of 11 was confusing and somewhat distracting (e.g. x-axis in top plot ranges from roughly -750 to +800, while the same axis in the bottom plot ranges from roughly -65 to +70). A bit more guidance to readers for how to interpret this figure (either in the caption or main text) would be appreciated.

Other than that, I have no other major critiques. Again, the text was clear and it was easy to follow the key points of the manuscript. I look forward to future developments in this area of research!


 

Pisa, August 13th 2020

To:
Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry

Dear Editor,

Enclosed please find a revised version of the manuscript entitled “Fast exploration of Potential Energy Surfaces with a joint venture of Quantum Chemistry, Evolutionary Algorithms and Unsupervised Learning” (formerly “Artificial Intelligence meets Quantum Chemistry for the effective exploration and exploitation of intra- and inter-molecular energy landscapes”) by G. Mancini, M. Fusè, F. Lazzari and V. Barone that we are submitting for publication in RCS Digital Discovery, after major revisions.

We note that all Referees agreed about the overall quality and significance of the manuscript and of the results presented. Referees 1 and 2 required clarifications to the text mostly in the Methods section while Referee 3 just required some improvement of Figure 5. In the present revision, we have carefully considered all the suggestions and remarks proposed by the Referees and a detailed point-to-point response together with a list of changes is included here below. In particular, we have completely rewritten the Methods section and added an accompanying appendix in the SI to clarify the workflow and the meaning of IM/EA parameters.
We are confident that the quality and readability of the manuscript have been improved thanks to Referees’ suggestions. We would like to thank you and the Referees for your kind cooperation and hope that the manuscript is now suitable for publication inRCS Digital Discovery.

Note about the list of changes
Note that page numbers and bibliographic entries reported in the following text refer to the present revised version of the manuscript in which some citations and paragraphs have been added or modified. Finally, the whole manuscript has been carefully reviewed and minor style modifications have been made in order to improve its readability. We attach a version of the manuscript with all changes highlighted for your convenience.


Thank you in advance for your attention.
Sincerely yours,


Giordano Mancini
List of Changes
Referee: 1

Comments to the Author
The paper "Artificial Intelligence meets Quantum Chemistry for the effective exploration and exploitation of intra- and inter-molecular energy landscapes" presents a new strategy, based on genetic algorithms, for the automatic exploration of complex potential energy surfaces.
The work tackles a problem that is certainly extremely relevant in computational chemistry - i.e., the application of machine learning algorithms to automatic reaction network exploration - and that is well suited for "Digital Discovery".
However, I believe that the authors should significantly restructure and extend the presentation of the algorithm before the paper can be considered for publications. In fact, from the current version of the paper, it is in my opinion extremely hard to understand how the algorithms proposed by the authors work in practice, and how they differ one from each other. This makes it very difficult to interpret the results that are reported in the final part of the paper.

Specifically, I believe that the changes in the theory part should address the following issues:

1. In the introduction, the authors mention that one of the novelties introduced by the present paper compared to their previous one is the use of internal coordinates. However, the theory part does not discuss - if I am not mistaken – in sufficient detail how internal coordinates are actually used in the context of the present work. Defining internal coordinates for systems as complex as the ones that the authors study in the present work (especially the hydrated iron, where non-bonded molecules are present) is highly non trivial and, therefore, the authors should explain in detail in the theory section how they exploit the power of internal coordinates.

Answer: The Referee is right in addressing the need for a more clear discussion of this issue. We have introduced six quaternion based operators for the exploration of non-covalently bound systems such as the Ag+(aq) ion; these operators take into account the topology of the system and are broadly referred to as “internal coordinates” since they do not operate on individual Cartesian coordinates as usually done in conventional MD simulations (disregarding the use of constraints) to generate new configurations. One reason for using these operators is that it’s easier to generate “large amplitude” motions (i. e. obtain a new configuration which is significantly different from the starting one(s)) while still keeping under control the number of unphysical configurations which is synergic to the exploration feature of metaheuristcs, We have clarified these points in the Introduction and extended the presentation of the operators in the methods section (see the new 2.2 and 2.3 subsections) and also added the following references:
G. Mancini, S. Del Galdo, B. Chandramouli, M. Pagliai and V. Barone, J. Chem. Theory Comput., 2020, 16, 5747–5761
Karney, C. F. Quaternions in Molecular Modeling. Journal of Molecular Graphics and Modelling 2007, 25 (5), 595–604.


2. Still in the context of internal coordinates, the authors mention that they introduce a "complete new set of operators". However, from my understanding, it's not clear what these operators are from the discussion in the theory part.

Answer: we have clarified that the “operators” are the types of moves that are randomly carried out with predefined probabilities to generate new configurations in non covalent systems. We have modified section 2.2 in order to clarify this aspect.


3. In the first column of Page 2, the sentence "Once again we carry out ..." does not read correctly.

Answer: the sentence has been corrected

4. In section 2.1, I do not understand how the lambda/mu ratio can be kept unitary and, at the same time, have a selection rate s != 1. More in general, I believe that the author should explain in much more detail how the algorithm works, focusing on the meaning of the lambda/mu/s parameters. The authors should also highlight how this algorithm works, in practice, when applied to PES exploration. In fact, in the last part of Page 2 the description of the algorithm is abstract, while at the beginning of Page 3 the authors mention how their choice of the parameters enables saving computational time. In order to understand why this is the case, I believe that a clearer description of how the GA is combined with electronic-structure calculations is crucial.

Answer: The Referee is right in addressing this point; in our previous contribution we included an in length explanation of parameters which was only mentioned in this manuscript. For the sake of completeness and clarity we have added a specific section in the Methods Section (2.2).
The λ/μ ratio is the relative amount of new offspring (λ) generated by parents chosen for reproduction (μ); for example, in a population of 10 specimens (P=10) we could pick two random ones (μ=2) to generate four offspring (λ=4 and λ/μ=2) and then apply some control mechanism to have a constant P (i.e. eliminate 4 specimens in this example). In our application the default is to generate a pair of new offsprings from a pair of parents hence λ/μ=1; the selection rate (s) parameter specifies the fraction of total new offspring (i. e. a fraction of P) that must be generated at each step (0.5 usually) is in this case s=λ=μ; we have introduced s trying to be coherent with the specific meaning of λ and μ since the amount of offspring generated can be decided in different ways.

5. The description of the algorithm in the first column of Page 3 is entirely not clear to me. I guess the main issue is that the authors consistently interchange between GA-specific (gene/alleles) and chemistry (dihedral/conformers...) terms. This makes the algorithm description extremely confusing, in my opinion. I believe that the text would be much clearer if the authors first introduced the algorithm from an abstract perspective, and then described its application to the PES exploration problem. A figure (or taking a small molecule as a test case) may help a lot.

Answer: In order to address the Referee point we have reorganized the Methods Section in the following way:
Subsection 2.1 (The IM-EA engine) is a general Introduction to GAs and to IM/EA; Figure 1 has been added as well. It also explains from an abstract point of view (not connected to chemistry) how the method works and the meaning of parameters. A flow chart has been added as well (Figure 2). The explanation of Simulated Binary Crossover (SBX) and Hall of Fame (HOF) mechanisms have been moved here.
Subsection 2.2 (Manipulation of molecular structures) Explains the analogies between GA and chemistry terminology and how covalent structures are manipulated.
Subsection 2.3 (Mutation and crossover for non-covalent interactions) explains how quaternion operators work (see also point 1 above)
Subsections 2.4 (Initial population), 2.5 (Quantum Chemistry) and 2.6 (Analysis of structures) have not been modified relevantly.

6. Similarly to point 5.: it's not clear to me how the Simulated Binary Crossover works in practice. Specifically: what are P1 and P2? How are the C1 and C2 parameters used in practice?

Answer: P1 and P2 (i. e. parent 1 and parent 2) are the actual structure mating (that is, P1 and P2 coordinates will be used mixed up) while C1 (child 1) and C2 (child 2) are the corresponding children/offspring. We clarified this point as well the meaning of the β, μ and η parameters: β is defined in terms of a uniformly distributed random number μ and a spread factor η (the latter is proportional to how much offspring alleles will resemble those of parents).

7. Regarding the presentation of the Latin Hypercube: why is the output of the LH algorithm N ensembles of m replicas? If the interval is divided in N non-overlapping intervals, and an element is
sampled from each of these intervals, how is it possible that the sample has m replicas? I would assume that one gets m ensembles of N replicas - however, only if the intervals contain the same number of elements.

Answer: we tried to reformulate the definition of LH to clarify it. In a one dimensional LH sampling if we have to extract N samples from a distribution we divide it in N evenly spaced regions and then pick a value from each region with uniform probability; in other words we get one ensemble of N points. Scaling to two variables we divide the space of each variable in N intervals and thus we get a N by N squared grid from which we can get one set of N points (with the requirement that they will not be neighbours or touch at a vertex). With m variables the procedure is similar and there will be just one sampling point for each m-dimensional interval. This procedure is repeated for each specimen that must be generated in the initial population.


8. Is the m parameter introduced at the beginning of section 2.4.1 the same as the one used on page 3? If not, I would suggest using another symbol.

Answer: it is not: we have changed it as Nstruct

9. The "hall of fame" mechanism should be explained in much more detail - possibly in the theory section. Specifically, it's not clear to me if it's an alternative to the algorithm that the authors propose to generate the initial population, or if it's an alternative to the whole GA. Also, it's not clear where the equation reported at the beginning of page 6 comes from.

Answer: Following the Referee’s suggestions we have moved the explanation about the Hall of Fame (HOF) in Subsection 2.2 and extended the theory section. An explanation of the equation concerning the Hall of Fame has been given as well and the equation has been numbered. More in detail, without the HOF the population size changes as follows:
P at start
P+s*P after offspring generation (0<= s*P <= P)
mutation: size does not change but old and new specimen may mutate and, for these, a new fitness evaluation (Electronic Structure calculation here) will be made
P after fitness selection (the s*P less fit specimens are removed)

when the HOF is in place we want to prevent the h best specimens from undergoing mutation and to do that we copy them so that one copy can mutate and the other one cannot mutate. Hence the population size changes as follows:
P at start
P+s*P after offspring generation (0<= s*P <= P)
hall of fame: the h best specimens are copied hence Pnew = P+s*P+h*P
mutation on P and s*P
P after fitness selection: the number of specimen removed is P(s+h)

10. In the application section, the authors explain that they are not including the "stiff" coordinates in the exploration. They instead modify the "large-amplitude" ones and relax, at each step, the stiff coordinates. However, how do they define the "large-amplitude" coordinates? In the introduction, they claim to not only include dihedral angles, but instead use a "complete new set of coordinates". How is this realized, in practice?
Answer: it is standard practice to consider bond lengths and valence angles as stiff degrees of freedom. Concerning dihedral angles, we use a threshold on the bond order of the central bond. A sentence has been added in order to clarify this point.

11. Figures 2 and 3 lack the labels.
Anwer: We have added the labels. Note that Figure number have changed due to the pictures added in the method section and that Figures 4 and 5 (formerly Figure 2 and 3) actually contained a wrong set of panels; this has been corrected.

12. In the caption of Figure 5, the authors say that one of the two plots is obtained with PM7 calculations. Shouldn't it be DFTB?
Answer. The Referee is right; we have corrected the caption.


Referee 2
The manuscript “Artificial Intelligence meets Quantum Chemistry for the effective exploration and exploitation of intra- and intermolecular energy landscapes.” by Mancini and coworkers describes a sampling algorithm based on evolutionary algorithms and the use of curvilinear coordinates. The efficient & fairly automated sampling of structures is of utmost importance for becoming predictive with quantum chemical simulations. Hence, this paper will be of interest to the computational chemistry community. Apart from the title, which does not reflect the content of the paper (remark 1), I think that the manuscript can be published after a revision of the following points:

1 The title is attempting to be catchy, but misleading and not reflecting the content of the paper. First, the presented method does not use artificial intelligence & I think we should be careful when using this term and distinguish it from other subfields, like optimization and machine learning. Cluster analyses etc. are part of unsupervised learning algorithms. Yet, it is used to reduce the number of considered structures, nothing is “learned” from it & e.g., used in subsequent simulations. Hence, this does not qualify as “artificial intelligence”. Furthermore, to call the presented approach “effective” requires the presentation of timings (remark 2)

Answer: We have modified the title as follows “New Routes toward the exploration and exploitation of intra- and inter-molecular energy landscapes” avoiding the words ‘Artificial Intelligence’ and ‘effective’.

2 An important point about sampling algorithms is the computational time that one spends to sample the conformational space. Due to extensive sampling and reoptimization that is necessary in CREST, which is the current state-of-the-art, one is roughly restricted to systems of about 200 atoms. Even then, the sampling may not be complete as the authors also pointed out in this work. But how is the computational cost compared to CREST, e.g., to obtain the data in Table 3.

Answer: this is a difficult question since the two procedures are implemented in a different way: IM-EA uses wrapper and template input files trading performance with flexibility while CREST (probably) uses function calls to use GFN2-XTB or Molecular Mechanics. Moreover, CREST outputs directly the number of rotamers or conformers after filtering and not all the points generated in the GC-MTD procedure. However, the computational cost depends mainly on the number of energy/gradient evaluations carried out in the two procedures which (i) in IM-EA is the number of generated structures (ii) in CREST is the number of total MD steps and crude + tight optimizations carried out between MTD iterations.
Looking at the four replicated runs carried out for Aspartic Acid (with run parameters set automatically by the application itself) we have 2 MTD iterations of 1000 steps each (5 ps with time step of 5 fs) plus about 3000 optimizations which is roughly twice the number of geometry optimizations needed by IM-EA with optimal settings. Note however that we let CREST set most of its run parameters which may have led to conservative values in some cases.
Finally is must be observed that, in general, any claim about the greater effectiveness of IM/EA vs CREST and/or vs a third method would be at least problematic according to the “No Free Lunch Theorem” (Wolpert, D.H., Macready, W.G. (1997), "No Free Lunch Theorems for Optimization", IEEE Transactions on Evolutionary Computation 1, 67) which states that there is no possible absolute best algorithm for any class of problems. For this reason in the current version of the manuscript we have focused on the relative improvement gained by modifying IM-EA and only included a comparison with CREST at the end in order to justify the use of another method in conjunction / as a comparison. We have added a slightly modified version of this answer to the manuscript (page 7).

3 The claim made on p.2 that the “current standards” only permit an a posteriori interpretation of spectroscopic data is not really justified. Particularly, since the publication of CREST along with modern SE methods, the state-of-the-art became that a conformational sampling plus refinement is done that precedes the calculation of spectroscopic properties. Particularly, the Grimme group has promoted this idea strongly in several papers: 10.1021/acs.joc.1c02008, 10.1021/acs.jpca.1c00971, 10.1002/anie.201708266. So, the claim made by the authors is not justified or it must be clarified what actually distinguishes the present work from the established workflows described in the literature.

Answer: We have added most of the references indicated by the referee and attempted to better explain our point on page 2 of the revised text. In particular, we have added the following text at line 167 and following:

Several approaches have been proposed to compute the spectroscopic outcome of flexible molecules in terms of averages among the spectra of significantly populated structures.34,35. However, high-resolution (especially microwave) spectroscopy in the gas-phase requires the accurate individual properties of those low-lying structure unable to relax to more stable energy minima under the experimental conditions.36 The current standards for the study of biomolecule building blocks in the gas phase (see e.g. refs. 31,32,36-41) do not permit the a priori prediction of the relative energies, interconversion barriers and spectroscopic outcome with sufficient accuracy, but only the a posteriori interpretation of experimental results in terms of the agreement with the computed spectroscopic parameters for a predefined number of conformers without explicit reference to their computed relative stability and possible relaxation.
We have, instead, performed a comprehensive study of aspartic acid with the aim of obtaining an unbiased a priori description of its conformational landscape to be validated only in a second step by the comparison of computed and experimental rotational spectroscopic parameters for the most stable conformers separated from each other by sufficiently high energy barriers.

and also added the following references:
S. Grimme and M. Steinmetz, Phys. Chem. Chem. Phys., 2013, 15, 16031–16042.
T. Risthaus, M. Steinmetz and S. Grimme, J. Comput. Chem., 2014, 35, 1509–1516.


4 On p. 3: The 4th operator (mirror) would change the chirality of chiral compounds and should be avoided, if one is interested in simulating optical rotation or circular dichroism. This should be discussed & is it possible to exclude it or only activate it for achiral compounds?

Answer: The Referee is right in pointing out this potential issue. Note however that (i) this operator has been applied to generate structures for the Ag+(aq) complex which is not chiral and (ii) as outlined in the methods section, the specific operator applied to coordinates is determined randomly after specifying its relative probability. To exclude it, it is sufficient to set this probability to 0.

5 This paper is quite strong in self citations (22/80). I recommend giving credit also to other groups who worked in this field and in computational spectroscopy. I already mentioned some papers in remark 3. Two more from the Grimme group actually fit well to be cited on p.8 along with Refs. 69 & 70: 10.1039/C3CP52293H and 10.1002/jcc.23649.

Answer: we have added the references suggested by the Referee.

6 None of the identified structural ensembles are given in the SI. The authors must the structural ensembles (digitally, not in a pdf) that were generated with the different approaches. Particularly the ones used in the computation of spectroscopic data (Table 6) must be provided.

Answer: The ensemble used to generate the data of Table 6 and another replica are available for download in the SI as a gzipped PDB trajectory.

7. On p.11: “…requires proper account of 4d electrons”. I am not entirely sure, which shells are considered explicitly in SC and LC (this information should be given maybe in the SI). But could also other shells play a role here or can this be ruled out?

Answer: it has been clarified that small and large core potentials replace 28 and 46 electrons, respectively. As a consequence with small core potential all the 18 electrons with principal quantum number = 4 are explicitly treated.

Minor issues:
A. p.2: “Once again we carry out an in depth benchmark of again we last generation…” That sentence is broken.
B. p.2: “brief recap of IM-EA” -> “brief recap of the IM-EA”
C. p.3: What is eta in equation 1 & 2?
D. p.4: “GFN-xTB” -> “GFN2-xTB”
E. p.7: “best XTB and DFTBA replicas”, but in the caption of Figure 5, it says “PM7 and XTB”. Which one is it. Also here, the two plots should be labeled (method and molecule investigated).
F. p.13: “authors red” -> “authors read”

Answer: all the minor issues have been corrected.

Referee 3
In this article, Mancini and co-workers expand on their recent previous work (Mancini, G. et al., J. Chem. Phys., 2020, 153, 124110, ref. 15 of the present work) by improving their algorithms to explore complex potential energy surfaces. Their updated approach is showcased via application to two rather different and challenging systems – conformational structures of aspartic acid in the gas phase, and the microsolvation structure of silver cations in water.

The core work is sound. The authors convincingly demonstrate that their approach can efficiently find all major conformers of the aspartic acid molecule and provide useful benchmark results for different algorithm parameters (mutation, crossover, use of elitism in the evolutionary algorithm) and levels of theory. The demonstrated agreement with rotational constants determined by microwave spectroscopy is impressive, and their analysis of the conformer that the CREST approach was unable to find in their test runs was quite interesting. Similarly, in the case of the silver cation in water, the authors show that their approach is able to find structures in reasonable agreement with experimental XAS data, with nearly linear structures in the first coordination sphere.

The work here is clearly original and of significant impact, particularly as the field moves towards larger and more flexible systems. While only two molecular systems were considered, they are wildly different and so the extent of agreement between the computational methods presented here and the experimental results is quite impressive. This paper should definitely be published.

1 My only minor note that the authors may wish to address is that Figure 5 and its accompanying caption are a bit confusing. First, in what is certainly just a typo, the caption indicates that the two plots are of data calculated with the PM7 and XTB methods (not obvious from the caption which is the top plot and which is the bottom plot), but the accompanying text in section 4.1.1 says this is for the XTB and DTFBA methods. A further point with this figure is that I was uncertain how (or if) I should interpret the x- and y-axes or instead simply look at the clustering patterns. In particular, I’m aware this is a low dimensional representation of high dimensional objects, and so there may not be any physically intuitive mapping of the data to the plot axes. That said, given that the objects in the two different plots are presumably very similar sets of conformers, the difference in scale of approximately a factor of 11 was confusing and somewhat distracting (e.g. x-axis in top plot ranges from roughly -750 to +800, while the same axis in the bottom plot ranges from roughly -65 to +70). A bit more guidance to readers for how to interpret this figure (either in the caption or main text) would be appreciated.

Answer: As correctly pointed out by the Referee there is a typo concerning the use of PM7 vs DFTBA, which has been fixed. The Referee is also correct in pointing out the nature of Figure 5 i. e. the representation of high dimensional spaces in a low (just two) dimensional space. Concerning the meaning of scales for the DFTBA and XTB plots we have added a sentence in the Figure caption.
Specifically, we used t-SNE (t-Stochastic Neighbour Embedding, https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) to reduce the dimensions instead of the more widespread Principal Component Analysis. The reason for this choice was that we wanted to convey the information that the XTB surfaces include close clusters of neighbours located in different energy ranges. Now, t-SNE (at variance with PCA) is usually applied exactly for this reason i. e. to conserve as much as possible the local structure (while the global one is lost as the distance between far away points is not significant) with a restricted number of features; the determination of the reduced space depends on the relative distance of neighbours in the original space and thus the two surfaces have a different representation. From a practical point of view the intuition of the Referee is correct i. e. here the message is carried by the local clusters while the axes do not have any relevant meaning. The situation would be different if we were projecting more closely related spaces (e. g. two DFTBA searches). For the sake of completeness, we chose t-SNE because the sum of variances with two principal components (using dihedral angles) was always below 0.5 and we considered it too low.

Other than that, I have no other major critiques. Again, the text was clear and it was easy to follow the key points of the manuscript. I look forward to future developments in this area of research!




Round 2

Revised manuscript submitted on 13 Aug 2022
 

14-Sep-2022

Dear Dr Mancini:

Manuscript ID: DD-ART-06-2022-000070.R1
TITLE: Fast exploration of Potential Energy Surfaces with a joint venture of Quantum Chemistry, Evolutionary Algorithms and Unsupervised Learning

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry


******
******

Please contact the journal at digitaldiscovery@rsc.org

************************************

DISCLAIMER:

This communication is from The Royal Society of Chemistry, a company incorporated in England by Royal Charter (registered number RC000524) and a charity registered in England and Wales (charity number 207890). Registered office: Burlington House, Piccadilly, London W1J 0BA. Telephone: +44 (0) 20 7437 8656.

The content of this communication (including any attachments) is confidential, and may be privileged or contain copyright material. It may not be relied upon or disclosed to any person other than the intended recipient(s) without the consent of The Royal Society of Chemistry. If you are not the intended recipient(s), please (1) notify us immediately by replying to this email, (2) delete all copies from your system, and (3) note that disclosure, distribution, copying or use of this communication is strictly prohibited.

Any advice given by The Royal Society of Chemistry has been carefully formulated but is based on the information available to it. The Royal Society of Chemistry cannot be held responsible for accuracy or completeness of this communication or any attachment. Any views or opinions presented in this email are solely those of the author and do not represent those of The Royal Society of Chemistry. The views expressed in this communication are personal to the sender and unless specifically stated, this e-mail does not constitute any part of an offer or contract. The Royal Society of Chemistry shall not be liable for any resulting damage or loss as a result of the use of this email and/or attachments, or for the consequences of any actions taken on the basis of the information provided. The Royal Society of Chemistry does not warrant that its emails or attachments are Virus-free; The Royal Society of Chemistry has taken reasonable precautions to ensure that no viruses are contained in this email, but does not accept any responsibility once this email has been transmitted. Please rely on your own screening of electronic communication.

More information on The Royal Society of Chemistry can be found on our website: www.rsc.org


 
Reviewer 1

The authors have made a significant effort for improving the paper based on my notes. For this reason, the paper can be considered for publication.




Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.
Creative Commons BY license