Open Access Article
Yosef Knattrup
,
Jakub Kubečka
,
Haide Wu
,
Frank Jensen
and
Jonas Elm
*
Department of Chemistry, Aarhus University, Langelandsgade 140, Aarhus C, 8000, Denmark. E-mail: jelm@chem.au.dk; Tel: +45 28938085
First published on 21st June 2024
Atmospheric molecular clusters, the onset of secondary aerosol formation, are a major part of the current uncertainty in modern climate models. Quantum chemical (QC) methods are usually employed in a funneling approach to identify the lowest free energy cluster structures. However, the funneling approach highly depends on the accuracy of low-cost methods to ensure that important low-lying minima are not missed. Here we present a reparameterized GFN1-xTB model based on the clusteromics I–V datasets for studying atmospheric molecular clusters (AMC), denoted AMC-xTB. The AMC-xTB model reduces the mean of electronic binding energy errors from 7–11.8 kcal mol−1 to roughly 0 kcal mol−1 and the root mean square deviation from 7.6–12.3 kcal mol−1 to 0.81–1.45 kcal mol−1. In addition, the minimum structures obtained with AMC-xTB are closer to the ωB97X-D/6-31++G(d,p) level of theory compared to GFN1-xTB. We employ the new parameterization in two new configurational sampling workflows that include an additional meta-dynamics sampling step using CREST with the AMC-xTB model. The first workflow, denoted the “independent workflow”, is a commonly used funneling approach with an additional CREST step, and the second, the “improvement workflow”, is where the best configuration currently known in the literature is improved with a CREST + AMC-xTB step. Testing the new workflow we find configurations lower in free energy for all the literature clusters with the largest improvement being up to 21 kcal mol−1. Lastly, by employing the improvement workflow we massively screened 288 new multi-acid–multi-base clusters containing up to 8 different species. For these new multi-acid–multi-base cluster systems we observe that the improvement workflow finds configurations lower in free energy for 245 out of 288 (85.1%) cluster structures. Most of the improvements are within 2 kcal mol−1, but we see improvements up to 8.3 kcal mol−1. Hence, we can recommend this new workflow based on the AMC-xTB model for future studies on atmospheric molecular clusters.
Sulfuric acid has been shown to be the main driver of cluster formation. Other key species are believed to be, bases (ammonia and amines), acids (methanesulfonic acid, nitric acid, iodine acids or organic acids), highly oxygenated organic molecules and water.4–7 It is extremely difficult to experimentally measure the composition and formation mechanism of the initial clusters due to their small size and neutral charge. Mass spectrometry techniques can measure the cluster compositions of the charged cluster, however, it is unknown if the ionization of neutral clusters significantly changes the cluster composition/structure and it is also believed that fragmentation may happen in the instruments.8–10 This leaves theoretical studies as the only way to elucidate the thermodynamics, kinetics, and molecular interactions governing cluster formation and its evolution. The main challenge for studying atmospheric molecular clusters is their complex configurational spaces, which require advanced configurational sampling techniques and computationally demanding quantum chemistry methods to evaluate the cluster properties accurately.5 Furthermore, atmospheric clustering is believed to be a multi-species process,11 adding another dimension of chemical complexity.
Thoroughly exploring the configurational space of atmospheric molecular clusters using, for instance, metadynamics simulations12,13 or genetic algorithms14–16 at a high level of theory is extremely computationally demanding. Hence, usually, a funneling approach5,7,17 is applied, where the configurational space is initially explored at a low level of theory such as force-field or semiempirical methods, and only a subset of low energy structures is selected, reoptimized, and reexamined at a higher level of theory. This process is repeated with an increasing level of theory until only a few structures remain for evaluation at the desired high level. Schematically, the process can be given as:
(1) Generate initial cluster configurations:
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif)
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif)
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif)
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif)
ABCluster/OGOLEM/Basin hopping or similar.
(2) Semi-empirical calculations:
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif)
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif)
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif)
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif)
Optimization at the PM6/PM7/GFN1-xTB/GFN2-xTB or similar level.
(3) DFT calculations:
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif)
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif)
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif)
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif)
DFT optimization and vibrational frequency calculations.
(4) Single point energy refinement:
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif)
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif)
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif)
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif)
Single-point energy calculation at coupled cluster level on the DFT optimized geometry.
Between each step in the funneling approach, filtering can be applied to reduce the number of structures that need to be handled. This can either be based on an energy threshold or a set number of cluster structures. Eventually, we end up with a handful of structures at the highest obtainable level.
The first step in the funneling procedure is the generation of a large number of configuration candidates. The key idea is sampling a large part of the potential energy surfaces at a low level of theory to get estimates for the global free energy minimum. This is usually carried out using force-field methods in combination with genetic algorithms such as in ABCluster15,16,18–21 and OGOLEM,14,22–27 by random/manual sampling or using dynamic methods such as basin hopping.28–32 The major issue at this step is that most force-field methods are unable to describe bond-breaking, such as proton transfer reactions, which are important for atmospherically relevant molecular clusters, requiring the sampling to include monomers where the hydrogens have been transferred to get adequate sampling. Furthermore, the accuracy of force-field methods is also insufficient to determine a subsample of the conformer candidates and all the candidates have to be taken to the higher level of theory.
The next step is semi-empirical calculations as these are a better description of the chemistry and filtering can be applied. Of the common semiempirical methods, GFN1-xTB,33 GFN2-xTB,34 PM6,35 and PM7
36 are the most used in configurational sampling procedures for atmospheric molecular clusters.5–7 GFN stands for geometries, frequencies and non-covalent interactions, which are the main target properties for the method. PM stands for parameterization method indicating the model version. Of these methods, GFN1-xTB has shown to have the highest correlation with electronic binding energies at a higher level of theories37,38 and have been shown to have a higher correlation with DFT trajectories for molecular dynamics than GFN2-xTB.39 The reason GFN2-xTB performs worse than GFN1-xTB for atmospheric molecular clusters (often involving sulfuric acid) is that there is a decrease in the number of d-functions for sulfur in the basis set for the newer GFN2-xTB model.
The third step is the subsequent optimization and vibrational frequency calculation of the structures with DFT. This is the main bottleneck in the sampling methodology as limited computational resources only allow a fixed number of DFT structures to be optimized. Therefore some form of filtering is required, often based on structural properties or electronic energies from the semiempirical calculations. To circumvent the inaccuracies of semiempirical methods an intermediate step can be included, involving single-point energy calculations at the DFT level on as many structures as possible. Another option for an intermediate step is the utilization of machine learning (ML) methods. One can calculate a subset of the structures at a desired DFT level and train an ML model to predict the energies of the remaining structures.39,40 However, to mimic accurate DFT energies, kernel-based ML methods become computationally demanding40–42 and neural-networks will require an extensive set of training data and hyperparameter optimization.43 Moreover, ML methods often fail when predicting on structures different from the training set.
Overall, the funneling approach is never more efficient than its weakest link given by the semiempirical step in 2, in which accuracy determines the number of structures that have to be optimized/have single points calculated at the DFT level. In this paper, we focus on reparameterizing the GFN1-xTB method based on DFT energies of atmospherically relevant molecular clusters yielding a GFN1-xTB model reparametrized based on ωB97X-D/6-31++G(d,p) for ‘atmospheric molecular clusters’ denoted AMC-xTB. This new parameterization is used to sample 288 large multi-acid–multi-base clusters containing AM equivalent to the clusters studied by Knattrup et al.44
47 was used for the DFT calculations. CREST 2.12
12,13 with an energy window of 15 kcal mol−1 and in noncovalent interaction mode and ABCluster 2.0
15,16 with a population of SN = 3000, number of generations of gmax = 200, and gen. survival of glimit = 4 were used for additional configurational sampling.
All new data calculated is freely available in the Atmospheric Cluster DataBase54 along with the new AMC-xTB parameter file (see Section S2).†
![]() | (1) |
We chose the electronic binding energies as the target properties to get a better tool for filtering based on energies in configurational sampling procedures. The gradients were included directly in the target function. We use equilibrium structures at the given level of theory, which are supposed to have near-zero gradients. However, the upper limit is set by the accuracy threshold within the xTB program during the optimization, which makes the gradients non-zero in the calculations.
Including only the electronic bindings energies in the target function yields a much better fit for the energies but causes the gradients to “explode”, effectively rendering the optimization functionality of AMC-xTB useless for our target species. Giving higher weight to the gradient norms in the target function makes the optimized structures more similar to the target ωB97X-D/6-31++G(d,p) level of theory, however, we found that it causes problems in the configurational sampling procedure where the decreased accuracy in determining the binding energies causes our configurational sampling to yield high-energy conformers at the DFT level. Overall, we found that including the gradient norms and difference in electronic bindings energies in a 1
:
1 ratio as the best compromise between the two properties.
| ABCluster → GFN1-xTBoptall → DFToptN lowest |
| ABCluster → AMC-xTBoptall → CREST(AMC-xTB)1 lowest → DFTopt50 lowest |
| Best structure → CREST(AMC-xTB) → DFTopt50 lowest |
The initial sampling yields binding free energies ranging from −28.43 kcal mol−1 [(MSA)1(NA)1(FA)1(AM)2] to −104.0 kcal mol−1 [(SA)3(NA)1(AM)1(MA)1(DMA)2] for the cluster configurations lowest in free energy at the ωB97X-D/6-31++G(d,p) level of theory.
436 data points), the sulfuric acid–multi-base (SA)1–4(AM/MA/DMA/TMA/EDA)1–4 cluster data set (684 data points) by Kubečka et al.41 and the multi-acid–muti-base (SA/FA/MSA/NA)1–4(MA/DMA/TMA)1–4 by Knattrup et al.44 including the new AM-containing clusters (1629 data points). All the tested structures are equilibrium structures at the ωB97X-D/6-31++G(d,p) level of theory. Although the Gaussian version and integration grid used for optimization differ for some structures, it was found to have a negligible effect on this comparison as we are studying the binding energies and not the absolute energies.
![]() | ||
| Fig. 2 Error in the electronic binding energies for the GFN1-xTB and AMC-xTB methods compared with the ωB97X-D/6-31++G(d,p) level of theory. The clusteromics I–V48–52 sets have (SA/FA/MSA/NA)0–2(AM/MA/DMA/TMA/EDA)0–2 clusters, the Kubečka et al.41 set has sulfuric acid–multi-base (SA)1–4(AM/MA/DMA/TMA/EDA)1–4 clusters, Knattrup et al.44 set has the multi-acid–muti-base (SA/FA/MSA/NA)1–4(MA/DMA/TMA)1–4 clusters including the new AM-containing clusters sampled in this work. | ||
For all the data sets shown in Fig. 2 the reparameterization results in near-zero means of the energy errors. This is a reduction from error means of 3.7–11.8 kcal mol−1 for GFN1-xTB. In addition, the AMC-xTB model achieves a more narrow error distribution with the root mean square deviations decreasing from 7.6–12.3 kcal mol−1 to 0.81–1.45 kcal mol−1, implying that there will be fewer outliers. The error span on the larger clusters for the Knattrup et al.44 and Kubečka et al.41 sets are similar to the error span for the smaller clusters in the optimization set. This shows that reparameterizing on smaller clusters is adequate for calculations on larger-sized clusters as the model gets some of the underlying chemistry correct and scales effectively with system size. The new AMC-xTB model reduces the number of structures needed to pass from the semiempirical step to the DFT step in configurational sampling. For atmospheric molecular clusters, this implies that the AMC-xTB model is unequivocally better to apply in the configurational sampling funneling workflow compared to GFN1-xTB.
![]() | ||
| Fig. 3 The gradient norms for the GFN1-xTB and AMC-xTB methods. α is the Bohr radius. The clusteromics48–52 set is (SA/FA/MSA/NA)0–2(AM/MA/DMA/TMA/EDA)0–2 clusters, the Kubečka et al.41 set is sulfuric acid–multi-base (SA)1–4(AM/MA/DMA/TMA/EDA)1–4 clusters, Knattrup et al.44 is the multi-acid–muti-base (SA/FA/MSA/NA)1–4(MA/DMA/TMA)1–4 clusters including the new AM-containing clusters sampled in this work. The structures are equilibrium structures at the ωB97X-D/6-31++G(d,p) level of theory. | ||
| Method/data set | Mean | Median | Std |
|---|---|---|---|
| GFN1-xTB/clusteromics | 0.484 | 0.387 | 0.330 |
| AMC-xTB/clusteromics | 0.355 | 0.242 | 0.282 |
| GFN1-xTB/Knattrup et al. | 0.378 | 0.367 | 0.138 |
| AMC-xTB/Knattrup et al. | 0.235 | 0.193 | 0.125 |
| GFN1-xTB/Kubečka et al. | 0.376 | 0.361 | 0.140 |
| AMC-xTB/Kubečka et al. | 0.189 | 0.164 | 0.094 |
We find that the AMC-xTB model reduces the mean RMSD of the full clusteromics set from 0.484 Å to 0.355 Å and a similar reduction is seen for the Knattrup et al.44 and Kubečka et al.41 sets with RMSDs being reduced from 0.378 Å to 0.235 Å and from 0.376 Å to 0.189 Å, respectively. This, coupled with the smaller gradients, suggests that the reparameterized model is closer to a minimum at the DFT level. This implies that the preoptimization step in a funneling approach with the AMC-xTB model compared to GFN1-xTB yields structures closer to the DFT structure and will likely reduce the subsequent optimization time at the DFT level.
Fig. 4 shows the difference in binding free energy for the lowest free-energy configuration found by employing the independent and improvement workflows for the (SA)4(EDA)4, (SA)4(AM)4, and (SA)4(AM)1(MA)1(DMA)1(TMA)1 clusters compared to Kubečka et al.41 and the (SA)1(MSA)1(NA)1(FA)1(AM)4 and the new (SA)1(MSA)1(NA)1(FA)1(AM)1(MA)1(DMA)1(TMA)1 clusters sampled in this work. The SA–AM clusters have been extensively studied previously6,56–61 and are therefore believed to be well-sampled using the original configurational sampling procedure and thereby difficult to improve. Still, the new CREST + AMC-xTB methodology manages to find cluster structures lower in free energy by 0.21 kcal mol−1 compared to the previous works.
![]() | ||
| Fig. 4 Comparison of the lowest free energy conformer found by the independent and improvement configurational workflows compared to the configurations found by Kubečka et al.41 (a–c) and two new multi-component AM clusters sampled in the same way as Knattrup et al.44 (d and e). Gibbs free energies are calculated at the ωB97X-D/6-31++G(d,p) level of theory with the quasi-harmonic approximation (cutoff of 100 cm−1) and vib. frequencies scaled by 0.996 in accordance with Kubečka et al.41 | ||
In the case of the (SA)4(AM)4 and (SA)1(MSA)1(NA)1(FA)1(AM)1(MA)1(DMA)1(TMA)1 clusters, the independent/improvement workflows perform similar and yield similar free energy improvements. However, for the (SA)4(AM)1(MA)1(DMA)1(TMA)1 clusters, the independent workflow works slightly better, finding a cluster 0.85 kcal mol−1 lower in free energy compared to the improvement workflow. This illustrates that the sampling is very sensitive to the configuration used as input for CREST, although it might also be due to the randomness of the dynamic processes in CREST. The reason for the difference might be that the original work's configurational sampling was worse than the independent workflow, yielding a worse starting structure for the CREST sampling within the improvement workflow. We see a massive improvement in the configurational sampling of the (SA)4(EDA)4 clusters by −18/−21 kcal mol−1. This is caused by the flexibility of the EDA molecule, as it is the only monomer that contains a C–C bond it can rotate around, making it difficult to sample the full configurational space using only ABCluster with rigid molecules. This improvement should primarily be attributed to the inclusion of metadynamics sampling in CREST and not purely the parameterization of AMC-xTB as it allows the EDA to rotate around its bonds and find a structure with more/better paired intermolecular interactions as seen in Fig. 5. It should also be noted that the main improvements are the electronic binding energy and the thermal contribution varies very little between the clusters.
However, this shows the strength of the presented workflows as they can be used for clusters containing more flexible organic molecules.
In most cases, the improvement is between 0–2 kcal mol−1. However, for the (SA)1(MSA)1(NA)1(FA)1(AM)1(DMA)2(TMA)1, (SA)3(NA)1(AM)1(MA)3, and (SA)3(FA)1(AM)1(MA)1(DMA)2 clusters a massive improvement of 8.3, 5.2 and 3.9 kcal mol−1 is observed, respectively.
Comparing the conformer index at the AMC-xTB level of theory with the final ωB97X-D/6-31++G(d,p) level of theory with the quasi-harmonic approximation (cutoff of 100 cm−1) the Gibbs free energy minimum at the DFT level is also the electronic energy minimum at the AMC-xTB level of theory for 66 of the clusters (see Fig. S1†). If 10 conformers are included from the AMC-xTB level of theory, the free energy minimum energy is captured for 155 out of the 288 clusters with improvements found for 209 (see Fig. S2†). Furthermore, the maximum error is 2 kcal mol−1 with a mean of 0.12 kcal mol−1 when reducing from 50 to 10 conformers.
This highlights the need for including dynamics-based sampling procedures for atmospheric clusters even though the system might seem fairly rigid. It can also be envisioned that the improvement workflow will be quite important when studying much larger (SA)1–20(base)1–20 clusters as recently done by Engsvang et al.42,62 and Wu et al.38 For large clusters, the global minimum is tricky to locate, and adding dynamics-based configurational sampling might aid in the process.
We tested two new configurational sampling procedures with the new parameterizations being employed in the xTB and CREST programs. The first workflow, denoted as “improvement workflow,” is based on improving the best structure currently known in the literature with CREST and then doing the DFT calculations. The second workflow, denoted the “independent workflow,” starts by configurational sampling using ABCluster, followed by xtb, CREST, and then DFT. Using the two workflows, we find cluster structures lower in free energy for the following (SA)4(EDA)4, (SA)4(AM)4,(SA)4(AM)1(MA)1(DMA)1(TMA)1, (SA)1(MSA)1(NA)1(FA)1(AM)4 and (SA)1(MSA)1(NA)1(FA)1(AM)1(MA)1(DMA)1(TMA)1 systems in all cases compared to the best-known value in the literature.
Testing the improvement workflow on 288 large multi-acid–multi-base cluster systems, the workflow finds improvements for 85.1% of the clusters, showing the need for dynamics-based sampling.
The parameterization strategy given here is not specific to either GFN1-xTB or atmospheric clusters and could be used in general. For instance, one could imagine increasing the number of d-functions in the basis set for sulfur atoms in GFN2-xTB and then reparameterizing the new GFN2-xTB model or doing a reparameterization for much larger clusters.
Footnote |
| † Electronic supplementary information (ESI) available: Figure showing the conformer index at the AMC-xTB level of theory out of the 50 conformers optimized at the DFT level. Figure showing the comparison of the lowest free energy conformer found by the improvement configurational workflow compared to the original workflow if only 10 conformers are selected. Guide to downloading the AMC-xTB parameter file and the structures/calculations generated in this work. See DOI: https://doi.org/10.1039/d4ra03021d |
| This journal is © The Royal Society of Chemistry 2024 |