Calculation of absolute molecular entropies and heat capacities made simple

We propose a fully-automated composite scheme for the accurate and numerically stable calculation of molecular entropies by efficiently combining density-functional theory (DFT), semi-empirical methods (SQM), and force-field (FF) approximations. The scheme is systematically expandable and can be integrated seamlessly with continuum-solvation models. Anharmonic effects are included through the modified rigid-rotor-harmonic-oscillator (msRRHO) approximation and the Gibbs–Shannon formula for extensive conformer ensembles (CEs), which are generated by a metadynamics search algorithm and are extrapolated to completeness. For the first time, variations of the ro-vibrational entropy over the CE are consistently accounted-for through a Boltzmann-population average. Extensive tests of the protocol with the two standard DFT approaches B97-3c and B3LYP-D3 reveal an unprecedented accuracy with mean deviations <1 cal mol−1 K−1 (about <1–2%) for the total gas phase molecular entropy of medium-sized molecules. Even for the hardship case of extremely flexible linear alkanes (C14H30–C16H34), errors are only about 3 cal mol−1 K−1. Comprehensive tests indicate a relatively strong variation of the conformational entropy on the underlying level of theory for typical drug molecules, inferring the complex potential energy surfaces as the main source of error. Furthermore, we show some application examples for the calculation of free energy differences in typical chemical reactions.


Statistical error measures
Statistical measure for a set x 1 , · · · , x n of data points with references r 1 , · · · , r n are: • Average: • Mean deviation (MD): • Mean absolute deviation (MAD): • Standard deviation (SD): • Root-mean-square deviation (RSMD): • P earson correlation coefficient (r p ): 2 Implementation, Algorithmic and Calculation Details 2.1 RMSD based metadynamics The RMSD metadynamics (MTD) were introduced in Ref. 1 and are based on a bias potential where ∆ i is the atomic RMSD 2 between a reference structure i and the calculated molecule.
k i and α i empirical or automatically determined parameters that shape the potential. During a metadynamics simulation points on the simulated PES, i.e. snapshots of the MD simulation are saved for the calculation of ∆ i , which is then used to generate a repulsive V bias contribution at the repective geometry. By a continuous collection and update of reference structures (from new snapshots) over the whole length of the simulation, V bias will dynamically increase and form a history-dependent potential. This way previously found regions of the PES are blocked for the exploration and new conformers (PES minima) are found more safely.
As a new alternative we introduce another type of metadynamics, called static metadynamics (sMTD). In contrast to the MTD discussed in Refs. 1,3, this simulation is initialized with a given set of reference geometries and the MD will hence exhibit one global (and unchanged) V bias potential. This version of MTD is more similar in nature to the well-known umbrella sampling or global optimization procedures. With regards to the PES sampling, sMTD has a less explorative character than MTD for finding the global minimum, but will more continously expand the conformational ensemble with new higherenergetic structures.

Molecular flexibility description
Many settings for the here discussed workflow are generated automatically and based on the individual structure of the investigated molecule. An important parameter is the molecular flexiblity, because it is directly related to the molecules accessible low-energy space. In Ref.
3 we proposed a molecular flexibility measure ξ f,cov , defined by The summation over all non-terminal bonds i with the atoms A,B ∈ i includes the Wiberg- Non-covalent interactions are quantified from the total hydrogen-bond energy E HB and D4 dispersion energy E disp , relative to the respective energies of a known reference system. In order to be comparable to the reference, the energies must be normalized to the number of atoms N in the system. The final formulation for ξ f,NCI then is given by

Rotamer numbers
One of the key assumptions in the proposed scheme is that every contributing conformer i can be effectively represented by a number of energetically degenerate rotamer structures with its degeneracy number g i . This number is composed of three parts where g rot is a factor arising from single-bond rotations, g core denotes a factor resulting from complex inversion and g sym includes the molecular symmetry into g i . Here, the factor g rot is a constant that is the same for all unique conformers and (pseudo-)enantiomers. All conformers of a molecule have the same number of rotatable groups, each resulting in a fixed prefactor equal to the number of equivalent nuclei exchanged by the rotation (i.e., 3 for methyl, 2 for phenyl, 5 for η 5 -C 5 H − 5 , and so forth). We assume this factor to be constant for a given molecules since all combinations of the rotations would be observed at some point in time (t → ∞). The factor g core results from more complicated inversion-type processes that are responsible for the generation of other degenerate structures such as (pseudo-)enantiomers of a conformer. g core is unique for every conformer since it is linked to the molecular symmetry of the respective structure. As an example the two lowest conformers of the n-pentane molecule are shown in Fig. 1a and 1b. With two terminal methyl groups n-pentane has a rotamer degeneration of g rot = 3 2 . The lowest conformer of n-pentane in the gas-phase has C 2v symmetry and no other enantiomeric structures exist (g core = 1). In the second conformer, however, one of the terminal methyl groups is slightly twisted resulting in a total of four different (pseudo-)enantiomers (g core = 4). Hence, 9 rotamers are to be expected for the lowest conformer of n-pentane, but there are 36 degenerate rotamers for the second conformer.
The rotamer number g i also depends on the molecular symmetry. If symmetry operations exist that coincide with some of the rotations included in g rot and can impose a nucleus on itself, g i has to be reduced by a factor g sym . A simple example is the ethane molecule as shown in Fig. 1c. With two terminal methyl groups one could expect 3 2 rotamers, but since ethane has D 3d symmetry (D 3h for the eclipsed form) there are only three different rotamers for the molecule. Here, g rot equals the symmetry number of the primary rotation axis. Other examples are neopentane (T d ), isobutane (C 3v ), or ferrocene (D 5d /D 5h ). For most molecules g rot simply is unity and is only important for high symmetry cases.
The rotamer number g i is generated automatically from the CRE obtained by a conformational search as implemented in the CREST program and information of chemically equivalent nuclei. Nuclear equivalencies are obtained as a by-product of the conformational search directly from the structure comparison as described in Ref. 6. For the identification of rotational groups, the topology of the molecule is set up for the lowest energy conformer and analyzed. Herein, the topology can be either based on quantum chemical data (covalent bond orders) or just set up directly from the coordination numbers (CNs). Molecular rings are identified in a graph representation of the topology, using an custom depth-first all-pairshortest-path algorithm. Rotational groups are obtained form groups of equivalent nuclei and must obey some simple heuristic rules: • The equivalent nuclei must be connected to a common neighboring atom.
• The neighboring atom may have a maximum of one neighbor other than the equivalent nuclei to be considered "freely rotatable". (An alternative definition via the WBO is possible).
• Rotations from different groups of equivalent atoms in the same ring must only be counted once to avoid double counting.
• The rotation number of the group is equivalent to the number of its members (i.e., the equivalent nuclei).
These rules work recursively (e.g., a tert-butyl group results in 3 4 rotamers), but special consideration has to be paid to freely coordinated rings (e.g., Cp in ferrocene), which will not be discussed here any further.
The factor g core is generated from a Cartesian RMSD comparison 2 of all structures in a CRE that belong to the same conformer. In this comparison all atoms that are rotationally equivalent must be neglected in the RMSD, leading to the identification of all the different "core" structures for each conformer, i.e., its (pseudo-)enantiomers. For an example again see Fig. 1b. The key assumption for this is, that the conformational search was able to generate all relevant enantiomeric structures of a single conformer at least once.

Single point hessian procedure
Within the described workflow and the calculation of the S msRRHO population average, the entropy for a reference structure S msRRHO,ref has to be calculated. For consistency this reference term has to be calculated at the same level of theory as the population average in S msRRHO , that is GFN2-xTB or GFN-FF. A geometry optimization at this level might lead to an alteration of the frequencies and hence calculated entropy. On the other hand, if calculated directly for the DFT reference geometry, there is a high probability to observe imaginary modes because the DFT geometry will not necessarily be a minimum on the GFN2 or GFN-FF PES. To account for this problem in a "best of two worlds" approach we employ a new procedure called single point hessians (SPH). Details of the SPH approach will be published elsewhere, 7 but basically it works by applying an additive potential 1,3 similar to Eq. 7 above, where ∆ is the atomic RMSD 2 between two molecular structures, and k and α define the potential shape. Within the SPH procedure k and α are calculated automatically in an iterative process, by repeatedly calculating the RMSD between the DFT input structure and a GFNn-xTB or GFN-FF reoptimized structure and updating V bias , until no change in the geometry is observed. This essentially reshapes the PES at GFN2-xTB (or GFN-FF) level and removes any imaginary modes for frequencies calculated directly for the DFT geometry. Entropies calculated with frequncies from SPH resemble those at the DFT level, but retain a slight level of theory dependent shift, which makes them compatible with S msRRHO .
With regards to computational cost the SPH approach is much cheaper than calculating frequencies at the DFT level, but more expensive as standard GFN2-xTB or GFN-FF Hessian calculations.    Table 4: Heat capacities for a subset of the LBH benchmark set. C p are given for a combination of C p,msRRHO calculated at DFT(B97-3c or B3LYP-D3/def2-TZVP) and C p,conf calculated at a lower (GFN2-xTB or GFN-FF) level. Mean deviation (MD), mean average deviation (MAD), root-mean-square deviation (RMSD) and standard deviation (SD) are given below. All values correspond to cal mol −1 K −1 .

Empirical entropy estimates
As mentioned in the manuscript, the empirical formulation is used in some studies 11,12 to estimate the conformational entropy. However, while this formulation may be used for very simple molecules, it breaks down for challenging energy surfaces. One could easily imagine a case where only a few conformers of an otherwise large ensemble contribute to the entropy (e.g., sofosbuvir at GFN-FF level), or an opposite case with many high-energetic conformers that individually contribute nothing, but in sum make a large part of the entropy (e.g., continuous ensembles, large n-alkanes). Population differences can thus lead to significant differences even for ensembles of same size, and would therefore not be captured by the approximation via N conf . The approximation is further unable to capture vibrational entropy averages as in S msRRHO . Differences ∆S conf /simple between this estimated and the fully converged entropy often exceed several cal mol −1 K −1 in either direction, which is shown for the CD25 in Tabs. 7,8 below.