A D -machine learning approach for force ﬁ elds, illustrated by a CCSD(T) 4-body correction to the MB-pol water potential

D -Machine learning ( D -ML) has been shown to e ﬀ ectively and e ﬃ ciently bring a low-level ML potential energy surface to CCSD(T) quality. Here we propose extending this approach to general force ﬁ elds, which implicitly or explicitly contain many-body e ﬀ ects. After describing this general approach, we illustrate it for the MB-pol water potential which contains CCSD(T) 2-body and 3-body interactions but relies on the TTM4-F 4-body and higher body interactions. The 4-body MB-pol (TTM4-F) interaction fails at very short range and for the water hexamer errors up to 0.84 kcal mol (cid:1) 1 are seen for some isomers, owing mainly to 4-body errors. We apply D -ML for the 4-body interaction, using a recent dataset of CCSD(T) 4-body energies that we used to develop a new water potential, q-AQUA. This 4-body correction is shown to improve the accuracy of the MB-pol potential for the relative energies of 8 isomers of the water hexamer as well as the harmonic frequencies. The new potential is robust in the very short range and so should be reliable for simulations at high pressure and/or high temperature.


Introduction
Machine learning (ML) correction methods aim to elevate the level of accuracy of ML properties, for example potential energy surfaces (PESs).There are two approaches currently being investigated to accomplish this goal.3][4] For example, Meuwly and co-workers applied transfer learning using thousands of local CCSD(T) energies to improve their MP2-based neural network PESs for malonaldehyde, acetoacetaldehyde and acetylacetone. 3The basic idea of transfer learning is that a t obtained from one source of data (perhaps a large one) can be ne-tuned for a related problem by using limited data.Therefore, in the present context of PES tting, an ML-PES trained with low-level electronic energies/gradients can be reused as the starting point of the model for an ML-PES with the accuracy of a high-level electronic structure theory.As noted, this is typically done with articial neural networks, where weights and biases trained on lower-level data hopefully require minor changes in response to additional training using high-level data.
The other approach is D-machine learning.6][7][8] We recently proposed and tested a D-ML approach, that uses a small number of CCSD(T) energies, to correct a low-level PES based on DFT electronic energies and gradients. 9,10The equation for this approach is simply where V LL/CC is the corrected PES, V LL is a PES t to low-level electronic data (such as DFT energies and gradients), and DV CC-LL is the correction based on high-level coupled cluster energies.It is noted that the difference between CCSD(T) and DFT energies, DV CC-LL , is usually not as strongly varying as V LL with respect to the nuclear congurations and therefore just a small number of high-level electronic energies are adequate to t the correction PES.The method was validated for PESs of small molecules, CH 4 and H 3 O + , 12-atom N-methyl acetamide, and 15-atom acetylacetone.In all cases, the coupled cluster energies were obtained over the same large span of congurations used to get the lower-level PES.
Here we propose to extend this D-ML approach from molecular PESs to general, non-reactive force elds that are explicitly or implicitly many-body.There are many examples of such force elds that determine the total energy of N monomers.For example, consider force elds for water.(For a recent review see ref. 11.)For simplicity, we denote these by "MB-FF".Suppose our goal is to bring this force eld to the "gold-standard" CCSD(T) level of theory.Clearly this cannot be done by simply applying the above equation for an arbitrary number of monomers owing to the prohibitively exponential computational cost of CCSD(T) calculations with respect to the number of monomers.Instead we propose a D-ML force-eld for N monomers given by the sum of many-body corrections, namely where DV n-b are the many-body corrections to the MB-FF manybody terms, given by the difference between CCSD(T) and MB-FF n-body (n-b) interaction energies.To be clear, recall that the n-b interaction energy is obtained from a cluster of n monomers.For example, the 4-b interaction is obtained by calculating the total energy of the tetramer (four monomers) and subtracting all the 1-, 2-, and 3-b interactions from the total energy.Note for simplicity, we assume that an accurate 1b term, e.g., the single water molecule, is given in the MB-FF.We have truncated explicit correction terms at the 4-b level with force-elds for water in mind.This is because it has been established by high-level calculations that 4-b, while small, are needed to obtain nearly 100 percent of the electronic dissociation energies of water clusters up to the 21-mer. 12This has also been shown previously for some isomers of the water hexamer 13,14 and we show this again explicitly here.And crucially, we have the CCSD(T) electronic energies needed for the correction up to 4-b, and these datasets were recently used to develop the pure many-body water potential q-AQUA. 15,16q-AQUA ¼ where the meaning of each term is clear.In q-AQUA the 2-, 3-, and 4-b interactions are permutationally invariant polynomial (PIP) 17,18 ts to thousands of CCSD(T) energies (details are in ref. 16.)These CCSD(T) electronic energies for the 2-b, 3-b and 4b are ready to be used to obtain a DV n-b correction to any water MB-FF, and we return to this below.In this work, the focus is on the 4-b correction to the MB-pol force eld. 19,20In MB-pol the 2-b and 3-b are already at CCSD(T) level, but the 4-b interaction is essentially given by the TTM4-F potential, 21 which is a sophisticated MB-FF for water.Errors between 0.1 and 0.84 kcal mol À1 for the these 4-b interactions for the hexamer isomers against direct CCSD(T) calculations, were reported in 2015. 22 In the next section we present the details of the 4-b correction PES, followed by several tests that indicate the effectiveness of this correction PES.

DV 4-b fitting details
The data set for the DV 4-b t is simply the difference between the 4-b CCSD(T)-F12/haTZ (aug-cc-pVTZ basis for O atoms and cc-pVTZ for H atoms) and MB-pol/TTM4-F energies.The total number of congurations in the data set is 3692 and all of them are used for the t.The t uses the PIP approach, in which the PIPs are generated using MSA soware. 18,23The PIPs are usually polynomials in the Morse variables of the internuclear distances, exp(Àr ij /l), where r ij is the distance between atoms i and j and l is a range parameter, taken for this calculation to be 1.5 bohr.We used 22221111 permutational symmetry at a maximum polynomial order of 3. Two additional issues concerning the basis set are important to note.
First, it is desirable not to include polynomials that do not have the correct limiting behavior as one or more monomers are removed from the others to a large distance.In the 4-body case, we need to consider the removal of each monomer from the other three and the removal of each possible dimer from the other one.In all of these cases, the 4-body interaction energy must vanish.The process of identifying PIPs that do not have the correct limiting behavior is what we call purication. 24,25To identify the PIPs with incorrect limiting behavior, the relevant distances are augmented by 100 Å, and we accept the polynomial as having the correct behavior if its Morse value is below 10 À6 .We cannot, however, immediately eliminate these polynomials because there may be other polynomials that, for example, are composed of products between one with a correct limit and one with an incorrect limit.At rst, we simply rename the ones with an incorrect limit.Aer all the polynomials have been evaluated, we examine the denitions of all those with the correct limits and determine which of the monomials and which of the renamed polynomials with incorrect limits are required to calculate them.Finally, we remove those polynomials that are not required and renumber those that remain, keeping the order of calculation to ensure that no partial calculation that contributes to any polynomial needs to be performed twice.We then have a set of polynomials that all have the correct limiting behavior and that can be calculated efficiently. 26he second issue that we need to consider is how to maintain permutational symmetry, not only in each monomer, but when monomers as a whole are interchanged with one another.This latter exchange is not taken into account by the MSA soware, so the polynomials that we create by the previously described purication will not, in general, have permutational symmetry with respect to exchange of identical monomers.A common method for dealing with this issue is to augment the dataset by adding all relevant permutations of the Cartesian coordinates and assigning them the same energy, thus requiring a set of n! geometries for each energy, where n is the number of monomers (4, in this case).A better method is to identify groups of polynomials that have permutational symmetry with respect to monomer exchange and then form "superpolynomials" that are the sum of the polynomial members of each group.We identify the permutationally invariant groups of polynomials by taking a single set of n! permutationally related geometries and calculating the value of each polynomial.While the values of individual polynomials vary from permutation to permutation, the groups of polynomials, taken together for each permutation, will have the same group of values.For each permutation, one can form pairs of the polynomial identities and their values, and then sort the pairs by their values.Looking at all pairs that have the same value component in all permutations gives the identities of the polynomials, some of which may be repeated, that make up a permutationally invariant group.In general, there will be as many groups as there were original polynomials.These groups, each with n! (not necessarily unique) polynomial contributions, are then summed to form "superpolynomials" having permutational symmetry with respect to exchange of identical molecules.Having formed these superpolynomials, there is no need for augmentation of the dataset with permutationally related geometries.
We used basis sets of different sizes, with 200, 500, and 1000 "superpolynomials".More details of the bases are given in Table 1.As seen, although tting with more polynomials can reduce the tting error, the computational cost is roughly proportional to the sum of the number of monomials and polynomials.The results presented in this paper are based on the basis with 200 "superpolynomials", as this achieves reasonably good accuracy with smaller cost.
The nal energy is written as S ijkl DV 4-b ði; j; k; lÞ; where S ijkl is a switching function whose value is 1 at short range and 0 at the long range.Specically, where r max is the maximum OO distance in a water tetramer, and S is 1 when r max is smaller than r i and is 0 when r max is greater than r f .In this work we used r i ¼ 5.5 Å and r f ¼ 7.0 Å unless otherwise specied.shown, the equilibrium R OO distance is 2.7 Å, so that large corrections are in the steeply repulsive part of the potential.However, there are also large corrections for highly distorted tetramers geometries not visited by this cut, discussed below.The bottom panels plot, as a function of the maximum R OO distance of the tetramer, the difference between TTM4-F and CCSD(T) energies (panel c) and the difference between the corrected TTM4-F and CCSD(T) energies (panel d).It is visually clear that the correction provides both a better correlation and a smaller error with respect to the CCSD(T)-F12 4-b energies.Note that in addition to large errors in the short range, the TTM4-F 4-b energies also have signicant errors (>50 cm À1 ) even when the R OO reaches 6.5 Å, as panel c shows.

Results and discussions
We now consider the binding energies of the eight isomers of the water hexamer.The binding energies for this particular water cluster have become a "benchmark" test for both electronic structure methods as well as numerous water potentials.For a recent example see ref. 29.That the hexamer has taken this central role in both theory 13,28,30,31 and experiment 32,33 is due to the fact that it is the smallest cluster with a non-planar conguration (for the O atoms).Thus, it has been termed the smallest drop of water/ice.In addition for the present purposes, this cluster has a signicant number (15) of 4-b interactions.The absolute binding energies of these clusters from the benchmark electronic structure calculation (CCSD(T)/CBS values taken from ref. 28) and various PESs are listed in Table 2, and comparisons are shown in Fig. 4. As seen, with the 4-b correction, the binding energies are in better agreement with the benchmark values, especially for bag and cyclic isomers.
Table 3 shows for four of the hexamer isomers the mean absolute error (MAE) in the harmonic frequencies for both uncorrected and corrected MB-pol PES.For prism and cage, the 4-b corrected version is essentially as accurate as the original MB-pol, while for book and cyclic ring, the 4-b correction clearly improves the frequencies.
Next, we comment briey on the timing requirements for the TTM4-F + DV 4-b potential.As shown in the last line of Table 1, the additional time for calculating the correction for the results presented in this paper (PIP 200 ) is only about 1 second for 100 000 energy evaluations and approximately 3 seconds if the energy and associated gradients are evaluated at the same time.Table 4 shows the overall timing to evaluate the energy of 64, 128, and 256 monomers, with different cutoffs of the 4b correction.In this table, t MBX is the time to evaluate all the terms in the MB-pol, including the TTM4-F and MB-pol 2-b and 3-b, with the latest MBX soware, 35 while t DV 4-b is the time to evaluate our 4-b correction term, which is the extra cost when the DV 4-b is added to MB-pol.All timings are evaluated using a single Intel i7-8750H core.It can be seen that the extra cost of the 4-b correction is in general a small fraction of the cost of MB-pol, and the cutoff distance can be tuned to achieve a balance between the cost and accuracy.
We have made comparisons between the TTM4-F 4-b potential, the TTM4-F 4-b with DV 4-b correction, and the CCSD(T) results.A remaining question that should be addressed is how the correction potential performs in comparison with the previously reported full 4-b potential 15 as recently improved. 16f course, many larger studies may already be based on the very successful MB-pol potential.For these, improvement by DV 4-b might be the easiest upgrade, with some extra computational cost, depending on the choice of the 4-b cutoff.But for those who are interested only in the 4-b potential, we suggest previously reported 4-b PES since it is much faster than TTM4-F 4-b + DV 4-b , and is slightly more accurate as well.
Finally, we note that many potentials or components of potentials can be corrected by this method, which has already been shown to be accurate and efficient for CH 4 , H 3 O + , Nmethyl acetamide, 9 and for acetylacetone, 10 with more applications in process.In the current study, we have shown substantial improvement of the 4-b potential for TTM4-F, but one might have reasonable hope that this D-ML method is a general approach that could provide substantial improvements to other potentials at relatively small cost.In this case, the corrections would begin with the 2-b ones and could go up to 4-b.Our recent CCSD(T) 2-b, 3-b, and 4-b datasets are available, 36 and so the corrections, DV n-b , simply require a water force eld.Some examples are the well-known potentials AMOEBA, 37 and TTM2. 1  (ref.38) or TTM3. 39These are polarizable potentials, however, with signicant differences.Another force eld that might be interesting to "correct" is MB-UCB. 40Since this potential relies heavily on DFT calculation, using the uB97X-V/def2-QZVPPD functional, it appears that the correction to MB-UCB would be analogous to the correction of DFT to CCSD(T) PESs that we considered originally in our D-ML method.

Summary and conclusions
The 4-b interaction in the MB-pol potential has been corrected using the proposed D-ML approach.The 4-b interaction itself and the correction are "small" compared to say the 2-b interaction.However, as noted above the 4-b correction has been shown to be the "ultimate" interaction by Xantheas and coworkers for large water clusters.Further it extends the successful MB-pol potential to this level of interaction.It is worth noting that the PIP t to the difference 4-b energies is challenging because it is a 12-atom PES.While this number of atoms is not at the frontier of ML-PESs currently, it was not feasible years ago when PIP 2-b and 3-b potentials were reported for water in the WHBB 41 and the MB-pol 19,20  Numerous applications of this 4-b corrected MB-pol potential can and hopefully will be done.From the present analysis, it is clear that the correction is not large; however, it does improve the accuracy of this excellent potential.
The methods described above offer three important ideas.First, we have described a method for ensuring that the potentials go to zero when appropriate distances get large.Second, we have also described a method for maintaining permutational symmetry not only in each monomer but when monomers as a whole are interchanged with one another.Finally, the DV method itself allows large improvements for small amount of effort, and this approach appears to be general and could be applied for other water force elds and similar types of force elds for other liquids and materials.
These are fairly large fractions of the 4b energy itself.Stimulated by recent assessments of the importance of the 4-b interaction and inaccuracy of the MB-pol 4-b, we report a correction 4-b PES, denoted DV 4-b , that is aimed directly at extending the MB-pol potential to the CCSD(T) 4b level.The correction is a PIP t to the energy difference between the CCSD(T) 4-b interactions and TTM4-F 4b interactions.

First we examine two 1 -
d cuts where we compare the 4b CCSD(T)-F12/haTZ energies to those from MB-pol/TTM4-F and to those from MB-pol + DV 4-b .Fig. 1 and 2 show cuts of the potential for separating two dimers from one another and for separating a monomer from the remaining trimer, respectively.The major improvements of the 4-b correction are in the short-range.Whereas MB-pol/TTM4-F is quite accurate in the long range, it is not designed with the proper Pauli exchange and repulsion in the short range.Despite the fact that TTM4-F fails badly in the short range, the D 4-b potential does provide a reasonable correction.It should be noted that for both cuts

Fig. 1 4
Fig. 1 4-b energies from indicated sources as a function of the oxygen-oxygen distance between pairs of water dimers in the tetramer.The arrows indicate the dimer pair that separates from the rigid tetramer.The equilibrium value of this distance is 2.7 Å.

Fig. 2 4
Fig. 2 4-b energies from indicated sources with a single monomer separating from the tetramer.R OO is the distance between the O atoms on the two monomers on the axis inferred from the arrow.

Fig. 3
Fig. 3 (a) Correlation plot between TTM4-F 4-b and CCSD(T)-F12 4-b energies; (b) correlation plot between TTM4-F + DV 4-b and CCSD(T)-F12 4-b energies; (c) error of TTM4-F 4-b as a function of max R OO in the tetramer; (d) error of TTM4-F + DV 4-b as a function of max R OO in the tetramer.

Fig. 4
Fig.4The binding energies of the eight isomers of the water hexamer from indicated sources.
potentials.That the correction potential is a signicant improvement over the TTM4-F potential can be seen (a) by comparing cuts of the potentials for TTM4-F 4-b, and TTM4-F 4-b + DV 4-b along with the CCSD(T)-F12 values; (b) by comparing the correlation between these potentials with the CCSD(T) values and the errors as a function of R OO ; (c) by comparing results for the binding energies of the water hexamer isomers; and (d) comparing the MAEs in harmonic frequencies for four of the isomers.

Table 2
Absolute binding energies (in kcal mol À1 ) of 8 isomers of water hexamer, from CCSD(T)/CBS benchmark calculations (ref.28)and the PESs

Table 3
Mean absolute errors (MAE) in harmonic frequencies (in cm À1 ) for indicated PESs.The benchmarks are from ref.34

Table 4 Time
(in seconds) needed to evaluate the energy of N mono monomers.Here t MBX is the time to obtain all terms in MB-pol using the latest version (MBX), and t DV 4-b is the time to evaluate our 4b correction.N tetra is the number of tetramers needed to be evaluatedN mono t MBX Cutoff ¼ 6.0 Å Cutoff ¼ 6.5 Å Cutoff ¼ 7.0 Å