A transferable active-learning strategy for reactive molecular force fields

Predictive molecular simulations require fast, accurate and reactive interatomic potentials. Machine learning offers a promising approach to construct such potentials by fitting energies and forces to high-level quantum-mechanical data, but doing so typically requires considerable human intervention and data volume. Here we show that, by leveraging hierarchical and active learning, accurate Gaussian Approximation Potential (GAP) models can be developed for diverse chemical systems in an autonomous manner, requiring only hundreds to a few thousand energy and gradient evaluations on a reference potential-energy surface. The approach uses separate intra- and inter-molecular fits and employs a prospective error metric to assess the accuracy of the potentials. We demonstrate applications to a range of molecular systems with relevance to computational organic chemistry: ranging from bulk solvents, a solvated metal ion and a metallocage onwards to chemical reactivity, including a bifurcating Diels–Alder reaction in the gas phase and non-equilibrium dynamics (a model SN2 reaction) in explicit solvent. The method provides a route to routinely generating machine-learned force fields for reactive molecular systems.

. (a) A GAP model for molecular water (fitted using the P1 parameter set, Table S1) as characterized by its accuracy on external validation data not encountered in training; dark and light shaded area bound the 'chemically accurate' ±1 kcal mol -1 and ±2 kcal mol -1 regions, respectively. (b) The resulting difference between predicted and true single point energies on MD frames at 300 K (δt = 0.5 fs). Dynamics with 30 water molecules in a cubic box (l = 10 Å, ρ ~ 1 g cm -3 ). Error range (min-max, shaded) and average of five simulations using the same trained GAP from different initial randomly placed then minimized points. DFTB(3ob) ground truth. Orange lines depict the total cumulative error (solid) and cumulative error above 0.1 eV (dashed).  Table S2. Outline of training strategies used to train a GAP for bulk water (Figure 1). N total ground truth evaluations used to train the final potential. Errors are quoted as standard errors in the mean from 5 independent samples where appropriate to one significant figure. All training used 10 water molecules in a cubic box with side length 7 Å.

Strategy Notes
Classical molecular mechanics MD simulations were performed at 100, 300, 500 and 1000 K using GROMACS v. 2019.2 with a timestep of 1 fs and TIP3P water. Following a 1 ns NVT equilibration of a random configuration of water 10 ns of NVT dynamics were performed taking 1000 evenly spaced frames from the simulation. 0 1000 2 rand.
Configurations were generated by adding GFN2-XTB optimised water molecules into the box in a random position and orientation, ensuring that there are no intermolecular distances < 1.5 Å. Random displacements were added to each Cartesian coordinate sampled from a random normal distribution with σ=0.05 Å, which samples over intramolecular bond stretches and bends. 0 1000 3 rand. min.
As rand. where each random configuration is minimized to |Fi| < 1 eV Å -1 , where Fi is the force on atom i. Subsequently, random normal displacements were added to each atom to ensure some sampling of the intramolecular modes.

AIMD
Ab-initio MD simulations were performed at the ground truth level (DFTB, 3ob parameters) for 1 ps at 300 K with a 0.5 fs timestep. Frames were randomly selected from the trajectory with at least a 2 fs interval.
0.011 ± 0.003 2570 ± 10 5 AL Active learning is initiated from a GAP trained on 10 random configurations with a 1.5 Å minimum distance between water molecules. MD simulations at 300 K was then propagated using this potential for n 3 + 2 fs, where n is the number of iterations of the MD trajectory. The error between the ground truth (E0) and predicted energy (EGAP) is evaluated for the final frame (|E0 -EGAP|) and, if above 0.1 eV the configuration is added to the training data. If above 10× the threshold then the error is backtracked in intervals of 2 fs until a suitable configuration is obtained, as to not add any very high energy configurations. If the error is 100× the threshold, then the first frame of the trajectory is returned, as the backtracking is likely to be too slow. If the error is below the threshold, then a further MD trajectory is propagated, and n incremented by one. If n > 10 then no configuration is added from this set of trajectories.  (Table S2). max(τacc) = 1 ps calculated in a 2500 K simulation for a ~1% probability of accessing a configuration 1 eV above the minimum, El = 0.043 eV, Et = 0.43 eV. (b, c) Energy and force distribution of the test data used to calculate a root mean squared error (RMSE) generated on a grid over rOHa, rOHb ∈ [1.0, 1.3] Å and rHaHb ∈ [1.0, 2.5] Å and truncated above 1 eV of the minimum to 103 datapoints that are not coincident with any training data. Figure S3. Water dimer PES predicted using SOAP and 2b GAPs with the ground truth (DFTB(3ob)) in black. Trained on then evaluated on the PES points. Intramolecular component subtracted using the same intra-GAP as Figure 1 (2b+3b, rc=3.0 Å) evaluated in separate boxes. Figure S4. Learning curves for a bulk water GAP trained on random configurations, with or without selection strategies. Water molecules randomized in the box by applying a random rotation and translation to each water molecule ensuring no intermolecular distance < 1.5 Å, apart from rand. 1.7 Å min (purple triangles) where the minimum distance is 1.7 Å. τacc calculated with a 1 fs interval, El = 0.1 eV, Et = 1 eV averaged over 5 initial random configurations. Error bars are standard error in the mean over 5 independent iterations. CUR-K 1/10 is a CUR selection of the square kernel matrix between SOAP descriptors calculated using Dscribe, 1 keeping 1 in 10 rows, CUR is selection on the SOAP matrix averaged over atoms in the box.

S1. Other solvents
To demonstrate the transferability of the intra+inter active learning training method to other solvents presented here are learning curves in τacc for a selection of small, commonly used, organic solvents. The accuracy of the potential is quantified by the final error metric following the full active learning, and as a quick validation the radial distribution functions are shown for all pairs in each system. Given the much higher dimensionality of the intramolecular PES in all of the solvents a dense grid over all coordinates is not possible (e.g. Natoms = 6 in MeCN ⇒ 8 (3×6) -6 ~ 10 11 points) and active learning needs to be employed for the intra surface. Employing active learning on the intramolecular modes of water requires both a high temperature, as to sample the curvature of the PES at regions often sampled at 300 K, and an energy threshold to prevent high energy configurations entering the fit (Figure S13).
Active learning at 1600 K then readily dissociates Cl• with a DFTB ground truth.    Table S3. 1600 K for all intramolecular active learning and 300 K for the intermolecular equivalent from 10 initial random normal displacements of all atoms (σ = 0.05 Å, µ = 0 Å). Maximum τacc shown as dashed lines and min(τacc) = 0.1 fs for plotting. Error bars plotted as the standard error in the mean from 5 independent repeats. Active learning halted if no configurations have error above the threshold. Table S3. Box sizes for intermolecular training and SOAP descriptors used for GAPs shown in Figure S14. All intermolecular training used 10 molecules initially optimised at the GFN2-XTB level of theory. Box size chosen to ensure P(generated) ⪆ 0.1 when molecules are added to the box ensuring a minimum distance of >2 ×XVdW -0.5 Å where XVdW is the van der Waals radius of the largest atom (X) in the system. All descriptors used rc SOAP = 3 Å and other parameters as Table S1. SOAP descriptors shown as e.g. X: Y, Z, as a SOAP on atom type X which includes other types Y and Z. 0.04 eV ~ 1 kcal mol -1 . . Radial distributions functions for acetonitrile generated using GAPs (purple) trained with active learning on inter and intramolecular degrees of freedom (as Figure S14) ground truth DFTB(3ob) level (black). Dynamics run for 30 ps at 300 K in a 13.8 Å length box with a time-step of 0.5 fs. Figure S16. (a) Learning curve for TS originated dynamics in the gas phase for Cl -+ CH3Cl → Cl -+ CH3Cl. Number of ground truth evaluations does not include those used to find the initial transition state. SOAP descriptors with rc = 3.5 Å on C and Cl. TS optimisation and energy/force evaluations performed with ORCA at the PBE/ma-def2-SVP level of theory. τacc calculated using a 2 fs time interval, 1 kcal mol -1 error threshold, 10 kcal mol -1 maximum total error to a maximum of τacc = 100 fs, as only short time dynamics are required from the TS.