Troy D.
Loeffler
a,
Tarak K.
Patra
ab,
Henry
Chan
ac and
Subramanian K. R. S.
Sankaranarayanan
*ac
aCenter for Nanoscale Materials, Argonne National Laboratory, Argonne, IL 60439, USA
bDepartment of Chemical Engineering, Indian Institute of Technology Madras, Chennai, TN 600036, India
cDepartment of Mechanical and Industrial Engineering, University of Illinois, Chicago, IL 60607, USA. E-mail: skrssank@uic.edu
First published on 26th March 2020
Neural network (NN) based potentials represent flexible alternatives to pre-defined functional forms. Well-trained NN potentials are transferable and provide a high level of accuracy on-par with the reference model used for training. Despite their tremendous potential and interest in them, there are at least two challenges that need to be addressed – (1) NN models are interpolative, and hence trained by generating large quantities (∼104 or greater) of structural data in hopes that the model has adequately sampled the energy landscape both near and far-from-equilibrium. It is desirable to minimize the number of training data, especially if the underlying reference model is expensive. (2) NN atomistic potentials (like any other classical atomistic model) are limited in the time scales they can access. Coarse-grained NN potentials have emerged as a viable alternative. Here, we address these challenges by introducing an active learning scheme that trains a CG model with a minimal amount of training data. Our active learning workflow starts with a sparse training data set (∼1 to 5 data points), which is continually updated via a nested ensemble Monte Carlo scheme that iteratively queries the energy landscape in regions of failure and improves the network performance. We demonstrate that with ∼300 reference data, our AL-NN is able to accurately predict both the energies and the molecular forces of water, within 2meV per molecule and 40 meV Å−1 of the reference (coarse-grained bond-order potential) model. The AL-NN water model provides good prediction of several structural, thermodynamic, and temperature dependent properties of liquid water, with values close to those obtained from the reference model. The AL-NN also captures the well-known density anomaly of liquid water observed in experiments. Although the AL procedure has been demonstrated for training CG models with sparse reference data, it can be easily extended to develop atomistic NN models against a minimal amount of high-fidelity first-principles data.
Design, System, ApplicationArtificial neural networks (ANNs) for molecular simulations are currently trained by generating large quantities (on the order of 104 or greater) of structural data in hopes that the ANN has adequately sampled the energy landscape both near and far-from-equilibrium. This can, however, be a bit prohibitive when it comes to more accurate levels of quantum theory. As such, it is desirable to train a model using the absolute minimal data set possible, especially when costs of high-fidelity calculations such as CCSD and QMC are high. Here, we present an active learning approach that starts with a minimal number of training data points, iteratively samples the energy landscape using nested ensemble Monte Carlo to identify regions of failure and retrains the neural network on-the-fly to improve its performance. We find that this approach is able to train a neural network to reproduce thermodynamic, structural and transport properties of bulk liquid water by sampling less than 300 configurations and their energies. This study reports a new active learning scheme with a promising sampling and training strategy to develop accurate force-fields for molecular simulations using extremely sparse training data sets. The approach is quite generic and can be easily extended to other classes or materials. |
In this context, it should be noted that neural network based potential models4–7 provide flexibility and can simultaneously be efficient if appropriately trained. NN models are emerging as a popular technique owing to the rapid advancement in the computational resources as well as a myriad of electronic structure codes that allow for efficient generation of training data. Bulk of the work on NN models is on representing atomistic interactions, with an underlying aim to retain first-principles accuracy at a comparatively lower computational cost. Nonetheless, NN potentials remain expensive compared to pre-defined functional forms and improving the efficiency remains a challenge. The other challenge with NN models is the need for large amounts of training data – NN models are interpolative, and as such the traditional approach for training NNs has relied on generating as large training data as possible. Often, large-scale generation of high-fidelity training data can become challenging.
To address the time scale challenge of the DNN models, one can circumvent their high computational cost via coarse-graining. CG models tend to sacrifice accuracy to gain efficiency. Zhang and co-workers have recently introduced a deep-learning coarse-grained potential (DeePCG) that constructs a coarse-grained neural network trained with full atomistic data that preserves the natural symmetries of the system.8 These models sample configurations of the coarse-grained variables accurately albeit at a much lower computational cost than the original atomistic model. Our previous work on the development of a CG-DNN model showed that one can accurately predict both the energies and the molecular forces of water, within 0.9meV per molecule and 54meV Å−1 of a reference (bond-order potential) model.6 While this CG-DNN water model provides good prediction of several structural, thermodynamic, and temperature dependent properties of liquid water, it did require elaborate training data, i.e. energies of ∼30000 bulk water configurations. Generating such extensive training data is difficult, especially if one is interested in high-fidelity calculations such as quantum Monte Carlo (QMC).
There have been several recent efforts on the use of active learning strategies to address the challenge of generation and efficient sampling of training data for NN models. Smith et al. have employed active learning (AL) via Query by Committee (QBC) to develop a machine learning model.9 QBC uses the disagreement between ensembles of ML potentials to infer the reliability of the ensemble's prediction. QBC automatically samples regions of chemical space where the potential energy is not accurately represented by the ML potential. They validated their AL approach on a COMP6 benchmark containing a diverse set of organic molecules. They demonstrated that one requires only 10% to 25% of the data to accurately represent the chemical space of these molecules. Likewise, Zhang et al. have developed an active learning scheme (deep potential generator (DP-GEN)) that can be used to construct machine learning models for molecular simulations of materials.10,11 Their procedure involves exploration, generation of accurate reference data, and training. Using Al and Al–Mg as examples, they demonstrate that ML models can be developed with a minimum number of reference data. On the other hand, Vandermause et al. have recently introduced an adaptive Bayesian inference method to automate the training of low-dimensional multiple element interatomic force fields using structures sampled on the fly from AIMD.12 Their active learning framework uses internal uncertainty of a Gaussian process regression model to decide the acceptance of model prediction or the need to augment training data. The aim in these studies is to minimize the ab initio training data required to develop interatomic potential.
Here, we introduce an active learning (AL) strategy that starts with minimal training data (∼1 to 5 data points) and is continually updated via a nested ensemble Monte Carlo scheme that iteratively queries the energy landscape in regions of failure and improves the network performance. We choose water as a representative system given its various thermodynamic anomalies, which have proven to be a challenge for modeling.1 Liquid water exhibits its density maximum at 277K.6 Several existing water models fail to capture the density anomaly accurately.7 One of the main barriers to the development of accurate CG-DNN models is the lack of large amount of high-quality data that are essential for understanding neural network topology, basis functions and parameterizations for capturing a wide range of water properties across different thermodynamic conditions. Given these challenges, water represents an excellent test system for testing the efficacy of our AL learning procedure. Note that our focus is not on developing a new coarse-graining procedure, but instead is on the ability to train a CG-DNN using sparse training data. We show that our AL-NN is able to adequately represent the CG energy landscape and the thermodynamic and structural properties of water by sampling minimal amount of reference data (∼300 total reference data).
Fig. 1 Schematic showing the active learning workflow employed for the generation of the NN potential model for bulk liquid water. |
1. Training of the NN using the current structure pool (of bulk water configurations).
2. Running a series of stochastic algorithms to test the trained network's current predictions.
3. Identification of configurational space where the NN is currently struggling.
4. An update of the structure pool with failed configurations.
5. Retraining of the NN with the updated pool and back to step 2.
To test our AL scheme, we train a neural network to a reference Tersoff based coarse grained water model (BOP water), which contains two and three body terms in its functional form.1 The neural networks used in this study were constructed and trained using the Atomic Energy Network (AENet) software package,13 which was modified to implement the active learning scheme outlined above. Simulations using these networks were carried out using AENet interfaces with the Classy Monte Carlo simulation software to perform the AL iterations. The main steps in our active learning iteration include the following:
G 1 | η (Å−2) | G 1 | η (Å−2) | G 1 | η (Å−2) | G 1 | η (Å−2) | G 1 | η (Å−2) |
---|---|---|---|---|---|---|---|---|---|
1 | 0.00417 | 6 | 0.01551 | 11 | 0.0576 | 16 | 0.21386 | 21 | 0.79406 |
2 | 0.00543 | 7 | 0.02016 | 12 | 0.07488 | 17 | 0.27802 | 22 | 1.03229 |
3 | 0.00706 | 8 | 0.02621 | 13 | 0.09734 | 18 | 0.36143 | 23 | 1.34197 |
4 | 0.00917 | 9 | 0.03408 | 14 | 0.12654 | 19 | 0.46986 | 24 | 1.74456 |
5 | 0.01193 | 10 | 0.04430 | 15 | 0.16451 | 20 | 0.61082 | 25 | 2.26790 |
G 2 | η (Å−2) | λ | ζ | G 2 | η (Å−2) | λ | ζ |
---|---|---|---|---|---|---|---|
1 | 0.0004 | 1 | 2 | 14 | 0.0654 | 1 | 4 |
2 | 0.0054 | 1 | 2 | 15 | 0.0704 | 1 | 4 |
3 | 0.0104 | 1 | 2 | 16 | 0.0754 | −1 | 4 |
4 | 0.0154 | −1 | 2 | 17 | 0.0804 | −1 | 4 |
5 | 0.0204 | −1 | 2 | 18 | 0.0854 | −1 | 4 |
6 | 0.0254 | −1 | 2 | 19 | 0.0904 | 1 | 5 |
7 | 0.0304 | 1 | 3 | 20 | 0.0954 | 1 | 5 |
8 | 0.0354 | 1 | 3 | 21 | 0.1004 | 1 | 5 |
9 | 0.0404 | 1 | 3 | 22 | 0.1054 | −1 | 5 |
10 | 0.0454 | −1 | 3 | 23 | 0.1104 | −1 | 5 |
11 | 0.0504 | −1 | 3 | 24 | 0.1154 | −1 | 5 |
12 | 0.0554 | -1 | 3 | 25 | 0.1204 | 1 | 6 |
13 | 0.0604 | 1 | 4 |
Here, we employ the generalized representation of NNs for constructing the PES.15 Within this representation, each molecule of a given configuration is represented by a NN. The total energy of a configuration is thus obtained as a sum of the molecular energies, defined as , where Ei is the output of the ith NN and NA is the total number of molecules in a given configuration. We note that the architecture and weight parameters of all the molecular NNs are identical. During the training, the symmetry functions of each molecule of a configuration are fed to the corresponding NN via its input layer. In every NN, all the compute nodes in the hidden layers receive the weighted signals from all the nodes of its previous layer and feed them forward to all the nodes of the next layer via an activation function . Here, f(x) = tanh(x) is used as the activation function for all the compute nodes. As mentioned earlier, the sum of all the outputs from all the NNs serves as the predicted energy of the system. The error in the NNs, which is the difference between the predicted and reference energies of a given configuration, is propagated backward via the standard backpropagation algorithm.17 All the weights that connect any two nodes are optimized using the Levenberg–Marquardt method18 in order to minimize the error, as implemented within the framework of AENet13 open-source code (see the next sub-section for details).
In addition, a wide collection of non-physical moves or non-thermal sampling approaches can be used. For the purposes of this work, the Boltzmann based Metropolis sampling and a nested ensemble based approach19 were used to generate the structures for each AL iteration. This was performed to gather information on both thermally relevant structures predicted by the neural network as well as higher energy structures, which may still be important for creating an accurate model. The Metropolis simulation was run for 5000 MC cycles at 300 K with the initial structure being randomly picked from the current neural network training pool. The nested ensemble simulations were run for another 5000 cycles. Here, a Monte Carlo cycle is defined as the number of moves such that on average each particle has a chance to move at least once. This is traditionally performed to facilitate a comparison to molecular dynamics, where every particle moves on the same time step. Thus, if there are N molecules in the system, a cycle is defined as N attempted moves.
Note that the nested sampling works by a halving approach with respect to the density of states. Initially, when the nested-MC simulation is performed, the atoms are allowed to move around freely with no rejection. During this stage, the probability as a function of system's potential energy is collected. After a specified number of cycles, the median energy value of the population is estimated from the probability distribution. This value now becomes the new maximum limit for the system. During the next run, if a given move will cause the system energy to go above the median value from the previous run, the move is rejected and returned to the previous state in a manner analogous to the Metropolis sampling. After the same amount of time, the median energy of this run is computed and used as the maximum limit of the next run. This is continuously applied until the system is stuck (implying a local minima has been found) or the maximum number of cycles is achieved.
The primary purpose of a nested sampling approach is that it is able to identify not only situations where a potential energy surface is under-predicted, but also situations where it is grossly over-predicted and will be naturally blocked from access under thermal sampling. It ensures a nice spread from very high energies (∼+0.01 eV per atom) down to the most stable configurations (≃0.42 eV per atom). This combats the entropic problems in sampling, where there are far more high energy structures than low energy ones and also combats problems associated with thermal sampling in that it can access high energy regions to look for configurations that are over-predicted by the neural network.
In order to rigorously validate the neural network models, we created a test set that consists of roughly ∼150000 bulk configurations of liquid water. Each bulk configuration for training and testing has 108 water molecules. The test set was generated using a combination of nested, random, and thermal sampling employing the reference (Tersoff) potential to derive as diverse a set of system configurations as possible.
To measure the network performance as a function of the number of AL iterations, we choose the best network from each AL iteration and use it to predict the energy on a test set that comprises 150000 configurations and their energies. The correlation between the energy predicted by the AL-NN and the reference energy (BOP energy) for the training data sets generated over the course of the active learning is shown in Fig. 3a. As expected, we find that the final optimized network is able to reliably predict the cluster energies for the test data set generated not only in the near equilibrium, but also in the highly non-equilibrium region that extends far beyond. In addition to evaluating the performance of the energy predictions of the AL-NN, we also evaluate its performance for forces. Note that the forces were not included as a part of the training during the AL iterations. The correlation plot comparing the predicted and reference forces is shown in Fig. 3(b). Each point in the correlation plot represents the reference and predicted values of one of the force components, Fx, Fy and Fz, acting on a particle. The overall MAE between the reference and AL-NN predicted forces was found to be ∼40 meV Å−1. Considering that the network had not been trained on the forces, the agreement was found to be of excellent quality. Overall, the AL-NN optimized network performs satisfactorily over the extensively sampled configurations in the test data set.
We next rigorously assess the performance of the actively learnt NN water model by computing the various temperature dependent properties of liquid water as obtained from our MC simulations. The MC simulations are all performed under an isothermal isobaric (NPT) ensemble at P = 1 bar and the temperatures are varied in the range of 250–320 K to extract the bulk water properties. Each simulation consisted of a system of bulk water with 1024 water molecules. The systems are equilibrated for 105 MC cycles in a periodic simulation box. The equilibration run is then followed by another 105 MC cycles to perform the production run. At each temperature, the values of the properties are averaged over four independent production runs. Fig. 4 compares one such temperature dependent property computed using our actively learnt NN model vs. the predictions of our reference BOP coarse-grained model as well as experimental20 values. The reference properties for the BOP model are computed from MD simulations performed using LAMMPS.21Fig. 4a shows the performance of the models in capturing the most famous density anomaly of liquid water. The actively learnt NN predicts the temperature of maximum density (TMD) to be T = ∼278 K, which is close to both the target BOP model (TMD = ∼280 K) and the experimental values (TMD = 277 K). Overall, we find that the actively learnt NN predicts temperature dependent density values that are well within 0.01 gm cc−1 of the reference model and experiments. For instance, our AL-NN predicts a liquid water density of 1.001 gm cc−1 at T = 300 K, which is in excellent agreement with the BOP predicted value of 0.9997 gm cc−1. Overall, our actively learnt NN model captures the correct temperature–density correlation in liquid water (including capturing the notoriously difficult density anomaly of liquid water).
We next assess the structural predictions by evaluating structural correlation functions such as the radial distribution function (RDF) and angular distribution function (ADF) of liquid water at a representative temperature of T = 300 K. Fig. 4b and c compare the RDF and ADF between the actively learnt NN and the BOP model, respectively. The RDF in our AL-NN water model displays 1st, 2nd and 3rd peak (corresponding to the first, second and third coordination shells, respectively) at r = 2.8 Å, 4.5 Å, and 6.8 Å. Likewise, the ADF for our AL-NN shows the 1st and 2nd peaks at θ = ∼ 47° and θ = ∼ 95°, respectively. Both the RDF and ADF peak positions are in excellent agreement with those of the reference BOP model. Moreover, there is an excellent quantitative agreement, i.e. the peak intensities and width of both models for the RDF and ADF are also in excellent agreement. As shown in Fig. 4b, we observe that the predicted RDF minimum for the AL-NN at ∼3.4Å is deeper than the experimental RDF.22 This might indicate the over-structuring of liquid water, where the exchange of molecules between the first and second coordination shells is underestimated. The over-structuring feature is also evident in the reference model. However, the average coordination number of water molecules, i.e., number of water neighbors in the first solvation shell, integrated out to the experimentally determined temperature independent isosbestic point (r = 3.25Å), is 4.7. This is in excellent agreement with the typical range of 4.3–4.7 observed in experiments.22,23 Overall, the AL-NN model reasonably captures the molecular structure of liquid water. We further assess the dynamical properties of the AL-NN by performing MD simulations in an NPT ensemble using the ASE open source package.24Fig. 4d shows the block averages of the mean square displacement (MSD) of a water molecule at T = 300 K obtained over the first 10 ps of an MD simulation. The calculated water diffusivity from the slope of the MSD curve is ∼3.05 × 10−5 cm2 s−1, which is close to the value of 3.04 × 10−5 cm2 s−1 predicted by the BOP model and the experimental value of 2.3 × 10−5 cm2 s−1.25
We find that the quality of the output network is more highly correlated to data quality than data quantity, meaning that simply flooding the training process with data can actually produce worse results compared with the most commonly used loss functions. The primary problem of a lot of current approaches is oversampling, i.e. an excess of training in one area and a lack of training data in another. We note that the AL training is able to adequately sample the configurational space while minimizing the number of training configuration. In some of our initial tests of the AL workflow, we observed that over representation of a given region can bias the data and influence the neural network's weight fitting procedure. If a given area of the phase space has several samples of the structures, i.e. over-sampled compared to another region, the fitting process is able to achieve a low loss function value by simply fitting the over-represented region well. Unfortunately, this occurs at the expense of the less represented regions. An over-abundance of one type of data is actually a problem for training a neural network, which we address in this work. Hence, we do not include all the expensive ground-truth calculations primarily to avoid degeneracy, which helps improve the predictive power of the trained network.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c9me00184k |
This journal is © The Royal Society of Chemistry 2020 |