Mingjian
Wen
ab,
Samuel M.
Blau
b,
Evan Walter Clark
Spotte-Smith
ab,
Shyam
Dwaraknath
b and
Kristin A.
Persson
*ac
aDepartment of Materials Science and Engineering, University of California, Berkeley, CA 94720, USA
bEnergy Technologies Area, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
cMolecular Foundry, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA. E-mail: kapersson@lbl.gov
First published on 8th December 2020
A broad collection of technologies, including e.g. drug metabolism, biofuel combustion, photochemical decontamination of water, and interfacial passivation in energy production/storage systems rely on chemical processes that involve bond-breaking molecular reactions. In this context, a fundamental thermodynamic property of interest is the bond dissociation energy (BDE) which measures the strength of a chemical bond. Fast and accurate prediction of BDEs for arbitrary molecules would lay the groundwork for data-driven projections of complex reaction cascades and hence a deeper understanding of these critical chemical processes and, ultimately, how to reverse design them. In this paper, we propose a chemically inspired graph neural network machine learning model, BonDNet, for the rapid and accurate prediction of BDEs. BonDNet maps the difference between the molecular representations of the reactants and products to the reaction BDE. Because of the use of this difference representation and the introduction of global features, including molecular charge, it is the first machine learning model capable of predicting both homolytic and heterolytic BDEs for molecules of any charge. To test the model, we have constructed a dataset of both homolytic and heterolytic BDEs for neutral and charged (−1 and +1) molecules. BonDNet achieves a mean absolute error (MAE) of 0.022 eV for unseen test data, significantly below chemical accuracy (0.043 eV). Besides the ability to handle complex bond dissociation reactions that no previous model could consider, BonDNet distinguishes itself even in only predicting homolytic BDEs for neutral molecules; it achieves an MAE of 0.020 eV on the PubChem BDE dataset, a 20% improvement over the previous best performing model. We gain additional insight into the model's predictions by analyzing the patterns in the features representing the molecules and the bond dissociation reactions, which are qualitatively consistent with chemical rules and intuition. BonDNet is just one application of our general approach to representing and learning chemical reactivity, and it could be easily extended to the prediction of other reaction properties in the future.
A bond dissociation reaction can be categorized into one of two types: homolysis
A:B → A⋅ + B⋅ | (1) |
A:B → [A:]− + [B]+ | (2) |
[A:B]− → A⋅ + [B⋅]− | (3) |
is the counterpart of eqn (1) for a −1 charged molecule. Whatever the bond dissociation type and the molecular charges are, the BDE is calculated as the energy change between the products and the reactant, ΔE = E(A) + E(B) − E(AB). Historically, reaction enthalpy has been used, as these values have been tabulated in textbooks;15,16 recently, however, the Gibbs free energy has become prevalent in the chemical literature.17–20 Quantum mechanical computational chemistry methods like density functional theory (DFT) are well suited to calculate a large (∼103 to 104) but still limited number of BDEs with high accuracy.21,22 However, they become too computationally demanding to be widely adopted for chemical design of real-system reaction cascades,23 where millions of BDEs need to be calculated to screen for appropriate molecules or reactions. Machine learning models could be a promising alternative to provide orders of magnitude faster predictions without a significant sacrifice in accuracy. Contemporary machine learning methods, especially deep learning, have already demonstrated success in solving many chemistry problems, such as retrosynthesis planning,4–6,24 reaction products prediction,25–28 molecule generation,29–32 and molecular property prediction.33,34 The most crucial component of chemical machine learning models is a suitable molecular representation to extract information relevant to the problem of interest. Conventional approaches utilize feature engineering to encode variable-size molecules as fixed-length vectors that emphasize particular aspects of molecules deemed important for a property while ignoring others.35–38 However, these manually crafted molecular representations are not easily transferable to new problems. More recently, molecular representations have been automatically generated using graph neutral network (GNN) methods.39–43 GNNs treat molecules as graphs and learn molecular representations from data via message passing between atoms and bonds. Models based on GNNs can significantly outperform conventional methods that rely on manual feature engineering.44–46
For the prediction of BDEs, there already are several machine learning models relying on molecular representations either from manual feature engineering or GNNs. Early works restricted themselves to extremely small datasets of one or two bond types and trained simple learning algorithms such as polynomial fitting47 and support vector machines48 on manually crafted features. More recent works have leveraged high-throughout DFT calculations to generate larger BDE datasets of various bond types and have adopted neural networks as the learning algorithm. Qu et al.2 trained an associative neural network (ANN) on ∼12000 BDEs for molecules made up of C, H, O, N, and S atoms, achieving a mean absolute error (MAE) of 0.145 eV. St. John et al.3 trained ALFABET (a GNN model) on ∼290000 BDEs for molecules made up of C, H, O, and N atoms, achieving an MAE of 0.025 eV.3 These MAEs are close to or even below the chemical accuracy of 0.043 eV (i.e. 1 kcal mol−1).49 Despite their successes, current machine learning models for BDE prediction still suffer from two interdependent limitations. First, these models assume particular states of the products (e.g. neutral charge) and predict BDEs from the reactants by specifying the breaking bonds, without considering feature updates of the products. Second, these models are only applicable to the homolysis of neutral molecules as in eqn (1); heterolysis (eqn (2)) and bond dissociation of charged molecules (eqn (3)) are beyond their capabilities. This is likely due to the lack of publicly accessible BDE datasets for charged molecules or heterolytic bond dissociation. Unlike the homolysis of neutral molecules (eqn (1)) where the two products exhibit the same charge, cleaving a neutral molecule heterolytically (eqn (2)) or a charged molecule homolytically (eqn (3)) will yield products of different charges. Depending on which product possesses which charge, there might be several different possible ways for the bond to break, and thus several different values for the BDE. Without explicitly including product information, it is impossible for a model to distinguish between these different possible reactions.
In this paper, we overcome these limitations and propose a general GNN model, Bond Dissociation Network (BonDNet), capable of predicting both homolytic and heterolytic BDEs for molecules of any charge. In addition to the atom and bond features widely used in previous GNNs for molecules,39–43 BonDNet adds global features50,51 to encode molecule-level information. Specifically, the total charge of a molecule is included as a global feature to distinguish molecules with the same connectivity but different charge. BonDNet then takes the differences of the atom, bond, and global features between the products and the reactant to represent a bond dissociation reaction.52–54 We show that these chemically inspired difference features assist the model in learning better representations of bond dissociation reactions, and thus, even when only predicting homolytic BDEs of neutral molecules, BonDNet surpasses previous models by a considerable margin. We trained BonDNet on a novel dataset consisting of both homolytic and heterolytic BDEs for neutral and charged (−1, and +1) molecules. The model achieves an MAE of 0.022 eV for unseen BDEs in this complex dataset, approaching the accuracy of the DFT method used to generate the data. Finally, we demonstrate how chemical insight can be extracted from BonDNet by analyzing the features representing the molecules and the reactions. An interface to use the developed model for the prediction of BDEs is provided via binder55 and can be accessed at https://github.com/mjwen/bondnet.
BonDNet has two major components. The first is a graph-to-graph (g2g) module that takes a molecular graph as input and yields the same molecular graph but with updated atom, bond, and global features. The g2g module is applied multiple times to learn better molecular representations from the data. The second component is a graph-to-property (g2p) module. Taking as input the molecular representations learned by the g2g module, the g2p module constructs chemically inspired representations of reactions and maps them to chemical properties (in this work, BDEs). In this section, we first provide a thorough discussion of the two components and then briefly introduce the input features and how the model is trained.
A schematic illustration of the g2g module of BonDNet is shown in Fig. 1a. First, each bond feature vector ek is updated from the feature vectors for the two atoms participating in the bond, vrk and vsk, the global feature vector u, and the current bond feature vector:
(4) |
(5) |
(6) |
(7) |
The bond feature update in eqn (4) and the atom feature update in eqn (5) pass messages based on the connectivity of the molecular graph, and the information exchange is thus inherently localized. To enable long-range interactions, we can compose multiple g2g modules together, taking the output of one module as the input for another module. For example, with four stacked g2g modules, the hydrogens of the H2CO3 molecule shown in Fig. 1a can interact because each g2g module lets an atom “see” other atoms one bond away from it. However, this is not realistic for large molecules where a large number of g2g modules would be needed to let all atoms interact as such an approach would make the model too deep to be effectively trained. This long-range interaction problem is addressed by the global feature update in eqn (7). In addition to encoding molecule-level information, the global features also serve as a central memory bank to facilitate long-range interaction. The bond and atom messages are aggregated to the memory bank in each g2g module and then disseminated from the memory bank to all bonds and atoms in the next g2g module. As a result, starting from the second g2g module, an atom or a bond can interact with all other atoms and bonds in the molecule. Our tests show that, with the help of the global features, three g2g modules are sufficient to learn good molecular representations.
Our approach is illustrated in Fig. 1b. First, we stack multiple g2g modules together (a later module takes as input the output of a former module) and apply them to the reactant and products to obtain a better molecular representation for each of them. The number of g2g modules is determined via a hyperparameter search based on the model performance on the validation set (see Table S4 in the ESI†). We then take the differences of the features of each atom, each bond, and the global state between the products and the reactant:
(8) |
r = Δv′‖Δe′‖Δu′, | (9) |
The key aspect of our approach is to represent a bond dissociation reaction with difference features. Operating on the difference features has several advantages. First, they are obtained by subtracting the features of the reactant from the products, equivalent to how a BDE is computed from the energies of the products and the reactant. Second, since atoms and bonds far away from the breaking bond in the reactant and the products tend to have similar feature values,53,54 the difference features deviate significantly from zero only for atoms and bonds near the breaking bond. This enables the model to focus on the breaking bond and its surrounding environment, consistent with the chemical intuition that a BDE depends on the relatively local environment of the bond.
Each dataset is split into three subsets for training, validation, and testing with a split ratio of 8:1:1. We optimize all model parameters in an end-to-end fashion using the training set, select hyperparameters based on the performance on the validation set, and report results on the test set unless otherwise stated. The model is implemented in Python using DGL66 with the PyTorch67 backend. To facilitate the training, we add a batch normalization (BN) layer68 and a dropout layer69 before the ReLU activation functions in eqn (4), (5) and (7). We train the model with the Adam optimizer to minimize a mean-squared-error loss function with a mini-batch size of 100. The learning rate is set to 0.001 at the start and is reduced if the validation error does not decrease for 50 epochs with a reducing rate of 0.5. The training stops when the validation error does not decrease for 150 epochs, and the optimization is allowed to run for a maximum of 1000 epochs. The optimal hyperparameters are obtained using a grid search and are given in Table S4 in the ESI.†
The mean absolute errors (MAEs) of BDEs predicted by BonDNet are presented in Table 1. The standard deviations are obtained by running the model five times with different data splits. Also included are MAEs by two other machine learning models: (1) ALFABET3 for the PubChem BDE dataset and (2) ANN2 for the ZINC BDE dataset. The ALFABET model is a GNN and the ANN model is an associate neural network trained on manually crafted features. MAEs by BonDNet are far below the chemical accuracy of 0.043 eV (i.e. 1 kcal mol−1)49 for both the BDNCM and PubChem BDE datasets. Although both BonDNet and ALFABET are GNN models, BonDNet outperforms ALFABET by 20% for the PubChem BDE dataset. BonDNet does not perform as well for the ZINC BDE dataset, with an MAE of about twice the chemical accuracy. One plausible explanation is that the ZINC BDE dataset is much smaller than the other two datasets (it consists of 16626 BDEs, only 5.7% the size of the PubChem BDE dataset); another could be that the reference BDE values in this dataset are less reliable and consistent, perhaps because the molecular geometries are optimized using a semiempirical tight-binding method as opposed to the DFT methods employed in the other two datasets. Nevertheless, BonDNet still achieves a 30% performance boost compared with the ANN model.
BDNCM | PubChem | ZINC | |
---|---|---|---|
a MAEs are reported in eV. | |||
BonDNet | 0.0221 ± 0.0026 | 0.0204 ± 0.0002 | 0.1013 ± 0.0076 |
ALFABET | 0.0252 | ||
ANN | 0.1453 |
To briefly test the transferability of BonDNet, we applied it predict the BDEs of a set of 82 drug-like molecules that are much larger than the molecules in the PubChem training set. The MAE for the drug-like molecules is 0.0460 eV (about twice the value of the MAE for the PubChem test set 0.0204 eV), which is acceptable considering that the drug-like molecules are much larger than the molecules in the PubChem dataset, and considering that this error is still roughly equal to chemical accuracy. See Fig. S2 in the ESI† for individual predictions.
As discussed in Section 1, it is possible to construct a machine learning model for the prediction of homolytic BDEs for neutral molecules based only on the reactants. For example, given the molecular graph G = (E,V) of a reactant (with no global features), we can update the atom and bond features with a procedure similar to the g2g module and then map the updated bond features to BDEs. In fact, ALFABET3 is such a model. In contrast, BonDNet (1) introduces global features to encode molecule-level information and (2) takes advantage of the chemically inspired difference features between the products and the reactant to represent a bond dissociation reaction.
To determine whether it is the inclusion of global features or the use of difference features that allows BonDNet to perform better than ALFABET, we conducted an ablation analysis by training a reactant-only model and testing its performance on the PubChem BDE dataset. The reactant-only model sits between ALFABET and BonDNet. It is effectively the same as ALFABET except that it includes global features which are not present in ALFABET. Compared with BonDNet, the reactant-only model uses reactant features instead of the difference features between the products and the reactant. (See Table S5 in the ESI† for architectural details of the reactant-only model.) The reactant-only model achieves an MAE of 0.0251 eV for the PubChem BDE dataset, virtually the same as ALFABET (see Table 1), suggesting that the difference features are responsible for the superior performance of BonDNet. In addition to the whole PubChem BDE dataset, we also trained on randomly selected 1/2, 1/4, …, 1/128 subsets. Fig. 2 shows the MAE versus dataset size relation. We see that BonDNet performs better than the reactant-only model across a range of dataset sizes, small or large. The trend suggests that the accuracy of both models can be further improved when more data becomes available.
BonDNet is a general model capable of learning any type of BDEs. To obtain a deeper understanding of its behavior on complex datasets, we provide a more fine-grained performance analysis on the newly generated BDNCM dataset consisting of both homolytic and heterolytic BDEs for neutral and charged molecules.
Fig. 3a shows the BDEs predicted by BonDNet versus the reference values from quantum chemical computations. The prediction closely follows the reference along the diagonal line, yielding good results in a range of BDEs from −5 eV to 20 eV. Fig. 3b shows a distribution of the prediction error defined as the difference between the predicted BDE and the reference BDE. The prediction errors are tightly localized around 0, although there are a few larger ones which can be seen more clearly in the inset of Fig. 3b where the vertical axis is plotted on a log scale. We analyzed the reactions for which the magnitude of the prediction error is larger than 0.43 eV (10 times the chemical accuracy) and found that these reactions can be broadly categorized into two groups. First, some types of reactions are underrepresented in the dataset. It is expected that a machine learning model such as BonDNet cannot provide good predictions for such underrepresented data. Second, most reactions with large prediction errors are more complex than one-bond dissociation. For example, when breaking a bond leads to the spontaneous change of a neighboring single bond to a double bond. Such a change would substantially alter the reference BDE, adding complexity that BonDNet is not yet designed to deal with. The reactions with the 10 largest prediction errors are given in Fig. S1 in the ESI,† together with an explanation for each of them.
Table 2 presents the MAEs and bond counts by the type of the breaking bond for both the training set and test set. BonDNet makes predictions almost equally well for all types of bonds in the training set irrespective of their counts. However, this does not mean the model would generalize equally well for unseen data (e.g. the test set) of different bond types. In fact, if a bond type has more instances in the training set, the model can more easily learn the corresponding underlying chemistry; thus, the model would generalize better for unseen data of this bond type. This can be seen from the test set MAEs in Table 2: the MAE decreases in general as the bond counts increase. As a specific example, although the training MAE for C–O and F→Li+ bonds are almost the same, the test MAE for C–O bonds is about only one third of that for F→Li+ bonds because the dataset has many more BDEs for C–O bonds. This data imbalance problem can be solved by collecting more BDEs for the underrepresented bonds in the future.
Bond type | MAE (train) | Counts (train) | MAE (test) | Counts (test) |
---|---|---|---|---|
a MAEs are reported in eV; the arrow “→” denotes a coordinate bond. | ||||
C–O | 0.0050 | 17037 | 0.0185 | 2152 |
C–H | 0.0045 | 12920 | 0.0189 | 1545 |
C–C | 0.0047 | 11774 | 0.0177 | 1557 |
O→Li+ | 0.0046 | 3868 | 0.0272 | 474 |
H–O | 0.0046 | 2313 | 0.0197 | 270 |
C–F | 0.0049 | 1890 | 0.0269 | 228 |
C→Li+ | 0.0051 | 1070 | 0.0496 | 138 |
F→Li+ | 0.0055 | 437 | 0.0539 | 54 |
O–F | 0.0131 | 75 | 0.0409 | 8 |
O–O | 0.0137 | 51 | 0.4886 | 5 |
H–F | 0.0181 | 7 | — | 0 |
F–F | 0.0031 | 4 | — | 0 |
H–H | 0.0088 | 4 | — | 0 |
Next, we assess how BonDNet performs with respect to the reactant charge using C–O bonds as an example. (Results for other bond types are given in Fig. S3 and S4 in the ESI.†) We divide the C–O bonds into three groups according to the charge of the reactants and plot the distribution of the prediction error in Fig. 3c. For all three groups, the prediction error is centered around 0. The prediction error for −1 charged molecules is somewhat more localized than for neutral molecules. As a result, the MAE for −1 charged molecules is smaller than for neutral molecules, as can be seen in Table 3. For the same reason, the MAEs for both −1 charged and neutral molecules are smaller than the MAE for +1 charged molecules. Nevertheless, these differences are not large, and BonDNet is able to accurately predict the BDEs for molecules of different charges. In a similar manner, we assess the performance of BonDNet with respect to the bond dissociation type: homolysis or heterolysis. The difference in the distributions of the prediction error (Fig. 3d) is negligible; the same can be said for the MAEs (Table 3), demonstrating that BonDNet is able to accurately predict both homolytic and heterolytic BDEs.
MAE (eV) | Counts | ||
---|---|---|---|
Charge | −1 | 0.0146 | 787 |
0 | 0.0178 | 890 | |
1 | 0.0265 | 475 |
Dissociation type | Homolysis | 0.0189 | 1373 |
Heterolysis | 0.0178 | 779 |
First, we look at the learned representations of the bond dissociation reactions. This provides us with an idea of how the model learns to map the inputs to the BDE predictions. For easier visual discovery of patterns, we embed the high-dimensional difference feature vector in eqn (9) for each reaction into a two-dimensional (2D) space by the uniform manifold approximation and projection (UMAP) method.70Fig. 4 shows the embedding for the BDNCM test set. In general, points that are close together in the 2D embedded space are similar in the original vector space. Therefore, since reactions with similar BDEs are close to each other in the embedded 2D space (Fig. 4a), their feature vectors are similar to each other. Note that all model parameters are optimized in an end-to-end fashion, where the g2g module and the g2p module work together to achieve the goal of reproducing the reference BDEs in the training set. Consequently, the feature vectors representing the reactions are adapted in accordance with the BDEs during the training process. Fig. 4b shows that reactions with the same type of breaking bond tend to “cluster” together, but there can be multiple faraway clusters for each bond type. The former is simply because reactions with the same type of breaking bond are similar to each other as we would expect. The latter, however, is because the surrounding environment of the bonds and/or the global state (e.g. total charge) of the molecules are different such that the model assigns distinctive feature vectors to them, in spite of being the same bond type. These observations suggest that the model “listens” to both the input (e.g. bond type) and the target (BDE) and learns by transforming the feature vectors to be aligned with them.
Furthermore, the patterns in the data yield chemical insights that may align with common chemical knowledge or, in some cases, challenge chemical intuition.42,71,72 Such insight would provide new perspectives on the data and thus help us to better understand the system under study. For example, in Fig. 4b we see that O–H bonds (pink) are always associated with C–H bonds (dark blue). This means that, despite the unique nature of O–H bonds, the model finds them to be fairly similar to C–H bonds. However, from the perspective of the learning model this is unsurprising because both O–H and C–H are covalent bonds formed with hydrogen atoms, and more importantly, unlike other atoms, hydrogen can only form one bond. The behavior of bonds formed with lithium is more interesting. We might expect F→Li+, C→Li+, and O→Li+ bonds to be similar because they are all coordinate bonds involving a lithium ion Li+. This is indeed the case for some F→Li+ bonds, as can be seen from the upper part of Fig. 4b where the F→Li+ (orange), C→Li+ (red), and O→Li+ (gray) bonds are close to each other. Surprisingly, there are a fair number of F→Li+ bonds (orange) deemed more similar to C–F bonds (dark green) than to the other coordinate bonds. There are two major reasons for this counterintuitive behavior. First, both F→Li+ and C–F are bonds formed with F. Second, the F→Li+ bonds have a wide spectrum of BDEs (−0.2 to 21.1 eV in the dataset), and some of them have BDEs more close to the C–F bonds. Such close BDEs result in the adaption of the feature vectors corresponding to these F→Li+ bonds towards the feature vectors of the C–F bonds during the training. For example, the F→Li+ and C–F bonds in the circle have very similar energies and, obviously, they are close to each other in the embedded 2D space.
In addition to the reaction-level difference features in the g2p module, each bond has its own features in each g2g module. To investigate how the bond features evolve in the learning process, we calculate the similarity between bond pairs by measuring the Pearson correlation coefficient between their feature vectors and observe how the similarity changes in different layers of BonDNet (a layer means a g2g module). Taking the fluorine-substituted lithium butylene dicarbonate molecule (F-LBDC) in the BDNCM dataset as an example (Fig. 5b), Fig. 5a shows the heatmap of the bond similarity matrix for various layers of BonDNet. The input bond features only include “whether a bond is in a ring”, “ring size”, and “whether a bond is a coordinate bond” (see Table S2 in the ESI† for more information on input features). As a result, the bond similarity for input features (layer 0) aggregates into two groups mainly based on the “whether a bond is in a ring” information. Moreover, the bonds in rings (bonds 1, 2, 3, and 4 in Fig. 5b) further aggregate into two subgroups according to “whether a bond is a coordinate bond.” As the learning proceeds, the bond similarity heatmap presents a distinctive pattern in later layers. For example, were it not for the fluorine substitution, bonds 9 and 11 would exhibit a similarity score of 1 in all layers due to the symmetry in the LBDC molecule. However, bond 11 in layer 3 is more similar to bond 10 (correlation score 0.92) than to bond 9 (correlation score 0.81), in agreement with our chemical intuition that the fluorine atom substantially impacts the properties of its neighboring bonds.
As a comparison, Fig. 5c displays the heatmap of the bond similarity matrix for layer 3 before training the BonDNet model. There is hardly any visual pattern in the heatmap that is in strong agreement with the chemical structure of the F-LBDC molecule. This demonstrates that BonDNet has learned to transform the raw input features into more refined features via the exchange of information among atoms, bonds, and the global state in the g2g module. More importantly, the refined features are in agreement with our understanding of the molecules, suggesting that BonDNet learns to predict the BDE by trying to understand the underlying chemical rules.
BonDNet does not take as input any geometric information of molecules, and thus stereoisomers (e.g. cis/trans isomers) cannot be distinguished. This, however, could be addressed by directly encoding the isomerism information into the atom, bond, and global features without explicitly using the geometric information, which we leave for future investigation.
In essence, BonDNet is a model that represents chemical reactions using molecular features of both the reactants and the products. Therefore, our approach is not limited to just predicting BDEs but could be applied to learn other reaction properties such as activation energy, retrosynthesis chemical reactivity, and reaction conditions (e.g. temperature and solvents). Such capabilities would require little to no modification of the current model besides modifying the training target to be another property of interest. Future generation of large quantitative datasets through high-throughput experimentation and/or quantum computational chemistry methods will thus enable the adoption of BonDNet and similar methods for rapid and accurate prediction of such properties.
Footnotes |
† Electronic supplementary information (ESI) available: Dataset description, raw input features, model hyperparameters, and additional error analysis. See DOI: 10.1039/d0sc05251e |
‡ It should be noted that, if the breaking bond is part of a ring, only one product will be formed. Without loss of generality, we assume the reactant cleaves into two products in this paper. |
This journal is © The Royal Society of Chemistry 2021 |