Open Access Article
Marco
Anselmi
a,
Greg
Slabaugh
*b,
Rachel
Crespo-Otero
*c and
Devis
Di Tommaso
*a
aSchool of Physical and Chemical Sciences, Queen Mary University of London, Mile End Rd, Bethnal Green, London, E1 4NS, UK. E-mail: d.ditommaso@qmul.ac.uk
bDigital Environment Research Institute, Queen Mary University of London, Empire House, 67-75 New Road, London, E1 1HH, UK. E-mail: g.slabaugh@qmul.ac.uk
cDepartment of Chemistry, University of College London, 20 Gordon Street, London WC1H 0AJ, UK. E-mail: r.crespo-otero@ucl.ac.uk
First published on 23rd April 2024
Graph Neural Networks (GNNs) have revolutionized material property prediction by learning directly from the structural information of molecules and materials. However, conventional GNN models rely solely on local atomic interactions, such as bond lengths and angles, neglecting crucial long-range electrostatic forces that affect certain properties. To address this, we introduce the Molecular Graph Transformer (MGT), a novel GNN architecture that combines local attention mechanisms with message passing on both bond graphs and their line graphs, explicitly capturing long-range interactions. Benchmarking on MatBench and Quantum MOF (QMOF) datasets demonstrates that MGT's improved understanding of electrostatic interactions significantly enhances the prediction accuracy of properties like exfoliation energy and refractive index, while maintaining state-of-the-art performance on all other properties. This breakthrough paves the way for the development of highly accurate and efficient materials design tools across diverse applications.
In the field of quantum chemistry, the development of Graph Neural Networks (GNN) has provided a means of computing the properties of molecules and solids, without the need to approximate the solution to the Schrödinger equation. Furthermore, compared to other Machine Learning (ML) techniques, they have shown immense potential in the field of chemistry, since they do not require manual feature engineering and have significantly better performance compared to other ML models.19 GNN models represent molecules or crystalline materials as graphs with a node for each constituent atom and an edges as bonds or inter-atomic relationships. By passing messages through the edges they update the molecular representation and learn the function that maps these graphs to properties obtained from reference electronic structure calculations such as Density Functional Theory (DFT).
There has been rapid progress in the development of GNN architectures for predicting material properties, such as such as SchNet,10 Crystal Graph Convolutional Neural Network (CGCNN),11 MatErials Graph Network (MEGNet),12 Atomistic Line Graph Neural Network (ALIGNN)13 and similar variants.14–18,20–25 These models consider only the pairwise interactions between bonded atoms or between atoms within a cut-off radius of typically 6 Å to 8 Å. Some have also incorporated many-body relationships, such as bond-angles, into the molecular representation.13,14,16,17,20 Nevertheless, all of these GNN models developed so far can be categorised as local methods26 and are limited to analysing only the local environment around atoms or relying on multiple message passing layers to approximate long range interactions.
However, for certain systems and/or tasks, long range interactions can be important. To date there are only a few ML models that have incorporated electrostatic interactions into their architecture.27–30 One of the most important contributions to the long-range interactions is electrostatics, which, together with van der Waals, define non-bonded interactions in the potential energy (PE) equation (eqn (1)).
![]() | (1) |
Although non-bonded interactions are composed of both electrostatic and van der Waals interactions, as the latter decays much faster, it is less relevant for the purpose of modelling long-range interactions. Thus, to represent long-range interactions only the electrostatic are necessary. The electrostatic interaction between two atoms in a structure can be obtained using Coulomb's law, shown in eqn (2).
![]() | (2) |
The other component to the non-bonding energy, the van der Waals interactions, unlike electrostatics, does not have a simplified representation. Thus, to obtain the van der Walls energy, it would require the calculation of the parameters of the Lennard-Jones potential, making it unsuitable as a potential descriptor of the pair-wise interaction between atoms in a ML architecture.
With the aim of enhancing GNN architectures akin to ALIGNN13 for the incorporation of long-range interactions, this paper introduces the Molecular Graph Representation (MGR) and the Molecular Graph Transformer (MGT). In this endeavour, the simplified Coulomb interactions that can be obtained from the Coulomb Matrix31 are explicitly included within the MGR and subsequently analysed by the MGT. The MGR splits the graphical representation of the system into three graphs: local graph (Glocal), line graph (Gline), and fully connected graph (Gglobal). The MGT alternates between graph attention layers on the Gglobal and graph convolutions on the Gline and Glocal, to update the molecular representation through non-bonding, three-body and two-body information. Our model is trained on both the MatBench32 and the QMOF33 to predict energetic, electronic and vibrational properties of solid-state materials directly from their unrelaxed structures.
![]() | (3) |
| hit+1 = Ut(hit,mit+1) | (4) |
Restricting the input to unordered graph-structure data allows all GNNs to be invariant to permutations in the input. Furthermore, by acting only node and edge features – hi, eij – composed of scalar values, such as atomic numbers for nodes and interatomic distances for edges, makes MPNNs invariant to geometric symmetry group E(3) operations – rotation, inversion and translation. This is a desirable property since prediction tasks for most scalar properties, such as molecular energy prediction, require E(3) invariance. Whereas, the inclusion of vector graph features and the prediction of vector properties, such as atomic forces, require E(3) equivariance. Recently, a class of models known as equivariant neural networks35–40 have been developed which can act directly on vector inputs, while maintaining equivariance, by using equivariant operations only.
In the context of GNNs, however, the input data structure is a graph. To adapt attention mechanisms to graph, on can interpret the original self-attention as constructing a fully connected graph over the input tokens and computing interactions between them. Recently many attempts have been made to adapt attention mechanisms to MPNNs. Most notably the Graph Attention (GAT) Network introduced by Veličković, et al.46 and the GATv2 introduced by Brody et al.47 which implement an attention function closely following the work of Bahdanau et al.48 while using the multi-headed approach of the transformer. Other works5,49–51 have tried to adapt the transformer directly to MPNNs with one of the most common research fields being on positional encodings.
The construction of the local and line graphs is the same as the ALIGNN13 representation. The local graph is constructed using a periodic 12-nearest-neighbour methodology, in which an edge is formed between an atom and its 12 nearest neighbours within a cut-off distance of 8 Å. Each atom is then assigned a nine features feature set based on its atomic species: (1) electronegativity; (2) covalent radius; (3) valence electrons; (4) group number; (5) electron affinity; (6) first ionization energy; (7) period number; (8) block number; (9) atomic volume. The feature sets are then encoded, through one-hot encoding, to form the feature vectors of the atoms. The edges are instead initialized with the distance between the two atoms that they connect. To form the feature vectors for the edges, a Radial Basis Function (RBF) is used with limits: 0 Å and 8 Å. The local graph can then be defined as Glocal = (h, e), where h are nodes and e are edges between pair of atoms, and Glocal has associated feature sets H = {h1, …, hi, …, hN} and E = {eij, eik, ekl, emi, …}, where hi is the feature vector given to node i and eij is the feature vector of the edge connecting nodes i and j.
The line graph is derived from the local graph. Each node in the line graph represents an edge in the local graph, and nodes and corresponding edges share the same feature vector, such that any update on a node of the line graph is reflected on the corresponding edge in the local graph. Edges in the line graph, correspond to the relationship between pairs of edges in the local graph that have one atom in common, representing a three-body interaction system between triplets of atoms, i.e. bond pair eij, eik and atom triplet hi, hj, hk where atom hi is the shared atom. The line graph edge features are given by an RBF expansion of the angle formed by two connected local graph edges, shown in green in Fig. 1. The line graph can then be defined as Gline = (e, t), where e are local graph edges and t are angles between connected edges or atom triplets, and Gline has associated feature sets E = {eij, eik, ekl, emi, …} and T = {tijk, tikl, tijm, …}.
The full graph is constructed similarly to the local graph. Each node in the local graph represents an atom in the structure, and it shares its latent representation with the nodes of the local graph. Edges in the full graph represent an interaction between pairs of atoms, and they are formed between all atoms that are within a cut-off distance from each other. Full graph edges features are derived from the Coulomb matrix31 of the structure and they represent the Coulomb repulsion between the two different atoms as described in the Coulomb matrix in eqn (5)
![]() | (5) |
The modifications to the feature vectors follow a global-to-local sequence. Initially, nodes for increased attention are determined through non-bonding two-body interactions from the global graph, followed by updates using the same interactions. Subsequently, utilizing the line graph, updates involving three-body interactions are executed, succeeded by updates involving two-body interactions from the local graph. Lastly, updates based on single-body information are performed using only the node information.
| Q = WQhi, K = WKhj, V = WVhj, F = WFfij | (6) |
![]() | (7) |
![]() | (8) |
denotes the summation with all the neighbours of node hi, and softmax is used to normalize each message along the edges of node hi such that:![]() | (9) |
The update to each node is then defined as:
![]() | (10) |
![]() | (11) |
![]() | (12) |
![]() | (13) |
![]() | (14) |
and eij respectively.
![]() | (15) |
![]() | (16) |
| Parameter | Value |
|---|---|
| Encoder layers | 2 |
| MHA layers | 1 |
| ALIGNN blocks | 3 |
| EGGC layers | 1 |
| Atom input features | 90 |
| Edge input features | 80 |
| Angle input features | 40 |
| Coulomb input features | 120 |
| Embedding features | 256 |
| Hidden features | 512 |
| FC layer features | 128 |
| Global pooling function | Average |
| Batch size per GPU | 2 |
| Learning rate | 0.0001 |
113 for the “Formation Energy” task. As reported by Dunn et al.,32 each task has been appropriately curated, according to specific per-task procedures, to ensure that there is no extraneous data and biases in the datasets.
The “jdft2d” task involves predicting the exfoliation energy for separating 2D layers from crystal structures, computed with the OptB88vdW55 and TBmBJ56 exchange–correlation DFT functionals. The “Phonons” task is dedicated to predicting the vibrational properties of crystal structures computed using the ABINIT code within the harmonic approximation based on density functional perturbation theory. The “Dielectric” task is concerned with predicting refractive index from crystal structures. The “log
10 GVRH” and “log
10 KVRH” tasks involve predicting the logarithm (base 10) of the shear (GVRH) and bulk (KVRH) modulus property of crystal structures. The “Bandgap” task focuses on the prediction of the electronic bandgap of crystal structures computed with the Perdew–Burke–Ernzerhor57 (PBE) functional. Lastly the “Perovskites” and “Formation energy” tasks are dedicated to predicting the formation energy of crystal structures, with the “Perovskites” task focusing on perovskites. For all the tasks, the train-test splits provided by the MatBench api, which avoid data leakage between the splits, were used. The training set was then divided into train-validation splits, consisting of approximately 80% and 20%, respectively, for each task. The performance of the MGT model on the MatBench dataset is shown in Table 2.
| Tasks | jdft2d | Phonons | Dielectric | log 10 GVRH |
log 10 KVRH |
Perovskites | Bandgap | Formation E |
|---|---|---|---|---|---|---|---|---|
| Units | meV per atom | cm−1 | Unitless | log10 GPa |
log10 GPa |
eV per unit cell | eV | eV per atom |
| ALIGNN | 43.4244 (8) | 29.5385 (3) | 0.3449 (13) | 0.0715 (3) | 0.0568 (5) | 0.0288 (2) | 0.1861 (3) | 0.0215 (3) |
| AMMExpress v2020 | 39.8497 (6) | 56.1706 (13) | 0.3150 (6) | 0.0874 (10) | 0.0647 (9) | 0.2005 (12) | 0.2824 (13) | 0.1726 (15) |
| CGCNN v2019 | 49.2440 (13) | 57.7635 (14) | 0.5988 (15) | 0.0895 (11) | 0.0712 (12) | 0.0452 (9) | 0.2972 (14) | 0.0337 (7) |
| coNGN | 36.1698 (5) | 28.8874 (2) | 0.3142 (5) | 0.0670 (1) | 0.0491 (1) | 0.0290 (3) | 0.1697 (2) | 0.0178 (2) |
| coGN | 37.1652 (4) | 29.7117 (4) | 0.3088 (4) | 0.0689 (2) | 0.0535 (2) | 0.0269 (1) | 0.1559 (1) | 0.0170 (1) |
| CrabNet | 45.6104 (9) | 55.1114 (12) | 0.3234 (9) | 0.1014 (14) | 0.0758 (13) | 0.4065 (14) | 0.2655 (12) | 0.0862 (13) |
| DimeNet++ (kgcnn v2.1.0) | 49.0243 (12) | 37.4619 (6) | 0.3400 (12) | 0.0792 (6) | 0.0572 (6) | 0.0376 (8) | 0.1993 (5) | 0.0235 (5) |
| Finder_v1.2 composition only version | 47.9614 (11) | 46.5751 (10) | 0.3204 (8) | 0.0996 (13) | 0.0764 (14) | 0.6450 (16) | 0.2308 (10) | 0.0839 (12) |
| Finder_v1.2 structure based version | 46.1339 (10) | 50.7406 (11) | 0.3197 (7) | 0.0910 (12) | 0.0693 (11) | 0.0320 (4) | 0.2193 (7) | 0.0343 (8) |
| MegNet (kgcnn v2.1.0) | 54.1719 (15) | 28.7606 (1) | 0.3391 (11) | 0.0871 (9) | 0.0668 (10) | 0.0352 (6) | 0.1934 (4) | 0.0252 (6) |
| MODNet (v0.1.12) | 33.1918 (2) | 34.2751 (5) | 0.2711 (1) | 0.0731 (4) | 0.0548 (3) | 0.0908 (10) | 0.2199 (8) | 0.0448 (10) |
| MODNet (v0.1.10) | 34.5368 (3) | 38.7524 (7) | 0.2970 (2) | 0.0731 (5) | 0.0548 (4) | 0.0908 (11) | 0.2199 (9) | 0.0448 (11) |
| RF-SCM/Magpie | 50.0440 (14) | 67.6126 (15) | 0.4196 (14) | 0.1040 (15) | 0.0820 (15) | 0.2355 (13) | 0.3452 (15) | 0.1165 (14) |
| SchNet (kgcnn v2.1.0) | 42.6637 (7) | 38.9636 (8) | 0.3277 (10) | 0.0796 (7) | 0.0590 (7) | 0.0342 (5) | 0.2352 (11) | 0.0218 (4) |
| MGT | 31.4223 (1) | 39.0179 (9) | 0.3047 (3) | 0.0840 (8) | 0.0636 (8) | 0.0361 (7) | 0.2145 (6) | 0.0378 (9) |
| Dummy | 67.2851 (16) | 323.9822 (16) | 0.8088 (16) | 0.2931 (16) | 0.2897 (16) | 0.5660 (15) | 1.3272 (16) | 1.0059 (16) |
The “jdft2d” and “Dielectric” tasks have benefited the most from the inclusion of electrostatic interactions in the MGT model, with improvements of 27% and 12%, respectively, compared to ALIGNN. The jdft2d task is related to the prediction of exfoliation energy of crystal structures, which involves the energy required to remove a layer of the material from its surface.58 Since molecular layers usually are connected through non-covalent weak interactions,59–61 as in the case of graphene, the inclusion of non-bonding interactions, such as electrostatics, can benefit GNN models in this task. Inclusion of electrostatics can also benefit in the dielectric task, which predicts the refractive index and is affected by the electrostatic interactions between all neighbouring atoms within 12 Å in the structure.
On the other hand, the “Formation energy” task has seen the least benefit from this inclusion, with a MAE of 0.0378 eV per atom with the MGT model compared to 0.0215 meV per atom with ALIGNN. The Formation Energy task is a regression task for predicting the energy required to form the crystal structure. As the energy needed to form a covalent or ionic bond between two atoms is much greater than the electrostatic interaction between them, the bond energy plays a dominant role in determining the formation energy of a molecule or compound. Therefore, including the electrostatic interaction without using a filtering function to correctly add the processed electrostatics to the bonded interactions, and using simplified Coulomb interactions can result in a negative impact on the performance of the model.
The remaining tasks (“phonons”, “log
10 GVRH” and “log
10 KVRH”) refer to properties that benefit more from a description of the structure rather than its electronic properties. The “perovskites” and “bandgap” tasks encounter the same problem as the “Formation energy” task, in which the covalent or ionic bonds play a much more important role in their calculation. Thus, the inclusion of electrostatic interaction can have a slight negative impact of the performance on these tasks.
From the test results on the all the tasks in the MatBench dataset, the incorporation of the attention module for long-range interactions shows promising results. While the overall MAE of MGT across all tasks is higher than that of ALIGNN, it ranks fifth out of 16 models, as shown in Table S1 (ESI†), and it still manages to rank in the top 10 best-performing models across all tasks. This demonstrates its broad applicability and competitive performance.
The QMOF33 consists of 20
375 Metal Organic Framework (MOF) structures with results from electronic structure calculations done using the following functionals: PBE,57 High Local Exchange 2017 (ref. 62) (HLE17), and Heyd–Scuseria–Ernzerhof63 06 (HSE06) with 10% and 25% of the Hartree–Fock exact exchange. As documented by Rosen et al.33 the QMOF database includes structures that have chemical elements covering nearly the entire periodic table, facilitating the creation of versatile ML models. Nevertheless, it is worth noting that structures with Cu, Zn and Cd, because they are the three most common types of inorganic nodes in the literature, they constitute a large part of the QMOF database. Furthermore, as a result of their curation process, certain types of MOFs are under-represented. A train-validation-test split of 16000-2000-2375 was applied. The performance of MGT on this dataset is presented in Table 3. The test part of the dataset was randomly sampled from the whole dataset and kept across all training sessions to ensure result comparability and prevent data leakage between training and testing sets. For each training session, the validation and training sets were randomly selected eliminating potential biases from algorithm/heuristic-based splitting approaches.
| Property | MGT | ALIGNN | CGCNN |
|---|---|---|---|
| Bandgap (eV) | 0.240 | 0.224 | 0.330 |
| HOMO (eV) | 0.263 | 0.245 | 0.361 |
| LUMO (eV) | 0.252 | 0.232 | 0.330 |
The performance on the prediction of bandgap values in the QMOF database is comparable to that observed in the bandgap task within the MatBench dataset. However, in addition to the bandgap predictions we also considered two other properties: the Highest Occupied Molecular Orbital (HOMO) and the Lowest Unoccupied Molecular Orbital (LUMO) energy levels. Given the greater significance of bonded interactions over long-range interactions in determining the HOMO and the LUMO, it is expected that integrating long-range interactions into the model may introduce unnecessary complexity, resulting in the difference of 0.02 eV between the MAEs of ALIGNN and MGT. Additionally, as the bandgap is the difference between the HOMO and LUMO, it follows that the difference in MAE between ALIGNN and MGT mirrors that for the HOMO and LUMO.
Overall, the results on both datasets suggest that the attention module can be a valuable tool for analysing long-range interactions and improving the performance of graph neural networks.
| Component | Number of repetitions | |||
|---|---|---|---|---|
| 1 | 2 | 3 | 4 | |
| MHA | 0.4031 | 0.3981 | 0.3880 | 0.3816 |
| ALIGNN | 0.3224 | 0.2894 | 0.2767 | 0.2609 |
| EGGC | 0.3840 | 0.3328 | 0.3152 | 0.3016 |
Excluding all three modules (MHA, ALIGNN and EGGC) the model has an error of 0.8734 eV, including even just a single MHA shows an improvement of at least 54% bringing the error down to 0.4031, which demonstrates the importance of these layers. Excluding two and using just one of the modules shows the individual performance of these layers. Using only EGGC layers there is an improvement of at least 56% over using no layers, and the performance saturates at 4 layers with an error of 0.3016 eV. Performance using only one ALIGNN layer is improved to 0.3224 eV and saturates at 4 layers with an error of 0.2609 eV. Meanwhile, the use of MHA Layers only shows a performance saturation at 4 layers with an error of 0.3816 eV.
The effect of each layer and the coupling between them can also be studied by varying the number of layers, while using all modules at the same time. Due to the number of possible configurations and the training time only a subset of them have been tested. Increasing the number of MHAs within an encoder has almost no effect on the performance on the model; using configurations with 1, 2, 3 and 4 MHAs the MAE obtained are 0.2888 eV, 0.2880 eV, 0.2865 eV, 0.2877 eV respectively, which shows a very small improvement when using more MHAs with a performance saturation at 3 layers.
Increasing the number of ALIGNN blocks, on the other hand has the biggest effect on the QMOF, with errors of 0.2888 eV, 0.2685 eV, 0.2661 eV, 0.2672 eV using 1, 2, 3, 4 layers respectively, showing improvements with an increase of Layers up to 3. Increasing the number of EGGCs, similarly to MHA, also brings small improvements on the performance, with errors of 0.2888 eV, 0.2828 eV, 0.2750 eV and 0.2716 eV using 1, 2, 3, and 4 layers respectively. These results have been obtained by changing the number of repetitions of each component while keeping the other two components at one. Nevertheless, even when testing for all possible combinations the results (shown in Table S2 in the ESI†) are almost the same.
Changing the number of layers of each module, impacts not only the performance but also size, training time and inference time of the model. Although, getting access to more powerful computers is becoming easier, not everyone has the latest and best computing resources, thus, the decision to add more or less layers is also dependent on their impact upon computational requirements. From Fig. 4a it can be seen that the ALIGNN blocks are the ones that have the biggest impact on the model size, with each block adding 2
630
656 parameters, while, the MHA and EGGC modules add 1
052
672 and 1
315
328 respectively. Nevertheless, the module that has the largest impact on training and inference times is the MHA module, as shown in Fig. 4b and c. Each additional MHA layer adds around 20 seconds to the inference time, double that of each additional ALIGNN block and quadruple the EGGC layers, which, add around 10 seconds and 5 seconds respectively.
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00014e |
| This journal is © The Royal Society of Chemistry 2024 |