Open Access Article
Weiyi
Gong
a,
Tao
Sun
b,
Hexin
Bai
c,
Shah Tanvir ur Rahman
Chowdhury
d,
Peng
Chu
c,
Anoj
Aryal
a,
Jie
Yu
e,
Haibin
Ling
*b,
John P.
Perdew‡
*ef and
Qimin
Yan
*a
aDepartment of Physics, Northeastern University, Boston, MA 02115, USA. E-mail: q.yan@northeastern.edu
bDepartment of Computer Science, Stony Brook University, Stony Brook, NY 11794, USA. E-mail: hling@cs.stonybrook.edu
cDepartment of Computer and Information Sciences, Temple University, Philadelphia, PA 19122, USA
dDepartment of Materials Science, Thayer School of Engineering, Dartmouth College, Hanover, NH 03755, USA
eDepartment of Physics, Temple University, Philadelphia, PA 19122, USA
fDepartment of Chemistry, Temple University, Philadelphia, PA 19122, USA
First published on 15th August 2023
In a data-driven paradigm, machine learning (ML) is the central component for developing accurate and universal exchange–correlation (XC) functionals in density functional theory (DFT). It is well known that XC functionals must satisfy several exact conditions and physical constraints, such as density scaling, spin scaling, and derivative discontinuity. However, these physical constraints are generally not incorporated implicitly into machine learning through model design or pre-processing on large material datasets. In this work, we demonstrate that contrastive learning is a computationally efficient and flexible method to incorporate a physical constraint, especially when the constraint is defined by an equality, in ML-based density functional design. We propose a schematic approach to incorporate the uniform density scaling property of electron density for exchange energies by adopting contrastive representation learning during the pretraining task. The pretrained hidden representation is transferred to the downstream task to predict the exchange energies calculated by DFT. Based on the computed electron density and exchange energies of around 10
000 molecules in the QM9 database, the augmented molecular density dataset is generated using the density scaling property of exchange energy functionals based on the chosen scaling factors. The electron density encoder transferred from the pretraining task based on contrastive learning predicts exchange energies that satisfy the scaling property, while the model trained without using contrastive learning gives poor predictions for the scaling-transformed electron density systems. Furthermore, the model with pretrained encoder gives satisfactory performance with only small fractions of the whole augmented dataset labeled, comparable to the model trained from scratch using the whole dataset. The results demonstrate that incorporating exact constraints through contrastive learning can enhance the understanding of density-energy mapping using neural network (NN) models with less data labeling, which will be beneficial to generalize the application of NN-based XC functionals in a wide range of scenarios which are not always available experimentally but are theoretically available and justified. This work represents a viable pathway toward the machine learning design of a universal density functional via representation learning.
Plenty of effort has been devoted to leveraging physical constraints in ML of XC functionals. In a previous work by Lei et al.,16 by using CNN as encoders, rotationally invariant descriptors were extracted and projected on a basis using spherical harmonic kernels. In another work by Hollingsworth et al.,17 it was found that the scaling property, which is one of the exact conditions that the exchange energy must satisfy, can be utilized to improve the machine learning of XC functionals. The study is limited to one-dimensional systems and lacks the generalizability to two- and three-dimensional systems. Machine-learning can however follow human-devised strategies to satisfy exact constraints faithfully, even in three dimensions. This is especially true for semilocal functional forms, such as GGAs and meta-GGAs. In this way, the SCAN meta-GGA,7 which satisfies 17 exact constraints, has been combined with machine-learning in the works of Dick and Fernandez-Serra14 and of Nagai, Akashi, and Sugino.18 In these works, the uniform density scaling constraint on the exchange energy functional is satisfied exactly by employing an exchange enhancement factor that is a machine-learned function of semilocal descriptors d(r) that scale to d(γr) when the electron density n(r) scales to γ3n(γr). Ref. 18 preserved many of the exact constraints satisfied by SCAN in a machine-learned functional fitted to data for small molecules. These works suggest that SCAN is close to the limit of what a meta-GGA can achieve, but that meta-GGA accuracy for molecules can still be boosted by machine learning. The approach that we will present here satisfies the uniform density scaling constraint only approximately, but is not limited to human-devised functional forms. More recently, another exact condition – derivative discontinuity – was incorporated into the NN-based XC functional design,19 while the study is again limited to one-dimensional systems. A more recent work has demonstrated that the fundamental limitation can be overcome by training a neural network on molecular data and on fictitious systems with fractional charge and spin,20 and the resulting NN-based functional DeepMind-21 demonstrated the universality and greatly improved predictive power for molecule energetics and dynamics. At the same time this work was written, schemes incorporating the Lieb–Oxford bound21 and spin scaling property10 into the machine learning density functional design were proposed.22
Many of the previous works use data augmentation to improve model performance by directly increasing the amount of labeled data following a given physical constraint. However, increasing the amount of data is not always possible due to the computational cost. Going beyond data augmentation, self-supervised learning has gained popularity because of its ability to avoid the cost of annotating large-scale datasets. It adopts self-defined pseudo labels as supervision and uses the learned representations for downstream tasks. Self-supervised learning has been widely used in image representation learning23 and natural language processing,24 and has been applied in molecular machine learning.25,26 Specifically, contrastive learning (CL) has recently become a dominant branch in self-supervised learning methods for computer vision, natural language processing, and other domains.27 It aims at embedding augmented versions of the same sample close to each other while trying to push away embeddings from different samples in the representation space. The goal of contrastive learning is to learn such an embedding space in which similar sample pairs stay close to each other while dissimilar ones are far apart, and the CL process can be applied in both unsupervised and supervised settings.28 In molecular systems, the application of contrastive learning, in conjunction with molecular graph representation,29–31 has emerged as an effective strategy. This method has shown promise in enhancing predictive accuracy and training efficiency, particularly in scenarios where data availability is limited. In this work, we will explore the incorporation of physical constraints in density functional learning through contrastive learning.
One of the most important and fundamental constraints for the exchange energy of an electron system is derived from the principle of uniform scaling.9 Consider an electron density distribution n(r) and a uniformly scaled density
| nγ(r) = γ3n(γr) |
Uniform scaling preserves the shape of the density, apart from an overall change of length scale. (Unless the origin of r is at the center of electronic charge, scaling also translates that center relative to the origin, from 〈r〉 to 〈r〉/γ.) Several important exact constraints on density functionals can be written using the scaled density. In this work, we focus on the exchange energy Ex[n], and its scaling property:9| Ex[nγ] = γEx[n] |
This important constraint is satisfied exactly in almost all human-designed density functionals, whether non-empirical or semi-empirical. As a chemical example, atomic one-electron ions of nuclear charge Z are scaled versions of the hydrogen atom with scale factor γ = Z. The exchange energy, −5Ze2/(16a0), in this case cancels the Hartree electrostatic interaction of the density with itself. Using this constraint as an important and illustrative example, we propose a schematic approach to incorporate any physical constraints (represented by equalities) via contrastive learning into the NN-based model design.
Specifically, we found that traditional supervised learning without data augmentation was not able to incorporate the scaling constraint into the ML functional when training the electron density encoder solely on a dataset of unscaled electron densities, as the model demonstrated a lack of extrapolability on scaled densities. To incorporate the scaling constraint, we chose to pre-train an electron density encoder by maximizing the similarity between molecular electron density and its scaled version with a randomly chosen scaling factor, within the framework of SimCLR,32 which is a widely used framework for contrastive learning of image pretraining. To obtain an encoder that gives similar representations (while different by a scaling factor) for scaled and unscaled electron densities, we added a scaling factor predictor component to the framework. The pre-trained encoder was then transferred to the downstream task to predict the exchange energies of scaled electron densities of molecule systems. We compared the model performance using this method with that of supervised learning with data augmentation. It is found that the model pretrained using contrastive learning is able to make predictions that are more consistent with the scaling relation and outperforms the supervised learning model in terms of predicting exchange energies. We will show that contrastively learned encoders are capable of encoding molecular electron density with less labeling cost based on the fact that they give comparable predictions by fine-tuning using only a small percentage of labeled data, compared to the model trained on the whole labeled dataset by supervised learning. This shows that contrastive learning using constraints can enhance the understanding of DFT theory for neural network models with a small amount of labeled data while generalizing the application of NN XC functionals in a wide range of scenarios which are not always available experimentally but are theoretically available and justified.
000 molecular systems following the procedures described in the Methods section. These matrices were then projected onto a grid of size (129, 129, 129) within a cube of edge length 40 angstroms to create the unscaled density n(r), with the center of mass located at the center of the cube. To generate the scaled density nγ(r), the uniform density scaling constraint of nγ(r) = γ3n(γr) was applied by taking the value of n(r) at γr and multiplying it by γ3. Inherent to this model choice for representing electron densities, using a larger number of grids generally leads to improved model performance. However, it is necessary to achieve a balance between model performance and computational cost, as the storage requirement for the volumetric data and the training time for the model will increase exponentially with the size of the grid. To partially mitigate this issue, we implemented a down-sampling technique. The input data on the (129, 129, 129) grid is passed to a fully connected linear layer with a rectified linear unit (ReLU) activation to create data on a (65, 65, 65) grid. The down-sampled data have more information than those data obtained by projecting the density matrix directly on a (65, 65, 65) grid. A comparison of model performance with and without down-sampling is provided in the ESI.†
In this work, we intend to design a pre-training task such that the electron density encoder is aware of the uniform density scaling property. In order to do so, unscaled and scaled electron densities on a fixed-size spatial grid are generated using the PySCF code33 with low computation cost, represented as three-dimensional arrays xi,
iγ ∈
d×d×d, where the scaling factor γ is chosen from five different scales: 1/3, 1/2, 1, 2, and 3. The scaled density is then translated randomly in the three-dimensional space to incorporate the translational symmetry. We included translational symmetry because our uniform density scaling translates the center of electronic charge, but this additional constraint was not found to be numerically important (Table 1). Electron density arrays are encoded as hidden representations hi = f(xi),
iγ = f(
iγ) ∈
m through the density encoder that is a mapping f:
d×d×d →
m to be learned. The hidden representations are then projected as a set of points zi = g(hi) ∈
n on a high dimensional unit sphere by a mapping g:
m →
n (n < m) that is a multilayer perceptron (MLP). For a batch of N molecules, the output Z ∈
2N×m contains projected representations of unscaled and scaled densities. Then we calculate the normalized temperature-scaled cross entropy (NT-Xent) loss32 that is defined as:
| Train/validate split | MAE on test set (eV) | |
|---|---|---|
| Unscaled (size = 1000) (γ = 1) | Scaled (size = 5000) (γ = 1/3, 1/2, 1, 2, 3) | |
| Supervised learning with data augmentation | ||
40 000/5000 |
0.481 | 0.757 |
![]() |
||
| Contrastive + transfer learning | ||
40 000/5000 |
0.461 | 0.739 |
32 000/5000 |
0.505 | 0.874 |
24 000/5000 |
0.561 | 0.932 |
16 000/5000 |
0.738 | 1.070 |
| 8000/5000 | 0.973 | 1.289 |
In the original SimCLR framework,32 augmented and unaugmented views of the same input form positive pairs, while those of different inputs form negative pairs. We would emphasize that, without any modules added to distinguish positive pairs, the encoder trained would be too “lazy” to learn different representations for the two “views” of the same input, since the simplest mapping f that minimizes the loss learns the same hidden representation for the augmented and unaugmented input from the same image, which satisfies
iγ = f(
iγ) = f(xi) = hi. Therefore, a module predicting the scaling factor from two hidden representations of the same molecule is added to distinguish the scaled density data from unscaled data. The final loss of the contrastive pretraining task is the summation of these two losses. The workflow of the pretraining task is shown in Fig. 1(a).
The cosine similarity of learned projected representations z and
for a batch of 32 molecules is shown in Fig. 2(a). As expected, the cosine similarity shows maximum values for positive pairs – unscaled and scaled densities of the same molecules, while the value is close to zero for negative pairs – densities of different molecules. As shown in Fig. 2(b), we further verify that projected representations of different molecules are well separated from each other by computing the t-distributed neighbor embedding (t-SNE). In Fig. 2(c), two examples of molecules, learned projected representations and predictions on scaling factors are shown. The best model achieves 0.01976 contrastive loss and 2 × 10−4 mean square error for scaling factor prediction.
Within the data-driven paradigm, the mapping of molecular electron density to the exchange energy is directly learned in a supervised manner by feeding electron densities to an electron density encoder, with the corresponding exchange energies calculated from first-principles calculations as the learning targets. Electron density in three-dimensional space is represented by a three-dimensional array, with the dimension along each axis equal to the grid dimension along the same axis. Encoding and decoding of volumetric data in three-dimensional space has been previously studied in 3D-UNet,34 with a DoubleConv layer consisting of two subsequent 3D convolutional layers as the building block. In the same 3D-UNet framework, instead of DoubleConv, residual networks can be used as the building block to extract useful information from raw three-dimensional volumetric data.35 In this work, the mapping of electron density to the exchange energy will be learned, so only the encoder part will be adopted from 3D-UNet. The encoder consists of several connected building block layers, being either DoubleConv or ResNet (see Methods). Due to the fact that ResNet outperforms DoubleConv for our learning tasks, as shown in the ESI,† we chose ResNet as the building block of the encoder.
The architecture of the encoder is shown in Fig. 1(b). A hidden representation that captures density-energy correlation is learned and fed to a subsequent fully connected prediction layer to give a single value prediction on the exchange energy. The original electron densities of molecules (with a scaling factor equal to one) are included in the dataset. For reliable evaluation of the models, the dataset is split into 80%, 10%, and 10% as training, validation, and testing datasets, containing 8000, 1000, and 1000 unscaled data, respectively. The training set is employed to train the model for 500 epochs by minimizing the mean squared error (MSE) loss, and the model is then applied to validate the performance on the testing set using the mean absolute error (MAE) as the measure.
To investigate whether the model trained with only unscaled densities understands the uniform density scaling property, we test its performance on both unscaled and scaled density datasets. As shown in Fig. 3(a), the difference in energy between predictions and targets on the unscaled dataset is close to 0.45 eV on average. Instead of minimizing this prediction error for unscaled electron density by improving existing learning frameworks, the focus in this work is to demonstrate the role of contrastive learning in the process of incorporating physical constraints in density functional design. A clear observation is that the model does not provide reasonable predictions for the exchange energies of the scaled density dataset. This indicates that the models trained in a supervised manner using only unscaled density in general do not satisfy the uniform density scaling property and thus give unreliable predictions for scaled densities, although they may achieve very high accuracy on the unscaled density dataset. This motivates us to apply contrastive learning in a pretraining task to give our model the ability to understand the density scaling property. As shown in Fig. 3(b), the model trained by our approach provides reasonable predictions even for scaled electron density. This shows the capability of the model trained by our approach to obey the uniform density scaling property.
000/5000/5000, 32
000/5000/5000, 24
000/5000/5000, 16
000/5000/5000 and 8000/5000/5000. As shown in Table 1, using the same dataset, our approach outperforms supervised learning with data augmentation in terms of exchange energy prediction accuracy, as demonstrated by smaller mean absolute errors (MAE) after fine-tuning with the same amount of training data (40
000). Based on this set of results, we can estimate that the two learning approaches may achieve the same prediction accuracy on a training dataset with size between 32
000 and 40
000. It demonstrates that our contrastive learning model can reduce the need for a large amount of data while achieving comparable performance.
Furthermore, the model trained with the contrastive learning method gives a prediction of exchange energies that satisfy the uniform density scaling property. As shown in Fig. 4, predicted and target exchange energies demonstrate a strong linear correlation even when the number of training data is decreased. Note that for the case of using 8000 training data, the model uses the same number of training data as that of the supervised learning task in a previous section. The dramatic difference of performance between models shown in Fig. 3 shows the understandability of uniform scaling property which is enabled by our proposed models. Because of the choice of using the same uniform grids for both scaled and unscaled densities, when the electron densities are “squeezed”, the number of effective grid points with finite density values is decreased. As a result, the prediction accuracy for the scaled electron densities with γ > 1 is in general worse. Note that the model prediction accuracy can be further improved by using nonuniform density grids or representing the electron densities by a set of local orbitals.14 Alternatively, this can be addressed in future studies by learning the exchange energy directly from density matrices instead of a projected uniform grid with limited resolution.
Note that our contrastive learning model demonstrates the interpolatability to provide predictions of exchange energies satisfying the uniform scaling property for electron densities with random scaling factors that are not present in the training data. This demonstrates the capability of our contrastive learning approach when generalizing to those scaling factors not seen during training. The details of the interpolatability test are provided in the ESI.†
We expect that the MAE can be further reduced by increasing the training steps or using a larger dataset. Our model currently includes only 10
000 molecules from the QM9 dataset. Adding more molecules will improve the prediction accuracy. The dataset we used, which is based on a large cubic grid of dimensions (129 × 129 × 129), is approaching the limits of our computation power. According to the DM21 study,20 using a different type of grid such as the Treutler grid36 might improve our model's efficiency and accuracy.
In the current study, our methods yield non-self-consistent predictions, which means we obtain energy values directly from the electron densities without adjusting or repeating the predictions until they converge based on a required accuracy. A method for incorporating ML in self-consistent predictions is suggested in a recent work,37 which incorporates a machine learning module that predicts exchange–correlation energy from electron densities and works in combination with other modules that solve the Kohn–Sham equations and extract kinetic energies. In this framework, the exchange–correlation potential can be obtained by taking the functional derivative of the predicted exchange–correlation energy functional through automatic differentiation. The self-consistent energy can therefore be obtained by following the workflow from ref. 37. From this point of view, the development of accurate and efficient machine learning models for predicting the energies of electron densities can greatly accelerate the self-consistent Kohn–Sham calculations.
The method proposed in this study is most effective for incorporating constraints formulated as equalities. This presents an opportunity for incorporating more equality-based constraints among the 17 exact constraints. For instance, the spin scaling property,10 size-extensivity property, and the second-order gradient expansion38 are likely candidates for applying our proposed contrastive learning method to push forward the development of universal and accurate machine learning density functionals.
A similar effect occurs with human-designed density functionals: those that are constructed to satisfy more exact constraints require fewer fit parameters that can be determined from smaller sets of molecular data, and a nonempirical meta-GGA functional7 satisfying 17 exact constraints can perform rather well without any fitting to molecular data. The improvement of generalized gradient approximations (GGAs) or meta-GGAs by their global hybridization39 with exact exchange is a good example, since the exact constraints on the underlying GGA or meta-GGA are preserved for any value of the fraction of exact exchange that is mixed with a complementary fraction of GGA or meta-GGA exchange.
000 molecules from the QM9 dataset40,41 by imposing the following criteria: (i) each molecule contains less than 20 atoms; (ii) each molecule does not contain atoms with an atomic number larger than 36 (element Kr); (iii) the size of each molecule is less than 12 angstroms; and (iv) the DFT calculated exchange energy of the molecule should be greater than −200 eV. Molecular density matrices are calculated by DFT with the PBE functional3 as implemented in the PySCF package.33 To prepare the grid-like input data with fixed dimensions, we project the density matrices onto real space grid points with a shape (65, 65, 65) on a fixed size cube centered at the origin with a length of 40 angstroms. The number of grid points is set to odd integers to include the origin. A larger grid with shape (129, 129, 129) is also used to construct more detailed density data. Due to the limit of storage for the whole dataset, an average pooling down-sampling pre-process is applied to reduce the grid dimensions from 129 to 65. A comparison of the results using these two grids is given in the ESI.† The projection of density matrices on grids in three-dimensional space is performed by using the PySCF code.33 The exchange energies are calculated from the density matrices as they would be in Hartree–Fock or exact exchange theories using the NWChem code.42
000 molecules from QM9 dataset. To find out the best model that encodes the electron density, two different types of building block layers: ResNet and DoubleConv, were tested to build the density encoder. The model was built and trained using the PyTorch-Lightning package43 which is a framework based on the PyTorch package.44 The whole dataset is split into 80%, 10% and 10% for training, validation and testing, containing 8000, 1000 and 1000 data samples, respectively. Training loss is backpropagated to update the model parameters by an Adam optimizer45 with a learning rate of 0.001. The best model was chosen to be that with the smallest MAE after 500 epochs.
000 molecules chosen from QM9 dataset. Each raw electron density is augmented by a scaled one with the scaling factor chosen from 1/3, 1/2, 1, 2, and 3, leading to a dataset with 50
000 data. The scaled density is then translated randomly in the three-dimensional space. As a result of hyperparameter searching, ResNet with feature maps (16, 32, 64, 128) and DoubleConv with feature maps (32, 64, 128) are chosen for the comparison of performance on the downstream task. The whole dataset is split into 80%, 10% and 10% for training, validation and testing, containing 40
000, 5000 and 5000 data, respectively. (See Fig. 3 and 4, and Table 1.) The total training loss is the summation of contrastive loss and scaling factor prediction loss, which is then backpropagated to update the model parameters by an Adam optimizer with a learning rate of 0.001. The best model was chosen to be that with smallest total loss after 1000 epochs.
000 molecules chosen from the QM9 dataset, resulting in a dataset containing 50
000 electron densities. The whole dataset is split into 80%, 10% and 10% for training, validation and testing, with a total of 40
000 training data, 5000 validation data and 5000 testing data. The model consists of an encoder that transferred from the contrastive learning task and a simple linear layer. For a given scaled density data nγ, the model predicts the scaling factor γ and the unscaled exchange energy Eγ=1 from which the predicted scaled energy can easily been calculated by Eγ = γEγ=1. The total loss is calculated by the mean squared error between the real and predicted γ and Eγ=1. To ensure a fair comparison, we also train a model from scratch without using the transferred encoder, which represents the simple method of supervised learning with data augmentation. The comparison results are shown in Table 1.
Footnotes |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3dd00114h |
| ‡ Present address: Department of Physics and Engineering Physics, Tulane University, New Orleans, LA 70118, USA. E-mail: E-mail: perdew@tulane.edu |
| This journal is © The Royal Society of Chemistry 2023 |