Julia
Westermayr
and
Reinhard J.
Maurer
*
Department of Chemistry, University of Warwick, Gibbet Hill Road, Coventry, CV4 7AL, UK. E-mail: r.maurer@warwick.ac.uk
First published on 30th June 2021
Modern functional materials consist of large molecular building blocks with significant chemical complexity which limits spectroscopic property prediction with accurate first-principles methods. Consequently, a targeted design of materials with tailored optoelectronic properties by high-throughput screening is bound to fail without efficient methods to predict molecular excited-state properties across chemical space. In this work, we present a deep neural network that predicts charged quasiparticle excitations for large and complex organic molecules with a rich elemental diversity and a size well out of reach of accurate many body perturbation theory calculations. The model exploits the fundamental underlying physics of molecular resonances as eigenvalues of a latent Hamiltonian matrix and is thus able to accurately describe multiple resonances simultaneously. The performance of this model is demonstrated for a range of organic molecules across chemical composition space and configuration space. We further showcase the model capabilities by predicting photoemission spectra at the level of the GW approximation for previously unseen conjugated molecules.
Novel functional materials in modern optoelectronic devices are often characterized by their molecular charge transport properties between acceptor and donor molecules. Such devices include organic diodes and transistors, which crucially depend on the subtle alignment of molecular acceptor and donor levels of different compounds with respect to each other. These fundamental molecular resonances associated with electron addition and removal in matter can be studied with photoemission and inverse photoemission spectroscopy.6,7 However, the search for optimal materials combinations is limited by the speed at which organic materials combinations can be spectroscopically characterized. This is exacerbated by the challenge of interpreting macroscopically averaged photoemission data for complex molecules.8–11
First-principles simulation of photoemission signatures have the potential to dramatically accelerate high throughput screening of organic materials, but the high computational cost associated with accurate many-body excited-state calculations limits their applicability to small molecular systems.12,13 Machine learning (ML) methods have the ability to overcome the gap between experiment and theory for spectroscopic characterization by reducing the computational effort of spectroscopic simulations without sacrificing prediction accuracy.14,15
ML methods in the context of spectroscopy have previously focused on predicting single energy levels,15–19 oscillator strengths,20,21 dipole moments,22–24 highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) energies25–28 or band gaps.29–31 They have also been applied successfully to identify and characterize structures from X-ray absorption spectra.32–34 Electronic excitations of molecules across chemical compound space show crossings of states with different character and discontinuous behaviour. For ML models based on smooth features to capture this behaviour while simultaneously predicting multiple electronic excitations is a formidable challenge.15,35 By predicting spectral lineshapes36,37 or continuous densities-of-states38 directly, some of these problems can be circumvented as spectral signatures are smooth. Furthermore spectra can be represented by basis functions or discrete grids providing a consistent representation that is independent of the number of energy levels or the size of the molecule.39–41 However, a consequence of this simplification is that direct information on the number and character of the molecular resonances is lost.
In this work, we develop a deep convolutional neural network that accurately predicts molecular resonances across a wide range of organic molecular compounds. We encode the fundamental physics of molecular resonances by representing them via a Hamiltonian matrix associated with a closed set of secular equations. In contrast to previous efforts,42–45 this matrix representation is not based on local atomic orbital features and the elements of this matrix have no direct physical correspondence beyond the fact that the matrix eigenvalues correspond to the learned molecular resonances. As we are only training on rotationally invariant quantities, the model achieves this without the need to explicitly encode vectorial46–49 or tensorial equivariance properties23,25 beyond the rotationally invariant representation of the input molecular coordinates.28 The simple algebraic modification of describing vectorial targets by diagonalization of a matrix output leads to increased learning rates, reduced prediction errors, and increased transferability in predicting electron addition and removal energies across molecular composition space. We showcase the capabilities of this model by predicting photoemission spectra of previously unseen organic electronics precursor molecules at the level of Density Functional Theory (DFT). We further show that the model can be augmented to account for solvation effects or many-body electron correlation effects using only a small fraction of the original training data. Correlation effects are described at the level of GW many-body perturbation theory, which provides spectroscopic predictions of large, complex molecules in close agreement with experiment.
In order to learn n molecular resonances with the conventional scalar SchNet model, n ML models, one for every electronic state or resonance i need to be trained. In the following, we refer to this as a one-state (1S) model (panel a). Similarly, a vector of n molecular resonances can be represented using one ML model with a single vectorial output, which we refer to as multi-state (MS) model (panel b).51 This is identical to a previously proposed model in the context of photochemistry.35 The pseudo-Hamiltonian model (SchNet + H), which we propose here is shown in panel c and internally builds an ML basis that satisfies the properties of a quantum mechanical Hamiltonian, i.e., it is symmetric and has eigenvalues that correspond to electron addition/removal energies. The dimension of the effective Hamiltonian output layer scales with the number of eigenvalues defined by the user. This is in contrast to a full quantum mechanical Hamiltonian, which scales with the size of the molecular system. This advantage makes it feasible to learn a large set of molecular resonances in a defined energy range for molecules of arbitrary size. The eigenvalues are obtained after diagonalization of the ML pseudo-Hamiltonian. Further details on the model training are given in the Methods Section 4.
The prediction accuracy of the three models is first analyzed by training on the 15 lowest Kohn–Sham DFT eigenvalues of 1000 configurations of the H2O molecule generated by ab initio molecular dynamics (for details on the training data, see ESI†) as shown in panels d and e of Fig. 1. As can be seen from the scatter plots in Fig. 1d and the prediction errors reported in Table S1,† the set of 15 1S models shows an accurate prediction of eigenvalues compared to the reference values with mean absolute errors (MAEs) ranging from 0.6 meV up to 5.5 meV for a given orbital energy. This is known and expected as each model only has to cover a small energy range.28 A single deep neural network with multi-variate outputs to predict all 15 eigenvalues shows substantial deviation between reference and prediction across all energies, i.e., for low-lying semi-core as well as for valence and virtual eigenstates (panel e) with MAEs of up to 300 meV. The MS model is about twenty times less accurate in terms of MAEs of the HOMO energy than the 1S models (52 meV vs. 2 meV). This finding is in line with similar models reported in the literature.17,18,22,26,27,35,39,42
The lack of prediction accuracy of the MS model can be understood as the model has to cover a large range of energies while having to capture the dependence of each eigenvalue as a function of input. In contrast, our proposed model, SchNet + H, which learns eigenvalues indirectly via the pseudo Hamiltonian matrix, faithfully reproduces orbital energies across the whole energy range. The maximum MAE is 67 meV and the HOMO orbital energy can be predicted with 26 meV accuracy. Analysis of the learning behaviour shows that the prediction error decreases faster with the number of data points for the SchNet + H model compared to the MS model (see ESI Fig. S1†). In Fig. 1e, the predicted and reference eigenvalue energies of frontier orbitals around the HOMO energy are plotted as a function of the bending angle in H2O. While all models provide a qualitatively correct description of the smooth dependence, the MS model shows larger deviations with respect to the reference values compared to the SchNet + H model.
Fig. 2 (a) Eigenvalues and (b) diagonal matrix elements of the pseudo-Hamiltonian of the SchNet + H model trained on molecules of the QM7-X data set52 along a trajectory of conformational change in 2-methylpentane. |
To further validate the accuracy of the model, we train it to represent 12 Kohn–Sham eigenvalues of ethanol42,54 along a molecular dynamics trajectory. Scatter plots are shown in ESI Fig. S2† and errors on a hold-out test set are reported in the ESI Table S2† along with other models reported in the literature for comparison. By comparing broadly across literature, we find that SchNet + H provides the same or better accuracy for the prediction of multiple resonances (between 12 and 53 across different training sets) compared to what most other models achieve for a single molecular resonance (e.g. the HOMO).17,18,26,35,39,63 The exception to this is the atomic-orbital-based SchNOrb Hamiltonian model,42 which predicts an average MAE for the same 12 eigenvalues of about 0.02 eV. However, we note that SchNOrb is a much larger and more flexible model, which is trained on eigenvalues and Hamiltonian matrices to predict all molecular eigenvalues (with a total averaged MAE of 0.48 eV). SchNOrb in its current form can only predict eigenvalues as a function of atomic positions for a fixed molecular composition.
Encouraged by the promising performance of SchNet + H, we have trained a transferable model of molecular electronic states based on the OE62 data base.66 This data set is especially challenging as it features greater elemental diversity and more heteroatoms and functional groups than there are in the QM9 or QM7-X data bases.26,66 The 62k molecules in OE62 are selected from known molecular crystal structures in the Cambridge Structural Database.67 For each equilibrium structure, the data set reports Kohn–Sham orbital eigenvalues calculated at the PBE + vdW and hybrid PBE (PBE0) functional level of DFT. The SchNet + H model trained on the PBE0 orbital energies is termed ML(PBE0). The predicted orbital energies against reference values of a test set are shown in Fig. 3a in light blue. The SchNet + H model is trained to capture up to 53 electronic states between −10 eV up to and including the LUMO+1 state. The model error for each data point in the whole training set shows a very large deviation for some systems with particularly high structural complexity. One such outlier is shown in panel a, which contains an 8-membered nitrogen cage in the center (see also Fig. S4 in the ESI†). We note that these data points do not influence the model accuracy and its ability to generalize across chemical compound space, which we have tested by removing outliers and retraining the model. Training errors are further reported along with the number of training data in ESI Table S2.† The model error (MAE of 0.13 eV) is quite convincing with few prominent deviations at low orbital energies that are associated with a small number of outlier molecules of particularly high structural complexity.
Fig. 3 Validation of the SchNet + H model to predict PBE0 eigenvalues of the OE62 data base and the Δ-ML model that corrects the PBE0 fitted eigenvalues to G0W0@PBE0 accuracy or to PBE0 + implicit water solvation. (a) Scatter plots of a test set show the accuracy of each model. (b) Histograms of orbital eigenvalue (quasiparticle) energies for PBE0 in implicit water solvation and G0W0@PBE0 are shown for the GW5000 data set. A Gaussian envelope with 0.5 eV width is placed over each peak to depict the energy shifts between data sets and ML models. The eigenvalues of (c) the molecule with most eigenvalues within the modelled energy range and with (d) the worst predicted eigenvalues in the test set are shown using a Pseudo-Voigt lineshape64,65 based on a 30% Lorentzian and 70% Gaussian ratio with 0.5 eV width. |
For a subset of 30876 molecules, the OE62 set further reports PBE0 (ref. 68) eigenvalues calculated with the Multipole Expansion (MPE) implicit solvation method.69 For a further subset of 5239 molecules in vacuum (termed GW5000), the data set reports quasiparticle energies calculated at the many-body perturbation theory in the G0W0@PBE0 approximation.70–72 With the exception of the HOMO, Kohn–Sham orbital energies lack a physical meaning73 and important properties of optoelectronic materials, such as donor and acceptor levels20,39 or band gaps are often incorrectly described.70 In order to obtain charged excitations in molecules and materials, the GW method13,71 can be used to correct artifacts that arise from approximations in the exchange–correlation functional in DFT. The computation of quasiparticle energies is computationally unfeasible for the full OE62 data set and for much larger molecular systems with potential relevance in organic electronics. The electronic resonances that include solvation effects and correlation effects captured in the two data subsets should principally deviate from the PBE0 energies of the full data set in relatively systematic ways. We therefore apply a Δ-ML approach20,74 to train ML models to capture the difference in orbital energy and quasiparticle energy between PBE0 in vacuum and in water and PBE0 and G0W0@PBE0, respectively. Our Δ-ML approach is explained in more detail in the Methods section. Briefly, the SchNet + H model of the PBE0 eigenvalues learns a baseline for the full 62k data set (50k training data points), whereas the Δ-ML models learn the difference with respect to this ML(PBE0) baseline from a much smaller training data set (4k).
Test errors of orbital (quasiparticle) energies predicted by the two Δ-ML models are also reported in Fig. 3a. We note that the error distribution is narrower for the Δ-ML-corrected models than for ML(PBE0). Fig. 3b shows that the ML(PBE0) and the two Δ-ML models predict eigenenergies with high fidelity and accurately represent the data sets with a MAE (RMSE) as low as 2 and 4 meV for PBE0(H2O) and G0W0@PBE0, respectively. On closer inspection, we find that the excitation spectrum of the molecule in the test set with the most eigenvalues in the represented energy range shows quantitative agreement with the reference spectrum and a MAE (RMSE) of 29 (52) meV in the vicinity of the peaks (see Fig. 3c). The spectrum for the molecule with the highest prediction error (Fig. 3d) shows noticeable deviations only for the Δ-ML(G0W0@PBE0) model. Here the model predicts a splitting of the HOMO levels and underestimates the energy of the LUMO compared to the reference data with a MAE of 0.51 meV and a RMSE of 0.94 meV on the spectrum in the vicinity of the peaks. We note that this molecule is a rare case in the data base that contains more heteroatoms than carbon atoms, which could be a reason for the increased prediction errors.
The Δ-ML(G0W0@PBE0) is only trained on a subset of 4k datapoints of the GW5000 data set as no quasiparticle energies are available for the full 62k data points of the OE62 data set. By applying the SchNet + H ML(PBE0) and Δ-ML(G0W0@PBE0) models to predict the quasiparticle energies of the full OE62 data set, we can gauge the transferability of the models across chemical space. We find that the models predict the same vertical shift of occupied and unoccupied states between PBE0 and G0W0@PBE0 levels of theory for the full OE62 data set that we have shown in Fig. 3b for the GW5000 set (see ESI Fig. S4b†). In addition, the predictions show a linear correlation of the Kohn–Sham HOMO and LUMO orbital energies with the corresponding quasiparticle energies (Fig. S4a†). This linear relation has previously been identified for HOMO energies of the smaller GW5000 subset in ref. 66, which we can now extend for all orbitals in the OE62 set. Not surprisingly, the application of the Δ-ML(G0W0@PBE0) induces a downward shift of occupied PBE0 energies and an upward shift in energy for unoccupied orbitals to create electron removal and addition quasiparticle energies. Hardly any shift can be found for the eigenenergies obtained from the implicit solvation model indicating that solvation has a minor impact on the molecular resonances.
The combined SchNet + H ML(PBE0) and Δ-ML(G0W0@PBE0) models can predict (inverse) photoemission spectra, ionization potentials and electron affinities of large and complex organic molecules which are well out of reach for ab initio calculations at this level of theory. Previous works have predicted individual HOMO and LUMO quasiparticle energies of the GW5000 (ref. 27) and GW100 (ref. 63 and 78) data sets. Our model is able to predict many quasiparticle resonances over a wide energy range and is therefore able to simulate photoemission spectra.
The excitation spectra are predicted with the ML model trained on PBE0 orbital energies of the OE62 data set (denoted as ML(PBE0)) and the Δ-ML model trained on the difference of the ML(PBE0) model and the G0W0@PBE0 values of 4k datapoints of the GW5000 data set. The combination of both models is denoted as ML(G0W0@PBE0) in the following. All photoemission spectra shown in Fig. 4a–d and ESI Fig. S6–S8† are ML predictions of molecules the model has not seen before. In addition to the photoemission spectra, the LUMO energies are plotted and the spectra obtained from Kohn–Sham eigenvalues are shown to highlight the Δ-ML quasiparticle correction. The spectra obtained with ML(G0W0@PBE0) are in excellent agreement with experiment. Compared to spectra based on Kohn–Sham orbital energies, they accurately reflect the positions and intensities of photoemission features. In addition, the model correctly predicts the spectral fingerprints of similar molecules and accurately describes substituent effects. For instance, the model accurately predicts the differences of 1,3-dibromoaculene and 1,3-dichloroaculene (see panel d and ESI† for details). Even a highly complex molecule such as 1,3-dibenzoylazulene with 48 atoms (see Fig. 4d), is predicted with high accuracy with respect to the experimental spectrum.
Fig. 4 Experimental and ML predicted photoemission spectra along with the LUMO (quasiparticle) orbital energies at the PBE0 (G0W0@PBE0) level for (a) perylene, (b) chrysene, (c) 1,3-dibenzoylazulene, and (d) 1,3-dichloroazulene. A Pseudo-Voigt lineshape64,65 based on a 30% Lorentzian and 70% Gaussian ratio with 0.3 eV width was used. (e) Electron affinities and ionization potentials of acene molecules are plotted with increasing ring size. (1)Experimental photoemission spectra have been extracted from ref. 75, (2) ref. 76, and (13) ref. 77. |
In addition to the photoemission spectra, we predict the electron affinities and ionization potentials of molecules of the acene family. As can be seen in Fig. 4d, acenes are built from linearly condensed benzene rings and are often referred to as “1d graphene strips”. Acenes are especially interesting as they are relevant in electronic devices due to their narrow HOMO–LUMO gaps that can result in generally high conductivity.2,77 The predicted ionization potentials and electron affinities fit well to experimental values although the HOMO–LUMO gaps are slightly underestimated. This underestimation is not an artifact of the ML model, but is a well known limitation of the G0W0 method for acene molecules.77 Due to the instability of hexacene (n = 6), the experimental prediction of charged excitations is challenging, hence no electron affinity value is available to which the ML predictions can be compared.2 The respective photoemission spectra are reported in ESI Fig. S8† and are in qualitatively good agreement with experimental spectra reported in literature.77
The ability to efficiently predict molecular resonances at high accuracy is key to enable large-scale computational screening of novel acceptor and donor molecules to be used in organic electronics and thin film device applications.7,81,84 We expect that the presented method will be very useful in this context. It will likely be especially powerful in combination with generative ML85,86 or reinforcement learning models87 that can recommend new molecular structures with specific tailored properties. In this way, a fully automated search algorithm for new molecules with optimally tuned acceptor and donor levels could be created.81,88,89
diag({εMLi}) = UTHMLU. | (1) |
As SchNet learns the molecular representation, the need for extensive hyperparameter search is reduced. As illustrated in Fig. 1, Hamiltonian elements for states i and j, Hij, are obtained by sum-pooling of atomic features, xa. wija denotes the weights that connect the last layer of the standard SchNet network to the pseudo-Hamiltonian layer.
(2) |
Diagonalization of the pseudo-Hamiltonian matrix is carried out after each pass trough the network and the eigenvalues predicted by the ML model enter the loss function, L2:
(3) |
SchnNet + H models consistently provide better accuracy than MS-SchNet models. While the accuracy of direct training in MS-SchNet can be improved by placing a Gaussian function on top of the orbital energies in the loss function, this did not lead to more accurate results than the SchNet + H model. Our goal was to develop a model that predicts molecular resonances across chemical space and does not scale with system size. We therefore define an energy range within which we represent all orbital energies up to a maximum number of values that defines the size of HML. The energy range that was fitted for each data set is reported in ESI Table S2.† A varying number of orbital energies are used for training with the maximum number of eigenvalues being 53 for the OE62 and GW5000 training sets.66 Every molecule that contains fewer orbital energies than the maximum amount of fitted values can be predicted by using a mask in the loss function that makes sure only relevant values are included.
ΔεML(G0W0 − DFT) = εref(G0W0) − εML(DFT) | (4) |
For the Δ-ML model, a conventional MS model is sufficient as the differences in DFT (predicted by the SchNet + H model) and G0W0 vary less strongly as a function of input than the actual targets.93,94 The architecture of the Δ-ML model is identical to panel (b) in Fig. 1. The Δ-ML model is trained separately from the SchNet + H model and is not combined in an end-to-end fashion. Nevertheless, the models depend on each other as the SchNet + H models provides the baseline for the Δ-ML model and predictions of both models need to be combined to obtain reliable quasiparticle energies.
Although the accuracy of the Δ-models can be improved by using DFT reference values as the baseline for Δ-models (MAE of 0.02 eV are obtained with DFT baseline models compared to MAEs of 0.16 eV with SchNet + H(PBE0) baseline models), the ML predicted DFT values are chosen as a baseline to circumvent the use of DFT reference calculations for new predictions altogether. This provides an ML prediction that is independent of electronic structure calculations and practical for large-scale screening studies. The predicted G0W0@PBE0 values are obtained by using the following equation:
εML(G0W0) = εML(DFT) + ΔεML(G0W0 − DFT). | (5) |
For the prediction of G0W0@PBE0 values, we thus use two ML models, one SchNet + H model trained on DFT orbital energies and one MS-SchNet model trained on the difference between quasiparticle and orbital energies. Further details on model size, training and test set split, and model parameters can be found in the ESI.† The chosen model parameters are reported in ESI Table S3.†
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/d1sc01542g |
This journal is © The Royal Society of Chemistry 2021 |