Open Access Article
Prakriti
Kayastha†
,
Sabyasachi
Chakraborty†
and
Raghunathan
Ramakrishnan
*
Tata Institute of Fundamental Research Hyderabad, Hyderabad 500046, India. E-mail: ramakrishnan@tifrh.res.in
First published on 18th August 2022
In this study, we explore the potential of machine learning for modeling molecular electronic spectral intensities as a continuous function in a given wavelength range. Since presently available chemical space datasets provide excitation energies and corresponding oscillator strengths for only a few valence transitions, here, we present a new dataset—bigQM7ω—with 12
880 molecules containing up to 7 CONF atoms and report ground state and excited state properties. A publicly accessible web-based data-mining platform is presented to facilitate on-the-fly screening of several molecular properties including harmonic vibrational and electronic spectra. We present all singlet electronic transitions from the ground state calculated using the time-dependent density functional theory framework with the ωB97XD exchange-correlation functional and a diffuse-function augmented basis set. The resulting spectra predominantly span the X-ray to deep-UV region (10–120 nm). To compare the target spectra with predictions based on small basis sets, we bin spectral intensities and show good agreement is obtained only at the expense of the resolution. Compared to this, machine learning models with the latest structural representations trained directly using <10% of the target data recover the spectra of the remaining molecules with better accuracies at a desirable <1 nm wavelength resolution.
ML models have been shown to accurately forecast a multitude of global13,16,17 and quasi-atomic molecular properties.18–20 For atomization or bonding energies, their prediction uncertainties are comparable to that of hybrid density functional theory (DFT) approximations.14,21–26 They have also successfully modeled non-adiabatic molecular dynamics,27 vibrational spectra,28,29 electronic coupling elements,30 excitons,31 electronic densities,32 excited states in diverse chemical spaces,33–35 as well as excited-state potential energy surfaces (PES).34,36–39 A key difference in the performance of ML in the latter two application domains is that ambiguities due to atomic indices and size-extensivity that affect the quality of structural representations for chemical space explorations40,41 do not arise in PES modeling or dipole surface modeling42–44 resulting in better learning rates.
ML models of global molecular energies (atomization/formation energies, etc.) with a robust structural representation benefit from the well-known mapping between the ground state electronic energy and the corresponding minimum energy geometry established by the Hohenberg–Kohn theorem.45 The Runge–Gross theorem provides a similar mapping between the time-dependent potential and the time-evolved total electron density.46 However, the target quantities in ML modeling of excited states are state-specific stemming from local molecular regions. For quasi-atomic properties such as 13C NMR shielding constants18–20,39 or K-edge X-ray absorption spectroscopy,18,39 a representation encoding the local environment of the query atom results in better learning rates. Similarly, quasi-particle density-of-states—interpreted as intensities in a photo-emission spectrum—have also been successfully modeled.47,48 However, for valence electronic excitations that are also local, the corresponding molecular substructure varies non-trivially across the chemical space. Hence, intensities based on oscillator strengths derived from many-electron excited state wave functions obeying dipole selection rules exhibit slow learning rates.34 Determining the characteristic chromophore responsible for the electronic excitations is non-trivial for chemical space datasets such as QM9 (ref. 49) that exhibit large structural diversity. This complexity, in turn, hinders the development of local descriptors that can map to the composition or structure of the chromophore and its environment. Hence, we are limited to using global structural representations for ML modeling of electronic excited state properties. This limitation becomes evident from the modest performances of ML models of excitation energies,33,34 and their zero-order approximations, the frontier molecular orbital (MO) energies.22,35,50,51
In this study, we: (i) present a high-quality chemical space dataset, bigQM7ω, containing ground-state properties and electronic spectra of 12
880 molecules containing up to 7 CONF atoms modeled at the ωB97XD level with different basis sets. (ii) Demonstrate the resolution-vs.-accuracy dilemma in modeling spectroscopic intensities. (iii) Present ML models trained on the bigQM7ω dataset for an accurate reconstruction of the electronic spectra of allowed transitions in a given wavelength domain.
In the present work, we explore molecules with up to 7 CONF atoms. We begin with the GDB11 set of SMILES because several important molecules such as ethylene and acetic acid present in GDB11 were filtered out in GDB13 and GDB17. Our new dataset contains 12
883 molecules—almost twice as large as the QM7 sets. The breakdown for subsets with 1/2/3/4/5/6/7 heavy atoms is 4/9/20/80/352/1850/10
568. The previous datasets QM7, QM7b, and QM9 have been generated using yesteryear's quantum chemistry workhorses: PBE,58 PBE0,59 and B3LYP.60 Here, we use the range-separated hybrid DFT method, ωB97XD61 that is gaining widespread popularity for its excellent accuracy. Hence, we name this dataset as bigQM7ω, with the last character emphasizing the DFT approximation utilized. The high-throughput workflow used for generating bigQM7ω is shown in Fig. 1 and Table 1 puts the new dataset in perspective by comparing with other popular datasets of similar constitution. While bigQM7ω is smaller than QM9, it provides a better coverage of molecules for the same number of CONF atoms. Further, bigQM7ω also provides excited state data collected at various theoretical levels, hence, comprehensively covering the property domain. A summary of properties of bigQM7ω, made available in the form of structured datasets,62 is provided in Table 2. As unstructured datasets, we provide raw input/output files63 to kindle future endeavors. For example, for ML modeling of forces, properties of non-equilibrium geometries may be extracted from these raw data.
| Details | QM7 | QM7b | QM9a | bigQM7ω |
|---|---|---|---|---|
a Contains 3993/22 786 molecules with up to 7/8 CONF atoms. Excited state data are available for the 22 786 subset ref. 33.
|
||||
| Origin | GDB13 | GDB13 | GDB17 | GDB11 |
| Elements | CHONS | CHONSCl | CHONF | CHONF |
| Size | 7165 | 7211 | 133 885 |
12 880 |
| Geometry optimization | PBE0/tight tier-2 | PBE/tight tier-2 | B3LYP/6-31G(2df,p) | ωB97XD/3-21G |
| ωB97XD/def2SVP | ||||
| ωB97XD/def2TZVP | ||||
| Frequencies | B3LYP/6-31G(2df,p) | ωB97XD/3-21G | ||
| ωB97XD/def2SVP | ||||
| ωB97XD/def2TZVP | ||||
| Excited states | E 1 | E 1, E2, f1, f2 | All states | |
| GW/tight tier-2 | RICC2/def2TZVP | TDωB97XD/3-21G | ||
| TDPBE0/def2SVP | TDωB97XD/def2SVP | |||
| TDPBE0/def2TZVP | TDωB97XD/def2TZVP | |||
| TDCAMB3LYP/def2TZVP | TDωB97XD/def2SVPD | |||
| PM6 |
|---|
| Equilibrium geometries (Å) |
| All molecular orbital energies (hartree) |
| Total electronic and atomization energies (hartree) |
| ωB97XD/(3-21G, def2SVP, def2TZVP) |
|---|
| Equilibrium geometries (Å) |
| All molecular orbital energies (hartree) |
| Atomization energies (hartree) |
| All harmonic frequencies (cm−1) |
| Zero-point vibrational energy (kcal mol−1) |
| Mulliken charges, atomic polar tensor charges (e) |
| Dipole moment (debye) |
| Polarizability (bohr3) |
| Radial expectation value (bohr2) |
| Internal energy at 0 K and 298.15 K (hartree) |
| Enthalpy at 298.15 K (hartree) |
| Free energy at 298.15 K (hartree) |
| Total heat capacity (cal/mol/K) |
| ZINDO, TDωB97XD/(3-21G, def2SVP, def2TZVP, def2SVPD) |
|---|
| Excitation energy of all states (eV, nm) |
| Oscillator strengths of all excitations (dimensionless) |
| Transition dipole moment of all excitations (a.u.) |
883 molecules in bigQM7ω were generated from SMILES by relaxing with the universal force field (UFF)64 employing tight convergence criteria using OpenBabel.65 As a guideline for quantum chemistry big data generation, a previous study proposed connectivity preserving geometry optimizations (ConnGO) to eliminate structural ambiguities due to rearrangements encountered in automated high-throughput calculations.66 Accordingly, we used a 3-tier ConnGO workflow to generate geometries at the ωB97XD61 DFT level using def2SVP and def2TZVP basis sets.67 Geometry optimizations at the simpler levels such as PM6 and ωB97XD/3-21G were performed without ConnGO, directly starting from the UFF structures. For ωB97XD/def2SVP final geometries, we used HF/STO3G and ωB97XD/3-21G as intermediate tier-1 and tier-2 levels, respectively. Similarly, for ωB97XD/def2TZVP, HF/STO3G and ωB97XD/def2SVP were lower tiers. In each tier, ConnGO compares the optimized geometry with the covalent bonding connectivities encoded in the initial SMILES and detects molecules undergoing rearrangements. For this purpose, we used the ConnGO thresholds: 0.2 Å for the maximum absolute deviation in covalent bond length and a mean percentage absolute deviation of 6%. In DFT calculations, tight optimization thresholds and ultrafine grids were used for evaluating the exchange–correlation (XC) energy. A few molecules required relaxing the optimization thresholds for monotonic convergence towards a minimum. All final geometries were confirmed to be local minima through harmonic frequency analysis. For molecules that are highly symmetric or with multiple triple bonds, converging to minima was only possible with the very tight optimization threshold and superfine grids. At both ωB97XD/def2SVP and ωB97XD/def2TZVP levels, 3 molecules with the SMILES O = c1cconn1, N = c1nconn1, O = c1nconn1, failed the ConnGO connectivity tests. Further investigation revealed these molecules to contain an –NNO– substructure in a 6-membered ring facilitating dissociation as previously noted in ref. 66. After removing these molecules, the size of bigQM7ω stands at 12
880.
We performed vertical excited-state calculations at Zerner's intermediate neglect of differential overlap (ZINDO)68 and TDωB97XD levels. ZINDO calculations were done on PM6 minimum energy geometries, while TDωB97XD with 3-21G, def2SVP, and def2TZVP basis sets, at the corresponding ground state equilibrium geometries. We also performed TDωB97XD calculations with the diffuse function augmented basis set, def2SVPD, on ωB97XD/def2SVP geometries. All electronic structure calculations were performed using the Gaussian suite of programs.69 The number of excited states accessible to the TDDFT formalism is limited by the number of electrons and the size of the orbital basis set. With the finite basis set used in this study, the spectrum is practically discrete. To ensure that all the singlet-type electronic bound states are calculated, we set an upper bound of 10
000 for the number of states in the TDDFT single point calculations with the keyword nstates = 10
000. For benchmarking the quality of TDDFT excitation spectra, we also performed similarity transformed equation-of-motion coupled cluster with singles doubles excitation (STEOM-CCSD)70 and the aug-cc-pVTZ basis set as implemented in Orca.71,72
![]() | (1) |
The coefficients, {ci}, are obtained by regression over the training data. The kernel function, k(·), captures the similarity in the representations of the query, q, and all N training examples. For ground state energetics, the Faber–Christensen–Huang–Lilienfeld (FCHL) formalism in combination with KRR-ML has been shown to perform better than other structure-based representations.22,74 However, for excitation energies and frontier MO gaps, FCHL's performance drops compared to the Spectral London-Axilrod-Teller-Muto (SLATM) representation.75 In this study, we compare the performance of FCHL and SLATM for modeling the full-electronic spectrum. SLATM delivers best accuracies with the Laplacian kernel, k(dq, di) = exp(−|dq − di|1/σ), where σ defines the length scale of the kernel function and |·|1 denotes L1 norm. For the FCHL formalism, we found an optimal kernel width of σ = 5 through scanning and a cutoff distance of 20 Å was used to sufficiently capture the structural features of the longest molecule in the bigQM7ω dataset, heptane.
The kernel width, σ, is traditionally determined through cross-validation. For multi-property modeling σ can be estimated using the ‘single-kernel’ approach,16 where σ is estimated as a function of the largest descriptor difference in a sample of the training set, σ = max{dij}/
log(2). Previous works16,20,76 have shown single-kernel modeling to agree with cross-validated results with in the uncertainty arising due to training set shuffles, especially for large training set sizes. KRR with a single-kernel facilitates seamless modeling of multiple molecular properties using standard linear solvers
| [K + ζI][c1, c2, …] = [p1, p2, …], | (2) |
The property (pi vector in eqn (2)) modeled in this study corresponds to sum of binned oscillator strength of electronic transitions from the ground state. Conventionally, the band intensity due to the k – th excitation is the molar absorption coefficient that is proportional to the corresponding oscillator strength, f
k, denoted shortly as fk.79 In order to model a full spectrum in a given wavelength range, one can consider each value of fk (in atomic units, a.u.) as a separate target quantity. However, the number of states is not uniform in a dataset such as bigQM7ω. Further, in practice, one is interested in an integrated oscillator strengths within a small resolution, Δλ. Hence, we uniformly divide the spectral range in powers of 2, Δλ = λspectrum/Nbin, where Nbin = 1, 2, 4, … is the number of bins. For the small organic molecules such as those in bigQM7ω, we set spectral range to λspectrum to 120 nm capturing most of the excitations. For wavelengths >120 nm the bigQM7ω dataset provides too few examples, hence, data in this long wavelength domain is inadequate for ML modeling. The target for ML is the sum of fk in a bin
![]() | (3) |
![]() | (4) |
i,a is the normalized oscillator strength for the i-th bin defined as
. For two spectra binned at a common resolution, Δλ, the accuracy metric for normalized spectra (Φ) is given by:![]() | (5) |
When the reference and target property vectors are the same, the accuracy is maximum, Φ = 100. For a sample with Nmol molecule, an overall prediction accuracy (
) can be obtained as an average
![]() | (6) |
For the low-lying excited states of small molecules, equations-of-motion coupled cluster with singles doubles (EOM-CCSD)84 and approximate second-order coupled-cluster (CC2) deliver a mean error of 0.10–0.15 eV compared to higher-level wave function methods.59,80–82,85–88 While these methods can be made more economical by using the resolution-of-identity (RI) technique, as in RICC2 (ref. 89) or domain-based local pseudo-natural orbital (DLPNO) variant of EOM-CCSD,90 they have known limitations when modeling the full electronic spectra of thousands of molecules. Formally, the total number of electronic states accounted for by these wave function methods scales as
(N2oN2v), where No and Nv are the numbers of occupied and virtual MOs. Even for a small molecule such as benzene with a triple-zeta basis set, the size of the resulting electronic Hamiltonian is of the order of millions. It is well known that the iterative eigensolvers used for such large scale problems converge poorly for higher eigenvalues restricting their usage only to the lowest few electronic states.91 Hence, as of now, large scale computations of full electronic spectra across a chemical space dataset are amenable only at the time-dependent (TD) DFT-level92,93 that show an
(NoNv) scaling.
While DFT offers a suitable high-throughput data generation rate, its accuracy for geometries and properties is dependent on the exchange-correlation (XC) functional. The chemical space dataset, QM9, was designed using the hybrid generalized gradient approximation (hGGA), B3LYP, with 6-31G(2df, p) basis set because of their use in the Gn family of composite wavefunction methods.94 For thermochemistry energies, B3LYP has an error of 4–5 kcal mol−1.95 A recent benchmark study96 has shown the range-separated hGGAs from the ωB97 family61 to have errors in the 2–3 kcal mol−1 window; their performance is second only to the Gn methods. While curating the QM9 dataset, the dispersion corrected variant of ωB97, namely ωB97XD, predicted high-veracity geometries less prone to rearrangements in automated high-throughput workflows.66 Hence, we resort to ωB97XD for geometry optimization and its time-dependent variant for modeling the complete electronic excitation spectra.
The electronic excitations of the molecules in our dataset are predominantly in the deep-ultraviolet (deep-UV) to X-ray region. Since the popular flavors of TDDFT depend on the adiabatic approximation where the orbitals are relaxed to first-order as a linear response, they often fail to describe the electronic wavefunction of high-lying excited states that can substantially differ from that of the ground state.97 Such effects may be anticipated especially for excitations of long-range charge-transfer character, Rydberg-type98 or excitations of core electrons.99 Additionally, electronic states of doubly excited character are not accessible to the linear-response formalism of TDDFT.100 However, as yet, remedies for improving TDDFT for pathological situations have not been tested over chemical space datasets. Furthermore, some of the new methods such as the orbital optimized DFT also suffer from algorithmic errors resulting in variational collapse to a low-lying state.97
To probe the effect of basis sets on the TDDFT-level excited state properties, we selected the smallest 33 molecules with up to 3 heavy atoms as a benchmark set. Accurate modeling of oscillator strengths and high-lying electronic states require basis sets augmented with diffuse functions in order to achieve semi-quantitative accuracy. Hence, in Fig. 2, we explore ωB97XD's performance for excitation properties computed at def2SVP (SVP), def2TZVP (TZVP), def2SVPD (SVPD), and def2TZVPD (TZVPD). We use the lowest two excitation energies (E1 and E2) and the corresponding oscillator strengths (f1 and f2) with the accurate STEOM-CCSD/aug-cc-pVTZ method as the reference. Unsurprisingly, def2SVP has the largest errors across all excitation properties followed by def2TZVP, def2SVPD, and def2TZVPD. Including diffuse functions results in errors that are almost half of those from basis sets devoid of diffuse functions. The errors for all four properties obtained with the def2SVPD basis set are very similar irrespective of whether the corresponding geometries were determined with def2SVP or def2TZVP basis sets. Even though def2TZVPD offers the best accuracies, we find the computational cost for determining the full spectra of all molecules in bigQM7ω to be very high. Hence, we resort to the def2SVPD basis set that is cost-effective for the excited state calculations. The final target-level data used for training ML models were obtained at the TDωB97XD/def2SVPD level using geometries calculated at the ωB97XD/def2SVP level. While TDωB97XD/def2SVPD level excitation spectra is by no means quantitatively accurate, for high-throughput explorations of medium-sized molecules, it still preserves broad trends that can be learned through structure–property relationships.
We also compare the performance of different methods for predicting the Thomas–Reiche–Kuhn (TRK) sum, ∑kfk. For an exact excited state method, this sum according to the TRK theorem must converge to the number of electrons.79 In quantum chemistry, unfortunately, this condition is satisfied only at the full-CI limit, when all excitations (singles, doubles, triples, and so on) are accounted for at the basis set limit. ZINDO and the TDωB97XD methods are not expected to satisfy the TRK limit. We illustrate this aspect in Fig. 3 where the TRK-sum is plotted as a function of total number of states accessible. ZINDO deviates the most from the TDωB97XD/def2SVPD target because the number of excited states available is limited by two factors. Firstly, core electrons are not included in ZINDO. Secondly, semi-empirical models are implicitly based on a minimal basis set.
TDDFT modeling with 3-21G improves ∑kfk and the total number of states compared to ZINDO. With the def2SVP and def2SVPD basis sets, ∑kfk quantizes at even numbers with a separation of about 2. For the large basis set, def2SVPD, the number of accessible states increases, while the TRK-sum drops below the def2SVP values. We investigated the reason for this trend using methane and found the def2-basis sets to show somewhat oscillatory convergence with the basis set size. For methane, the def2SVP/def2SVPD values are 8.78/8.12, the larger basis set value agreeing better with the aug-cc-pV5Z basis set limit value of 7.82. Residual errors in ZINDO/TDDFT TRK-sums from total number of electrons are shown in the inset to Fig. 3.
of predictions from ZINDO, ωB97XD/3-21G, or ωB97XD/def2SVP levels with that of the ωB97XD/def2SVPD values (see Fig. 4). For atomization energies and low-lying excitation energies, these values are 3–4 kcal mol−1, and 0.2–0.3 eV,82 respectively. For oscillator strengths, such a threshold has not been established, especially for chemical space datasets.
![]() | ||
Fig. 4 Accuracy metrics for binned oscillator strengths in the λ ≤ 120 nm range for all molecules in bigQM7ω: (a) mean absolute error, MAE(Δλ), in atomic units (a.u.) as defined in eqn (4), (b) mean accuracy metric for normalized spectra, (Δλ), as defined in eqn (6). Results are shown for ZINDO, ωB97XD/3-21G and ωB97XD/def2SVP for approximating ωB97XD/def2SVPD level values. | ||
In Fig. 4a, the MAE of ZINDO shows a smaller variation with Δλ. For the extreme case of Δλ = 120 nm, where the oscillator strengths of all states are summed in a bin, ZINDO's MAE saturates to about 27.5 a.u implying a systematic error in ZINDO. For the desired resolution of 0.94 nm, ZINDO's error increases only slightly. The MAEs improve for the spectra calculated with ωB97XD/3-21G. For the single bin case, the 3-21G results also indicate a systematic error albeit of a smaller magnitude compared to ZINDO. The errors are further quenched for the def2SVP basis set, which for a resolution of Δλ = 0.94 nm has an MAE of about 20 a.u. Overall, the MAE-vs-Δλ dependency becomes stronger in the order: ZINDO < ωB97XD/3-21G < ωB97XD/def2SVP. This trend is in agreement with the magnitude of TRK-sum as predicted by these methods, see Fig. 3. In general, a similar trend is noted also for individual oscillator strengths.
As pointed out in Section-3, for the limiting case of Δλ = 120 nm, when all oscillator strengths are summed in one bin, the Φ is 100 for ZINDO, ωB97XD/3-21G, and ωB97XD/def2SVP methods compared to the target ωB97XD/def2SVPD (see Fig. 4b). With increasing resolution, the methods diverge from the target, ZINDO showing the largest deviation from ωB97XD/def2SVPD. For a desirable resolution of 1% of λspectrum, Δλ ≈ 1 nm, 3-21G and def2SVP predictions result in Φs of 30–50 compared to the target, while ZINDO has a worse score ≈10. The reason for poor Φs of ZINDO predictions at small resolution is because core states are absent in ZINDO, limiting the spectral range to >19.8 nm. In contrast, the density of the states at the target TDωB97XD level is high in the short wavelength domain. Applying bin-specific systematic corrections can improve both the accuracy metrics for all three methods: ZINDO, ωB97XD/3-21G, and ωB97XD/def2SVP. However, such corrections may not result in uniform improvement throughout the spectral range. For instance, at short wavelength regions where the TDωB97XD spectra are sharp, ZINDO lacks these lines. However, systematic corrections may result in vanishing MAE for the wrong reason. On the other hand, the effect of such corrections will be less severe for the normalized metric,
. Hence, we do not apply bin-specific systematic corrections in this analyses. Overall, at the desired resolution of 0.94 nm, among the methods inspected here, the one with larger MAE has the smaller accuracy metric,
and vice versa.
≈ 80 for 0.94 nm resolution FCHL model shows improved learning rates, suggesting its scope for modeling full-electronic spectra of larger datasets. These findings indicate that it is possible to employ ML modeling for reconstructing electronic spectra at a high-resolution. Since the ML models were trained on pi (binned oscillator strengths), the predicted spectra can be compared with the reference TDωB97XD spectra similarly binned. The prediction error of the reconstructed spectrum may be quantified either as a sum of absolute differences, or using the accuracy metric upon normalizing the binned intensities. The definitions of the error metric do not influence the ML-reconstruction of the spectra, but they serve merely to quantify the mean prediction accuracy.
The spectra reconstructed with these models do not contain any state-specific information, but rather indicate the intensity of dipole absorption in a finite wavelength window. At the limit of very small Δλ, these bins will correspond to individual transitions. It is worth noting that for a resolution of 0.94 nm, TDωB97XD/def2SVP spectra agree with that of the target-level only with a score of ≈47. The Φ drops even further for TDωB97XD/3-21G (≈29) and ZINDO (≈9) levels. The learning rates in our evaluatory Δ-ML17 calculations using ZINDO, TDωB97XD/3-21G, or even TDωB97XD/def2SVP baseline spectra were inferior than modeling directly on the TDωB97XD/def2SVPD target. Hence, all ML models were trained directly on the target.
In Fig. 6, we present the entire spectrum of an out-of-sample molecule, cyclohexanone, reconstructed using FCHL-ML models with 1 k training examples at three different wavelength resolutions −3.75 nm, 1.88 nm, and 0.94 nm. Since the ML models were trained using geometries at the UFF level, these out-of-sample predictions were performed with in a matter of seconds. As a part of the supplementary material, we provide a sample code for generating the spectrum using a trained FCHL models (see Data Availability). For Δλ = 3.75 nm, the ML-reconstructed spectrum agrees with the target TDωB97XD spectrum with a Φ of 86.5. This accuracy drops for higher resolutions due to the fine details present in the target spectrum. Also, with increase in resolution we note a reduction in the spectral heights in order to conserve the total area under the spectrum. For the desired value of Δλ = 0.94 nm, the spectrum of cyclohexanone is reconstructed with a score of 72.6 which is slightly lower than the mean score reported for out-of-sample predictions in Fig. 5.
![]() | ||
| Fig. 6 Electronic excitation spectrum of cyclohexanone, reconstructed at 3.75 nm, 1.88 nm, and 0.94 nm resolutions using a 1 k FCHL-KRR-ML model trained on binned oscillator strengths (pi in eqn (3)) at the TDωB97XD/def2SVPD target-level. Accuracy metric for normalized spectra, Φ, compared to TDωB97XD reference values calculated according to eqn (5) are also given. | ||
Further, for the highest resolution explored here, we present the ML reconstructed spectra for three more randomly drawn out-of-sample molecules in Fig. 7. For all these molecules, the prediction is better than for cyclohexanone and are illustrative of the model's mean out-of-sample performance. While the reference TDωB97XD-level binned oscillator strengths are always >0, the predicted values are not bound, hence, we notice small negative intensities for 5,5-dimethyl-4,5-dihydro-1H-pyrazole. For all four out-of-sample molecules considered here, the spectral intensities are low for λ > 100 nm because of the corresponding excitations in this region is sparse. We believe that ML strategy for spectral reconstruction reported in this study will hold even at the interesting long-wavelength domain when these models are trained on adequate examples.
![]() | ||
| Fig. 7 Electronic excitation spectrum of three randomly selected molecules—(3Z)-5-fluoro-4-methylpenta-1,3-diene, 1-fluoropentan-3-ol, and 5,5-dimethyl-4,5-dihydro-1H-pyrazole—reconstructed at 0.94 nm resolution using a 1 k FCHL-KRR-ML. The model was trained on TDωB97XD/def2SVPD electronic spectra in the λ ≤ 120 nm wavelength range. Accuracy metric for normalized spectra, Φ, compared to TDωB97XD reference values calculated according to eqn (5) are also given. | ||
880 molecules obtained at the ωB97XD/def2SVP & ωB97XD/def2TZVP levels at https://moldis.tifrh.res.in/datasets.html with both ground-state and excited-state properties.
In Fig. 8, we present a representative property query in the MolDis platform and the corresponding results. On accessing the def2SVP tab in the bigQM7ω Datasets page, we arrive at the corresponding query page. As noted in Fig. 8a, there are 11 ground state properties—dipole moment, polarizability, EHOMO, ELUMO, ELUMO–HOMO, zero-point vibrational energy, zero-Kelvin internal energy (U0), room temperature internal energy (U), room temperature enthalpy (H), room temperature Gibbs free energy (G), and constant volume heat capacity (Cv)—with available property ranges reported next to them. For a query, users need to enter values with in the property range with appropriate units selected and click on the Query button. The search can be further customized upon including multiple properties in the query and display them in ascending or descending order with respect to any property from the corresponding drop-down window. We have also enabled an option to query based on composition. In the bottom half of Fig. 8a, users can select either a set of atoms or any valid stoichiometry as listed on the right side of the query page. Upon making a successful query, users are presented with results (Fig. 8b), where the Cartesian coordinates, vibrational and electronic spectra are provided along with the magnitudes of queried properties in desired units. A JSMol applet enables visitors to visualize the structures on their browser upon clicking the “View in JSMol” button. Further, upon a fruitful query, both ground-state and excited-state properties for every molecule is presented to the visitor as downloadable files on the results page (Fig. 8b). This platform allows access to ab initio properties collected via high-throughput chemical space investigations to the community in a user-friendly fashion, hence, widening the applicability scope of the bigQM7ω dataset.
880 molecules with up to 7 atoms of CONF. Geometry optimizations of the bigQM7ω molecules have been performed with the ConnGO workflow ensuring veracity in the covalent bonding connectivities encoded in their SMILES representation. Minimum energy geometries and harmonic vibrational wavenumbers are reported at the accurate, range-separated hybrid DFT level ωB97XD using def2SVP and def2TZVP basis sets. This level was selected because it has been previously shown to result in efficient geometry predictions for chemical space datasets.66 We report electronic excited state results at the TDωB97XD level using the def2SVPD basis set containing diffuse functions that are necessary for improved modeling of oscillator strengths, and high-lying states in general. Even for the low-lying excited states of the bigQM7ω molecules, we found TDωB97XD/def2SVPD to deliver more accurate results than the ωB97XD/def2TZVP combination when benchmarked against STEOM-CCSD/aug-cc-pVTZ reference values. For all molecules, full electronic spectra are calculated covering all possible excitations allowed by the TDωB97XD framework. For the small molecules H2O, NH3, and CH4 the resulting number of excited states modeled amounts to 188, 156 and 136, respectively, while for large molecules such as toluene or n-heptane the total number of excited states reported is 3222 and 5258, respectively. Our preliminary findings have shown that generating the TDωB97XD results with the even larger basis set def2TZVPD to require several-fold increase in CPU time. However, when aiming at only a few low-lying states, our results can be improved when using approximate correlated methods such as DLPNO-STEOM-CCSD(T) or RI-CC2.
For ML modeling of the full electronic spectra, we propose an approach using locally integrated spectral intensities at various wavelength resolutions. We illustrate the existence of a resolution-vs.-accuracy dilemma for comparing full electronic spectra from different methods. The mapping between the electronic spectra and the global molecular structure-based representations improves only when the intensities are binned at a finite resolution. Semi-quantitative agreement between methods is reached only at the expense of resolution. Compared to this, ML models deliver better accuracies at a sub-nm resolution when training on fraction of the dataset. For accurate reconstruction of full electronic spectra across chemical space with a resolution of <1 nm, we recommend FCHL-KRR-ML. Further, it may be possible to improve the ML model's performance in the long wavelength region using varying resolutions at different spectral regions. However, testing this idea requires new datasets comprising adequate data at the desired wavelength domain.
Our goal is to provide a proof-of-concept for ML modeling of binned electronic spectra and demonstrate accurate spectral reconstruction. Unfortunately, the size of the dataset limits the rigor of quantum mechanical methods and basis sets used to estimate the target spectra for ML models. While we used range-separated hybrid DFT with moderately large basis sets containing diffuse functions, inherent deficiencies in the method challenge the accuracy of the target. Further, the small size of the molecules in bigQM7ω implied excitations modeled are in the far UV region. However, ML modeling reproduced target spectra at accuracies lower than that arising from deficiencies in the quantum mechanical methods. This suggests that replacing the target with properties estimated from high-fidelity methods will be adequately captured through ML modeling.
Improvements of ML modeling of excited state requires development of new local descriptors that can map to the chromophores responsible for excitation. For this, an automated protocol to characterize electronic excited-states should be developed for high-throughput chemical space design frameworks. This allows the opportunity to explore chemically diverse photochemically interesting molecules, such as dyes, active in the UV/visible domain and investigate chromophore's/auxochrome's influence on spectra. Another possibility is to cluster the electronic spectral data according to chromophores33,51 or by unsupervised learning.103 However, one must ensure that for generating accurate models, each cluster must be adequately represented in the training set. In order to facilitate further studies, we provide all data generated for this study in public domains.
Footnote |
| † These authors contributed equally to this work. |
| This journal is © The Royal Society of Chemistry 2022 |