Mahmoud A. E.
Ibrahim
abc,
X.
Liu
d and
J.
Pérez-Ríos
*ab
aDepartment of Physics and Astronomy, Stony Brook University, Stony Brook, New York 11794, USA. E-mail: jesus.perezrios@stonybrook.edu
bInstitute for Advanced Computational Science, Stony Brook University, Stony Brook, New York 11794, USA
cDepartment of Physics, Faculty of Science, Assiut University, Assiut, 71515, Egypt
dFritz-Haber-Institut der Max-Planck-Gesellschaft, D-14195 Berlin, Germany
First published on 6th November 2023
We present a machine-learning approach toward predicting spectroscopic constants based on atomic properties. After collecting spectroscopic information on diatomics and generating an extensive database, we employ Gaussian process regression to identify the most efficient characterization of molecules to predict the equilibrium distance, vibrational harmonic frequency, and dissociation energy. As a result, we show that it is possible to predict the equilibrium distance with an absolute error of 0.04 Å and vibrational harmonic frequency with an absolute error of 36 cm−1, including only atomic properties. These results can be improved by including prior information on molecular properties leading to an absolute error of 0.02 Å and 28 cm−1 for the equilibrium distance and vibrational harmonic frequency, respectively. In contrast, the dissociation energy is predicted with an absolute error ≲0.4 eV. Alongside these results, we prove that it is possible to predict spectroscopic constants of homonuclear molecules from the atomic and molecular properties of heteronuclears. Finally, based on our results, we present a new way to classify diatomic molecules beyond chemical bond properties.
On the other hand, in 1939 Newing proposed a theoretical justification for observed empirical relationships between spectroscopic constants given by
cf(Re) = μωe2 | (1) |
In the 1960s and 1970s, a number of authors devised the virial theorem, perturbation theory, and Helmann–Feynman theorem29–32 to develop a better understanding of the nature of the relationship between Re and ωevia electron densities.33–40 Most notably, Anderson and Parr were able to establish a relationship between Re, ωe, and atomic numbers Z1 and Z2, as
(2) |
(3) |
(4) |
Alongside these developments, several authors attempted connecting the dissociation energy, D00, with ωe and Re of diatomic molecules.19,41–45 However, these received little attention due to the lack of reliable experimental data.9,41,46–48 Most of the relationships are given by
D00 = A′μωe2Rel | (5) |
Thanks to machine learning (ML) techniques and the development of extensive spectroscopic databases,49 it has been possible to study the relationship between spectroscopic constants from a heuristic perspective, i.e., from a data-driven approach,26 find optimal potentials based on spectroscopy data50 and to improve ab initio potentials to match experimental observations.51 In particular, Gaussian process regression (GPR) models have been used on a large dataset of 256 heteronuclear diatomic molecules. As a result, it was possible to predict Re from the atomic properties of the constituent atoms. Similarly, ωe and the binding energy D00 were predicted using combinations of atomic and molecular properties. However, the work of Liu et al. only studied heteronuclear molecules. Hence, the universality of the relationship between spectroscopic constants still needs to be revised. On the other hand, ML techniques can be used to boost density functional theory approaches to larger systems with low computational costs.52–55 Hence, ML techniques are used to enlarge the capabilities of quantum chemistry methods. However, if sufficient data and information are available, could ML substitute quantum chemistry methods?
In this work, we present a novel study on the relationship between spectroscopic constants via ML models, including homonuclear molecules in a dataset of 339 molecules: the largest dataset of diatomics ever used. As a result, first, we show that it is possible to predict Re and ωe with mean absolute errors ∼0.026 Å and ∼26 cm−1, leading to an improvement of factor 2 in predicted power and accuracy concerning previous ML models. Furthermore, the dissociation energy, D00, is predicted with a mean absolute error ∼0.4 eV, in accordance with previous ML models. However, our model can benefit from having a more accurate and extensive database. Second, we show that it is possible to accurately predict the molecular properties of homonuclear molecules out of heteronuclear ones. Finally, since we use the same ML model in this work, we are in a unique situation to define similarity among molecules. Thus, we propose a data-driven classification of molecules. The paper is organized as follows: in Section 2, we introduce the database and analyze the main properties; in Section 3, we present the ML models with a particular emphasis on Gaussian process regression; in Section 4, we present our results and in Section 5, the conclusions.
To assess the variation of the spectroscopic constants in the dataset, we display the histogram and box plots of Re, ωe, and D00 in Fig. 1. This Figure shows that the spectroscopic constants' histogram is nearly unimodal. However, Re and ωe show a heavy-tailed distribution. In the case of Re, the tail is due to the presence of van der Waals molecules. In contrast, light molecules are responsible for the tail in the histogram for ωe. On the other hand, the box plot of D00 shows almost no outliers and only an incipient peak for a molecule with binding energy smaller than 0.75 eV due to the presence of van der Waals molecules. On the other hand, we investigate the relationship between pairs of spectroscopic constants in panels (d)–(f) of Fig. 1. For example, panel (d) displays Reversus ωe, showing an exponential trend similar to the one suggested by eqn (2) or a power law (Morse relationship). On the contrary, by plotting Reversus D00 and D00versus ωe in panels (e) and (f), respectively, we notice a large dispersion of the points with no observed trends in both panels.
Next, we analyze the chemical properties of the molecules under consideration, employing the Arkel–Ketelaar triangle – also known as the Jensen triangle, which separates qualitatively covalent, ionic, and van der Waals molecules. The triangle plots the absolute value of the electronegativity difference between the constituent atoms |χa − χb| versus their average electronegativity, as shown in Fig. 2, where χa and χb denote the electronegativities of the molecules' constituent atoms. The average electronegativity of the constituent atoms on the x-axis quantifies the van der Waals-covalent bonding. On the contrary, the difference in electronegativity of the constituent atoms quantifies the ionic character on the y-axis. The triangle shows that the data set comprises chemically diverse diatomic molecules with bonding characters ranging from covalent to ionic with many van der Waals. This chemical diversity strongly manifests itself in the range of the ground state spectroscopic constants depicted in Fig. 1.
We are interested in mapping features to target labels via a regression model yi = f(xi) + εi, where f(xi) is the regression function, and εi is an additive noise. We further assume that εi follows an independent, identically distributed (i.i.d.) Gaussian distribution with variance σn2
(6) |
One approach to tackle the regression problem is to specify a functional form of f(xi). Then, set the free parameters of the regression model by fitting the data. Alternatively, one can disregard specifying a functional form of f(xi) but instead place a prior distribution over a space of functions and infer the posterior predictive distribution following a Bayesian non-parametric approach. Within this group of methods, we find Gaussian process regression (GPR), assuming a Gaussian process prior over the space of functions.128,129
(7) |
A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution. A Gaussian process is specified by a mean function m(x) and a covariance function (kernel) k(x, x′), we will describe both shortly. A posterior distribution of the value of f(x*) at some point of interest, x*, is determined through the Bayes theorem as
(8) |
(9) |
The mean of the resulting predictive posterior distribution, μ*, is used to obtain a point estimate of the value of f(x*), and its covariance Σ* provides a confidence interval.
In GPR, the regression model is completely specified by the kernel k(x, x′). The kernel is a similarity measure that specifies the correlation between a pair of values f(x) and f(x′) by only using the distance between a pair of feature vectors x and x′ as its input variable. Specifying a kernel, we encode high-level structural assumptions (e.g., smoothness, periodicity, etc.) about the regression function. Here, we focus on the Matérn class kernel given by given by
(10) |
We can encode a physical model via the relationships between spectroscopic constants by specifying the Gaussian process prior mean function m(x). A common choice of the prior mean function is m(x) = 0. This choice is satisfactory in most cases, especially in interpolation tasks. However, selecting an appropriate prior mean function can simplify the learning process (delivering better results using fewer data). The mean function can also guide the model for better predictions as k(xp, xq) → 0; this is necessary for extrapolation and interpolation among sparse data points. Further, a model with a specified mean function is more interpretable.
To design a model, we choose an X suitable to learn y through a GPR. We then choose a convenient prior mean function m(X) based on physical intuition, alongside the last hyperparameter ν ∈ {1/2, 3/2, 5/2, ∞} is determined by running four models, each with a possible value of ν, and we chose the one that performs the best on the training data to be the final model. Precisely, a cross-validation (CV) scheme is used to evaluate the performance of each model iteratively: we split Dtv into a training set Dtrain (∼90% of Dtv) and a validation set Dvalid. We use Dtrain to fit the model and determine its parameters by maximizing the log-marginal likelihood. The fitted model is then used to predict the target labels of Dvalid. We repeat the process with a different split in each iteration such that each element in Dtv has been sampled at least once in both Dtrain and Dvalid. After many iterations, we can determine the average performance of the model. We compare the average performance of the four models after the CV process. Finally, We determine the value of ν to be its value for the best-performing model.
We adopt a Monte Carlo (MC) splitting scheme to generate the CV splits. Using the MC splitting scheme, we expose the models to various data compositions, and thus, we can make more confident judgments about our models' performance and generality.26 To generate a single split, we use stratified sampling.129,130 First, we stratify the training set into smaller strata based on the target label. Stratification will be such that molecules in each stratum have values within some lower and upper bounds of the target label (spectroscopic constant) of interest. Then, we sample the validation set so that each stratum is represented. Stratified sampling minimizes the change in the proportions of the data set composition upon MC splitting, ensuring that the trained model can make predictions over the full range of the target variable. Using the Monte Carlo splitting scheme with cross-validation (MC-CV) allows our models to train on Dtv in full, as well as make predictions for each molecule in Dtv. In each iteration, Dvalid simulates the testing set; thus, by the end of the MC-CV process, it provides an evaluation of the model performance against ∼90% of the molecules in the data set before the final testing stage.
We use 1000 MC-CV iterations to evaluate each model's performance. Two estimators evaluate the models' performance at each iteration, the mean absolute error (MAE) and the root mean squared error (RMSE), given by
(11) |
(12) |
(13) |
(14) |
We use learning curves to evaluate the performance of the models as a function of the size of Dtrain. We use 500 CV splits at each training set size to generate the learning curves. The validation and training are plotted as a function of the size of Dtrain.
Models that have the lowest validation , , and SEM are elected for the testing stage. In the testing stage, we fit the models to Dtv and make predictions of the target labels of Dtest. Finally, we report the validation and and test MAE and RMSE as our final evaluation of the model.
In all the seven newly developed models, we use the groups g1 and g2 and periods p1 and p2 of the molecules' constituent atoms and the square root of the reduced mass of the molecule μ1/2 as features. Therefore, the set of properties {p1, g1, p2, g2, μ1/2} uniquely defines each molecule in the dataset. On the contrary, additional spectroscopic properties are added to these five features for models within the category (iii). Furthermore, we choose the models' features and prior mean functions using physical intuition based on the discussion in the introduction and observations from the data Fig. 1, and ν was set to 3/2 using the CV scheme discussed in the last section. The models' characteristics are given in Table 1.
Target | Model | Molecules | Features | Prior mean | ν |
---|---|---|---|---|---|
R e (Å) | rlr1 | 314 | p1 + p2, g1 + g2, ln(Z1Z2) | — | — |
rlr2 | 308 | ln(ωe), p1 + p2, g1 + g2, ln(Z1Z2), ln(μ) | — | — | |
svmr1 | 314 | p1, g1, p2, g2 | — | 3/2 | |
svmr2 | 314 | p1, g1, p2, g2, μ1/2 | — | 3/2 | |
svmr3 | 308 | ln(ωe), p1, g1, p2, g2, μ1/2 | — | 3/2 | |
r1 | 314 | p1, g1, p2, g2 | m r1 | 1/2 | |
r2 | 314 | p1, g1, p2, g2, μ1/2 | m r2 | 3/2 | |
r3 | 308 | p1, g1, p2, g2, μ1/2 | m r3 | 3/2 | |
r4 | 308 | ln(ωe), p1, g1, p2, g2, μ1/2 | m r3 | 3/2 | |
ln(ωe) | wlr1 | 308 | p1 + p2, g1 + g2, ln(Z1Z2), ln(μ) | — | — |
wlr2 | 308 | R e, p1 + p2, g1 + g2, ln(Z1Z2), ln(μ) | — | — | |
svmw1 | 308 | p1, g1, p2, g2, μ1/2 | — | 3/2 | |
svmw2 | 308 | R e, p1, g1, p2, g2, μ1/2 | — | 3/2 | |
w1 | 308 | R e −1, p1, giso1, p2, giso2, ḡ | 0 | 5/2 | |
w2 | 308 | p1, g1, p2, g2, μ1/2 | m w2 | 3/2 | |
w3 | 308 | p1, g1, p2, g2, μ1/2 | m w3 | 3/2 | |
w4 | 308 | R e, p1, g1, p2, g2, μ1/2 | m w4 | 3/2 | |
ln(D00) | dlr1 | 244 ln(Re), ln(ωe), p1 + p2, g1 + g2, ln(μ) | — | — | |
svmd1 | 244 | p1, g1, p2, g2 | — | 3/2 | |
d1 | 244 | p1, g1, p2, g2, μ1/2 | m d1 | 3/2 |
For all the nine implemented models, we permute the groups and periods in Dtrain in the training and validation stage and in Dtv in the testing stage to impose permutational invariance.26 That is, the models should not differentiate between x = (p1, g1, p2, g2,…) and x′ = (p2, g2, p1, g1,…) upon exchanging the numbering of the two atoms in a molecule. Eight of the models use linear prior mean functions, the linear coefficients of which are determined by fitting the linear model to Dtrain in each CV iteration in the training and validation stage and by fitting to Dtv in the testing stage.
For the sake of comparison with baseline ML models we have implemented linear regression (LR) models based on eqn (3)–(5). Specifically, models rlr1 and rlr2 to predict Re, wlr1 and wlr2 to predict ln(ωe) and dlr1 to predict ln(D00). The same MC-CV scheme used to train the GPR models was used to train the LR models. Further, we train support vector machines (SVM) models for regressions tasks to predict Re, ωe and D00. The hyperparameters of the Matern 3/2 kernels for each SVM model are tuned via 1000 MC-CV steps using Bayesian optimization.131 A description of these models is given in Table 1 and a statistical summary of their performance is given in Table 2.
Target | Model | Validation | Validation | Test MAE | Test RMSE |
---|---|---|---|---|---|
R e (Å) | rlr1 | 0.33 | 0.54 | — | — |
rlr2 | 0.112 | 0.146 | — | — | |
svmr1 | 0.043 | 0.069 | 0.044 | 0.068 | |
svmr2 | 0.039 | 0.059 | 0.046 | 0.068 | |
svmr3 | 0.025 | 0.038 | 0.025 | 0.037 | |
r1 | 0.060* | 0.100* | 0.047 | 0.070 | |
r2 | 0.041* | 0.060* | 0.046 | 0.066 | |
r3 | 0.027* | 0.039* | 0.027 | 0.038 | |
r4 | 0.026* | 0.038* | 0.027 | 0.040 | |
ω e (cm−1) | wlr1 | 218 | 316 | — | — |
wlr2 | 118 | 197 | — | — | |
svmw1 | 39.4 | 65.2 | 36.4 | 53.7 | |
svmw2 | 25.8 | 42.3 | 24.7 | 31.8 | |
w1 | 33.2 ± 0.3 | 64.8 ± 1.0 | 33.5 | 61.2 | |
w2 | 40.3 ± 0.3 | 66.3 ± 0.6 | 37.9 | 53.4 | |
w3 | 27.7 ± 0.2 | 44.8 ± 0.4 31.3 | 39.35 | ||
w4 | 25.9 ± 0.2 | 41.6 ± 0.3 | 26.9 | 33.6 | |
D 00 (eV) | dlr1 | 0.98 | 1.25 | — | — |
svmd1 | 0.36 | 0.57 | 0.79 | 0.83 | |
d1 | 0.37 ± 0.002 | 0.52 ± 0.003 | 0.55 | 0.72 |
mr1−r2 = β0r1−r2 + β1r1−r2(p1 + p2) + β2r1−r2(g1 + g2), | (15) |
A comparison between models r1 and r2 is displayed in Fig. 3. The scatter plots show a more significant dispersion of the predictions for model r1 compared to model r2. Both models show the same outliers: homonuclear and van der Waals molecules. However, for model r2, the number of outliers is smaller than in model r1, and their dispersion from the true line is significantly suppressed. As a result, model r2 performs substantially better, especially in predicting molecules with Re ≥ 3 Å (mainly van der Waals molecules). The learning curves of models r1 and r2, displayed in panels (d) and (e) of Fig. 3, respectively, show a convergent validation curve towards the training set result as the size of the training set increases, indicative of the learning capability of the model, although, model r2 displays a faster convergence, indicating that the model learns more efficiently. Overall, model r2 shows an improvement in the prediction on Re ∼ 20% with respect r1 as shown in Table 2, leading to validation and of 0.041 Å and 0.060 Å, respectively.
Motivated by previously proposed relationships between Re and ln(ωe), we introduce models r3 and r4. Model r3 employs the same features as model r2 but incorporates spectroscopic information in the prior mean function. On the contrary, model r4 uses ln(ωe) as a feature. Both models have a prior mean given by
mr3−r4 = β0r3−r4 + βr3−r41(p1 + p2) + βr3−r42(g1 + g2) + βr3−r43ln(μ1/2) + βr3−r44ln(ωe), | (16) |
To highlight a few of the common outliers of the four models, we consider Li2, B2, LiCs, and LiCa. r1, r2, r3, r4 underestimate Re for Li2 by 6–10%. r1, r2, r3, r4 underestimate Re for B2 by 14%. 15%, 7%, and 8%, respectively, which could be connected to B2 being the only homonuclear molecule from group 13 in the data set. For LiCs, Re = 3.67 Å (ref. 87) and r2 predicts Re = 3.49 ± 0.15 Å; that is, the experimental value is 1.2 standard deviation away from the mean predictive posterior distribution of model r2 for LiCs, although most of the theoretical Re values of LiCs are within one standard deviation.86 For LiCa, the experimental value found by Krois et al. is Re = 3.34 Å.84 On the contrary, the r4 model predicts Re = 3.20 ± 0.05 Å, almost three standard deviations away from the experimental value. However, model r2 predicts Re = 3.33 ± 0.09 Å, with only 0.3% relative error. In addition, high-level ab initio calculations results are within one standard deviation from the mean predictive posterior distribution of model r2 for LiCa, namely CASPT2 predicts Re = 3.40 Å,134 QCISD(T) gives Re = 3.41 Å,135 MRCI leads to Re = 3.40 Å,135 and CIPI prediction is Re = 3.40 Å.136
mw2 = β0w2 + β1w2(p1 + p2) + β2w2(g1 + g2) + β3w2ln(μ1/2), | (17) |
Motivated by the relationship between ωe and Re, both w3 and w4 use the same prior mean function
mw3−w4 = β0w3−w4 + β1w3−w4(p1 + p2) + β2w3−w4(g1 + g2) + β3w3−w4Re + β4w3−w4ln(μ1/2), | (18) |
Fig. 4 compares w1, w2, and w4 (plots of w3 are similar to those of w4). We notice from panel (a) that model w1 struggles against hydrides, and hydrogen and hydrogen fluoride isotopologues. Indeed, the model significantly overestimates ωe for H2. On the other hand, panel (b) shows that w2 performs much better against hydrides, and hydrogen and hydrogen fluoride isotopologues. w2 predictions for H2 and HF are accurate and even better than those of models w3 and w4, as shown in panel (c). Panels (a) and (b) clearly show that model w2 outperforms model w1 when considering molecules with larger values of ωe. Looking at the learning curves in panels (d) and (e), we see that model w2 is far more consistent than model w3, as indicated by the shade around the validation curves of both models. From Table 2, the validation SEM(RMSE) of models w2 and w1 show that model w2 is 40% more consistent in its performance than model w1 when both models are validated using the same 1000 MC-CV splits. Furthermore, the test RMSE of w2 is 20% lower than that of w1. Model w2 has lower dimensionality than model w1 and only implements atomic properties; nevertheless, it performs similarly to model w1.
From Table 2, we see that although model w3 has a test MAE almost equal to model w1, models w3 and w4 have validation 15–21% lower than that of w1, indicating an overall better average performance of the newly developed models. Furthermore, w3 and w4 have validation and test RMSEs 28–36% lower than w1, showing the robustness of the two new models. Panel (c) of Fig. 4 shows minimal scatter around the true line. Few hydrides, along with BN and C2, still challenge the model; however, their absolute errors are significantly suppressed compared to w1 and w2. The validation curve of model w4 in panel (f) shows a much higher learning rate than w1 and w2, with a much shallower gap between the validation and learning curves. Moreover, the shadow around the validation curve is minimal at all training sizes. From Table 2, we see that w3 and w4 are far more consistent than w1, with STD(RMSE) 60–70% lower than that of w1.
On the other hand, the lower three panels in Fig. 4 show that the validation and training curves can converge towards lower error values. Hence, all the models might benefit from training on a more extensive dataset. The training MAEs of w1, w2, w3, and w4 range between 8 to 7 cm−1, so it might be possible to reach near spectroscopic accuracy (∼10 cm−1) by training these models on larger datasets. In the case of w2, if the validation curve's decaying trend persists upon further training, near spectroscopic accuracy might be achieved only through knowledge of atomic positions in the periodic table. Similarly, these models trained in larger database could outperform the state-of-the-art ab initio quantum chemistry methods.132,133
We highlight some of the outliers that are common to some of the models. All the models overestimate ωe for HgH by at least 12%, while for IrC, w1 and w2 overestimate ωe by 30% and 25%, while w3 and w4 only overestimate it by only 4% and 7%, respectively. The observed overestimation might be because HgH and IrC are the only molecules that consist of mercury or iridium in the dataset.
We have found two values of ωe for AuF in the literature; Saenger et al. reported ωe = 560 cm−1 in 1992 (ref. 63), while Andreev et al. reported ωe = 448 cm−1 in 2000.59 All our models predict values closer to 560 cm−1: w2 predicts ωe = 529 ± 87 cm−1, while w3 and w4 are almost in exact agreement with Saenger's value with ωe = 568 ± 54 cm−1 and ωe = 565 ± 45 cm−1, respectively.† Our predictions are in agreement with relativistic density functional and ab initio methods. Namely, first-order relativistic density functional calculation predicts ωe = 491 cm−1 while Zeroth-order regular relativistic approximation within the Kohn–Sham density functional scheme ZORA(MP) predicts ωe = 526 cm−1.61 In the same line, the relativistic MP2 approach predicts ωe = 590 cm−1,138 while relativistic MR-CI predicts ωe = 525 cm−1.139 A similar situation occurs in the case of ZnBr, as shown in Table 3. For 30 years, there was a discrepancy in the value of ωe of ZnBr. Gosavi et al. reported ωe ≈ 319 cm−1 in 1971.140 Next, Givan et al. reported ωe ≈ 198 cm−1 in 1982.141 On the contrary, the MRCI calculations by Elmoussaoui and Korek predicted ωe ≈ 267 cm−1 in 2015.142 Finally, Burton et al. experimentally reported ωe = 284 cm−1 in 2019.118 Here, w2 predicts ωe = 271.2 ± 21.7 cm−1, w3 predicts ωe = 289.5 ± 15.4 cm−1 and w4 predicts ωe = 281.0 ± 12.0 cm−1. Therefore, our predicted values are in great agreement with the most recent theoretical and experimental values.
Molecule | Models for Re, ωe | Predicted Re (Å) | Experimental Re (Å) | Predicted ωe (cm−1) | Experimental ωe (cm−1) | Ref. |
---|---|---|---|---|---|---|
HCl | r4, w4 | 1.267 ± 0.029 | 1.274 | 2939 ± 114 | 2990 | 56 |
r2, w2 | 1.275 ± 0.046 | 3020 ± 209 | ||||
2HCl | r4, w4 | 1.286 ± 0.027 | 1.274 | 2172 ± 80.0 | 2145 | 56 |
r2, w2 | 1.285 ± 0.0425 | 2123 ± 136 | ||||
RuC | r4, w4 | 1.614 ± 0.039 | 1.600 | 1106 ± 59.4 | 1100 | 96 |
r2, w2 | 1.644 ± 0.074 | 1066 ± 119 | ||||
WO | r4, w4 | 1.667 ± 0.046 | 1.657 | 1049 ± 65.5 | 1067 | 143 and 144 |
r2, w2 | 1.708 ± 0.088 | 994.9 ± 131 | ||||
MoC | r4, w4 | 1.652 ± 0.037 | 1.676 | 982.5 ± 49.2 | 1008 | 96 and 97 |
r2, w2 | 1.714 ± 0.057 | 1011 ± 106 | ||||
WC | r4, w4 | 1.746 ± 0.0547 | 1.714 | 1065 ± 78.3 | 983.0 | 145 |
r2, w2 | 1.645 ± 0.099 | 1097 ± 178 | ||||
NbC | r4, w4 | 1.739 ± 0.041 | 1.700 | 1019 ± 58.3 | 980.0 | 102 |
r2, w2 | 1.664 ± 0.057 | 967.7 ± 115 | ||||
NiC | r4, w4 | 1.621 ± 0.048 | 1.627 | 857 ± 55.8 | 875.0 | 104 |
r2, w2 | 1.668 ± 0.093 | 825.3 ± 114 | ||||
PdC | r4, w4 | 1.736 ± 0.032 | 1.712 | 872.0 ± 37.9 | 847.0 | 108 |
r2, w2 | 1.720 ± 0.057 | 866.6 ± 74.0 | ||||
UO | r4, w4 | 1.863 ± 0.022 | 1.838 | 888.1 ± 27.2 | 846.0 | 121 |
r2, w2 | 1.839 ± 0.033 | 893.7 ± 45.3 | ||||
NiO | r4, w4 | 1.585 ± 0.038 | 1.627 | 785.2 ± 40.2 | 839.0 | 105 |
r2, w2 | 1.667 ± 0.055 | 796.9 ± 82.9 | ||||
YC | r4, w4 | 1.907 ± 0.076 | 2.050 | 649.2 ± 70.8 | 686.0 ± 20 | 122 and 123 |
r2, w2 | 1.824 ± 0.094 | 834 ± 185 | ||||
ZnF | r4, w4 | 1.756 ± 0.029 | 1.768 | 603 ± 24.2 | 631.0 | 146 |
r2, w2 | 1.801 ± 0.053 | 580.4 ± 45.8 | ||||
NiS | r4, w4 | 1.940 ± 0.044 | 1.962 | 482 ± 28.6 | 512.0 | 106 |
r2, w2 | 1.999 ± 0.081 | 479.1 ± 58.6 | ||||
ZnCl | r4, w4 | 2.136 ± 0.028 | 2.130 | 384.8 ± 15.4 | 390.0 | 147 |
r2, w2 | 2.164 ± 0.053 | 371.0 ± 29.0 | ||||
ZnBr | r4, w4 | 2.299 ± 0.029 | 2.268 | 284.9 ± 11.7 | 284.0 | 118 |
— | — | — | 319.0 | 140 | ||
— | — | — | 198.0 | 141 | ||
r2, w2 | 2.321 ± 0.0542 | 271.1 ± 21.7 | ||||
ZnI | r4, w4 | 2.499 ± 0.030 | 2.460 | 235.5 ± 10.1 | 223.0 | 56 |
r2, w2 | 2.484 ± 0.057 | 228.0 ± 19.2 | ||||
SnI | r4, w4 | 2.722 ± 0.035 | 2.732 | 193.3 ± 9.48 | 197.0 | 107 |
r2, w2 | 2.725 ± 0.068 | 198.0 ± 19.9 | ||||
PbI | r4, w4 | 2.784 ± 0.030 | 2.798 | 156.8 ± 6.54 | 160.0 | 107 |
r2, w2 | 2.814 ± 0.056 | 154.1 ± 13.0 | ||||
CoO | r2 | 1.543 ± 0.056 | 1.628 | — | — | 72–74 |
CrC | r2 | 1.517 ± 0.099 | 1.630 | — | — | 77 |
IrSi | r2 | 2.084 ± 0.171 | 2.09 | — | — | 78 |
UF | r2 | 2.002 ± 0.081 | 2.02 | — | — | 119 |
ZrC | r2 | 1.846 ± 0.058 | 1.740 | — | — | 124 |
md1 = β0d1 + β1d1(p1 + p2) + β2d1(g1 + g2) + β3d1Re + β4d1ln(μ1/2) + β5d1ln(ωe), | (19) |
During the development of this work, we have realized that, historically, uncertainties about the dissociation energy experimental values had restrained the development of empirical relations connecting them to other atomic and molecular properties and have led several authors to focus their efforts on the ωe − Re relation.9,41,47 More recently, Fu et al. used an ML model to predict dissociation energies for diatomic molecules, exploiting microscopic and macroscopic properties.150 They tested their model against CO and highlighted that the reported experimental dissociation energy in the literature had increased by 100 kcal mol−1 over the course of 78 years from 1936 to 2014 (ref. 150–152) (in Table 1 of ref. 150). The data used to train model d1 is primarily collected from Huber and Herzberg's constants of diatomic molecules, first published in 1979.56 Unlike experimental values of Re and ωe, since 1980, a significant number of D00 values have been updated.48 To name a few, MgD, MgBr, MgO, CaCl. CaO, SrI, SrO, TiS, NbO, AgF, AgBr, and BrF all have their experimental values updated with at least ±2.3 kcal mol−1 difference from their values in Huber and Herzberg.57 Moreover, for some molecules, the uncertainties in D00 experimental values are not within chemical accuracy. For instance, MgH, CaCl, CaO, CaS, SrH, BaO, BaS, ScF, Tif, NbO, and BrF have uncertainties ranging from ±1 kcal mol−1 up to ±8 kcal mol−1.48
Based on the previous discussion, we can connect the unsatisfactory performance of model d1-in comparison to our developed Re and ωe models-to noise modeling. Unlike Re and ωe, it is most likely that uncertainties around D00 experimental values drive from various systematic effects. Therefore, modeling the errors in D00 experimental values to be identically distributed as in eqn (6) might not be a proper treatment. Thus, to develop better models for predicting D00, more sophisticated techniques of error modeling might be required. To this endeavor, gathering more reliable data with experimental uncertainty within ±1 kcal mol−1 might be sufficient. Something that we are working on it, and it will be published elsewhere.
Molecule | Ab inito R e (Å) | r2 predicted Re (Å) | Ab inito ω e (cm−1) | w2 predicted ωe (cm−1) |
---|---|---|---|---|
LiFr | 3.691 | 3.709 ± 0.123 | 180.7 | 198.9 ± 35.9 |
KFr | 4.284 | 4.483 ± 0.173 | 65.2 | 64.0 ± 16.2 |
RbFr | 4.429 | 4.389 ± 0.145 | 46.0 | 48.8 ± 10.4 |
CsFr | 4.646 | 4.403 ± 0.221 | 37.7 | 42.7 ± 13.7 |
Fig. 7 A heat map quantifies the degree of similarity among molecules in the data set from 0 (white, not similar) to 1 (black, identical) on a grayscale. The heat map was generated by finding the matrix element of a similarity matrix. Each matrix element quantifies the similarity between a pair of molecules p (on the x-axis) and q (on the y-axis) viaeqn (10) with ν = 3/2 and parameters determined as described in the text. |
To further explore the quantified similarity among molecules, we consider three subsets of molecules and show their heatmaps in the upper panels of Fig. 8. The lower panels of Fig. 8 show the corresponding network representation of the similarity among these subsets of molecules. Black squares in the heat map plots of Fig. 8 indicate that a pair of molecules is highly similar, whereas white squares indicate 0% similarity. The network representation represents each molecule as a node. The similarity between two molecules is diagrammatically shown with a line joining their corresponding nodes. The networks show similarities above the 80% level. A line joins two nodes only if they are at least 80% similar. The length of a joining line indicates the degree of similarity between a pair above the 80% level. A short line indicates a high degree of similarity, and a long line indicates a lower degree of similarity.
From panel (a) of Fig. 8, we see noble gas dimers clustering around Xe2, and alkali metals-alkaline earth metals cluster around NaRb. Both clusters are isolated from each other and VF, indicating a lower degree of similarities between these clusters and VF. A similar scenario is observed in panel (b), where alkaline earth metal hydrides cluster upon themselves with tight interconnections indicating high similarity. On the other hand, ZnH is remotely connected to the cluster, indicating a lower degree of similarity. The upper right cluster shows an interconnection among diatomic reactive nonmetals, including halides and oxygen; notably, AgAl is connected to these molecules. Panel (c) displays a more involved clustering scheme involving transition metal hydrides (MnH and AgH), connected to a metalloid hydride (TlH and InH) and with a lower degree to alkaline earth metals hydrides (LiH and BeH). The right-hand side cluster consists of various transition metal diatomics, dihalides, and others, all closely related except for MgS. Note that all the molecules in the right-hand side cluster consist of atoms from the periodic table's right side. At the same time, MgS combines one atom from group 2 and one from group 16. Notably, homonuclear diatomic and heteronuclear molecules are firmly embedded within all the clusters, emphasizing the importance of including homonuclear data in our models.
Since only atomic properties are required to find elements of the matrix representation of the kernel, the similarity matrix can guide us in our data-gathering efforts. For example, we can determine which molecules can fill the gaps and connect clusters to build more robust models. More interestingly, we can systematically classify molecules based on the similarity matrix. Such classification would help develop potential energy surfaces (PES) for diatomic molecules. As pointed out by Newing, similar molecules will have similar potential energy surfaces, at least around Re.28
• Models in category (i) only employ information from the periodic table and thus can predict spectroscopic properties of any combination of two elements. These models can be used to systematically classify molecules made up of any two elements in the periodic table (Section 4.6). While spectroscopic data availability does not limit these models' ability to predict spectroscopic constants of any molecule, it affects the models' accuracy. These models are characterized by a relatively larger gap between validation and learning curves than models in categories (ii) and (iii). Thus, we would expect a better performance of category (i) models upon training on larger datasets.
• Models in category (ii) use spectroscopic information only in their mean function but not in the kernel, and are robust against noise in input variables. In this case, since the mean function is a linear function, we can apply standard errors-in-variables methods.153 This might be advantageous if we would like to use uncertain experimental data or predictions from (i) models or ab initio methods to train our models.
• Models in category (iii) include our most flexible, accurate, and consistent models (r4, w4). These models benefit from a high learning rate and a narrow gap between validation and learning curves. Apart from their outstanding performance, we can train these models using ground and excited states simultaneously since each state will be labeled by its spectroscopic constant values Re or ωe along with other properties that define the molecule {p1, g1, p2, g2, μ1/2}.
In summary, the newly developed models in this work showed an outstanding performance in all metrics in comparison to the previous ML models and other empirical and semiempirical models, with mean absolute errors ranging between 0.02 Å and 0.04 Å for Re, and 26 cm−1 to 40 cm−1 for ωe. We have been able to predict homonuclear spectroscopic properties with good accuracy upon training our models on heteronuclear molecules' data. Indeed, our models are almost as accurate as the state-of-the-art ab inito methods for diatomics.132,133 In addition, our models only require data, whereas ab initio quantum chemistry methods require specific knowledge by the user.
On the other hand, since we use the same kernel for all models under consideration, we are uniquely positioned to study a way to classify diatomic molecules beyond the traditional one based on the nature of the chemical bond. We expect such classification to enhance the performance and facilitate the development of ML models predicting spectroscopic and molecular properties of diatomic molecules. Further, the classification of diatomic molecules should help develop potential energy surfaces (PES).
Finally, we have shown that for molecules with large ionic character and containing heavy atoms (e.g., LiCs, LiCa, AuF, and ZnBr), our predictions are comparable to DFT and even the state-of-the-art ab initio methods. Moreover, two of our models (r2 and w2) offer a promising opportunity to predict spectroscopic properties from atomic positions in the periodic table with high accuracy. This is a stepping stone towards closing the gap between atomic and molecular information; more spectroscopy data is required to do so. More extensive, open, and user-friendly data will help the field of data-driven science to understand the chemical bonding and spectroscopy of small molecules. Indeed, that is something that we are currently working on in our group: we need more spectroscopic data in the big data era. Finally, it is worth mentioning that we are approaching a period in which machine learning techniques are as accurate as ab initio quantum chemistry methods for calculating spectroscopic constants of diatomics with almost no computational effort.
Footnote |
† The w2, w3, and w4 predictions for AuF in the main text were predicted, including HgCl, HgI, and HgBr in the training set. To test the robustness of the models, we removed those three molecules from the training set since their Re values might be related to HgCl2, HgI2, and HgBr2.56,137 Indeed, those molecules could affect the model predictions because they are closely related to AuF since Au (group 11) and Hg (group 12) are members of the sixth period, and F, Cl, I, and Br are all halogens. However, in this case, w2, w3 and w4 predict ωe ∼ 530 cm−1, ωe ∼ 600 cm−1, and ωe ∼ 590 cm−1, respectively, in good agreement with the predicted results in the main text, experimental results and ab initio methods. |
This journal is © The Royal Society of Chemistry 2024 |