Xiangyue Liu,
Gerard Meijer and
Jesús Pérez-Ríos*
Fritz-Haber-Institut der Max-Planck-Gesellschaft, Faradayweg 4-6, 14195 Berlin, Germany. E-mail: jperezri@fhi-berlin.mpg.de
First published on 19th April 2021
Through a machine learning approach, we show that the equilibrium distance, harmonic vibrational frequency and binding energy of diatomic molecules are related, independently of the nature of the bond of a molecule; they depend solely on the group and period of the constituent atoms. As a result, we show that by employing the group and period of the atoms that form a molecule, the spectroscopic constants are predicted with an accuracy of <5%, whereas for the A-excited electronic state it is needed to include other atomic properties leading to an accuracy of <11%.
The development of quantum chemistry helped to shed some light on the physics behind empirical relationships between spectroscopic constants. In particular, thanks to the application of the Hellmann–Feynman theorem, it was possible to connect ωe directly with the electronic density at Re.19–22 As a result, a first principles-based explanation (containing a few free parameters), of the observed empirical relations between spectroscopic constants appeared.23–30 Nevertheless, the obtained relations based on the electronic density were only valid for subsets of molecules. To date, it has not been possible to find general relations for spectroscopic constants of diatomic molecules in terms of the properties of their constituent atoms.
The accuracy of quantum chemistry methods relies on (finite) basis sets optimized for each element under certain bounds. At the same time, an accurate description of the system's electronic structure is required, which is achieved through a hierarchy of different treatments of the electron correlation.31,32 On the other hand, the widely-used (Kohn–Sham) density functional theory (DFT) methods require accurate electron exchange–correlation density functionals. The non-empirical density functionals are derived under certain constraints, some with several free parameters,33–36 while the semi-empirical density functionals employ more flexible functional forms with (sometimes even several tens of) coefficients fitted to various experimental or theoretical reference properties.33,37 Machine learning (ML) methods, on the other hand, discover the underlying relationships from data (the so-called “training set”) and build up models on top of them. These models can be quantitatively predictive for other systems that follow similar underlying physics. More importantly, they provide possibilities for discovering relationships between the different properties of the system under consideration.38,39
This work shows that the relationship between spectroscopic constants of heteronuclear diatomic molecules is general for most kinds of molecules at hand. Our findings rely upon applying state-of-the-art ML models to an orthodox dataset of experimental spectroscopic constants for diatomic molecules. In particular, we apply the Gaussian process (GP) regression model40 to predict Re, ωe, and the binding energy, D0, as a function of the group and period of the constituent atoms. Similarly, our model can predict Re and ωe for the A-excited electronic state of a given molecule. Our findings generalize the idea that some of the system's chemical properties depend on the atoms' group and period. Indeed, the periodicity of elements has long been used to predict chemical compounds' properties intuitively at a qualitative level. However, the correlations between the chemical properties and the constituent atoms' periodicity are not always straightforward, and such predictions can hardly be quantitative in most cases. On the contrary, our main result is quantitatively meaningful: it is possible to predict those spectroscopic constants with an accuracy of <5% for ground electronic states and <11% for the A-excited electronic state. More interestingly, by analyzing our models' outliers, we show that molecules showing a non-chemical bond nature like bi-alkali molecules and molecules containing first-row elements, such as HF, are more difficult to predict. However, the spectroscopic constants of molecules containing transition metals challenging for quantum chemistry methods can be adequately described.
Fig. 1 Distribution and box plots of Reaωeb with different powers combined with the reduced mass m and number of valence electrons n. |
At the same time, more spectroscopic information of molecules became available, and more advanced and accurate quantum chemistry tools were developed. Therefore, it was possible to search for a first principle explanation of the empirically observed relationships between spectroscopic constants. In that endeavor, Parr and coworkers took the lead by looking at the electron density within a molecule as the source of the relationship between spectroscopic constants. The model assumes that the electron density mutually created by the one atom in the other atom is equal at the equilibrium distance, i.e., at the sum of two atomic radii. In particular, the electron density of atom 1 at the position of atom 2, within a molecule, is given by27
ρ1(2) = CZ1exp(−ξR1), | (1) |
Z1Z2 = Aexp(ξRe), | (2) |
Re = ξ−1logZ1Z2 − ξ−1logA. | (3) |
However, the performance of this relationship has only been checked for molecules with atoms coming from the same group of the periodic table.29
Anderson, Parr and coworkers also suggested a relationship between ωe and Re29 as
mωe2 = 4πCZ1Z2e−2Re, | (4) |
(5) |
In the same vein, following the relationship between the equilibrium distance and the harmonic vibrational frequency, it is possible to find a relationship between the atomic number Zi, Re, and the dissociation energy De, as27,29,30,45
(6) |
(7) |
For the derivation of eqn (6) it must be assumed that De = Amωe2Rel without any further justification.30 In eqn (7), l = 3 and ξ′ = 0.97. Eqn (7) has been tested in a dataset of 150 molecules leading to a good result, although no further characterization of the model performance was reported to objectively judge its quality. Finally, using the relation of the dissociation energy, De, and the binding energy, D0,
(8) |
As far as we know, this is the most extensive dataset for experimental ground state properties of heteronuclear diatomic molecules. Fig. 2 shows the equilibrium distance's distribution and its ratio to the sum of the atomic radii of the constituent atoms, R1 + R2, for molecules within the dataset. Most molecules show an equilibrium distance between 1.4 Å and 3.8 Å, with a most probable value of 1.7 Å. Looking at the values of Re/(R1 + R2), it is clear that the molecules within the dataset have different bonds: covalent, van der Waals, and ionic.
Fig. 2 Ratio of the equilibrium distance, Re, to the sum of the atomic radii of the atoms forming a molecule, R1 + R2, vs. Re. The background color indicates the nature of the molecular bond in each of the molecules. The density in the upper part of the figure shows the kernel density distribution of Re. The box plot shows the minimum, the maximum, the sample median, and the first and third quarterlies of Re. The empirical atomic radii of the atoms are taken from ref. 46. |
We have classified the dataset based on the types of constituent atoms within a molecule, and the results are shown in Fig. 3. As a result, we notice that the dataset mainly consists of various metal and non-metal halides, hydrides, and metalloid compounds. It is worth noticing that more than 20% of the dataset contains transition metal compounds, including f-block elements. Therefore, the present dataset is general since it goes beyond the main-group diatomic molecules and deals with some of the more intriguing and complex atoms from a chemistry standpoint.
In addition to the dataset mentioned above of the ground state properties, we also study 131 molecules whose Re, ωe are available for the A-excited electronic state. The A-state dataset mainly consists of metal and non-metal compounds, including transition metal compounds and several f-block compounds.
(9) |
In this work, the spectroscopic properties y are modeled as
(10) |
The present dataset is relatively small from an ML perspective. When the dataset is split into training and test sets, the training set may not be representative. This may lead to a bias in the performance of the test set. To solve this problem, we have employed a Monte Carlo (MC) approach, in which the dataset is stratified into 25 strata based on the level of the true values of the labels (Re, ωe, and in the present work).
As shown in panel (a) of Fig. 4, we have two loops in the training and evaluation of the models. In the outer loop, we split the dataset into training set and test set. The training set is used to learn from the data and the test set is used for model evaluation. In the inner loop, we train the models with the training set, which is further split to perform a stratified 5-fold cross validation (CV) for the hyperparameter optimization. In particular, as shown in panel (b) of Fig. 4, in the outer loop, the training/test splittings are done by a Monte Carlo (MC) approach. Specifically, we randomly select 25 test molecules from the dataset, which is stratified into 25 strata based on the levels of the true values of the labels. The stratification helps to minimize the change of the proportions of the dataset compositions upon splitting.55 In each MC step, a regression model is trained and gives the predictions to the training set and the test set. Therefore, in this work we report the mean and standard deviation of the predictions for each molecule when they are used in the training and test sets from all the MC steps. In total, we evaluate our models with 1000 MC steps for the training/test splittings for the model performance evaluation, and 500 MC steps for generating the learning curves.
The performance of the models is evaluated by three different estimators. The first estimator is the mean absolute error (MAE) defined as
(11) |
(12) |
The last estimator is the normalized error rE, defined as the ratio of the RMSE to the range of y,
(13) |
The GP regression model performance of ground state Re as a function of input features (g1, g2, p1, p2) is shown in Fig. 5, where the MAE associated with each of the distinct type of molecules is reported. As a result, most of the molecules are well described by our GP model, as confirmed in the inset of Fig. 5. In particular, it shows little dispersion of the predicted values concerning the true values except for a handful of molecules (transition metal–metal and bi-alkali molecules). To further quantify the GP regression model performance, we calculate the average RMSE of the predicted Re on 1000 randomly selected test sets leading to 0.0968 ± 0.0070 Å (Table 1), and rE = 2.80 ± 0.20%. Our results confirm that the model performance improves as the number of molecules in the training set, N, grows, as it is shown in the learning curve in panel (a) of Fig. 8. Indeed, it is not yet converged for N = 231, suggesting that the GP regression model can be further improved by learning more data in the training set.
Property | Model | Feature | Test MAE | Test RMSE | Test rE (%) |
---|---|---|---|---|---|
a is the predicted value from (g1, g2, p1, p2). | |||||
Re (Å) | GPR | (g1, g2, p1, p2) | 0.0662 ± 0.0037 | 0.0968 ± 0.0070 | 2.80 ± 0.20 |
LR | log(Z1Z2) | 0.2605 ± 0.0018 | 0.3591 ± 0.0006 | 10.41 ± 0.01 | |
ωe (cm−1) | GPR | (Re−1, g1, g2, p1, p2) | 126.7 ± 2.1 | 207.2 ± 2.6 | 5.07 ± 0.06 |
(, g1, g2, p1, p2)a | 152.5 ± 3.6 | 227.5 ± 4.6 | 5.56 ± 0.11 | ||
(Re−1, giso1, giso2, p1, p2) | 61.5 ± 2.9 | 142.8 ± 7.0 | 3.49 ± 0.17 | ||
(, giso1, giso2, p1, p2) | 96.9 ± 2.9 | 176.0 ± 13.1 | 4.30 ± 0.32 | ||
(Re−1, giso1, giso2, p1, p2, ) | 67.5 ± 1.0 | 151.8 ± 9.5 | 3.71 ± 0.2 | ||
(, giso1, giso2, p1, p2, ) | 101.8 ± 5.4 | 188.7 ± 25.4 | 4.61 ± 0.62 | ||
(Re−1, giso1, giso2, p1, p2, ḡ) | 46.7 ± 0.6 | 73.4 ± 0.2 | 1.80 ± 0.005 | ||
(, giso1, giso2, p1, p2, ḡ) | 81.0 ± 0.82 | 121.8 ± 0.8 | 2.98 ± 0.02 | ||
LR | 376.5 ± 6.6 | 529.4 ± 1.2 | 12.95 ± 0.03 | ||
Re−2 | 209.6 ± 5.4 | 297.3 ± 1.4 | 7.27 ± 0.03 | ||
GPR | (Re, ḡ, ) | 0.249 ± 0.008 | 0.357 ± 0.007 | 3.52 ± 0.07 | |
(, ḡ, ) | 0.270 ± 0.006 | 0.451 ± 0.007 | 4.45 ± 0.07 | ||
LR | Re | 0.833 ± 0.004 | 1.018 ± 0.014 | 10.03 ± 0.14 |
In learning ωe, we find (, giso1, giso2, p1, p2, ḡ) to be the best combination of features, where is the predicted equilibrium distance from (g1, g2, p1, p2), gisok encodes the information about the hydrogen isotopes of the k-th atom in the molecule, and ḡ is the average of the groups of the two atoms. However, a much better performance is found when the true Re value is employed. The GP model's performance is shown in the inset of Fig. 6, where it is noticed that the predicted values agree very well with the true values. Indeed, the test set MAE and RMSE are 46.7 ± 0.6 cm−1 and 73.4 ± 0.2 cm−1, respectively, while rE = 1.80 ± 0.005%, as shown in Table 1. Despite the outstanding performance of our GPR model some molecules are still not well described as shown in Fig. 6. These outliers include HF, DF, and HgH. The large errors predicting ωe of HF and DF can be attributed to their unique bond mechanism compared to other halogen hydrides.
Within the features (Re−1, giso1, giso2, p1, p2, ḡ), it is interesting that the average of groups helps in learning ωe. In particular, with ḡ, the MAE of the model reduces around 20% compared with the predictions using (Re−1, giso1, giso2, p1, p2) as the input feature, as summarized in Table 1. Analogously, the standard deviation of the MC training/test splittings predictions becomes much smaller, suggesting that the model is more robust for different kinds of molecules within the dataset. Actually, by introducing ḡ, the most significant improvement happens in the descriptions of bi-alkali molecules, where the MAE can be reduced by a factor of 3. The errors predicting HF and DF can also be reduced by a factor of 2, although they are still tricky cases for the model. On the contrary, introducing the average of periods does not help improve the model, suggesting that ωe has a dependency on the total number of valence electrons of the two atoms rather than the number of electron shells.
Motivated by the pioneering work of Anderson, Parr, and coworkers,27,29,30,45 we study the prediction of based on GP regression and the results are shown in Fig. 7. In particular, in the figure's inset, we show the GP regression model prediction of versus its true value, which shows a good performance with an RMSE = 0.357 ± 0.007 and a rE equal to 3.52 ± 0.07%, as shown in Table 1. In this case, the GP is fed with (Re, ḡ, ) as input features and it shows a fast convergence with respect to the size of training set around N = 150 as shown in panel (c) of Fig. 8. The most significant outlier for is NaK, which is a van der Waals molecule. D0 of NaK is overestimated and it may be attributed to the fact that NaK is the only bi-alkali molecule in the dataset having D0. There are also some outliers having first-row elements and 3d transition metals.
A summary of our GP regression model performance for the different combinations of the ground state spectroscopic constants considered in this work is shown in Table 1, compared against the proposed models of Parr, Anderson et al.27,29,30,45 As a result, the GP regression model shows a superior performance against the linear model (LR in the table) based on a particular functional form of the electron density within the molecule. Indeed, the GP performance is, in some cases, five times better than the linear model (in terms of the relative error). Therefore, the group and period (correlated to the number of valence electrons and the number of electrons shells, respectively) of constituent atoms within a molecule encapsulates more valuable information regarding spectroscopic constants than using simple, functional forms for the electron density within the framework of ref. 27, 29, 30 and 45. Indeed, it is interesting to notice that, when predicting Re and ωe, one needs groups and periods of each atom in the molecule, whereas can be well described only with the average of group and period of the two atoms. Therefore, is correlated to groups and periods' additive properties rather than the differences between the two atoms caused by their different groups.
To further examine if our ML approach is generalizable, we have selected 26 molecules out of the dataset and unseen by the ML algorithm including CoO,56 CrC,57 InBr,58 IrSi,59 MgD,60 MoC,61 NbC,61 NiBr,62 NiC,63 NiO,64 NiS,65 PbI,66 PdC,61 RuC,61 RuF,67 ScBr,62 SnI,66 TiBr,62 UF,68 UO,69 WC,70 YC,61 ZnBr,62 ZrC,61 ZrCl,71 ZrF.71 The MAE of the GP regression model predicting ground state Re of the extra test set is 0.066 Å. The average relative error (defined as the absolute errors of each molecule divided by their true Re) is 3.3%. Indeed, for CrC, InBr, MgD, ZnBr, ZrCl the relative errors are <1%. Within this extra test set, experimental ground state ωe values are also available for 14 molecules: InBr, MoC, NbC, NiC, NiO, NiS, PbI, PdC, RuC, SnI, UO, WC, YC and ZnBr. The MAE of GPR model predictions is 30 cm−1 (4%). For RuC and ZnBr, the relative errors are below 1%, and for NiS and MoC, the relative errors are below 2%. For MoC, NbC, PbI, SnI, YC and ZrC, the experimental binding energy has been reported and the MAE of our GPR model to predict D0 is 0.32 eV (7.6%). Therefore, our models perform fairly well in this extra test set.
Property | Model | Feature | Test MAE | Test RMSE | Test rE (%) |
---|---|---|---|---|---|
Re (Å) | GPR | (Re(X), g1, g2, p1, p2) | 0.0783 ± 0.0018 | 0.107 ± 0.0026 | 5.81 ± 0.14 |
(Re(X), g1, g2, p1, p2, D(IP, EA)) | 0.0691 ± 0.0062 | 0.098 ± 0.0097 | 5.32 ± 0.53 | ||
ωe (cm−1) | GPR | (ωe(X), Re−1(X), Re−1(A), giso1, giso2, p1, p2, ḡ) | 71.8 ± 1.4 | 107.9 ± 4.4 | 11.3 ± 0.46 |
(ωe(X), Re−1(X), Re−1(A), giso1, giso2, p1, p2) | 70.4 ± 0.9 | 105.1 ± 1.5 | 11.0 ± 0.15 | ||
(ωe(X), Re−1(X), Re−1(A), g1, g2, p1, p2) | 70.6 ± 0.9 | 105.1 ± 1.1 | 11.0 ± 0.12 |
For learning ωe of the A excited electronic state, in addition to the ground state Re−1(X), it is also necessary to include the A state Re−1(A). Furthermore, it is better to include the ground state ωe(X) as the input feature. The results are shown in Fig. 10 in which (ωe(X), Re−1(X), Re−1(A), g1, g2, p1, p2) leads to a RMSE of 105.1 ± 1.1 cm−1 and rE = 11.0 ± 0.12%. We also find that including the average of groups ḡ or the isotope information cannot further improve the model performance. This is expected, since this information have already been encoded in the ground state ωe.
The performance of our models predicting the A excited electronic state Re and ωe are summarized in Table 2. Compared to the ground state predictions, the errors predicting the A excited electronic state spectroscopic constants are around two times larger, suggesting the difficulty predicting the excited state properties. However, we notice that ωe is correlated with the inverse of Re(A) as for ground state molecules. Our findings corroborate the hypothetical relationship between Re and ωe in the early times of molecular spectroscopy as it has been introduced in Section 2.
Despite the present GP models' outstanding performance, machine learning methods may be considered mere fitting techniques or as a black-box algorithm that one can hardly learn anything new from them. This statement is not accurate. As an example, here, we emphasize what we have learned from the present machine learning approach:
• It is generally assumed that some molecular properties can be predicted based on the forming atom's positions in the periodic table.72 However, the predictions are only qualitative rather than quantitative. For instance, it is possible to anticipate the nature of a molecule's bond, but it cannot accurately guess its dissociation energy. However, thanks to ML, we know that it is possible to predict reasonably accurate spectroscopic constants using the constituent atoms' group and period.
• We have learned that ωe and Re depend strongly on the number of valence electrons and electrons shells of the atoms forming a molecule, whereas the average number of valence electrons also plays an important role in describing ωe. depends on the average number of valence electrons and average number of electron shells of the molecule.
• The capability of learning excited electronic state properties of diatomic molecules may open the possibility of predicting Franck–Condon factors for interesting transitions regarding direct cooling of molecules.47,73–76
Finally, we would like to emphasize that there are around 7000 heteronuclear molecules, and we only utilize 256 of these for our GP regression model. The limited availability of spectroscopic data (only around 3% of possible heteronuclear diatomic molecules) shows the vast amount of spectroscopy that can be done within the realm of diatomic molecules. The more data we have, the more accurate will be the GP regression model predictions before reaching convergence of the learning curve, and the more knowledgeable the community will be about the fundamental properties of diatomic molecules. From our perspective, the present work may motivate data science-driven studies on the field of spectroscopy of diatomic molecules. In particular, it will help to evolve the field of spectroscopy towards the current information era and help to achieve a better understanding on the spectroscopic properties. Furthermore, our results may also bring some insight for the development of features and geometry representations in material science.
(14) |
In learning ωe, we use the Matérn class of covariance functions40
(15) |
(16) |
The explicit basis functions in learning Re are linear basis, while when learning ωe and the basis functions are set to be constant.
This journal is © The Royal Society of Chemistry 2021 |