Open Access Article
This Open Access Article is licensed under a
Creative Commons Attribution 3.0 Unported Licence

On the relationship between spectroscopic constants of diatomic molecules: a machine learning approach

Xiangyue Liu, Gerard Meijer and Jesús Pérez-Ríos*
Fritz-Haber-Institut der Max-Planck-Gesellschaft, Faradayweg 4-6, 14195 Berlin, Germany. E-mail: jperezri@fhi-berlin.mpg.de

Received 15th March 2021 , Accepted 1st April 2021

First published on 19th April 2021


Abstract

Through a machine learning approach, we show that the equilibrium distance, harmonic vibrational frequency and binding energy of diatomic molecules are related, independently of the nature of the bond of a molecule; they depend solely on the group and period of the constituent atoms. As a result, we show that by employing the group and period of the atoms that form a molecule, the spectroscopic constants are predicted with an accuracy of <5%, whereas for the A-excited electronic state it is needed to include other atomic properties leading to an accuracy of <11%.


1 Introduction

Early in the history of molecular spectroscopy, when it became a discipline within chemical physics in the 1920's,1 some intriguing empirical relationships between different spectroscopic properties were observed.2–4 In particular, it was found that the equilibrium distance, Re, and the harmonic vibrational frequency, ωe, were correlated in diatomic molecules. As the field evolved, the relationship between Re and ωe became more evident, and more empirical relations between spectroscopic constants were identified.5–12 However, these empirical relationships were typically only valid for specific atomic numbers or groups of the constituent atoms. These results motivated the development of realistic diatomic molecular potentials4,13–17 and triggered the physical chemistry community to think about the “periodicity” of diatomic molecules.18

The development of quantum chemistry helped to shed some light on the physics behind empirical relationships between spectroscopic constants. In particular, thanks to the application of the Hellmann–Feynman theorem, it was possible to connect ωe directly with the electronic density at Re.19–22 As a result, a first principles-based explanation (containing a few free parameters), of the observed empirical relations between spectroscopic constants appeared.23–30 Nevertheless, the obtained relations based on the electronic density were only valid for subsets of molecules. To date, it has not been possible to find general relations for spectroscopic constants of diatomic molecules in terms of the properties of their constituent atoms.

The accuracy of quantum chemistry methods relies on (finite) basis sets optimized for each element under certain bounds. At the same time, an accurate description of the system's electronic structure is required, which is achieved through a hierarchy of different treatments of the electron correlation.31,32 On the other hand, the widely-used (Kohn–Sham) density functional theory (DFT) methods require accurate electron exchange–correlation density functionals. The non-empirical density functionals are derived under certain constraints, some with several free parameters,33–36 while the semi-empirical density functionals employ more flexible functional forms with (sometimes even several tens of) coefficients fitted to various experimental or theoretical reference properties.33,37 Machine learning (ML) methods, on the other hand, discover the underlying relationships from data (the so-called “training set”) and build up models on top of them. These models can be quantitatively predictive for other systems that follow similar underlying physics. More importantly, they provide possibilities for discovering relationships between the different properties of the system under consideration.38,39

This work shows that the relationship between spectroscopic constants of heteronuclear diatomic molecules is general for most kinds of molecules at hand. Our findings rely upon applying state-of-the-art ML models to an orthodox dataset of experimental spectroscopic constants for diatomic molecules. In particular, we apply the Gaussian process (GP) regression model40 to predict Re, ωe, and the binding energy, D0, as a function of the group and period of the constituent atoms. Similarly, our model can predict Re and ωe for the A-excited electronic state of a given molecule. Our findings generalize the idea that some of the system's chemical properties depend on the atoms' group and period. Indeed, the periodicity of elements has long been used to predict chemical compounds' properties intuitively at a qualitative level. However, the correlations between the chemical properties and the constituent atoms' periodicity are not always straightforward, and such predictions can hardly be quantitative in most cases. On the contrary, our main result is quantitatively meaningful: it is possible to predict those spectroscopic constants with an accuracy of <5% for ground electronic states and <11% for the A-excited electronic state. More interestingly, by analyzing our models' outliers, we show that molecules showing a non-chemical bond nature like bi-alkali molecules and molecules containing first-row elements, such as HF, are more difficult to predict. However, the spectroscopic constants of molecules containing transition metals challenging for quantum chemistry methods can be adequately described.

2 The quest of relationships between spectroscopic constants of diatomic molecules

As soon as molecular spectroscopy became an essential tool to analyze molecules' unique fingerprints and more spectra of molecules were taken, approximate relationships were found between spectroscopic constants. As a result, it was postulated that the molecules' spectroscopic constants might be correlated based on empirical grounds. In particular, it was observed that the equilibrium distance and the harmonic vibrational frequency are related as Re2ωe2m = const in hydrogen halides,2,41–43 where m is the reduced mass of the molecule. This relationship was generalized as Reiωe2m = const, the precursor of the well-known Badger's rule,6 where i is a natural number. On the other hand, after studying the spectra of 16 molecules, including homonuclear molecules and molecular ions, Mecke and Birge found that the expression Re2ωe = const described the observed spectra better.3,44 In the same line, but using a given functional form for the interatomic interaction of a molecule, Morse proposed a relationship given as Re3ωe = const.4 Finally, more involved relationships between the equilibrium distance and the vibrational harmonic frequency were proposed17 as mRe6ωe2na, where n stands for the number of valence electrons, and a is a rational number. The results for a variety of the proposed empirical rules are shown in Fig. 1, where it is noticed that for a larger dataset, as the present one, none of the empirical relationships hold.
image file: d1ra02061g-f1.tif
Fig. 1 Distribution and box plots of Reaωeb with different powers combined with the reduced mass m and number of valence electrons n.

At the same time, more spectroscopic information of molecules became available, and more advanced and accurate quantum chemistry tools were developed. Therefore, it was possible to search for a first principle explanation of the empirically observed relationships between spectroscopic constants. In that endeavor, Parr and coworkers took the lead by looking at the electron density within a molecule as the source of the relationship between spectroscopic constants. The model assumes that the electron density mutually created by the one atom in the other atom is equal at the equilibrium distance, i.e., at the sum of two atomic radii. In particular, the electron density of atom 1 at the position of atom 2, within a molecule, is given by27

 
ρ1(2) = CZ1[thin space (1/6-em)]exp(−ξR1), (1)
where C is a fitting parameter and ξ represents the decay constant of the electron density. Within this model, one finds a relationship between the atomic numbers of the two atoms, Z1 and Z2, and the equilibrium internuclear distance Re of a diatomic molecule as27,29,30,45
 
Z1Z2 = A[thin space (1/6-em)]exp(ξRe), (2)
where A is a free parameter. According to this relationship, Re depends linearly on log(Z1Z2) as
 
Re = ξ−1[thin space (1/6-em)]log[thin space (1/6-em)]Z1Z2ξ−1[thin space (1/6-em)]log[thin space (1/6-em)]A. (3)

However, the performance of this relationship has only been checked for molecules with atoms coming from the same group of the periodic table.29

Anderson, Parr and coworkers also suggested a relationship between ωe and Re29 as

 
e2 = 4πCZ1Z2e−2Re, (4)
based on the Born–Oppenheimer approximation, the electron density of eqn (1) and the Hellmann–Feynman theorem. From eqn (4) it is possible to express the harmonic vibrational frequency in terms of the equilibrium distance and atomic properties as
 
image file: d1ra02061g-t1.tif(5)

In the same vein, following the relationship between the equilibrium distance and the harmonic vibrational frequency, it is possible to find a relationship between the atomic number Zi, Re, and the dissociation energy De, as27,29,30,45

 
image file: d1ra02061g-t2.tif(6)
which can be rewritten as
 
image file: d1ra02061g-t3.tif(7)

For the derivation of eqn (6) it must be assumed that De = Amωe2Rel without any further justification.30 In eqn (7), l = 3 and ξ′ = 0.97. Eqn (7) has been tested in a dataset of 150 molecules leading to a good result, although no further characterization of the model performance was reported to objectively judge its quality. Finally, using the relation of the dissociation energy, De, and the binding energy, D0,

 
image file: d1ra02061g-t4.tif(8)
where ωexe represents the first anharmonic correction to the harmonic vibrational frequency, it should be possible to find a linear regression model for image file: d1ra02061g-t5.tif.

3 The dataset

In this work, we focus on heteronuclear molecules due to their relevance on laser cooling of molecules with applications in ultracold chemistry.47–49 The employed dataset contains the main spectroscopic constants: Re, ωe, and D0 for the ground electronic state of heteronuclear diatomic molecules. In particular, it contains the experimental values of Re, ωe for 256 heteronuclear diatomic molecules taken from ref. 50–53, whereas the experimentally determined values of D0 are only available for 197 of them.

As far as we know, this is the most extensive dataset for experimental ground state properties of heteronuclear diatomic molecules. Fig. 2 shows the equilibrium distance's distribution and its ratio to the sum of the atomic radii of the constituent atoms, R1 + R2, for molecules within the dataset. Most molecules show an equilibrium distance between 1.4 Å and 3.8 Å, with a most probable value of 1.7 Å. Looking at the values of Re/(R1 + R2), it is clear that the molecules within the dataset have different bonds: covalent, van der Waals, and ionic.


image file: d1ra02061g-f2.tif
Fig. 2 Ratio of the equilibrium distance, Re, to the sum of the atomic radii of the atoms forming a molecule, R1 + R2, vs. Re. The background color indicates the nature of the molecular bond in each of the molecules. The density in the upper part of the figure shows the kernel density distribution of Re. The box plot shows the minimum, the maximum, the sample median, and the first and third quarterlies of Re. The empirical atomic radii of the atoms are taken from ref. 46.

We have classified the dataset based on the types of constituent atoms within a molecule, and the results are shown in Fig. 3. As a result, we notice that the dataset mainly consists of various metal and non-metal halides, hydrides, and metalloid compounds. It is worth noticing that more than 20% of the dataset contains transition metal compounds, including f-block elements. Therefore, the present dataset is general since it goes beyond the main-group diatomic molecules and deals with some of the more intriguing and complex atoms from a chemistry standpoint.


image file: d1ra02061g-f3.tif
Fig. 3 Molecules in the dataset classified by the types of their constituent atoms.

In addition to the dataset mentioned above of the ground state properties, we also study 131 molecules whose Re, ωe are available for the A-excited electronic state. The A-state dataset mainly consists of metal and non-metal compounds, including transition metal compounds and several f-block compounds.

4 Machine learning method

The quest for universal relationships between spectroscopic constants is related to the problem of how atomic and molecular properties describe a spectroscopic property of a molecule, y = f(x). Here, x = (x1, x2, …, xn), consists of different atomic properties of the constituent atoms or molecular properties, where n denotes the number of input features relevant for the problem at hand. Unlike traditional (non-)linear regression models, which assume a fixed form of function f(x), GP embraces a Bayesian perspective and presumes a prior distribution over the space of functions
 
image file: d1ra02061g-t6.tif(9)
with a joint multivariate-Gaussian distribution, centered at m(xi) and characterized by the covariance function K(xi, xj), which specifies the correlation (or “similarity”) between data points.40

In this work, the spectroscopic properties y are modeled as

 
image file: d1ra02061g-t7.tif(10)
where the basis functions, h(xi), project {xi} to a new (higher dimensional) feature space with coefficients β, and σy includes the noise in the observations.40,54 The training set image file: d1ra02061g-t8.tif with N observations, constrains the available distribution of functions through Bayes theorem, and the mean of the posterior distribution is used for prediction. The functional form of K(xi, xj) and h(x) can be selected according to the cross-validation performance of the models.

4.1 Model performance evaluation

In training and evaluating the regression models, as customary in ML, the ground state dataset is divided into training and test sets. The training set represents the set of molecules used for learning a given spectroscopic constant from the atomic properties of the constituents atoms. The test set is the set of molecules that have not been included in the learning procedure and hence are new to the regression algorithm. In learning the equilibrium internuclear distance, Re, and the harmonic vibrational frequency ωe, the training and test sets consist of 231 and 25 molecules, respectively. In learning image file: d1ra02061g-t9.tif the training/test splitting is 172/25. For learning Re and ωe for the A-excited electronic state, the training set consists of 106 molecules and the test set consists of 25 molecules.

The present dataset is relatively small from an ML perspective. When the dataset is split into training and test sets, the training set may not be representative. This may lead to a bias in the performance of the test set. To solve this problem, we have employed a Monte Carlo (MC) approach, in which the dataset is stratified into 25 strata based on the level of the true values of the labels (Re, ωe, and image file: d1ra02061g-t10.tif in the present work).

As shown in panel (a) of Fig. 4, we have two loops in the training and evaluation of the models. In the outer loop, we split the dataset into training set and test set. The training set is used to learn from the data and the test set is used for model evaluation. In the inner loop, we train the models with the training set, which is further split to perform a stratified 5-fold cross validation (CV) for the hyperparameter optimization. In particular, as shown in panel (b) of Fig. 4, in the outer loop, the training/test splittings are done by a Monte Carlo (MC) approach. Specifically, we randomly select 25 test molecules from the dataset, which is stratified into 25 strata based on the levels of the true values of the labels. The stratification helps to minimize the change of the proportions of the dataset compositions upon splitting.55 In each MC step, a regression model is trained and gives the predictions to the training set and the test set. Therefore, in this work we report the mean and standard deviation of the predictions for each molecule when they are used in the training and test sets from all the MC steps. In total, we evaluate our models with 1000 MC steps for the training/test splittings for the model performance evaluation, and 500 MC steps for generating the learning curves.


image file: d1ra02061g-f4.tif
Fig. 4 Scheme of the training/test set splitting in the model evaluation. (a) There are two loops: the outer loop for the model performance evaluation, and the inner loop for the training of model and hyperparameter optimization. (b) In the outer loop, the data are stratified based on the true values of the labels, and each stratum is randomly split into training and test sets. In learning the properties, the training sets are further split into training and validation sets to perform a stratified 5-fold cross-validation.

The performance of the models is evaluated by three different estimators. The first estimator is the mean absolute error (MAE) defined as

 
image file: d1ra02061g-t11.tif(11)
where image file: d1ra02061g-t12.tif are the true values, yi are the predictions, and N is the number of observations. The second estimator is the root mean square error (RMSE), which is given by
 
image file: d1ra02061g-t13.tif(12)

The last estimator is the normalized error rE, defined as the ratio of the RMSE to the range of y,

 
image file: d1ra02061g-t14.tif(13)

4.2 The learning curves

The learning curves show the training and test performance of a model as a function of the training set size N. From the learning curves it is possible to infer the performance of a model by looking at its bias and variance. Similarly, it is possible to understand if the model performance improves with the training set size. For each of the points in the learning curve, the training is performed with 500 different training/test splittings by the MC approach.

5 Results and discussion

5.1 Learning ground state spectroscopic constants

Fueled by the idea of periodicity of molecules (see, e.g., ref. 18 and references in it), we use the group, gk, and period, pk, of the atoms within a molecule, i.e., k = 1, 2, as input features for a GP regression model to predict different combinations of spectroscopic constants: Re, ωe and image file: d1ra02061g-t15.tif, as presented in Section 2. The training sets are permuted before feeding the learning algorithm to reproduce the permutational invariance of relevant properties upon exchanging two atoms in a molecule in the GP regression models.

The GP regression model performance of ground state Re as a function of input features (g1, g2, p1, p2) is shown in Fig. 5, where the MAE associated with each of the distinct type of molecules is reported. As a result, most of the molecules are well described by our GP model, as confirmed in the inset of Fig. 5. In particular, it shows little dispersion of the predicted values concerning the true values except for a handful of molecules (transition metal–metal and bi-alkali molecules). To further quantify the GP regression model performance, we calculate the average RMSE of the predicted Re on 1000 randomly selected test sets leading to 0.0968 ± 0.0070 Å (Table 1), and rE = 2.80 ± 0.20%. Our results confirm that the model performance improves as the number of molecules in the training set, N, grows, as it is shown in the learning curve in panel (a) of Fig. 8. Indeed, it is not yet converged for N = 231, suggesting that the GP regression model can be further improved by learning more data in the training set.


image file: d1ra02061g-f5.tif
Fig. 5 GP regression performance on predicting Re using (g1, g2, p1, p2) as input features classified by the types of the constituent atoms. In particular, the MAE of the test set is reported. The inset shows the test set predictions of Re versus the true values. The values shown are the average of predictions from 1000 MC sampled training/test splittings. The GP regression model gives predictions of the test and training sets. Shown are the mean and standard derivation of each molecule's predictions when used as training data (green symbols) and test data (orange symbols).
Table 1 Regression model predictions of Re, ωe, and D0. gi and pi represent the group and period of the i-th atom, respectively. gisoi stand for the group encoding the information of isotopes of hydrogen, and [p with combining macron], are the average of groups and periods of the two atoms, respectively
Property Model Feature Test MAE Test RMSE Test rE (%)
a image file: d1ra02061g-t41.tif is the predicted value from (g1, g2, p1, p2).
Re (Å) GPR (g1, g2, p1, p2) 0.0662 ± 0.0037 0.0968 ± 0.0070 2.80 ± 0.20
LR log(Z1Z2) 0.2605 ± 0.0018 0.3591 ± 0.0006 10.41 ± 0.01
ωe (cm−1) GPR (Re−1, g1, g2, p1, p2) 126.7 ± 2.1 207.2 ± 2.6 5.07 ± 0.06
(image file: d1ra02061g-t34.tif, g1, g2, p1, p2)a 152.5 ± 3.6 227.5 ± 4.6 5.56 ± 0.11
(Re−1, giso1, giso2, p1, p2) 61.5 ± 2.9 142.8 ± 7.0 3.49 ± 0.17
(image file: d1ra02061g-t35.tif, giso1, giso2, p1, p2) 96.9 ± 2.9 176.0 ± 13.1 4.30 ± 0.32
(Re−1, giso1, giso2, p1, p2, [p with combining macron]) 67.5 ± 1.0 151.8 ± 9.5 3.71 ± 0.2
(image file: d1ra02061g-t36.tif, giso1, giso2, p1, p2, [p with combining macron]) 101.8 ± 5.4 188.7 ± 25.4 4.61 ± 0.62
(Re−1, giso1, giso2, p1, p2, ) 46.7 ± 0.6 73.4 ± 0.2 1.80 ± 0.005
(image file: d1ra02061g-t37.tif, giso1, giso2, p1, p2, ) 81.0 ± 0.82 121.8 ± 0.8 2.98 ± 0.02
LR image file: d1ra02061g-t38.tif 376.5 ± 6.6 529.4 ± 1.2 12.95 ± 0.03
Re−2 209.6 ± 5.4 297.3 ± 1.4 7.27 ± 0.03
image file: d1ra02061g-t39.tif GPR (Re, , [p with combining macron]) 0.249 ± 0.008 0.357 ± 0.007 3.52 ± 0.07
(image file: d1ra02061g-t40.tif, , [p with combining macron]) 0.270 ± 0.006 0.451 ± 0.007 4.45 ± 0.07
LR Re 0.833 ± 0.004 1.018 ± 0.014 10.03 ± 0.14


In learning ωe, we find (image file: d1ra02061g-t16.tif, giso1, giso2, p1, p2, ) to be the best combination of features, where image file: d1ra02061g-t17.tif is the predicted equilibrium distance from (g1, g2, p1, p2), gisok encodes the information about the hydrogen isotopes of the k-th atom in the molecule, and is the average of the groups of the two atoms. However, a much better performance is found when the true Re value is employed. The GP model's performance is shown in the inset of Fig. 6, where it is noticed that the predicted values agree very well with the true values. Indeed, the test set MAE and RMSE are 46.7 ± 0.6 cm−1 and 73.4 ± 0.2 cm−1, respectively, while rE = 1.80 ± 0.005%, as shown in Table 1. Despite the outstanding performance of our GPR model some molecules are still not well described as shown in Fig. 6. These outliers include HF, DF, and HgH. The large errors predicting ωe of HF and DF can be attributed to their unique bond mechanism compared to other halogen hydrides.


image file: d1ra02061g-f6.tif
Fig. 6 GP regression performance based on the MAE predicting ωe for molecules in the test set using (Re−1, giso1, giso2, p1, p2, ) as input features classified by the types of the constituent atoms. The inset shows the test set predictions of ωe compared with respect to the true values. The values shown are the average of predictions from 1000 MC sampled training/test splittings. The GP regression model as learned from the training set gives predictions of the test and training set. Shown are the mean and standard derivation of each molecule's predictions when used as training data (green symbols) and test data (orange symbols).

Within the features (Re−1, giso1, giso2, p1, p2, ), it is interesting that the average of groups image file: d1ra02061g-t18.tif helps in learning ωe. In particular, with , the MAE of the model reduces around 20% compared with the predictions using (Re−1, giso1, giso2, p1, p2) as the input feature, as summarized in Table 1. Analogously, the standard deviation of the MC training/test splittings predictions becomes much smaller, suggesting that the model is more robust for different kinds of molecules within the dataset. Actually, by introducing , the most significant improvement happens in the descriptions of bi-alkali molecules, where the MAE can be reduced by a factor of 3. The errors predicting HF and DF can also be reduced by a factor of 2, although they are still tricky cases for the model. On the contrary, introducing the average of periods image file: d1ra02061g-t19.tif does not help improve the model, suggesting that ωe has a dependency on the total number of valence electrons of the two atoms rather than the number of electron shells.

Motivated by the pioneering work of Anderson, Parr, and coworkers,27,29,30,45 we study the prediction of image file: d1ra02061g-t22.tif based on GP regression and the results are shown in Fig. 7. In particular, in the figure's inset, we show the GP regression model prediction of image file: d1ra02061g-t23.tif versus its true value, which shows a good performance with an RMSE = 0.357 ± 0.007 and a rE equal to 3.52 ± 0.07%, as shown in Table 1. In this case, the GP is fed with (Re, , [p with combining macron]) as input features and it shows a fast convergence with respect to the size of training set around N = 150 as shown in panel (c) of Fig. 8. The most significant outlier for image file: d1ra02061g-t24.tif is NaK, which is a van der Waals molecule. D0 of NaK is overestimated and it may be attributed to the fact that NaK is the only bi-alkali molecule in the dataset having D0. There are also some outliers having first-row elements and 3d transition metals.


image file: d1ra02061g-f7.tif
Fig. 7 GP regression performance on predicting image file: d1ra02061g-t20.tif using (Re, , [p with combining macron]) as input features classified by the types of the constituent atoms. In particular, the MAE of the test set is reported. The inset shows the test set predictions of image file: d1ra02061g-t21.tif compared with respect to the true values. The values shown are the average of predictions from 1000 MC sampled training/test splittings. The GP regression model gives predictions of the test and training set. Shown are the mean and standard derivation of each molecule's predictions when used as training data (green symbols) and test data (orange symbols).

A summary of our GP regression model performance for the different combinations of the ground state spectroscopic constants considered in this work is shown in Table 1, compared against the proposed models of Parr, Anderson et al.27,29,30,45 As a result, the GP regression model shows a superior performance against the linear model (LR in the table) based on a particular functional form of the electron density within the molecule. Indeed, the GP performance is, in some cases, five times better than the linear model (in terms of the relative error). Therefore, the group and period (correlated to the number of valence electrons and the number of electrons shells, respectively) of constituent atoms within a molecule encapsulates more valuable information regarding spectroscopic constants than using simple, functional forms for the electron density within the framework of ref. 27, 29, 30 and 45. Indeed, it is interesting to notice that, when predicting Re and ωe, one needs groups and periods of each atom in the molecule, whereas image file: d1ra02061g-t25.tif can be well described only with the average of group and period of the two atoms. Therefore, image file: d1ra02061g-t26.tif is correlated to groups and periods' additive properties rather than the differences between the two atoms caused by their different groups.

To further examine if our ML approach is generalizable, we have selected 26 molecules out of the dataset and unseen by the ML algorithm including CoO,56 CrC,57 InBr,58 IrSi,59 MgD,60 MoC,61 NbC,61 NiBr,62 NiC,63 NiO,64 NiS,65 PbI,66 PdC,61 RuC,61 RuF,67 ScBr,62 SnI,66 TiBr,62 UF,68 UO,69 WC,70 YC,61 ZnBr,62 ZrC,61 ZrCl,71 ZrF.71 The MAE of the GP regression model predicting ground state Re of the extra test set is 0.066 Å. The average relative error (defined as the absolute errors of each molecule divided by their true Re) is 3.3%. Indeed, for CrC, InBr, MgD, ZnBr, ZrCl the relative errors are <1%. Within this extra test set, experimental ground state ωe values are also available for 14 molecules: InBr, MoC, NbC, NiC, NiO, NiS, PbI, PdC, RuC, SnI, UO, WC, YC and ZnBr. The MAE of GPR model predictions is 30 cm−1 (4%). For RuC and ZnBr, the relative errors are below 1%, and for NiS and MoC, the relative errors are below 2%. For MoC, NbC, PbI, SnI, YC and ZrC, the experimental binding energy has been reported and the MAE of our GPR model to predict D0 is 0.32 eV (7.6%). Therefore, our models perform fairly well in this extra test set.

5.2 Learning the first excited state spectroscopic constants

To learn the equilibrium internuclear distance Re of the A excited electronic state for different molecules, we need to employ atomic features of the two constituent atoms, including g1, g2, p1, p2, D(IP, EA), and the ground state Re(X) when constructing the GP regression models. It is interesting that including D(IP, EA) can improve the predictions (Table 2), which is defined as
image file: d1ra02061g-t28.tif
where IPi, EAi and χi are the ionic potential, electron affinity and electronegativity of atom i, respectively. Therefore, D(EA, IP) qualitatively measures the electron transfer between the two constituent atoms. The resultant test set MAE, RMSE and rE are 0.0691 ± 0.0062\AA, 0.098 ± 0.0097\AA, 5.32 ± 0.53, respectively. As shown in Fig. 9, similar to the results of ground state Re, the transition metal–metal compounds are the most difficult ones to predict.
Table 2 Regression model predictions of the A excited electronic state Re and ωe. gi and pi are the groups and periods of the i-th atom, respectively whereas gisoi stand for the group encoding the information of isotopes of hydrogen. [p with combining macron], are the average of groups and periods of the two atoms, respectively. Re(X) and Re(A) refer to the ground state and A-state Re, respectively. ωe(X) refers to the ground state ωe
Property Model Feature Test MAE Test RMSE Test rE (%)
Re (Å) GPR (Re(X), g1, g2, p1, p2) 0.0783 ± 0.0018 0.107 ± 0.0026 5.81 ± 0.14
(Re(X), g1, g2, p1, p2, D(IP, EA)) 0.0691 ± 0.0062 0.098 ± 0.0097 5.32 ± 0.53
ωe (cm−1) GPR (ωe(X), Re−1(X), Re−1(A), giso1, giso2, p1, p2, ) 71.8 ± 1.4 107.9 ± 4.4 11.3 ± 0.46
(ωe(X), Re−1(X), Re−1(A), giso1, giso2, p1, p2) 70.4 ± 0.9 105.1 ± 1.5 11.0 ± 0.15
(ωe(X), Re−1(X), Re−1(A), g1, g2, p1, p2) 70.6 ± 0.9 105.1 ± 1.1 11.0 ± 0.12



image file: d1ra02061g-f8.tif
Fig. 8 Performance of the GP regression models as a function of the training set size N. (a) Learning curve of Re as a function of the size of training set, predicted with the groups and periods of the two atoms, (g1, g2, p1, p2). (b) Learning curve of ωe as a function of the size of training set, using the equilibrium internuclear distance Re, as well as the groups and periods and the average of groups of the two atoms (Re−1, giso1, giso2, p1, p2, ) as the input feature. (c) Learning curve of image file: d1ra02061g-t27.tif as a function of the size of training set, using the equilibrium internuclear distance Re, as well as the averages of groups and periods of the two atoms (Re, , [p with combining macron]) as the input feature. The shade around the points denotes the variance of the errors regarding the MC method.

image file: d1ra02061g-f9.tif
Fig. 9 The test set MAE predicting A excited electronic state Re by GP regression, using (g1, g2, p1, p2, Re(X), D(IP, EA)) as input features, classified by the types of the constituent atoms. The inset shows the test set predictions of the A-excited electronic state Re compared with respect to the true values. The values shown are the average of predictions from 1000 MC sampled training/test splittings. The GP regression model as learned from the training set gives predictions of the test and training set. Shown are the mean and standard derivation of each molecule's predictions when used as training data (green symbols) and test data (orange symbols).

For learning ωe of the A excited electronic state, in addition to the ground state Re−1(X), it is also necessary to include the A state Re−1(A). Furthermore, it is better to include the ground state ωe(X) as the input feature. The results are shown in Fig. 10 in which (ωe(X), Re−1(X), Re−1(A), g1, g2, p1, p2) leads to a RMSE of 105.1 ± 1.1 cm−1 and rE = 11.0 ± 0.12%. We also find that including the average of groups or the isotope information cannot further improve the model performance. This is expected, since this information have already been encoded in the ground state ωe.


image file: d1ra02061g-f10.tif
Fig. 10 The test set MAE predicting A excited electronic state ωe by GP regression, using (ωe(X), Re−1(X), Re−1(A), g1, g2, p1, p2) as input features, classified by the types of the constituent atoms. The inset shows the test set predictions of A-excited electronic state ωe compared with respect to the true values. The values shown are the average of predictions from 1000 MC sampled training/test splittings. The GP regression model as learned from the training set gives predictions of the test and training set. Shown are the mean and standard derivation of each molecule's predictions when used as training data (green symbols) and test data (orange symbols).

The performance of our models predicting the A excited electronic state Re and ωe are summarized in Table 2. Compared to the ground state predictions, the errors predicting the A excited electronic state spectroscopic constants are around two times larger, suggesting the difficulty predicting the excited state properties. However, we notice that ωe is correlated with the inverse of Re(A) as for ground state molecules. Our findings corroborate the hypothetical relationship between Re and ωe in the early times of molecular spectroscopy as it has been introduced in Section 2.

6 Conclusions

In summary, we have shown that using the GP regression model, the main spectroscopic constants of diatomic molecules are related. This result confirms the scenario that Kratzer and Mecke envisioned a century ago.2,3 The relationships are mostly independent of the nature of the chemical bond of the diatomic molecule. In particular, we have demonstrated that merely using the atoms' group and the period within a molecule as input features can predict particular combinations of spectroscopic constants with an error rE < 5%. In other words, the spectroscopic constants of diatomic molecules can be efficiently learned from an appropriate dataset by GP regression models, and their values can be accurately predicted. Furthermore, we have shown that GP regression can efficiently learn spectroscopic relationships for excited electronic states of molecules with an error rE < 11%.

Despite the present GP models' outstanding performance, machine learning methods may be considered mere fitting techniques or as a black-box algorithm that one can hardly learn anything new from them. This statement is not accurate. As an example, here, we emphasize what we have learned from the present machine learning approach:

• It is generally assumed that some molecular properties can be predicted based on the forming atom's positions in the periodic table.72 However, the predictions are only qualitative rather than quantitative. For instance, it is possible to anticipate the nature of a molecule's bond, but it cannot accurately guess its dissociation energy. However, thanks to ML, we know that it is possible to predict reasonably accurate spectroscopic constants using the constituent atoms' group and period.

• We have learned that ωe and Re depend strongly on the number of valence electrons and electrons shells of the atoms forming a molecule, whereas the average number of valence electrons also plays an important role in describing ωe. image file: d1ra02061g-t29.tif depends on the average number of valence electrons and average number of electron shells of the molecule.

• The capability of learning excited electronic state properties of diatomic molecules may open the possibility of predicting Franck–Condon factors for interesting transitions regarding direct cooling of molecules.47,73–76

Finally, we would like to emphasize that there are around 7000 heteronuclear molecules, and we only utilize 256 of these for our GP regression model. The limited availability of spectroscopic data (only around 3% of possible heteronuclear diatomic molecules) shows the vast amount of spectroscopy that can be done within the realm of diatomic molecules. The more data we have, the more accurate will be the GP regression model predictions before reaching convergence of the learning curve, and the more knowledgeable the community will be about the fundamental properties of diatomic molecules. From our perspective, the present work may motivate data science-driven studies on the field of spectroscopy of diatomic molecules. In particular, it will help to evolve the field of spectroscopy towards the current information era and help to achieve a better understanding on the spectroscopic properties. Furthermore, our results may also bring some insight for the development of features and geometry representations in material science.

Appendix: details about the GP regression models

The choice of covariance functions defines the smoothness of the data points. In learning Re, the covariance function employed is the exponential kernel defined as
 
image file: d1ra02061g-t30.tif(14)
where σf is the signal variance, l is the characteristic length scale, and r is the Euclidean distance between xi and xj.

In learning ωe, we use the Matérn class of covariance functions40

 
image file: d1ra02061g-t31.tif(15)
with ν = 5/2. Kv is modified Bessel function in D dimensions, r is the Euclidean distance between x and x′, then the Matern 5/2 kernel function is
 
image file: d1ra02061g-t32.tif(16)

The explicit basis functions in learning Re are linear basis, while when learning ωe and image file: d1ra02061g-t33.tif the basis functions are set to be constant.

Conflicts of interest

There are no conflicts to declare.

Acknowledgements

We thank Dr Matthias Rupp for his comments and suggestions and Drs Daniel Thomas and Uwe Hergenhahn for carefully reading the manuscript.

Notes and references

  1. G. Herzberg, Annu. Rev. Phys. Chem., 1985, 36, 1 CrossRef CAS PubMed .
  2. A. Kratzer, Z. Phys., 1920, 3, 289 CrossRef CAS .
  3. R. Mecke, Z. Phys., 1925, 32, 823 CrossRef CAS .
  4. P. M. Morse, Phys. Rev., 1929, 34, 57 CrossRef CAS .
  5. C. D. Clark, London, Edinburgh Dublin Philos. Mag. J. Sci., 1934, 18, 459–470 CrossRef CAS .
  6. R. M. Badger, J. Chem. Phys., 1933, 2, 128 CrossRef .
  7. C. D. Clark, London, Edinburgh Dublin Philos. Mag. J. Sci., 1935, 19, 476–485 CrossRef CAS .
  8. C. D. Clark and J. L. Stoves, Nature, 1934, 133, 873 CrossRef CAS .
  9. W. Gordy, J. Chem. Phys., 1946, 14, 305–320 CrossRef CAS .
  10. K. M. Guggenheimer, Proc. Phys. Soc., 1946, 58, 456–468 CrossRef CAS .
  11. C. H. D. Clark, Trans. Faraday Soc., 1941, 37, 299–302 RSC .
  12. C. H. D. Clark and K. R. Webb, Trans. Faraday Soc., 1941, 37, 293–298 RSC .
  13. J. W. Linnett, Trans. Faraday Soc., 1940, 36, 1123–1134 RSC .
  14. R. Newing, London, Edinburgh Dublin Philos. Mag. J. Sci., 1940, 29, 298–301 CrossRef CAS .
  15. J. W. Linnett, Trans. Faraday Soc., 1942, 38, 1–9 RSC .
  16. Y. P. Varshni, Rev. Mod. Phys., 1957, 29, 664 CrossRef CAS .
  17. Y. P. Varshni, J. Chem. Phys., 1958, 28, 1081 CrossRef CAS .
  18. R. Hefferlin, Periodic systems and their relation to the systematic analysis of molecular data, The Edwin Mellen Press, Queenston, Canada, 1989 Search PubMed .
  19. L. Salem, J. Chem. Phys., 1963, 38, 1227–1236 CrossRef CAS .
  20. H. J. Kim and R. G. Parr, J. Chem. Phys., 1964, 41, 2892–2897 CrossRef CAS .
  21. R. F. Borkman and R. G. Parr, J. Chem. Phys., 1968, 48, 1116–1126 CrossRef CAS .
  22. W. T. King, J. Chem. Phys., 1968, 49, 2866–2867 CrossRef CAS .
  23. R. F. Borkman, G. Simons and R. G. Parr, J. Chem. Phys., 1969, 50, 58–65 CrossRef CAS .
  24. P. Politzer, J. Chem. Phys., 1970, 52, 2157–2158 CrossRef CAS .
  25. A. B. Anderson, N. C. Handy and R. G. Parr, J. Chem. Phys., 1969, 50, 3634–3635 CrossRef CAS .
  26. A. B. Anderson and R. G. Parr, J. Chem. Phys., 1970, 53, 3375–3376 CrossRef CAS .
  27. A. B. Anderson and R. G. Parr, J. Chem. Phys., 1971, 55, 5490–5493 CrossRef CAS .
  28. G. Simons and R. G. Parr, J. Chem. Phys., 1971, 55, 4197–4202 CrossRef CAS .
  29. A. B. Anderson, J. Mol. Spectrosc., 1972, 44, 411–424 CrossRef CAS .
  30. J. Gazquez and R. G. Parr, Chem. Phys. Lett., 1979, 66, 419–422 CrossRef CAS .
  31. K. Raghavachari, G. W. Trucks, J. A. Pople and M. Head-Gordon, Chem. Phys. Lett., 1989, 157, 479–483 CrossRef CAS .
  32. R. J. Bartlett, J. Watts, S. Kucharski and J. Noga, Chem. Phys. Lett., 1990, 165, 513–522 CrossRef CAS .
  33. N. Mardirossian and M. Head-Gordon, Mol. Phys., 2017, 115, 2315–2372 CrossRef CAS .
  34. J. P. Perdew, A. Ruzsinszky, J. Tao, V. N. Staroverov, G. E. Scuseria and G. I. Csonka, J. Chem. Phys., 2005, 123, 062201 CrossRef PubMed .
  35. J. P. Perdew, K. Burke and M. Ernzerhof, Phys. Rev. Lett., 1996, 77, 3865 CrossRef CAS PubMed .
  36. J. Tao, J. P. Perdew, V. N. Staroverov and G. E. Scuseria, Phys. Rev. Lett., 2003, 91, 146401 CrossRef PubMed .
  37. Y. Zhao and D. G. Truhlar, Theor. Chem. Acc., 2008, 120, 215–241 Search PubMed .
  38. X. Liu, G. Meijer and J. Pérez-Ríos, Phys. Chem. Chem. Phys., 2020, 22, 24191–24200 RSC .
  39. M. J. Willatt, F. Musil and M. Ceriotti, Phys. Chem. Chem. Phys., 2018, 20, 29661–29668 RSC .
  40. C. K. Williams and C. E. Rasmussen, Gaussian processes for machine learning, MIT press Cambridge, MA, 2006, vol. 2 Search PubMed .
  41. S. Glasstone, Recent advances in physical chemistry, J. & A. Churchill, London, 2nd edn, 1933, p. 498 Search PubMed .
  42. M. Davies, J. Chem. Phys., 1949, 17, 374–379 CrossRef CAS .
  43. D. F. Heath, J. W. Linnett and P. J. Wheatley, Trans. Faraday Soc., 1950, 46, 137–146 RSC .
  44. R. T. Birge, Phys. Rev., 1925, 25, 240 Search PubMed .
  45. A. Anderson and R. Parr, Chem. Phys. Lett., 1971, 10, 293–296 CrossRef CAS .
  46. J. C. Slater, J. Chem. Phys., 1964, 41, 3199–3204 CrossRef CAS .
  47. J. Pérez-Ríos, An Introduction to Cold and Ultracold Chemistry, Springer International Publishing, 2020 Search PubMed .
  48. N. Balakrishnan, J. Chem. Phys., 2016, 145, 150901 CrossRef CAS PubMed .
  49. L. D. Carr, D. DeMille, R. V. Krems and J. Ye, New J. Phys., 2009, 11, 055049 CrossRef .
  50. X. Liu, S. Truppe, G. Meijer and J. Pérez-Ríos, The Diatomic Molecular Spectroscopy Database, https://rios.mp.fhi.mpg.de/index.php, accessed February 1, 2020 Search PubMed.
  51. K. P. Huber and G. Herzberg, Molecular Spectra and Molecular Structure, Springer-Verlag, Berlin, Germany, 1979 Search PubMed .
  52. B. M. Smirnov, Reference Data on Atomic Physics and Atomic Processes, Springer-Verlag, Berlin, Germany, 2008 Search PubMed .
  53. X. Liu, S. Truppe, G. Meijer and J. Pérez-Ríos, J. Cheminf., 2020, 12, 31 CAS .
  54. MATLAB, 9.7.0 (R2019b), The MathWorks Inc., Natick, Massachusetts, 2019 Search PubMed .
  55. S. Raschka, 2018, arXiv preprint arXiv:1811.12808.
  56. S. McLamarrah, P. Sheridan and L. M. Ziurys, Chem. Phys. Lett., 2005, 414, 301–306 CrossRef CAS .
  57. D. J. Brugh, M. D. Morse, A. Kalemos and A. Mavridis, J. Chem. Phys., 2010, 133, 034303 CrossRef PubMed .
  58. S. Mishra, R. K. Yadav, V. Singh and S. Rai, J. Phys. Chem. Ref. Data, 2004, 33, 453–470 CrossRef CAS .
  59. M. A. Garcia, C. Vietz, F. Ruipérez, M. D. Morse and I. Infante, J. Chem. Phys., 2013, 138, 154306 CrossRef PubMed .
  60. T. C. Steimle, R. Zhang and H. Wang, J. Chem. Phys., 2014, 140, 224308 CrossRef PubMed .
  61. R. S. DaBell, R. G. Meyer and M. D. Morse, J. Chem. Phys., 2001, 114, 2938–2954 CrossRef CAS .
  62. M. Burton and L. M. Ziurys, J. Chem. Phys., 2019, 150, 034303 CrossRef CAS PubMed .
  63. D. J. Brugh and M. D. Morse, J. Chem. Phys., 2002, 117, 10703–10714 CrossRef CAS .
  64. R. Ram and P. Bernath, J. Mol. Spectrosc., 1992, 155, 315–325 CrossRef CAS .
  65. R. Ram, S. Yu, I. Gordon and P. Bernath, J. Mol. Spectrosc., 2009, 258, 20–25 CrossRef CAS .
  66. C. J. Evans, L.-M. E. Needham, N. R. Walker, H. Köckert, D. P. Zaleski and S. L. Stephens, J. Chem. Phys., 2015, 143, 244309 CrossRef PubMed .
  67. T. C. Steimle, W. L. Virgo and T. Ma, J. Chem. Phys., 2006, 124, 024309 CrossRef PubMed .
  68. I. O. Antonov and M. C. Heaven, J. Phys. Chem. A, 2013, 117, 9684–9694 CrossRef CAS PubMed .
  69. L. A. Kaledin, J. E. McCord and M. C. Heaven, J. Mol. Spectrosc., 1994, 164, 27–65 CrossRef CAS .
  70. S. M. Sickafoose, A. W. Smith and M. D. Morse, J. Chem. Phys., 2002, 116, 993–1002 CrossRef CAS .
  71. A. Martinez and M. D. Morse, J. Chem. Phys., 2011, 135, 024308 CrossRef PubMed .
  72. C. A. Coulson, The shape and structure of molecules, Clarendon Press, Oxford, 1973 Search PubMed .
  73. M. V. Ivanov, F. H. Bangerter and A. I. Krylov, Phys. Chem. Chem. Phys., 2019, 21, 19447–19457 RSC .
  74. M. D. Di Rosa, Eur. Phys. J. D, 2004, 31, 395–402 CrossRef .
  75. B. L. Augenbraun, J. M. Doyle, T. Zelevinsky and I. Kozyryev, Phys. Rev. X, 2020, 10, 031022 CAS .
  76. S. Truppe, S. Marx, S. Kray, M. Doppelbauer, S. Hofsäss, H. C. Schewe, N. Walter, J. Pérez-Ríos, B. G. Sartakov and G. Meijer, Phys. Rev. A, 2019, 100, 052513 CrossRef CAS .

This journal is © The Royal Society of Chemistry 2021