On the relationship between spectroscopic constants of diatomic molecules: a machine learning approach

Through a machine learning approach, we show that the equilibrium distance, harmonic vibrational frequency and binding energy of diatomic molecules are related, independently of the nature of the bond of a molecule; they depend solely on the group and period of the constituent atoms. As a result, we show that by employing the group and period of the atoms that form a molecule, the spectroscopic constants are predicted with an accuracy of <5%, whereas for the A-excited electronic state it is needed to include other atomic properties leading to an accuracy of <11%.


Introduction
Early in the history of molecular spectroscopy, when it became a discipline within chemical physics in the 1920's, 1 some intriguing empirical relationships between different spectroscopic properties were observed. [2][3][4] In particular, it was found that the equilibrium distance, R e , and the harmonic vibrational frequency, u e , were correlated in diatomic molecules. As the eld evolved, the relationship between R e and u e became more evident, and more empirical relations between spectroscopic constants were identied. [5][6][7][8][9][10][11][12] However, these empirical relationships were typically only valid for specic atomic numbers or groups of the constituent atoms. These results motivated the development of realistic diatomic molecular potentials 4,13-17 and triggered the physical chemistry community to think about the "periodicity" of diatomic molecules. 18 The development of quantum chemistry helped to shed some light on the physics behind empirical relationships between spectroscopic constants. In particular, thanks to the application of the Hellmann-Feynman theorem, it was possible to connect u e directly with the electronic density at R e . [19][20][21][22] As a result, a rst principles-based explanation (containing a few free parameters), of the observed empirical relations between spectroscopic constants appeared. [23][24][25][26][27][28][29][30] Nevertheless, the obtained relations based on the electronic density were only valid for subsets of molecules. To date, it has not been possible to nd general relations for spectroscopic constants of diatomic molecules in terms of the properties of their constituent atoms.
The accuracy of quantum chemistry methods relies on (nite) basis sets optimized for each element under certain bounds. At the same time, an accurate description of the system's electronic structure is required, which is achieved through a hierarchy of different treatments of the electron correlation. 31,32 On the other hand, the widely-used (Kohn-Sham) density functional theory (DFT) methods require accurate electron exchange-correlation density functionals. The non-empirical density functionals are derived under certain constraints, some with several free parameters, [33][34][35][36] while the semi-empirical density functionals employ more exible functional forms with (sometimes even several tens of) coefficients tted to various experimental or theoretical reference properties. 33,37 Machine learning (ML) methods, on the other hand, discover the underlying relationships from data (the so-called "training set") and build up models on top of them. These models can be quantitatively predictive for other systems that follow similar underlying physics. More importantly, they provide possibilities for discovering relationships between the different properties of the system under consideration. 38,39 This work shows that the relationship between spectroscopic constants of heteronuclear diatomic molecules is general for most kinds of molecules at hand. Our ndings rely upon applying state-of-the-art ML models to an orthodox dataset of experimental spectroscopic constants for diatomic molecules. In particular, we apply the Gaussian process (GP) regression model 40 to predict R e , u e , and the binding energy, D 0 , as a function of the group and period of the constituent atoms. Similarly, our model can predict R e and u e for the A-excited electronic state of a given molecule. Our ndings generalize the idea that some of the system's chemical properties depend on the atoms' group and period. Indeed, the periodicity of elements has long been used to predict chemical compounds' properties intuitively at a qualitative level. However, the correlations between the chemical properties and the constituent atoms' periodicity are not always straightforward, and such predictions can hardly be quantitative in most cases. On the contrary, our main result is quantitatively meaningful: it is possible to predict those spectroscopic constants with an 2 The quest of relationships between spectroscopic constants of diatomic molecules As soon as molecular spectroscopy became an essential tool to analyze molecules' unique ngerprints and more spectra of molecules were taken, approximate relationships were found between spectroscopic constants. As a result, it was postulated that the molecules' spectroscopic constants might be correlated based on empirical grounds. In particular, it was observed that the equilibrium distance and the harmonic vibrational frequency are related as R e 2 u e 2 m ¼ const in hydrogen halides, 2,[41][42][43] where m is the reduced mass of the molecule. This relationship was generalized as R e i u e 2 m ¼ const, the precursor of the well-known Badger's rule, 6 where i is a natural number. On the other hand, aer studying the spectra of 16 molecules, including homonuclear molecules and molecular ions, Mecke and Birge found that the expression R e 2 u e ¼ const described the observed spectra better. 3,44 In the same line, but using a given functional form for the interatomic interaction of a molecule, Morse proposed a relationship given as R e 3 u e ¼ const. 4 Finally, more involved relationships between the equilibrium distance and the vibrational harmonic frequency were proposed 17 as mR e 6 u e 2 n a , where n stands for the number of valence electrons, and a is a rational number. The results for a variety of the proposed empirical rules are shown in Fig. 1, where it is noticed that for a larger dataset, as the present one, none of the empirical relationships hold.
At the same time, more spectroscopic information of molecules became available, and more advanced and accurate quantum chemistry tools were developed. Therefore, it was possible to search for a rst principle explanation of the empirically observed relationships between spectroscopic constants. In that endeavor, Parr and coworkers took the lead by looking at the electron density within a molecule as the source of the relationship between spectroscopic constants. The model assumes that the electron density mutually created by the one atom in the other atom is equal at the equilibrium distance, i.e., at the sum of two atomic radii. In particular, the electron density of atom 1 at the position of atom 2, within a molecule, is given by 27 where C is a tting parameter and x represents the decay constant of the electron density. Within this model, one nds a relationship between the atomic numbers of the two atoms, Z 1 and Z 2 , and the equilibrium internuclear distance R e of a diatomic molecule as 27,29,30,45 where A is a free parameter. According to this relationship, R e depends linearly on log(Z 1 Z 2 ) as However, the performance of this relationship has only been checked for molecules with atoms coming from the same group of the periodic table. 29 Anderson, Parr and coworkers also suggested a relationship between u e and R e 29 as based on the Born-Oppenheimer approximation, the electron density of eqn (1) and the Hellmann-Feynman theorem. From eqn (4) it is possible to express the harmonic vibrational frequency in terms of the equilibrium distance and atomic properties as In the same vein, following the relationship between the equilibrium distance and the harmonic vibrational frequency, it is possible to nd a relationship between the atomic number Z i , R e , and the dissociation energy D e , as 27,29,30,45 which can be rewritten as For the derivation of eqn (6) it must be assumed that D e ¼ Amu e 2 R e l without any further justication. 30 In eqn (7), l ¼ 3 and x 0 ¼ 0.97. Eqn (7) has been tested in a dataset of 150 molecules leading to a good result, although no further characterization of the model performance was reported to objectively judge its quality. Finally, using the relation of the dissociation energy, D e , and the binding energy, D 0 , where u e x e represents the rst anharmonic correction to the harmonic vibrational frequency, it should be possible to nd a linear regression model for log D 0 R e l Z 1 Z 2 .

The dataset
In this work, we focus on heteronuclear molecules due to their relevance on laser cooling of molecules with applications in ultracold chemistry. [47][48][49] The employed dataset contains the main spectroscopic constants: R e , u e , and D 0 for the ground electronic state of heteronuclear diatomic molecules. In particular, it contains the experimental values of R e , u e for 256 heteronuclear diatomic molecules taken from ref. 50-53, whereas the experimentally determined values of D 0 are only available for 197 of them. As far as we know, this is the most extensive dataset for experimental ground state properties of heteronuclear diatomic molecules. Fig. 2 shows the equilibrium distance's distribution and its ratio to the sum of the atomic radii of the constituent atoms, R 1 + R 2 , for molecules within the dataset. Most molecules show an equilibrium distance between 1.4 A and 3.8 A, with a most probable value of 1.7 A. Looking at the values of R e /(R 1 + R 2 ), it is clear that the molecules within the dataset have different bonds: covalent, van der Waals, and ionic.
We have classied the dataset based on the types of constituent atoms within a molecule, and the results are shown in Fig. 3. As a result, we notice that the dataset mainly consists of various metal and non-metal halides, hydrides, and metalloid compounds. It is worth noticing that more than 20% of the dataset contains transition metal compounds, including f-block elements. Therefore, the present dataset is general since it goes beyond the main-group diatomic molecules and deals with some of the more intriguing and complex atoms from a chemistry standpoint.
In addition to the dataset mentioned above of the ground state properties, we also study 131 molecules whose R e , u e are available for the A-excited electronic state. The A-state dataset mainly consists of metal and non-metal compounds, including transition metal compounds and several f-block compounds.

Machine learning method
The quest for universal relationships between spectroscopic constants is related to the problem of how atomic and molecular properties describe a spectroscopic property of a molecule, ., x n ), consists of different atomic properties of the constituent atoms or molecular properties, where n denotes the number of input features relevant for the problem at hand. Unlike traditional (non-)linear regression models, which assume a xed form of function f(x), GP embraces a Bayesian perspective and presumes a prior distribution over the space of functions  with a joint multivariate-Gaussian distribution, centered at m(x i ) and characterized by the covariance function K(x i , x j ), which species the correlation (or "similarity") between data points. 40 In this work, the spectroscopic properties y are modeled as where the basis functions, h(x i ), project {x i } to a new (higher dimensional) feature space with coefficients b, and s y includes the noise in the observations. 40,54 The training set D ¼ fðx i ; y i Þji ¼ 1; .; Ng with N observations, constrains the available distribution of functions through Bayes theorem, and the mean of the posterior distribution is used for prediction. The functional form of K(x i , x j ) and h(x) can be selected according to the cross-validation performance of the models.

Model performance evaluation
In training and evaluating the regression models, as customary in ML, the ground state dataset is divided into training and test sets. The training set represents the set of molecules used for learning a given spectroscopic constant from the atomic properties of the constituents atoms. The test set is the set of molecules that have not been included in the learning procedure and hence are new to the regression algorithm. In learning the equilibrium internuclear distance, R e , and the harmonic vibrational frequency u e , the training and test sets consist of 231 and 25 molecules, respectively. In learning log D 0 R e 3 Z 1 Z 2 the training/test splitting is 172/25. For learning R e and u e for the Aexcited electronic state, the training set consists of 106 molecules and the test set consists of 25 molecules. The present dataset is relatively small from an ML perspective. When the dataset is split into training and test sets, the training set may not be representative. This may lead to a bias in the performance of the test set. To solve this problem, we have employed a Monte Carlo (MC) approach, in which the dataset is stratied into 25 strata based on the level of the true values of the labels (R e , u e , and log D 0 R e 3 Z 1 Z 2 in the present work).
As shown in panel (a) of Fig. 4, we have two loops in the training and evaluation of the models. In the outer loop, we split the dataset into training set and test set. The training set is used to learn from the data and the test set is used for model evaluation. In the inner loop, we train the models with the training set, which is further split to perform a stratied 5-fold cross validation (CV) for the hyperparameter optimization. In particular, as shown in panel (b) of Fig. 4, in the outer loop, the training/test splittings are done by a Monte Carlo (MC) approach. Specically, we randomly select 25 test molecules from the dataset, which is stratied into 25 strata based on the levels of the true values of the labels. The stratication helps to minimize the change of the proportions of the dataset compositions upon splitting. 55 In each MC step, a regression model is trained and gives the predictions to the training set and the test set. Therefore, in this work we report the mean and standard deviation of the predictions for each molecule when they are used in the training and test sets from all the MC steps. In total, we evaluate our models with 1000 MC steps for the training/test splittings for the model performance evaluation, and 500 MC steps for generating the learning curves.
The performance of the models is evaluated by three different estimators. The rst estimator is the mean absolute error (MAE) dened as where y * i are the true values, y i are the predictions, and N is the number of observations. The second estimator is the root mean square error (RMSE), which is given by The last estimator is the normalized error r E , dened as the ratio of the RMSE to the range of y,

The learning curves
The learning curves show the training and test performance of a model as a function of the training set size N. From the learning curves it is possible to infer the performance of a model by looking at its bias and variance. Similarly, it is possible to understand if the model performance improves with the training set size. For each of the points in the learning curve, the training is performed with 500 different training/test splittings by the MC approach.

Learning ground state spectroscopic constants
Fueled by the idea of periodicity of molecules (see, e.g., ref. 18 and references in it), we use the group, g k , and period, p k , of the atoms within a molecule, i.e., k ¼ 1, 2, as input features for a GP regression model to predict different combinations of spectroscopic constants: R e , u e and log D 0 R e 3 Z 1 Z 2 , as presented in Section 2. The training sets are permuted before feeding the learning algorithm to reproduce the permutational invariance of relevant properties upon exchanging two atoms in a molecule in the GP regression models.
The GP regression model performance of ground state R e as a function of input features (g 1 , g 2 , p 1 , p 2 ) is shown in Fig. 5, where the MAE associated with each of the distinct type of molecules is reported. As a result, most of the molecules are well described by our GP model, as conrmed in the inset of Fig. 5. In particular, it shows little dispersion of the predicted values concerning the true values except for a handful of molecules (transition metal-metal and bi-alkali molecules). To further quantify the GP regression model performance, we calculate the average RMSE of the predicted R e on 1000 randomly selected test sets leading to 0.0968 AE 0.0070 A (Table 1), and r E ¼ 2.80 AE 0.20%. Our results conrm that the model performance improves as the number of molecules in the training set, N, grows, as it is shown in the learning curve in panel (a) of Fig. 8. Indeed, it is not yet converged for N ¼ 231, suggesting that the GP regression model can be further improved by learning more data in the training set.
In learning u e , we nd (R e *À1 , g iso 1 , g iso 2 , p 1 , p 2 , g) to be the best combination of features, where R * e is the predicted equilibrium distance from (g 1 , g 2 , p 1 , p 2 ), g iso k encodes the information about the hydrogen isotopes of the k-th atom in the molecule, and g is the average of the groups of the two atoms. However, a much better performance is found when the true R e value is employed. The GP model's performance is shown in the inset of Fig. 6, where it is noticed that the predicted values agree very well with the true values. Indeed, the test set MAE and RMSE are 46.7 AE 0.6 cm À1 and 73.4 AE 0.2 cm À1 , respectively, while r E ¼ 1.80 AE 0.005%, as shown in Table 1. Despite the outstanding performance of our GPR model some molecules are still not well described as shown in Fig. 6. These outliers include HF, DF, and HgH. The large errors predicting u e of HF and DF can be attributed to their unique bond mechanism compared to other halogen hydrides.
Within the features (R e À1 , g iso 1 , g iso 2 , p 1 , p 2 , g), it is interesting that the average of groups g ¼ g 1 þ g 2 2 helps in learning u e . In particular, with g, the MAE of the model reduces around 20% compared with the predictions using (R e À1 , g iso 1 , g iso 2 , p 1 , p 2 ) as the input feature, as summarized in Table 1. Analogously, the standard deviation of the MC training/test splittings predictions becomes much smaller, suggesting that the model is more robust for different kinds of molecules within the dataset. Actually, by introducing g, the most signicant improvement happens in the descriptions of bi-alkali molecules, where the MAE can be reduced by a factor of 3. The errors predicting HF and DF can also be reduced by a factor of 2, although they are still tricky cases for the model. On the contrary, introducing the average of periods p ¼ p 1 þ p 2 2 does not help improve the model, suggesting that u e has a dependency on the total number of valence electrons of the two atoms rather than the number of electron shells. Motivated by the pioneering work of Anderson, Parr, and coworkers, 27,29,30,45 we study the prediction of log D 0 R e 3 Z 1 Z 2 based on GP regression and the results are shown in Fig. 7. In  as input features and it shows a fast convergence with respect to the size of training set around N ¼ 150 as shown in panel (c) of Fig. 8. The most signicant outlier Table 1 Regression model predictions of R e , u e , and D 0 . g i and p i represent the group and period of the i-th atom, respectively. g iso i stand for the group encoding the information of isotopes of hydrogen, and p, g are the average of groups and periods of the two atoms, respectively is NaK, which is a van der Waals molecule. D 0 of NaK is overestimated and it may be attributed to the fact that NaK is the only bi-alkali molecule in the dataset having D 0 . There are also some outliers having rst-row elements and 3d transition metals. A summary of our GP regression model performance for the different combinations of the ground state spectroscopic constants considered in this work is shown in Table 1, compared against the proposed models of Parr, Anderson et al. 27,29,30,45 As a result, the GP regression model shows a superior performance against the linear model (LR in the table) based on a particular functional form of the electron density within the molecule. Indeed, the GP performance is, in some cases, ve times better than the linear model (in terms of the relative error). Therefore, the group and period (correlated to the number of valence electrons and the number of electrons shells, respectively) of constituent atoms within a molecule encapsulates more valuable information regarding spectroscopic constants than using simple, functional forms for the electron density within the framework of ref. 27, 29, 30 and 45. Indeed, it is interesting to notice that, when predicting R e and u e , one needs groups and periods of each atom in the molecule, whereas log D 0 R e 3 Z 1 Z 2 can be well described only with the average of group and period of the two atoms. Therefore, log D 0 R e 3 Z 1 Z 2 is correlated to groups and periods' additive properties rather than the differences between the two atoms caused by their different groups.
To further examine if our ML approach is generalizable, we have selected 26 molecules out of the dataset and unseen by the ML algorithm including CoO, 56 (4%). For RuC and ZnBr, the relative errors are below 1%, and for NiS and MoC, the relative errors are below 2%. For MoC, NbC, PbI, SnI, YC and ZrC, the experimental binding energy has been reported and the MAE of our GPR model to predict D 0 is 0.32 eV (7.6%). Therefore, our models perform fairly well in this extra test set.

Learning the rst excited state spectroscopic constants
To learn the equilibrium internuclear distance R e of the A excited electronic state for different molecules, we need to Table 2 Regression model predictions of the A excited electronic state R e and u e . g i and p i are the groups and periods of the i-th atom, respectively whereas g iso i stand for the group encoding the information of isotopes of hydrogen.
p, g are the average of groups and periods of the two atoms, respectively. R e (X) and R e (A) refer to the ground state and A-state R e , respectively. u e (X) refers to the ground state u e  . 8 Performance of the GP regression models as a function of the training set size N. (a) Learning curve of R e as a function of the size of training set, predicted with the groups and periods of the two atoms, (g 1 , g 2 , p 1 , p 2 ). (b) Learning curve of u e as a function of the size of training set, using the equilibrium internuclear distance R e , as well as the groups and periods and the average of groups of the two atoms (R e À1 , g iso 1 , g iso 2 , p 1 , p 2 , g) as the input feature. (c) Learning curve of log D 0 R e 3 Z 1 Z 2 as a function of the size of training set, using the equilibrium internuclear distance R e , as well as the averages of groups and periods of the two atoms (R e , g, p) as the input feature. The shade around the points denotes the variance of the errors regarding the MC method. employ atomic features of the two constituent atoms, including g 1 , g 2 , p 1 , p 2 , D(IP, EA), and the ground state R e (X) when constructing the GP regression models. It is interesting that including D(IP, EA) can improve the predictions (Table 2), which is dened as where IP i , EA i and c i are the ionic potential, electron affinity and electronegativity of atom i, respectively. Therefore, D(EA, IP) qualitatively measures the electron transfer between the two constituent atoms. The resultant test set MAE, RMSE and r E are 0.0691 AE 0.0062\AA, 0.098 AE 0.0097\AA, 5.32 AE 0.53, respectively. As shown in Fig. 9, similar to the results of ground state R e , the transition metal-metal compounds are the most difficult ones to predict. For learning u e of the A excited electronic state, in addition to the ground state R e À1 (X), it is also necessary to include the A state R e À1 (A). Furthermore, it is better to include the ground state u e (X) as the input feature. The results are shown in Fig. 10 in which (u e (X), R e À1 (X), R e À1 (A), g 1 , g 2 , p 1 , p 2 ) leads to a RMSE of 105.1 AE 1.1 cm À1 and r E ¼ 11.0 AE 0.12%. We also nd that including the average of groups g or the isotope information cannot further improve the model performance. This is expected, since this information have already been encoded in the ground state u e .
The performance of our models predicting the A excited electronic state R e and u e are summarized in Table 2. Compared to the ground state predictions, the errors predicting the A excited electronic state spectroscopic constants are around two times larger, suggesting the difficulty predicting the excited state properties. However, we notice that u e is correlated with the inverse of R e (A) as for ground state molecules. Our ndings corroborate the hypothetical relationship between R e and u e in the early times of molecular spectroscopy as it has been introduced in Section 2.

Conclusions
In summary, we have shown that using the GP regression model, the main spectroscopic constants of diatomic molecules are related. This result conrms the scenario that Kratzer and Mecke envisioned a century ago. 2,3 The relationships are mostly independent of the nature of the chemical bond of the diatomic molecule. In particular, we have demonstrated that merely using the atoms' group and the period within a molecule as input features can predict particular combinations of spectroscopic constants with an error r E < 5%. In other words, the spectroscopic constants of diatomic molecules can be efficiently learned from an appropriate dataset by GP regression models, and their values can be accurately predicted. Furthermore, we have shown that GP regression can efficiently learn Fig. 9 The test set MAE predicting A excited electronic state R e by GP regression, using (g 1 , g 2 , p 1 , p 2 , R e (X), D(IP, EA)) as input features, classified by the types of the constituent atoms. The inset shows the test set predictions of the A-excited electronic state R e compared with respect to the true values. The values shown are the average of predictions from 1000 MC sampled training/test splittings. The GP regression model as learned from the training set gives predictions of the test and training set. Shown are the mean and standard derivation of each molecule's predictions when used as training data (green symbols) and test data (orange symbols).

Fig. 10
The test set MAE predicting A excited electronic state u e by GP regression, using (u e (X), R e À1 (X), R e À1 (A), g 1 , g 2 , p 1 , p 2 ) as input features, classified by the types of the constituent atoms. The inset shows the test set predictions of A-excited electronic state u e compared with respect to the true values. The values shown are the average of predictions from 1000 MC sampled training/test splittings. The GP regression model as learned from the training set gives predictions of the test and training set. Shown are the mean and standard derivation of each molecule's predictions when used as training data (green symbols) and test data (orange symbols). spectroscopic relationships for excited electronic states of molecules with an error r E < 11%.
Despite the present GP models' outstanding performance, machine learning methods may be considered mere tting techniques or as a black-box algorithm that one can hardly learn anything new from them. This statement is not accurate. As an example, here, we emphasize what we have learned from the present machine learning approach: It is generally assumed that some molecular properties can be predicted based on the forming atom's positions in the periodic table. 72 However, the predictions are only qualitative rather than quantitative. For instance, it is possible to anticipate the nature of a molecule's bond, but it cannot accurately guess its dissociation energy. However, thanks to ML, we know that it is possible to predict reasonably accurate spectroscopic constants using the constituent atoms' group and period.
We have learned that u e and R e depend strongly on the number of valence electrons and electrons shells of the atoms forming a molecule, whereas the average number of valence electrons also plays an important role in describing u e . Finally, we would like to emphasize that there are around 7000 heteronuclear molecules, and we only utilize 256 of these for our GP regression model. The limited availability of spectroscopic data (only around 3% of possible heteronuclear diatomic molecules) shows the vast amount of spectroscopy that can be done within the realm of diatomic molecules. The more data we have, the more accurate will be the GP regression model predictions before reaching convergence of the learning curve, and the more knowledgeable the community will be about the fundamental properties of diatomic molecules. From our perspective, the present work may motivate data sciencedriven studies on the eld of spectroscopy of diatomic molecules. In particular, it will help to evolve the eld of spectroscopy towards the current information era and help to achieve a better understanding on the spectroscopic properties. Furthermore, our results may also bring some insight for the development of features and geometry representations in material science.