A data-driven approach to determine dipole moments of diatomic molecules

We present a data-driven approach for the prediction of the electric dipole moment of diatomic molecules, which is one of the most relevant molecular properties. In particular, we apply Gaussian process regression to a novel dataset to show that dipole moments of diatomic molecules can be learned, and hence predicted, with a relative error<5%. The dataset contains the dipole moment of 162 diatomic molecules, the most exhaustive and unbiased dataset of dipole moments up to date. Our findings show that the dipole moment of diatomic molecules depends on atomic properties of the constituents atoms: electron affinity and ionization potential, as well as on (a feature related to) the first derivative of the electronic kinetic energy at the equilibrium distance.


I. INTRODUCTION
The study of relationships between spectroscopic constants is a traditional topic in chemical physics since the pioneering work of Kratzer and Mecke, among others 1-6 and is beautifully summarized by Varshini. 7,8 Recently, we have shown that some spectroscopic constants are universally related, 9 i.e., the relationships between them are independent of the nature of the molecular bond. However, the electric dipole moment of a molecule, despite being an essential molecular property, has not been considered in previous studies about relationships between spectroscopic constants. Only recently, there have been some efforts towards the understanding of the dipole moment in terms of molecular spectroscopic constants. As a result, it has been found by Hou and Bernath that the expression for the dipole moment, d, taught in elementary chemistry courses where q is the effective charge and R e denotes the equilibrium bond length of the molecule, does not capture the proper physics of the dipole moment in many molecules. 10,11 They also demonstrated that the dipole moment of some molecules can be predicted from the effective charge (obtained from quantum chemistry calculations) and spectroscopic constants of molecules.
In the 2000s the big data-driven science paradigm emerged in the scientific community. 12 In this new paradigm, machine learning techniques are among the most prominent tools to assess scientific knowledge. To be precise, adequately formatted data are used to identify unexpected correlations and to predict observables based on patterns and trends of the data. When applied to physics, this novel paradigm lets nature speak up through hidden and intriguing correlations that lead to the formulation of new questions beyond a specific physical model. In particular, in chemical physics, as recently shown, datadriven approaches bring a new perspective to solve some of the most delicate problems of the field. [13][14][15][16] In this paper, we present a data-driven approach to dipole moments of diatomic molecules and its relationship with spectroscopic constants. We show that, after compiling the most exhaustive list of dipole moments for diatomics up to date (to the best of our knowledge) into a dataset, it is possible to learn the dipole moment of diatomic molecules based upon atomic and molecular properties with an error 5%. The number of molecules in our dataset, classified by the type of the constituent atoms, is given in Fig. 1. Our results reveal that it is not possible to predict the dipole moment of a molecule solely from atomic properties, although this is feasible for the spectroscopic constants, 9 but that it is necessary to include molecular features. The molecular spectroscopic constants are needed in a combination that describes the force on the electrons at the equilibrium distance, i.e. in a combination that has the same functional dependence as the first derivative of the electronic kinetic energy at the equilibrium distance.

II. AN OVERVIEW ON THE NATURE OF THE ELECTRIC DIPOLE MOMENT OF MOLECULES
The study of the nature of the electric dipole moment of molecules is a traditional topic in quantum chemistry that has fascinated the chemical physics community for almost a century by now. The first explanation of the nature of the electric dipole moment of molecules is due to Pauling in the 1930s 17 . In particular, after studying hydrogen halide molecules, Pauling proposed that the dipole moment of a molecule is correlated with the relevance of the ionic structure with respect to the covalent one at the equilibrium bond length of the molecule. In this model, the dipole moment is a consequence of the charge transfer between the atoms within the molecule. Therefore, the larger the charge transfer, the bigger the dipole moment is. The charge transfer is quantized by the ionic character (IC), which is given by where e is the electron charge. Comparing Eqs. (2)  slight inaccuracy of Pauling's model in predicting dipole moments, it is worth emphasizing that Pauling realized that the dipole moment of a molecule must be related to other molecular properties through the molecular bond. The next step towards understanding the electric dipole moment was the introduction of a new concept: the homopolar dipole moment, d h , by Mulliken. In particular, Mulliken realized that because the atomic orbitals are different in size, the overlap between those leads to a charge displacement with respect to the midpoint of the equilibrium bond length, which affects the electric dipole moment of the molecule 18 . Furthermore, Mulliken noticed that the asymmetry in the charge distribution of hybrid orbitals causes the so-called atomic dipole moment, d a . The models of Mulliken and Pauling were summarized and further developed by Coulson 19 , who proposed the ultimate expression for the dipole moment of a diatomic molecule as where d p is the contribution due to the polarization of the atomic orbitals to the dipole moment of the molecule. One has to realise that Eq. (3), although being more precise than Eq.
(1), requires the input from quantum chemistry calculations. For a summary on the Pauling and Mulliken models, we recommend the comprehensive review of Klessinger. 20 The models of Pauling and Mulliken have been accepted by the physical chemistry community and taught in elementary chemistry courses for a long time, despite the fact that neither one of those is fully satisfactory. Recently, Hou and Bernath 10,11 , after studying the experimentally determined dipole moments of an extensive group of molecules and using quantum chemistry calculations, have suggested that the electric dipole moment of a molecule should be given as where q is the effective charge and R d is an effective length that depends on fundamental spectroscopic constants of the molecule with R d < R e . Both Eq. (4) and Eq.
(3) rely on the input of quantum chemistry calculations, in particular on the results from a natural bond orbital analysis. Therefore, the electric dipole moment of diatomic molecules still lacks a satisfactory and accurate explanation in terms of fundamental spectroscopic constants.

A. Gaussian process regression
Finding relationships of the dipole moment with spectroscopic constants can be viewed as a regression problem, where the goal is to learn the mapping from the input atomic and molecular features x onto the target property, y, which in this case is the electric dipole moment, by a function y = f (x). In the present work, we use Gaussian process regression (GPR) to approximate the function f (x). As a non-parametric probabilistic method, GPR does not presume a functional form of f (x) before observing the data. Instead, it infers a Gaussian distribution of functions over function space by a Gaussian process 21,22 f (x) ∼ GP(m(x), k(x, x )), determined by a mean function, m(x), and a kernel (covariance) function, k(x, x ). The prior, p( f |x), spanning in the function space, after exposed to the observations, is constrained into a posterior, p( f |x, y), based on the Bayes theorem. The predictions, y * , can then be made for new input atomic and molecular features, x, through the posterior. The kernel function, k(x, x ), captures the smoothness of the response and intrinsically encodes the behaviour of the model acting on the input. The kernel functions can be chosen by presuming the behaviour of the response to the input feature by observing the data. Its functional form and the possible hyperparameters can also be determined by a cross-validation (CV) 23 .

B. Model evaluation
In learning the dipole moments, the dataset is divided into training and test sets. As a data-driven approach, GPR learns the relationship between the input features and dipole moments by observing the training set, while the predictive performance of the GPR models is examined with the test set.
In this work, 20 molecules are used in the test set, while the rest are used in the training set. For the training/test splitting, the dataset is first stratified into 20 strata based on the dipole moments' true values. A Monte Carlo (MC) approach is then performed to select the 20 test data from the dataset randomly. In each MC step, a GPR model is trained based on the training set with 5-fold cross-validation. The generalization performance of the model is then evaluated with the test set. In the end, the mean and standard deviation (STD) of the test-set errors are reported in this work, obtained from 1000 MC training/test splittings. Details about this MC approach will be discussed elsewhere. 24 The performance evaluation of the GPR models is carried out through three different estimators: • The mean absolute error (MAE) defined as where y i are the true values of dipole moments, y * i are the predictions, and N is the number of observations in the dataset.
• The root mean square error (RMSE), which reads as • The normalized error, r E , defined as the ratio of the RMSE to the range of the data

IV. THE DATASET
The dataset employed in this work consists of ground-state dipole moments of 162 polar diatomic molecules, 139 of which have both information on the equilibrium bond length, R e , and the harmonic vibrational frequency, ω e . The dataset is presented in Table IV of the Appendix and it constitutes the most extensive dataset for dipole moments of diatomic molecules that we are aware of. Nevertheless, for more efficient scrutiny of our dataset's generality, we show in Fig. 2 the equilibrium bond length, R e , versus the electric dipole moment of diatomic molecules. The density plots and the box plots show the distribution of R e (right) and dipole moment, d, (top), respectively. The equilibrium bond length of the molecules is distributed between 0.9 and 3.9 Å with a median of around 1.5 Å, although most of the molecules show an equilibrium bond length between 1.2 and 3.2 Å. The dipole moment values in the dataset range from 0.0043 D to 11.69 D with a median of around 2.45 D, which shows the large variety of molecules included in the dataset. The dataset can also be categorized in terms of the type of atoms constituting the molecules, as it is shown in Fig. 1. In this figure, it is noticed that most of the molecules in the dataset present a highly ionic bond resulting from a transition metal and a nonmetal atom. The second most prominent group of molecules contains a halogen atom and an alkaline atom, which shows an ionic bond. The rest of the molecules exhibit a bond from partially ionic to highly ionic, which shows the diversity of the dataset.

V. RESULTS AND DISCUSSION
We have used a GPR approach to learn the diatomic molecules' dipole moment employing features coming from different atomic and molecular properties. The atomic properties considered are the electron affinity (EA), ionic potential (IP), electronegativity (χ) and polarizability (α) whereas the molecular properties are the reduced mass, µ, equilibrium bond length, R e , and the harmonic vibrational frequency, ω e . The atomic properties employed are related to the intrinsic chemical nature of the dipole moment due to the polarity of a molecular orbital in the molecular-orbital bond theory or to the ionic character of the molecular bond within the valencebond theory. 19 The GPR performance for different features is summarized in Table I, where we employ 118 out of the 139 molecules from the dataset having values for both R e and ω e .
After using different combinations of atomic and molecular properties, we find that the dipole moment of a di-  Fig.3, The predicted values reproduce the true values very well with a small deviation that leads to a normalized error r E < 5% (RMSE= 0.56±0.02 D). We have also computed the learning curve of the cited GPR model, which gives an intuitive idea about the model's learning and generalization performance concerning the size of the training set. The results are shown in the inset of Fig.3. The training RMSE and test RMSE are shown as a function of the number of training data points. The learning curve's shade shows the variance of training/test RMSE, obtained for each point from a MC approach of 500 training/test splittings. The mean test error decreases with increasing training data. In particular, with 80 training data, the learning curve is almost converged, suggesting that this model can not benefit from more data of the same dataset. The error's variance shows the ability of the model to be employed in different subgroups of molecules. In this case, the variance of test RMSE becomes smaller as the number of training data increases and converges to < 0.02 D with 60 training data.
In previous work, we have shown that R e , ω e , and the binding energy of a diatomic molecule can be learned through groups and periods of the constituent atoms as features 9 . However, the same features dramatically fail in learning the dipole moment. In particular, we find that the test errors are RMSE= 1.25 ± 0.02 D and r E = 10.8 ± 0.1%, respectively. In our view, this is an indication of the more intricate nature of the dipole moment compared to the spectroscopic constants of diatomic molecules.
In Ref. 25 it is shown that the dipole moment of diatomic alkali-alkaline earth molecules can be empirically calculated from the difference in the electronegativity of the constituent atoms |χ 1 − χ 2 |, the mean atomic polarizabilitiesᾱ = (α 1 + α 2 )/2 and the dissociation energy D e . We have generalized this idea trough a GPR model by using ( |χ 1 − χ 2 |,ᾱ, D −1 0 ) as features and applied it to the present dataset, despite the fact that alkaline earth-alkaline molecules are absent in the dataset. We have used the binding energy, D 0 , instead of the dissociation energy, as the former is tabulated more frequently. As a result, the normalized error is r E = 10.5 ± 0.3%, which indicates that some of the physics behind the dipole moment  24 . The test set contains 20 molecules, while the training set contains 98 molecules. The mean and standard derivation of the predictions are shown for each molecule when they are used as training data (shown in blue) and test data (shown in orange). The inset shows the learning curve, which shows the training and test RMSE of the model with respect to the number of training data. The shade in the learning curve shows the variance of training/test RMSE, obtained for each point from a MC approach of 500 training/test splittings. function of alkali-alkaline earth molecules is applicable to any other molecule. This is an unexpected result that shows the underlying universality of the physics behind the dipole moment.
The outstanding performance of (EA 1 ,EA 2 ,IP 1 ,IP 2 , µR e ω 2 e ) as the set of descriptors implies that the accepted picture in chemistry in which the difference of the electronegativity of the atoms within a molecule establishes the ionic character of the molecular bond 17,19,26 is not sufficient to characterize the dipole moment of a molecule. When using the electron affinity and the atoms' ionization potential as features, the performance improves by 25%. However, only if µR e ω 2 e is included as a feature, the dipole moment is predicted with a RMSE below 0.7 D. Therefore, we find that it is essential to add µR e ω 2 e as a feature in describing the dipole moment of a diatomic molecule. It can be shown that this feature is related to the derivative of the electronic kinetic energy, T (R), at the equilibrium bond length as 27 which represents a force within the molecule. When equating this force to the pure electrostatic force, one obtains R d and, through Eq.(4), it is then possible to define the ionic character as where the value of IC is given in percent. It is seen that IC does not directly depend upon the electronegativity differences of the atoms, contrary to the accepted picture in chemistry. The feature µR e ω 2 e was first introduced by Hou and Bernath 10,11 as an empirical relationship, and we use this here to define the ionic character of a molecular bond.
Alternatively, the ionic character can be defined in terms of the electronegativity difference between the two atoms forming a molecule as following Hannay and Smyth 26 . Surprisingly, Eqs. 10 and 11 lead to different results for the ionic character of the molecules in the database, as shown in Fig. 4, where it is noticed that the distribution of the ionic character following Eq. 11 appears to the complement to the one obtained from Eq. 10. Furthermore, the model of Hou and Bernath (Eq. 10) systematically leads to a larger ionic character than the model of Hannay an Smyth. The GPR model with (EA 1 ,EA 2 ,IP 1 ,IP 2 , µR e ω 2 e ) as input features shows several outliers. To see the importance of these outliers we have compared the distribution of the ionic character and dipole moment of the molecules in Fig. 4 (shown in grey) with the same magnitudes for the subset of 118 molecules that can be learned in this work (shown in blue). The ML-learned subset has similar overall distributions of dipole moments and ionic characters in comparison with the whole dataset. Therefore the outliers do not significantly modify the underlying distribution that the molecules follow.
In Table II, it is shown a classification of the outliers as a function of its molecular bond and constituent atoms. The effective atomic charges of these molecules are also calculated with a density functional theory (DFT) approach, which is shown in Table III utilizing different charge partitioning methods. The calculations are performed with the B3LYP functional 28 and def2-TZVP basis set [29][30][31] , with the Gaussian 16 package 32 . We have noticed that for these outliers, the natural bond orbital (NBO) method gives larger effective atomic charges compares to the Mulliken population. Furthermore, all the molecules showing a NBO charge larger than 1.0 are the ones showing an ionic character in virtue of Eq.10 above 100%. For the outliers within the van der Waals molecules, we find LiNa and NaCs. LiNa has the smallest R e and dipole moment of the bialkaline molecules in this dataset, while NaCs has the largest R e and dipole moment.
To understand the effect of different bonding types on the dipole moment, we plot in Fig. 5 the relationships between R e and dipole moments for different kinds of molecules in the current dataset, where the outliers are shown in red circles. We observe that the relationships between R e and dipole moments depend on the type of molecule under consideration. As shown in panel (a) of Fig. 5, R e and dipole moments show linear relationship for metal-nonmetal molecules, in which the nonmetals atoms are from the same group in the periodic table. Similarly, linear behaviors have also been observed for the group IV/VI diatomic molecules in Ref. 33 . For the oxygen halides shown in panel (b), R e increases almost linearly TABLE II. Outliers for learning the electric dipole moment of diatomic molecules. These molecules are labeled in Fig. 2 and classified with the types of constituent atoms and the molecular bonds.

VI. CONCLUSION
In summary, we have shown that through a GPR model, the ground state dipole moments of diatomic molecules can be FIG. 5. The equilibrium bond lengths R e as a function of dipole moments, classified by the type of the constituent atoms. The molecules that can be described by the GPR models from (EA 1 , EA 2 ,IP 1 ,IP 2 , µR e ω 2 e ) are shown in blue circles, while the outliers are shown in red circles. related to spectroscopic constants, namely R e and ω e . More specifically, without any quantum chemistry calculation, the dipole moments of 118 molecules have been predicted with an error 5% by using both atomic features, including electron affinity and ionic potential, and a combination of molecular spectroscopic constants, µR e ω 2 e . In addition, we find that the difference in the electronegativity of the constituents atoms is not sufficient to describe the dipole moments of the diatomic molecules in stark contrast with what is generally assumed in general chemistry. Therefore, our data-driven approach shows that the nature of the dipole moment is more intricate than other spectroscopic constants, and it is clearly correlated with the very fundamental nature of the chemical bond. Finally, it is worth emphasizing that our findings have been possible thanks to the development of a complete and orthodox dataset.

ACKNOWLEDGMENTS
We thank Dr. Stefan Truppe for reading the manuscript and for useful comments and discussion regarding the nature of the electric dipole moment.
The kernel function employed in this work, which gives the best CV scores, is the ration quadratic kernel 22 defined by where σ l is the length scale, and α is a scale-mixture parameter, r is the Euclidean distance between x i and x j defined as Appendix B: The dataset for dipole moment of diaotmic molecules The dataset is summarized in Table IV, which consists of dipole moments µ 0 of 162 polar diatmonic molecules, 156 of which have information about equilibrium bond length R e while 139 also have harmonic vibrational frequency ω e . The references of the dipole moments are also listed in the table.