The equilibrium molecular structures of 2-deoxyribose and fructose by the semiexperimental mixed estimation method and coupled-cluster computations

Fructose and deoxyribose (24 and 19 atoms, respectively) are too large for determining accurate equilibrium structures, either by high-level ab initio methods or by experiments alone. We show in this work that the semiexperimental (SE) mixed estimation (ME) method offers a valuable alternative for equilibrium structure determinations in moderate-sized molecules such as these monosaccharides or other biochemical building blocks. The SE/ME method proceeds by fitting experimental rotational data for a number of isotopologues, which have been corrected with theoretical vibration–rotation interaction parameters (ai), and predicate observations for the structure. The derived SE constants are later supplemented by carefully chosen structural parameters from medium level ab initio calculations, including those for hydrogen atoms. The combined data are then used in a weighted least-squares fit to determine an equilibrium structure (r e ). We applied the ME method here to fructose and 2-deoxyribose and checked the accuracy of the calculations for 2-deoxyribose against the high level ab initio r e structure fully optimized at the CCSD(T) level. We show that the ME method allows determining a complete and reliable equilibrium structure for relatively large molecules, even when experimental rotational information includes a limited number of isotopologues. With a moderate computational cost the ME method could be applied to larger molecules, thereby improving the structural evidence for subtle orbital interactions such as the anomeric effect.


Introduction
The determination of accurate equilibrium structures for moderately large molecules remains a challenge, both from the experimental and theoretical points of view. 1 The structure optimization by high-level ab initio methods allows us to obtain accurate structures, but it rapidly becomes too expensive when the size of the molecule increases. However, it is possible to obtain equilibrium structures more easily by using the semiexperimental (SE) method, which is generally considered the most accurate one for equilibrium structures (r SE e ) of small molecules. [2][3][4] This method derives the equilibrium rotational constants from experimentally determined (effective) ground-state rotational constants and theoretical corrections based on an ab initio cubic force field. The most complex molecule for which the rotational spectroscopy method has been tested is the amino acid proline (C 5 H 9 NO 2 : 17 atoms, 45 degrees of freedom). 5 However, it was noticed in proline that the set of experimental rotational constants, although extensive, could not fix satisfactorily the molecular structure. This conclusion is quite general for molecules with many degrees of freedom because of the problem of statistical ill-conditioning. For this reason, ab initio constraints are required to analyze larger molecules. The compromise of these constraints is that they may induce systematic errors in the calculation, making it difficult to estimate the uncertainty of the resulting molecular structure.
Recently, the predicate-regression mixed estimation (ME) method 1,6,7 has proved successful in determining very accurate equilibrium structures for several medium-sized molecules. 8,9 In the ME method the structure fitting uses simultaneously equilibrium moments of inertia together with bond lengths and bond angles from medium-level quantum chemical calculations.
In this paper, we will demonstrate that it is possible to use this method for molecules larger than proline. We will first apply the ME method to the lowest-energy conformer of c-b-2-deoxy-Dribopyranose-1 C 4 -1, 10 (Fig. 1, later abbreviated as deoxyribose), a 19-atom (C 1 ) molecule with 51 degrees of freedom. The validity of the method will be checked for this molecule against high-level CCSD(T) ab initio calculations. Then, we will apply the ME method to the lowest-energy conformer of cc-b-D-fructopyranose-2 C 5 11,12 ( Fig. 1, later abbreviated as fructose), a larger 24-atom (C 1 ) molecule with 66 degrees of freedom. Both molecules represent the larger molecular systems for which equilibrium structures have been determined so far.
Deoxyribose and fructose are representatives of 5/6-carbonatom aldose/ketose monosaccharides, which make up carbohydrates. Carbohydrates constitute one of the most versatile biochemical constituents, playing important roles as energy resources, structural bio-scaffolds and signal transducers. 13 In particular, deoxyribose is notably present in nucleotides forming DNA, while fructose is commonly attached to glucose to form sucrose. Both molecules exhibit dominant pyranose (six-membered) ring structures in the solid, liquid and gas phases, in contrast with the furanose (five-membered) ring observed for deoxyribose in DNA and other biologically active molecules or fructose in sucrose. The solid-state structure is known for both compounds, 14,15 but there is no reliable gasphase structure with which to assess the quality of the theoretical models used for other monosaccharides.
A final argument for selecting these target molecules is that the rotational spectra have been observed for both compounds.
Thus, experimental moments of inertia are available for the application of the ME method. The detection of the rotational spectra for the sugars used supersonic-jet microwave spectroscopy combined with picosecond UV laser desorption. For deoxyribose the experiment detected 6 different pyranoside forms in the gas phase. 10 For the lowest-energy species the inertial data span the parent, all five monosubstituted 13 C species and the endocyclic 18 O species, which were observed in natural abundance. For fructose two pyranoside rotamers were detected and rotational data were available for the parent, all six monosubstituted 13 C species and two single deuterated species of the lowest-energy conformation. However, data for the important endocyclic 18 O species was missing. 11,12

Experimental
Previous experiments on fructose 11,12 missed the detection of the endocyclic 18 O6 isotopologue because it was too weak to be measured in natural abundance (ca. 0.2%). Since the coordinates of this ring atom are critical for the determination of the pyranose structure, we extended the rotational measurements to this species. For this purpose we used an enriched sample (490%) of [ 18 O6]-Dfructose (Omicron Biochemicals, USA) that was pressed into a cylindrical pellet. The solid target was vaporized by pulsed picosecond UV (355 nm) laser desorption, and the jet-cooled microwave spectrum was recorded in the region 6-18 GHz. 11,12 Details of the Balle-Flygare-type Fourier transform microwave spectrometer (FT-MW) at the UPV-EHU have been reported before. 16 The experimental rotational frequencies are given in Table S1 (ESI †).

Computational
Different ab initio calculations were required for this work. The geometry optimizations were performed at the frozen-core (FC) and all-electron (AE) MP2 level 17 with the cc-pVTZ, cc-pVQZ, 18 cc-pwCVTZ 19 and 6-311+G(3df,2pd) 20 basis sets. The calculations were also performed at the levels of the density functional theory (B3LYP) 21-23 with the 6-311+G(3df,2pd) basis set and the coupledcluster method with single and double excitations (CCSD-FC) 24 using the cc-pVTZ basis set. Moreover, the structure optimization for deoxyribose was possible at the level of the coupled-cluster method with a perturbative treatment of connected triples (CCSD(T)-FC) 25 using the cc-pVTZ basis set. In order to determine the rovibrational contributions for both molecules, the anharmonic force field up to semidiagonal quartic terms was calculated at the MP2-FC/cc-pVTZ level of theory. This calculation was repeated for each isotopologue, as different isotopes require distinct vibrational corrections. The MP2, B3LYP and CCSD calculations were performed with the Gaussian 09 package, 26 whereas the MolPro program 27,28 was used for the CCSD(T) calculations.

Results and discussion
It is well established that the quality of the structural fit is sensitive to the true accuracy of the ground-state rotational constants. 1,29,30  For this reason, we first redetermined these parameters with the method of predicate observations, combining the experimental rotational frequencies with quartic centrifugal distortion constants derived from the ab initio force field. 6,7 The uncertainty used for weighting the predicates was 10% of their value. The results are given in Tables S2 and S3 (ESI †) for deoxyribose and fructose, respectively. In order to obtain the semiexperimental equilibrium rotational constants, the experimental ground-state rotational constants were corrected using the vibration-rotation interaction constants (a i ) derived from the ab initio MP2-FC/cc-pVTZ cubic force field. The derived rotational constants and the rovibrational corrections are given in Tables 1 and 2 for both molecules.
The methodology used for determining the predicates was described before. 31 Briefly, the CH bond lengths are computed at the MP2-FC/cc-pVTZ level of theory. Due to a compensation of errors, they are usually very close to the accurate equilibrium values. The CC bond lengths are also calculated at the same level. When the double bond character is negligible, these values are also a good choice for the predicates. The CO bond lengths are calculated at the B3LYP/6-311+G(3df,2pd) level and a small correction is applied to the calculated value. 32 All these computed bond lengths are expected to have an accuracy of about 0.002 Å. The bond angles are first calculated at the MP2-FC level with the cc-pVTZ and 6-311+G(3df,2pd) basis sets with an expected accuracy of about 0.3-0.41. From our previous work, it was found that the 6-311+G(3df,2pd) basis set gives slightly more accurate results. 9 This outcome is confirmed here by comparing with the Born-Oppenheimer equilibrium structure, r BO e , (alternatively named in the literature as best estimated ab initio or CCSD(T)-based structure) determined below.
The median absolute deviation (MAD) is 0.181 with the cc-pVTZ basis set and 0.091 with the 6-311+G(3df,2pd) basis set. For the dihedral angles, the CCSD-FC/cc-pVTZ level was used because the MP2 method has sometimes been found inaccurate. 8,9,33,34 The estimated accuracy of the predicate dihedral angles is 0.71. Comparison with the r BO e structure confirms this value, the MAD being 0.511. For the bond angles, the accuracy of the MP2 and CCSD methods is similar. However, when the CCSD values are used for the predicates of the bond angles, the standard deviation of the fits is slightly smaller. For this reason, the CCSD-FC/cc-pVTZ values were also used for the predicates of all angles, but this choice has a negligible effect on the values of the fitted parameters. Actually, for deoxyribose, the CCSD-FC/cc-pVTZ and MP2-FC/6-311+G(3df,2pd) have the same MAD when compared to the r BO e structure. The structures calculated at these different levels of theory are given in Tables S4 and S5 (ESI †) for deoxyribose and fructose, respectively.
The ME method was applied in several steps. In the first step, the bond lengths and bond angles to all hydrogen atoms were held at their predicate values, while the parameters for the heavy atoms were fitted to the equilibrium rotational constants. This fit is the standard least-squares one. In the second step, a structure was fitted to both the equilibrium rotational constants and the full set of predicate values with their estimated uncertainties. This step leads to a considerable improvement in the accuracy of the structure. However, an inspection of the leverage values shows that they are close to unity for the predicates of many bond lengths, whereas they are distributed rather uniformly and are significantly below unity for the moments of inertia. It is obvious that the structural parameters Table 1 Ground-state and equilibrium rotational constants and rovibrational corrections for deoxyribose (c-b-2-deoxy-D-ribopyranose-1 C 4 -1), all values in MHz  Table 2 Ground-state and equilibrium rotational constants and rovibrational corrections for fructose (cc-b-D-fructopyranose-2 C 5 ), all values in MHz of the hydrogen atoms (unsubstituted in most of the isotopologues) are almost exclusively determined by their predicate values. This outcome is not a problem because the predicates are expected to be accurate for these light atoms. To check that the predicates for the heavy atoms are compatible with the semiexperimental equilibrium moments of inertia, the errors of the predicates for the bond lengths of the heavy atoms of deoxyribose have been increased in a third step from 0.002 Å to 0.005 Å. This relaxation gives a fit compatible with the previous one, albeit with larger standard deviations (up to a factor of two) for some bond lengths. The results are given in Table 3 (Cartesian coordinates in Table S6, ESI †). The nice agreement of the derived (non-fitted) parameters with their predicate values indicates that the fit is likely to be of good quality. The exception is the C5-O6 bond length, worsened by an unfavorable propagation of errors. However, this problem is easy to point out because, in this case, the derived value is far from its predicate. This situation can be explained by underweighted predicates relative to the moments of inertia, so the fitted parameters remain sensitive to inaccuracies in the moments of inertia. In this particular case, a careful analysis indicates that the problem is mainly due to the small a coordinate of atom C5, a SE (C5) = À0.447(2) Å, to be compared to a BO = À0.430 Å. As a confirmation, an increase in the weight of the predicates increases the standard deviation of a SE (C5). Furthermore, there are different ways to circumvent this difficulty, the simplest one being to use another set of fitted parameters including C5-O6.
In that case it results in 1.428(2) Å.
To further check the accuracy of the equilibrium structure of deoxyribose, it was also calculated at the CCSD(T)-FC/cc-pVTZ level of theory. The small effect of further basis set enlargement (cc-pVTZcc-pVQZ) was then estimated at the MP2 level. The core-core and core-valence correlation correction was computed at the MP2 level using the cc-pwCVTZ basis set. The resulting r BO e estimate was: r BO e = r e (CCSD(T)-FC/cc-pVTZ) + r e (MP2-FC/cc-pVQZ) À r e (MP2-FC/cc-pVTZ) + r e (MP2-AE/cc-pwCVTZ) À r e (MP2-FC/cc-pwCVTZ) The accuracy of the estimate in this equation, which is based on the CCSD(T) structure and additivity of small corrections, estimated at the less expensive MP2 level, has been confirmed many times; see, for instance, ref. 30 and 35-38.
The results of the different theoretical calculations are given in Table S4 (ESI †), and the derived r BO e structure is compared in Table 3 to the r SE e structure. For the bond lengths the largest difference is 0.002 Å for the C1-O1 bond. The largest differences in the bond and dihedral angles are 0.561 for C2-C3-C4 and 0.901 for C1-C2-C3-C4. The standard deviations (calculated from the MAD) are 0.0011 Å, 0.171, and 0.751 for bond lengths, angles and dihedrals, respectively. This calculation confirms that the uncertainties chosen for the predicates are correct and that the r SE e structure is accurate. It has to be noted that for the angles C2-C3-C4 and C2-C3-C4-C5, the predicate values are closer to the r BO e structure than to the r SE e structure.
This finding means that the small discrepancy is due to the semiexperimental rotational constants, not to the predicates. The same procedure was used to calculate the semiexperimental structure of fructose. In the final fit the predicates for the bond distances connecting two substituted atoms in the set of experimental isotopologues were given a larger error of 0.005 Å instead of 0.002 Å. The predicates for the bond angles defined by three substituted atoms were given an error of 1.51 instead of 0.51. Finally, the predicates for the torsional angles defined by four substituted atoms were given an error of 2.01 instead of 0.71. This final fit is almost identical to the fit where the predicates have a larger weight. As a further check, the uncertainties of the predicates for the bond lengths of the heavy atoms have been increased by a factor 1.5. Introducing this change decreases the leverages but has no significant effect on the values of the fitted parameters. This observation gives us confidence in the accuracy of the derived results. The final structural parameters are given in Table 4 (Cartesian coordinates in Table S7, ESI †).
The determined structures for the two sugars are regarded as highly accurate. The standard deviation of the fitted parameters is a reliable indicator of their precision provided that the weights were correctly chosen and systematic errors were insignificant. From the present analysis and from our previous work, 8,9 it is highly likely that the weights of the predicates have reasonably correct values. On the other hand, it is much more difficult to estimate the accuracy of the semiexperimental rotational constants. Furthermore, it is known that they are affected by a non-negligible systematic error. 39,40 For these reasons, a conservative estimate of the accuracy of the fitted parameters can be stated as 0.002 Å for the bond lengths, 0.2-0.61 for the bond angles, and 0.5-0.91 for the dihedral angles.
The empirical substitution structures (r s ) are also given in Tables 3 and 4 for comparison. As the range of the rovibrational corrections is quite small (0.37 MHz for A, 0.19 MHz for B, and 0.10 MHz for C, see Table 2 for fructose), the r s structure might be expected to be relatively accurate. Inspection of Tables 3  and 4 shows that such accuracy is not the case. This observation is confirmed by the examination of the Cartesian coordinates of fructose given in Table S8 (ESI †). This result is common for large molecules for which the isotopic shift of the rotational constants is generally small. We note that the results remain inaccurate even when the equilibrium rotational constants are used in the Kraitchman equations. 8,9,31 It is also instructive to examine the quality of the effective structure (r 0 ). In these molecules the number of ground-state rotational constants is not sufficient to determine a complete structure without multiple structural assumptions that render the results unreliable. On the other hand, using the same predicates as for the r SE e -fits, there is no difficulty in performing structural least-squares fits. The results are given in the last column of Tables 3 and 4. Obviously, the quality of the fits is only moderately good: the standard deviations of the fits and of the fitted parameters are about three times larger than in the r SE e -fits. Furthermore, an analysis of residuals shows that, contrary to the r SE e -fit, the predicates and the ground state rotational constants are not fully compatible and the distances between the heavy atoms are rather inaccurate. Nevertheless, the angles, although not very precise, are in fair agreement with the r SE e structures. In conclusion, the r 0 structure permits the determination of approximate values for the bond and dihedral angles. However, interest in these structures is limited because it is not much more accurate than the predicates. Fig. 2 shows the deviations of semiexperimental and experimental parameters of   (46) the deoxyribose ring relative to the computed values, r BO e . It can be seen that there is an excellent agreement between the r SE e and r BO e structures, whereas the discrepancies between the r BO e and experimental structures, both r s and previously determined r 0 , 10 denoted as r 0 (old), as well as between the r s and improved r 0 (denoted as r 0 (new)) structures are very large. The ME method thus allows us to improve the fit of the experimental data and to considerably increase the accuracy of the experimental structure determination.
The accurate determination of the molecular structure allows us to obtain information on subtle electronic effects that are reflected in the molecular structure but are usually very difficult to notice, such as the anomeric effect. The anomeric effect is known to be present in both molecules: the hydroxy substituent on the anomeric carbon atom adjacent to the endocyclic oxygen atom prefers the axial orientation. 41 Furthermore, the anomeric CO bond length is shorter than the standard single bond length, which is 1.417 Å in methanol. 42 This parameter is 1.407 Å in deoxyribose and 1.410 Å in fructose. Finally, in the case of fructose, the C2-O6 bond adjacent to the anomeric C1-O1 bond is shorter (1.412 Å), whereas the O6-C6 bond is longer (1.426 Å). This result is in good agreement with the X-ray study of crystalline fructose 15 and the ab initio calculations on methoxymethanol by Jeffrey et al. 43 The structures of the title compounds are known to be further stabilized by intramolecular hydrogen bond networks.
There are many different ways to point out the existence of a hydrogen bond. 44,45 It may be defined on the basis of interaction geometries (short distances, fairly linear angles) or certain properties of the electron density distribution. Following the definition of Jeffrey 44  These results are in agreement with the conclusions about the low stability of the five-membered quasi-ring formed by hydrogen bond due to an unfavorable geometry of this ring (in comparison to the six-membered quasi-ring); see, for example; ref. 46.
Using this criterion, two weak HÁ Á ÁO hydrogen bonds are present in deoxyribose, and in fructose there are five weak hydrogen bonds (see Fig. 1 and Table 5). Another consequence of the hydrogen bond is that the r(D-H) bond length is lengthened and that there is a correlation between r(D-H) and d(HÁ Á ÁA). Indeed, there is a correlation between r(D-H) and d(HÁ Á ÁA), the correlation coefficient being À0.86. This observation is consistent with r(O-H) bond lengths being longer than in methanol (0.957 Å). 42 The d(H5Á Á ÁO4) in fructose is not determined accurately, and its value is likely to be too small, if this datum is eliminated, the correlation coefficient increases (in absolute value) to À0.93.
Bader's quantum theory of atoms in molecules (AIM) is frequently used to prove the existence of a hydrogen bond. 47,48 According to this theory, the bond exists, if there is a point with minimal electron density along the bond path. This point is called a (3, À1) bond critical point (BCP). For detection of BCPs in deoxyribose, the required wave functions were generated for  Fig. 2 Histogram of absolute deviations of the r SE e , r s , r 0 (old data 10 ) and r 0 (new data, present work) parameters relative to the r BO e values for deoxyribose. optimized geometries at the MP2 and B3LYP levels of theory with the cc-pVTZ basis set. The molecular graphs were computed with the AIM2000 49,50 program package, but no BCP nor associated ring critical point (RCP) could be found for the hydrogen bonds (see Fig. S1, ESI †). On the one hand, this might be explained by the fact that all the hydrogen bonds are weak. On the other hand, as it has been noted by Deshmukh et al. 51,52 in the studies of alkanediols and sugars, the AIM method sometimes conflicts with experimental data. The explanation of this phenomenon requires further investigation that is not the purpose of the present study. We note that the stabilizing effects of the hydrogen bonds in fructose has been recently discussed from a theoretical point of view. 53 Most of the C-C bond lengths are only slightly shorter than the value found for ethane, 1.522 Å. 54 They are thus typical single bonds. 55 However, the C4-C5 bond in deoxyribose at 1.513 Å and the C5-C6 bond in fructose at 1.514 Å are rather short, as seems to be the rule in aldohexoses for bonds that involve a C atom next to the ring O atom. 15

Conclusions
We have demonstrated that the mixed regression method is more suitable for the accurate determination of the equilibrium structure of a moderately large molecule than either the pure high-level ab initio methods or the classical semiexperimental method. Another typical example of the superiority of this method is the structure of tropinone (34 degrees of freedom). 31 The ME method combines two steps. First, highor medium-level ab initio calculations furnish accurate values for the X-H bond lengths (X = C, N, O) and for bond angles, and more approximate values for the dihedral angles and for the distances between heavy atoms. Then, these data are supplemented by semiexperimental equilibrium rotational constants in a least-squares fit that allows us to check that the predicates are accurate and to improve their accuracy.
Further work on the ME method will be directed to larger molecular systems, exploiting the synergy between experimental high-resolution rotational data and quantum chemical calculations.