Open Access Article
R. E.
Skyner
a,
J. B. O.
Mitchell
*a and
C. R.
Groom
b
aSchool of Chemistry, University of St Andrews, Purdie Building, North Haugh, St Andrews, Fife KY16 9ST, UK. E-mail: jbom@st-andrews.ac.uk
bCambridge Crystallographic Data Centre, 12 Union Road, Cambridge, CB2 1EZ, UK
First published on 19th December 2016
The abundance of crystal structures of solvated organic molecules reflects the common role of solvent in the crystallisation process. An understanding of solvation is therefore important for crystal engineering, with solvent choice often affecting polymorphism as well as influencing the crystal structure. Of particular importance is the role of water, and a number of approaches have previously been considered in the analysis of large datasets of organic hydrates. In this work we attempt to develop a method suitable for application to organic hydrate crystal structures, in order to better understand the distribution of water molecules in such systems. We present a model aimed at combining the distribution functions of multiple atom pairs from a number of crystal structures. From this, we can comment qualitatively on the average distribution of water in organic hydrates.
000 structures of organic and organometallic systems found to contain water in some form2,3 (2010).
The abundance of hydrates reflects the common role of solvent in the crystallisation process. An understanding of this is therefore of paramount importance for crystal engineering, with solvent choice often influencing the crystal structure and properties; either by formation of a solvate or hydrate, by directing the molecular conformation, or by favouring a particular crystal packing.
The systematic analysis of hydrates was, until recently, often confined to inorganic structures.4 Such investigations have been complemented by surveys of organic hydrates, which have served primarily as a tool for the classification of the role of water within the crystallisation process, and in overall structure. A commonly accepted classification system organises water sites within crystal structures into three main categories: isolated lattice sites, lattice channels and metal-ion coordinated water.5 Other survey studies have also considered the driving force for hydrate formation.6–8
A recent discussion9 considers novel coordination environments, specifically in relation to hydrates. Emphasis is placed on the abundance of hydrates within crystal structures, implying that any discussion of hydrates should first consult the CSD. It is suggested that existing work directed toward characterisation of water motifs adequately describes the variety of possible motifs to an appropriate standard of notation.10,11 This assumption is supported by the classification of apparently novel motifs by the authors' own classification system, and the classification of organic hydrates seems possible in the forms of either a three-category or a cluster-based approach. These methods of characterisation are commonly accepted and often cited within the literature.
van de Streek and Motherwell noted that “statistical surveys into the behaviour of hydrates are difficult due to the severe bias that is introduced at many levels8”, however there may be scope within similar surveying techniques for the building of predictive models. For example, Galek et al.12 have utilised data available in the CSD to develop statistical models for hydrogen-bond coordination behaviour (not limited to the study of hydrates). Their work describes the hydrogen bonding behaviour of over 70 unique atom types, and begins to make assessments of structural stability of hydrogen bonding environments in known crystal structures, showing potential for application of empirically or statistically derived models.
In this work we develop a method for the statistical analysis of organic hydrate crystal structures.
Our model combines the radial distribution functions (RDFs) of multiple atom pairs from numerous organic hydrate crystal structures. We also compare water oxygen (OW) and water hydrogen (HW) RDFs to the work of Soper.13 Soper evaluated neutron diffraction data for water and ice at a range of temperatures (220 K to 673 K) and pressures (up to 400 MPa) in the form of OO, OH and HH partial structure factors. Fourier transformation of these partial structure factors produces site–site RDFs. However, the presence of systematic uncertainties arising from diffraction experiments means that this transformation is not as intuitively straightforward as expected. Soper uses empirical potential structure refinement (EPSR) in order to fit a 3D computational water model as closely as possible to the pre-determined experimental structure factors, improving the reliability of the extracted RDFs. Preliminary comparison of our own data with all of Soper's water and ice functions showed that our functions fit best (from visual overlay) with ice at 220 K, and water at 298 K, both under ambient pressure. Thus, comparisons between these two models and our own RDF will be discussed in depth.
The profile of a liquid RDF differs greatly. The function represents an average of particle locations, conversely to the precise positions depicted in crystal structures. When a crystal melts to liquid, long-range order is lost, and at large distances there is an equal probability of finding a second particle in any shell of equal volume. However, at short distances close to the reference particle there may be some remaining order, a vestige of that found in the crystal phase. The nearest neighbours of the reference particle may still approximately occupy their original positions. Thus, it is often possible to identify an average sphere of nearest neighbours in the first and perhaps the second shell r1 and r2 from the reference particle.14
A useful description of the energetics of a solution can be extracted from the potential of mean force15 (PMF), which describes free energy changes of the system as a function of a coordinate or coordinates. A popular choice for the coordinate is the distance r, due to the simplicity of calculation.
For a given r between two molecules, the PMF describes an average over all orientations of the surrounding solvent molecules. RDFs are directly related to the PMF w2(r) by;
w(2)(r) = −kT ln g(r) |
The Helmholtz free energy A(r) can be expressed as;
A(r) = −kT ln g(r) + a |
The PMF can be used to describe the energetics of the whole system. An appropriate weighting scheme applied to empirically parameterised RDFs can then be utilised within computational algorithms for the simulation of systems in solution. This reduces the computational cost associated with explicit solvent models, whilst improving some of the inaccuracies that implicit solvation models suffer due to their inherent approximations. A theoretical example of how RDFs and PMFs could be applied to predictive models in the future is given in the Future Application section below.
The dataset for building of RDFs was obtained from a search for any structure containing water as an independent entity in the CSD (CSD version 5.34, 2013).16 Structures included in the dataset were selected with the following restrictions; 3D coordinates determined, R ≤ 0.05, not disordered, no errors, not polymeric, no powder structures, and only organic. All hydrogen positions were normalised according to the following criteria; C–H = 1.089 Å, N–H = 1.015 Å, O–H = 0.993 Å. The final dataset contained 5922 structures in total.
We developed a programmatic approach within MATLAB in order to automate the processing of the dataset, and to collate the results effectively for the building of RDFs.
The developed program's primary operation can be summarised as follows;
• Determine atom types according to AMBER forcefield definitions for a crystal structure .pdb file with Antechamber.17,18
• Apply all crystallographic algorithms necessary to produce symmetry equivalent atom positions and to expand the lattice by one unit cell in each direction.
• Sort all atoms for each structure into individual arrays.
• Move the structure coordinate system origin to a target atom nucleus position (either water oxygen or hydrogen).
• Convert to a spherical polar coordinate system.
• Calculate distance, azimuth and elevation for all atom pairs within a specified cut-off distance (15 Å).
• Repeat, moving origin for every target atom in the system.
• Save data as a MATLAB workspace for manipulation with further routines.
The libraries for all information relating to symmetry operations were developed from the existing Fortran library CrysFML,19 the Bilbao Crystallographic Server,20–22 and the International Tables.23 Routines for RDF calculations were developed from I.S.A.A.C.S24 and from Allen and Tildesley.25 Atom type assignment is performed as an external routine through Antechamber.17,18 Schematic representations of the atom types used in this study are shown in Fig. 1.
The identification of motifs (as defined by Infantes and Motherwell10) was conducted using the CSD-Materials module, available in the current release of Mercury.26 The selected motifs are represented in Fig. 2. The motifs can be separated into: infinite chains, discrete chains, discrete rings, and infinite tapes in one dimension.
![]() | ||
| Fig. 2 The 15 water motifs used in this work. The motifs can be separated into: infinite chains (C1, C2, C3, C4; the number represents the number of unique waters present before the motif is repeated), discrete chains (DC1, DC2, DC3, DC4; the number represents the number of contacts between waters in the chain), discrete rings (R3, R4, R5, R6; the number represents the number of waters in the ring), and infinite tapes in one dimension involving rings (T4(1), T4(2)6(2), T6(1); a number outside of brackets represents the number of waters in the ring motif, and a number inside of brackets represents the number of waters from this ring also involved in a neighbouring ring). Nomenclature taken, and figure adapted from Infantes and Motherwell.10 | ||
The search criteria for water motifs ignore specific hydrogen bonding interactions, and simply define a network by an O⋯O distance < sum vdW radii + 1 Å. Therefore, quantification of the intermolecular pair distances (HW⋯OW) is not directly possible from the search results themselves. In order to assess these interactions, the pair count histograms were selected from the original dataset, and a new RDF calculated for each motif.
It is important to determine whether the discrete features observable in the RDF are in fact noise, or signal. There are two possible scenarios: A) the features present are noise, due to an insufficient amount of data, meaning the distribution is not entirely representative of a smooth and average distribution within hydrates; B) the features present are signal, comprising a number of discrete peaks occurring due to the complexity of the water networks or motifs found in organic hydrates.
Fig. 3 (bottom) shows Soper's EPSR model for ice at 220 K parameterised from neutron diffraction data (red), and our RDF (original: dotted black line, smoothed function: blue) resulting from all water oxygen to water hydrogen pair distances found within our dataset (5922 structures). It can be seen that there is a shift of the first two observable peaks to higher values of r, and the absence of the third peak observable in Soper's function. The peaks and troughs of the RDF profile also occur at different values of g(r). This difference is highly relevant if the model data from our RDF data are to be applied to predictive models in the future, particularly in the conversion of RDFs to PMFs, as the logarithmic relationship between g(r) and w(r) means that a small change in free energy (a small multiple of kT) can correspond to a change in g(r) of an order of magnitude from its expected or most likely value. However, one structural feature unique to the Soper ice RDF, which doesn't occur in the Soper water RDF, also appears to be present in our RDF; namely, the presence of a small peak in the trough between the two large peaks representing the first and second hydration shells, between 2–3 Å.
Overlaying the OW⋯HW RDF with Soper's model of water (298 K) provides a better fit in terms of peak positions, as shown in Fig. 3 (top; original: dotted black line, smoothed function: blue). However, discrete features unique to the solid state of ice are not present in Soper's liquid water function.
If the RDF model is compared to this subtle peak in Soper's water model, it can be seen that the maxima of the peaks in its profile, although quite noisy, fit the shape of the water profile well. No smoothing function has been applied as part our own method, however Soper fitted his data to inherently smooth computational models of water and ice.
A visual comparison of the short-range interactions discussed above is also summarised in Fig. 3. In both images, we have applied the Savitzky–Golay smoothing algorithm28 to our data (shown as a blue line, with the original data as a black dotted line) simply for the purpose of producing this figure, in order to increase the signal-to-noise ratio without unduly distorting the original data. In the top image, we compare this to Soper's 298 K water model, and highlight three areas where our own RDF displays features that are not explained by the water model. Namely, a large shoulder on the right of the first interaction peak, at ∼2.15 Å, a smaller shoulder on the left of a second interaction peak, at ∼2.85 Å, and a third small but independent peak at ∼4.16 Å. We have also indicated peaks that are explained by the water RDF, as indicated by the blue and red arrows, highlighting the peaks in their respective plot colours.
In the bottom image, we compare our smoothed profile (blue) to Soper's ice RDF (red), and attempt to indicate sources for the unexplainable peaks from the ice profile, as indicated above. The first shoulder, indicated by the only black arrow in the bottom image, is not confidently explained by either of Soper's distributions, and is probably due to the broad distribution of data in the first solvation shell, and between the first solvation shell and the second solvation shell.
The overall shape of our profile correlates well to that of Soper's water profile. However, certain features present in Soper's ice RDF also appear in our RDF; i) a peak at 2.9 Å that becomes a shoulder on the peak at 3.3 Å when a smoothing algorithm is applied, corresponding to a similar feature of Soper's 220 K ice function, at 2.8 Å and ii) a peak at 4.1 Å, which is emphasised upon the application of a smoothing algorithm, corresponding to the third solvation shell, present in Soper's 220 K ice function at 3.8 Å. This suggests that some order found in a typical ice model is also present in the overall structure of water in organic hydrates. In liquid water, this order is lost, meaning that Soper's water model no longer contains these interactions. However, the peak positions in our RDF correspond more closely to those present in Soper's liquid water model than to the ice model.
The presence of peaks in similar positions to Soper's water function in our RDF may suggest that our data are most representative of systems at 298 K, implying that water networks within hydrates have similar interaction distances to liquid water. This may result from the measurement temperature of the original data; over half of the contributing structures (3659) were measured above 261 K. However, it could also be an indication of peak broadening in the RDF due to the diversity of structures within our dataset. Beyond the second solvation shell, the RDF appears to be noisy.
Additional consideration was given to the measurement temperature at which the crystallographic data were obtained. The data were separated into three 50 K temperature intervals, and one interval where the temperature was above 261 K. These intervals were chosen based upon the distribution of measurement temperatures across the whole dataset, with a large number of structures (over half of the dataset) being measured at ∼298 K. Next, the HW⋯OW RDFs were recalculated for each temperature interval. The resulting functions are shown in Fig. 4.
![]() | ||
| Fig. 4 The HW⋯OW RDFs for water, separated by temperature ranges, as indicated by the legend (bottom) with the functions stacked in order of increasing temperature. | ||
The positions of the peak maxima representative of the first and second solvation shells do not change, unlike the Soper functions. This is because of the normalisation of hydrogen bond lengths, done because hydrogen positions are notoriously difficult to assign in crystal structure solution and refinement. Unfortunately, this means that subtle differences in the data, reflecting the variation in lengths of covalent bonds to hydrogen, may occasionally be lost. However, it is unlikely that the data would be any more accurate or reliable should the hydrogen bond lengths not be normalised, and perhaps more errors would be incorporated into the data from unreliable bond lengths due to the unreliable assignment of hydrogen positions in the experimental data.
The only observable difference between the measurement temperature separated data are the values of g(r) at which the peak maxima occur, although there is no observable pattern to explain this. The number of contributing data were considered as a cause, but recalculating the functions with the same number of contributing structures for each temperature range produced similar results. The larger oscillations seen in the results at 211–260 K are due to there being fewer data in this range than in other intervals.
In order to determine whether discrete features at both short and long range were due to specific arrangements of water, further analyses of specific motifs were carried out.
We observe a better fit of the long-range pair distances to Soper's water model in comparison to the ice model. However, there is still a considerable amount of ‘noise’ present at long-range distances. This was investigated further by the overlay of the RDF with an RDF (calculated in I.S.A.A.C.S24) for Bernal's hexagonal ice structure.27 However, statistical analysis of the long range pair distances (>4 Å) for both of the Soper functions and also for the hexagonal ice function (Table 1) showed that the profile of water (298 K) fits best, followed by ice (220 K) and finally hexagonal ice.
| Hexagonal ice | Water (298 K) | Ice (220 K) | |
|---|---|---|---|
| RMSE | 8.7 | 0.57 | 0.62 |
| ln(L) | −640 | −154 | −170 |
| AIC | 1287 | 314 | 345 |
| BIC | 1297 | 324 | 355 |
Log of Likelihood (ln(L)), the Akaike information criterion (AIC), and the Bayesian information criterion (BIC) were used as statistical measures for goodness of fit (GOF). The AIC is a measure that aims to select the best approximating model from a group of non-linear models.29 Given a collection of models for the data, the AIC estimates the quality of each model, relative to all of the models being tested. It offers a relative estimate of the information lost when a model is used to mimic the process that generates the data. AIC is calculated by;
AIC = 2p − ln(L) |
BIC = p(ln(N)) − 2 ln(L) |
| Motif type | Motif | Frequency (%) | Number of structures | RMSE | R 2 | ln(L) | AIC | BIC | Rank |
|---|---|---|---|---|---|---|---|---|---|
| Infinite chain | C1 | 2.9 | 169 | 2.0 | 0.99 | −361 | 727 | 736 | 13 |
| C2 | 3.9 | 229 | 1.6 | 0.99 | −325 | 655 | 665 | 12 | |
| C3 | 1.8 | 106 | 1.0 | 1.00 | −249 | 505 | 514 | 6 | |
| C4 | 1.9 | 112 | 1.3 | 0.99 | −285 | 576 | 585 | 10 | |
| Discrete chain | DC1 | 10.5 | 623 | 0.1 | 1.00 | 213 | −420 | −410 | 1 |
| DC2 | 2.8 | 164 | 0.4 | 1.00 | −77 | 160 | 169 | 2 | |
| DC3 | 2.6 | 155 | 1.0 | 1.00 | −236 | 478 | 487 | 5 | |
| DC4 | 1.5 | 89 | 1.1 | 0.99 | −252 | 511 | 520 | 7 | |
| Discrete ring | R3 | 0.4 | 24 | 2.3 | 0.98 | −382 | 770 | 780 | 15 |
| R4 | 3.1 | 184 | 1.2 | 0.99 | −273 | 551 | 560 | 9 | |
| R5 | 0.8 | 49 | 1.2 | 0.99 | −265 | 537 | 546 | 8 | |
| R6 | 1.7 | 103 | 1.4 | 0.99 | −296 | 597 | 607 | 11 | |
| Infinite tapes | T4(1) | 0.2 | 13 | 0.6 | 1.00 | −163 | 332 | 341 | 3 |
| T4(2)6(2) | 0.6 | 33 | 0.9 | 1.00 | −229 | 463 | 473 | 4 | |
| T6(1) | 0.2 | 11 | 2.3 | 0.98 | −379 | 764 | 774 | 14 |
The purpose of recalculating RDFs for specific water motifs was to identify whether discrete features within the overall HW⋯OW RDF could be specific to a particular arrangement of water in organic hydrates observable in RDF plots. Initial analysis of the likelihood of this was performed by a simple overlay of each recalculated motif RDF with the original HW⋯OW RDF. It was found that peaks unique to the profile of particular motifs were also distinctly present in the original function. An example of this is shown in Fig. 5.
In order to quantify the likelihood of these distinct features correlating to the features present in the original RDF (omitting r < 1.6 Å), a statistical analysis of the goodness-of-fit (GOF) of each motif to the original RDF was conducted. The results of this analysis are shown in Table 1. The following statistical measures were employed; root mean squared error (RMSE), R2, ln(L), the AIC, and BIC. Here, we treat the original RDF as the ‘true’ model, and the motif RDFs as approximating models.
From the results of AIC and BIC analysis, the GOF for each motif was ranked (the same ranking applies for both AIC and BIC), as shown in Table 1. It was found that the DC1 motif fitted most closely with the overall RDF. It might be expected that this would be the case, as DC1 motifs appear most frequently in our original dataset. However, a regression of the AIC and BIC scores against the frequency of occurrence for all motifs found no correlation to suggest this.
Comparison of the most prominent peak positions for each atom type with OW vs. each atom type with HW identifies whether, on average, the atom type is in closer proximity to the OW or HW of water. Comparison of the relative values of g(r) also gives an indication of which atom types are most likely to be in close proximity to water.
A comparison of the profiles of the CC and the CK atom types (Fig. 8) gives an example of how using a sophisticated atom-typing algorithm may offer an advantage over using traditional element labels. Both atom types represent a carbon adjacent to a nitrogen in a five-membered ring. The CC atom type can have any substituent, whereas the CK atom type has a hydrogen substituent (see Fig. 1). The first immediate difference between the CC and CK RDFs is the overall likelihood of finding carbon to water pairs. The addition of a non-hydrogen substituent (i.e. in the CK RDF) produces a significant peak for CK⋯HW pairs that is not present in the CC⋯HW profile (r = 2.95 Å, g(r) = 2.63), as indicated by the peak highlighted in Fig. 8. This difference may seem intrinsic; however these results exemplify how the atom-typing method is able to describe the major differences in water distribution introduced in the average case of substituent changes. This again corroborates the postulate that atom typing algorithms are useful in a quantitative survey of hydrate distributions, as conventional atom labels based on atomic number alone would not have identified this change in distribution.
![]() | ||
| Fig. 8 CC⋯HW RDF (blue) and CK⋯HW RDF (red) with a much larger peak apparent at ∼3 Å in the CK⋯HW profile (outlined in purple). | ||
Where substituent effects are not considered, there is little more to be learned from the RDFs of carbon atom types, as the distribution of water around such atoms is expectedly broad, and does not show significant patterns which cannot be observed within the RDFs describing substituent atoms of terminal ligands.
The RDF profile for the OH (Fig. 11) atom type, representing alcohol oxygen, differs somewhat from the O and O2 atom types, reflecting the ability of an alcohol group to participate in both H-bond donor and acceptor interactions with water.
The first obvious difference in the OH RDF occurs for OH⋯HW pairs, where a definite intermolecular interaction is represented by a sharp and narrow peak. This peak represents the alcohol oxygen participating in H-bond acceptor behaviour, O⋯HW. Two further peaks are also present at r similar to those found in the O and O2⋯HW pair RDFs (r = 1.86 Å and 3.21 Å). These peaks are increasingly broadened, suggesting less definite positions and orientations of water as r increases. A high g(r) value peak occurs in the OH⋯OW RDF at r = 2.81 Å, which is the same r for the highest peak found in the O2⋯OW RDF, suggesting a similar mode of interaction.
Interestingly, for the OS atom type, the largest peak in the RDF for HW is found at a distance (∼4.6 Å) not indicative of hydrogen bond formation. The OS atom type represents an ether or ester oxygen. It is known that there few examples of ester hydrogen bonding in the CSD.30 A study31 into ether and ester hydrogen bond formation found that ester oxygen hardly participates in hydrogen bonding. For (E)-esters, this is because of competition with the adjacent carbonyl group. For (Z)-esters, this is because of destabilisation due to a repulsive electrostatic interaction by the carbonyl group. Ethers were found to form hydrogen bonds at longer distances than expected, suggesting the bond is readily elongated by competing interactions.
Only two of the nine investigated hydrogen atom types showed profiles with distinct narrow peaks; H, representing hydrogen in an amide or imino group, and HO, representing hydroxyl hydrogen. Both profiles indicate distinct H⋯OW pairs for interactions, characterised by the appearance of a peak in the hydrogen HW RDF before a hydrogen OW peak.
Consider the following expression for HFE, as given by the RISM equations, employing a hypernetted chain closure (RISM-HNC);34
An appropriate weighting scheme, such as the atomic contribution to the solvent accessible surface area (SASA), can be applied to estimate the contribution of the RDF per atom to a total function g(r). It is worth noting that using the RDFs we have described here assumes that the solvent structure in solution is analogous to that in hydrate structures. This estimated distribution function could be used to solve the HNC closure to find the total and direct correlation functions. This could be implemented with either the typical Lennard-Jones type intermolecular pair potentials, or the PMF calculated from a SASA-weighted g(r). If the PMF is used for the pair potential, the energy expression simplifies to;
Footnote |
| † Electronic supplementary information (ESI) available: All MATLAB scripts used to calculate the RDFs described, and the resulting RDFs, a list of refcodes for the crystal structures used, and a list of their measurement temperatures. See DOI: 10.1039/c6ce02119k |
| This journal is © The Royal Society of Chemistry 2017 |