Jacco
van de Streek
* and
Sam
Motherwell
Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge, CB2 1EZ, UK. E-mail: jaccovandestreek@yahoo.co.uk
First published on 20th October 2006
In this paper we describe several new pieces of software that allow the Cambridge Structural Database (CSD) to be searched for solvated and unsolvated crystal structures. One program finds all pairs of solvated and unsolvated crystal structures of the same chemical compound for a given solvent; another program finds all crystal structures that were grown from a particular solvent. In addition, an algorithm was implemented to determine the stereochemistry of a crystal structure from its 3-D atomic coordinates. Results for water as the solvent molecule are presented. It is found that about 25% of all crystal structures grown from water form hydrates. If a chemical compound is soluble in water, it seems almost impossible to predict if it will form a hydrate or not, although presence of both donors and acceptors in the host compound appears to be required. Chiral molecules produce significantly more hydrates, presumably including water to achieve close packing in spite of the lack of inversion centres.
In previous studies, Görbitz and Hersleth quantified the number of solvent-containing CSD entries.2 Unsurprisingly, water was found to be the most abundant solvent in organic crystal structures, water being incorporated 8 times more often than the second most abundant solvent, methanol. Nangia and Desiraju3 took this one step further and attempted to include the frequencies with which the solvents had been used for crystal growth. The amount of work involved necessarily limited the number of publications from which these frequencies were derived; water was not considered in their study. Gillon et al.4 and Infantes and Motherwell5–7 used the CSD to study the environment of water molecules in hydrates. As early as 1991, Desiraju8 attempted to rationalise hydrate formation in terms of the presence of hydrogen bonding groups. Desiraju concluded, based on a random sample of hydrates from the CSD, that hydrate formation occurs when the number of hydrogen-bond acceptors outnumbers the number of donors, i.e. when the donor/acceptor ratio is small. This rationale is based on the Etter hydrogen bonding rule9 that “all good proton donors and acceptors are used in hydrogen bonding”. When a molecule shows an imbalance of strong donors and acceptors there may be some unsatisfied donors and acceptors, and the addition of water to the crystal structure reduces the imbalance. Unfortunately, a comparison against a control sample of anhydrated structures was not included. Infantes et al.10 used a QSAR-type approach to predict hydrate formation based on an analysis of the CSD.
Although the CSD is complete as far as coverage of the published literature is concerned, the type of data that can be extracted from the CSD is limited by the options available in the interface of the search program provided (ConQuest11). In the current paper, we will describe several in-house pieces of software that can be used to search the CSD for solvated and unsolvated crystal structures in novel ways.
A computer program was developed to find those chemical compounds for which both a solvated and an unsolvated forms exist, allowing us to investigate the role the solvent molecules are playing by analysing the differences between the solvated and the unsolvated crystal structures.
An unbiased way of analysing unsolvated crystal structures is to only include those crystal structures that were grown in the presence of the solvent of interest. A program was written for this purpose.
The CSD does not store information on the stereochemistry of the entries, and an algorithm was implemented to derive this information from the 3-D atomic coordinates.
All this software can be used with any solvent molecule, but the special case of water as solvent molecule is presented here because of its wide ranging importance in industry and in particular in pharmaceutical industries; furthermore, its abundance allows for better statistics. A more comprehensive survey including 51 solvents (including water) has been conducted for which the manuscript is in preparation.
![]() | ||
Fig. 1 An example of a hydrate versus anhydrate pair, the 2-D connectivities of CSD entries ELEVOG and UNOGOT are shown. There must be a 1-to-1 match between all molecules (other than the water molecule); in this case, the methanol was required to be present in both structures. |
The program first locates all CSD entries that contain the solvent molecule. In the second step, the solvent molecule is removed from each of these entries, and the CSD is searched for the chemical compound that remains. There are no restrictions on the complexity of the chemical compound that is left: counter ions, co-crystals and any remaining solvent molecules are all included in the comparison (see Fig. 1). The number of pair-wise comparisons in this second step is potentially huge: for water, the most abundant solvent in the CSD, this number is 11 billion for the November 2005 release. Carrying out this many comparisons of 2-D connectivities would be prohibitively slow, while generating a successful hit for only a small fraction of these comparisons. Therefore, an intermediate screening step that is much faster but less reliable is used to reduce the number of comparisons. During the screening step, a 2-D connectivity is represented by its topological index: a close to unique single number that attempts to encapsulate the full 2-D connectivity.12,13 Comparison of full 2-D connectivities is only necessary for those pairs that matched during the screening step; for water, this reduces the number of comparisons from 11 billion to four thousand.
The 2-D connectivities in the CSD do not store stereochemical information, and pairs of diastereomers would be included in the list, as would enantiomerically pure compounds versus their racemates. This is undesirable, as stereochemically different compounds are chemically different species, and the Solvates program therefore contains an algorithm to check the stereochemical consistency of the hits based on the 3-D atomic coordinates and space group stored in the CSD. The algorithm is restricted to stereocentres with four neighbours, but space-group symmetry and the presence of enantiomers in the asymmetric unit are taken into account. In practice, this means that the algorithm performs poorly for organometallic coordination compounds, but is satisfactory for most organic compounds.
CSD entries containing different amounts of the same solvent, e.g.paracetamol monohydrate (CSD refcode HUMJEE) and paracetamol dihydrate (WAFNAT), are both included as hydrates of paracetamol anhydrate (HXACAN), but the dihydrate is not considered to be a hydrated form of the monohydrate.
The result is a list of CSD refcodes of hydrate–anhydrate pairs (Fig. 2), an example of the 3-D structure of a pair is given in Fig. 3. The sample of molecules covered shows a very great range of chemical diversity; some examples are metal complexes such as (adeninato-N9)-methyl-mercury(II) (ADMEHH/CUYCUU), pharmaceutical organic molecules e.g.Olanzapine (AQOMAU/UNOGIN), salts of organic molecules e.g.L-arginine hydrochloride (ARGHCL10/LARGIN02), very small molecules e.g.Ammonium hydrogen oxalate (AHOXLH/MOYHAJ), and very large e.g. Enniatin B (BICMEF/DESYIJ). We note also that the chemical diversity covers a wide range of hydrogen bonding ability, which is studied in detail below. A final point to note is that it is quite common for there to be incomplete sets of hydrogen coordinates, particularly for water.
![]() | ||
Fig. 2 Start of the list of CSD structures as pairs of reference codes for water as the solvent molecule. The first column identifies the compound with water present, the second is the corresponding unsolvated compound. |
![]() | ||
Fig. 3 Example of a structural pair, CSD refcode UNOGIN (left) is the anhydrous crystal, AQOMAU (right) is a dihydrate. Hydrogen bonds are shown as dotted lines. There are three unsatisfied acceptors in the anhydrate, two in the hydrate. |
The Solvates program also writes out those solvated entries for which no corresponding unsolvated crystal structure could be found in the CSD, i.e. a list of “solvates-only” structures.
This suggests the preparation of a third list: a list of unsolvated structures for which no solvate exists. In order to ensure that the absence of water in the crystal is not due to water not having been available during the crystal growth process, the CSD was searched for all entries for which it was explicitly recorded that the crystal was grown in the presence of water. Information on the solvent from which the crystal was grown is available for about 80 000 CSD entries, and an in-house computer program was written to find those CSD entries where the solvent description field included at least one of the words “water”, “H2O”, “deuteriumoxide”, “aqueous” or “dilute” (as in “dilute hydrochloric acid”). Extensive manual examination of the results indicates that these search terms, though leaving some room for false hits and missed hits, give a fairly accurate list of (published) crystal structures grown in the presence of water.
Readers interested in the functionality that the Solvates program provides should contact the Cambridge Crystallographic Data Centre.
Organometallic compounds are difficult to process for water because it is often not clear if the water is a free solvent or a metal-coordinating ligand; the six-coordinated metal centres also present problems for the stereochemistry-comparison code. CSD entries containing a metal were therefore removed from the set.
Fig. 4 shows the effect of the preprocessing step on the list from Fig. 2: the number of pairs of structures was reduced from 2602 to 374. In Fig. 4, JAYPUU and JAYPUU01 are polymorphs. AQOMAUand AQOMAU01 are two polymorphs of the dihydrate of UNOGIN, whereas AQOMEY is the trihydrate of UNOGIN.
![]() | ||
Fig. 4 As Fig. 2, after preprocessing to eliminate questionable structures, redeterminations and organometallics. |
We use the concept of the number of possible donors, NPD, and the number of possible acceptors, NPA, per structure. The NPD is simply defined as the number of hydrogen atoms on a group, C–NH2 is counted as NPD = 2, for instance. Each O, N, S, F, Cl, Br and I atom is counted as a possible acceptor, NPA = 1, irrespective of their chemical context. The only exceptions occur for (positively) charged N, planar N with three connections and halogens covalently bonded to carbon, which never accept and have NPA = 0. Thus, a water molecule, H2O, has NPD = 2, NPA = 1, COOH has NPD = 1, NPA = 2, and C–NH3+ has NPD = 3, NPA = 0. This pays no attention to the observed frequency of hydrogen-bond donor or acceptor bonds: it is a measure of the possibility for a donor or acceptor bond.
CSD reference code lists, the defined functional groups, and the RPluto command script are given in the ESI.†
These files were analysed with a program CONTAN, written for this study, to calculate various statistics for each CSD data sample. This program creates statistics on the counts of each group, and the average number of contacts for each group. This takes into account that some groups such as SO42– and COOH consist of more than one atom. There are also facilities to select entries to form statistics for sub-sets, selected on groups with specified occurrence per compound, e.g. exactly 2 H2O, or 1 NHR3+, 1 Cl– and 1 to 10 water molecules. One is also able to select structures for simple group contact environments, e.g. OH2⋯Cl–⋯H2O, or H2O with 3 contacts to H2O, CONH and CONH.
First, density depends on the atomic weights of the atoms in the unit cell , and adding a light molecule such as water to, say, a bromine-containing compound always reduces the density of that crystal structure, even if the water allows the molecules to pack more closely. Therefore, rather than working with densities, we must use molecular volumes, and in order to isolate the contribution of the main molecule in the hydrated structure, the contribution of the water molecules must be subtracted.
Second, our pairs of crystal structures were not necessarily determined at the same temperature, which has to be corrected for.
The volume of a water molecule in the solid state at room temperature, Vmolest (water, 298 K), can be calculated to be 21.55 Å3 using the average atomic volumes derived from CSD data by Hofmann.15 Beaucamp et al.16 derived a volume of 22.951 Å3 for a molecule of H2O from CSD data, but this slightly bigger volume already implicitly incorporates the hydrogen-bonding effect that we are trying to measure and we therefore prefer to use Hofmann's unbiased values. Hofmann's atomic volumes were derived such that their sum equals the unit-cell volume, i.e. the empty space present when packing spheres has been absorbed into the atomic volumes. Hofmann also determined the temperature dependence of the volumes, which allows us to correct for temperature differences. We can now quantify the close-packing of the main molecule, corrected for temperature and for the presence of waters, as follows:
Vmolest (compound, 298 K) = (Vcellobs (T)exp (1.0 – (0.95 × 10–4 K–1(Texp – 298 K))) – Vmolest (water, 298 K) Nwater)/Z | (1) |
![]() | (2) |
Before preprocessing, some of the changes in the molecular volume between the hydrate and the anhydrate were rather large, and some of these outliers were examined manually. For several of these pairs, the unit-cell volumes of the hydrate and the anhydrate were virtually identical, as were the crystal structures except for the presence of the water molecules. In several of these cases the R-factor of the “anhydrate” was >10%, and the R-factor was always higher than for the corresponding hydrate structure. We therefore suspect that most of these “anhydrates” are in fact hydrates, and that the water was missed when the structure was determined. The preprocessing described above removes most of these outliers.
Chiral molecules can be expected to be biased towards biologically relevant compounds, i.e. those compounds that either were obtained from a natural source or show biological activity. Because biologically relevant molecules tend to contain higher than average numbers of oxygen and nitrogen and tend to be water soluble, they would bias the sample. These biologically relevant compounds are flagged as such in the CSD, and a simple ConQuest search sufficed to remove these compounds from the sample.
Sample | Number of entries left after preprocessing | Number of entries used for analysis with RPluto |
---|---|---|
Anhydrates whole CSD | 93 440 | 5146 |
Anhydrates from water | 1123 | 1093 |
Hydrates-of-anhydrate | 374 | 364 |
Anhydrates-of-hydrate | 374 | 372 |
Hydrates-only | 5267 | 5232 |
Out of all 100 864 organic-only entries in the preprocessed subset of the CSD, 5972 are hydrates, which is 5.9%. There were 40 000 entries grown from water in the CSD; filtering out disordered or suspicious structures, duplicates and organometallics left 1440 structures. 1123 of these were anhydrates, 317 were hydrates; thus, when growing a crystal from water, there appears to be a 22% chance of forming a hydrate. This figure is probably a slight underestimate, because disordered structures are eliminated in the preprocessing step and solvated structures are more often disordered; the real figure is therefore probably closer to 25%.
Surprisingly, for 11 of the anhydrates grown from water the corresponding hydrate was present in the CSD. All 11 structures were closely examined, but all 11 appeared to be genuine anhydrates and no indication was found that the water molecules had simply been missed. Their CSD refcodes are BERYAZ, BISMEV04, EVODUO, GLUCIT02, HXACAN12, IYAWAG, IYAXAH, IZAJUO, KAMPIY, LARGIN02 and LSERIN20. For most of these, the solvent was a mixture of water and, say, ethanol. In these cases, for the water to remain mixed with the ethanol is entropically favoured over getting trapped in a crystal structure. This suggests that it is possible to find an ethanol/water ratio where both the hydrate and the anhydrate can grow concomitantly.
We are fully aware that these samples only reflect the reported experimental literature, and that there is the possibility that the hydrate list could have non-hydrate pairs as a result of future experiments. Similarly, the fact that there is no hydrate reported in the CSD does not mean that it cannot easily be made.
![]() | ||
Fig. 5 Frequency of number of included water molecules per asymmetric unit for the Hydrates-only sample. |
![]() | ||
Fig. 6 Average molecular weight (MW), number of possible donors (NPD) and number of possible acceptors (NPA) as a function of the number of waters per asymmetric unit in the crystal structures. These data are for the Hydrates-only sample. |
![]() | ||
Fig. 7 Ratios of the average numbers of functional groups in the Hydrates-only sample and the Anhydrates grown from water sample. (a) Functional groups that occur more often in hydrates than in anhydrates (grown from water); SO42− through X-I have a value of infinity, i.e. never occur as an anhydrate; (b) (note the change of scale): functional groups that occur equally often in both samples (NH4+ through X-Br) or that occur more often in anhydrates (grown from water) than in hydrates. |
When a group occurs more frequently in the Hydrates-only set than the anhydrate set (ratio >1.0) there is a higher presence of polar groups. Further study of these distributions is beyond the scope of this paper, and has been discussed elsewhere under the concept of “hydrate affinity”.5 However, it is noteworthy that there are certain groups which are populated much more frequently in Hydrates-only than in Anhydrates-only samples, such as R2PO2–, Cl– and C–NH3+. The converse applies to hydrophobic groups such as CF3, C–Cl, and, interestingly, –O–CO–NH. Note that the preference for hydrates of the OH group strongly depends on the chemical context of the group: HO–CR3 > HO–CR2 > HO–CHR2 > HO–CH2R.
Because the sample size is so small, the further subdivision into the 110 functional groups leaves many groups with less than 20 occurrences; these were discarded as not statistically significant. For groups with 20 or more instances, the average number of contacts in the hydrates and the anhydrates is shown in Fig. 8.
![]() | ||
Fig. 8 Average number of contacts per functional group for the anhydrous (red) and the hydrated (blue) crystal structures, for functional groups with count >20. |
These results show that in almost every case the average number of contacts per group increases when we move from the anhydrous to the hydrated crystal structure. This is not surprising, as generally there must be more hydrogen bonds in the hydrated structure, but it is interesting that increases are not only seen for polar groups, but several of the non-polar groups also show an increase in contacts. It must be remembered that the contacts are not entirely hydrogen bonds, and include other short contacts such as NO2⋯NO2.
We see the most significant increase in contacts for the Cl– group, which moves towards its preferred number of contacts as 3. Other groups showing notable increases are CONH, COO–, NH3R+, NR2, CO and C–Br.
![]() | ||
Fig. 9 The hydrogen-bonding pattern of the anhydrous form of 4-dimethylaminopyridinium chloride (top, CSD refcode JUBKAS), and the hydrogen-bonding pattern of the corresponding dihydrate (bottom, CSD refcode DMAPYC, Z′ = 2). The anhydrous salt has Cl– atoms rather below their preferred level of hydrogen bonding (3 contacts). The addition of water allows Cl– to form 3 contacts, and the water comfortably hydrogen bonded with at least two strong donor bonds to the Cl–. |
![]() | ||
Fig. 10 CSD refcodes AMEPOX and AMEQAK. Packing problems of the anhydrous form causes three unsatisfied donors and acceptors—water forms two donor and two acceptor bonds, and now all donors and all acceptors but one are satisfied. |
Examination of the average counts of unsatisfied donor and acceptor atoms, including water molecules if present, shows that anhydrates in the CSD have a low unsatisfied donor count of 0.2, but a relatively high unsatisfied acceptor count of 2.8. This reflects the fact that many of the anhydrate entries have few donors, or even zero donors. However, the Hydrates-only sample shows an unsatisfied donor average that is slightly higher, 0.3, but an unsatisfied acceptor average that is considerably lower, 1.8. This suggests that a primary role for water is to reduce the number of unsatisfied acceptors. This is in accord with the observations of Gillon et al.,4 and Infantes et al.,10 that the average hydrogen-bond count for water is 1.9 donor bonds and 1.0 acceptor bonds. Thus addition of water to a crystal structure has an average effect of reducing the number of unsatisfied acceptors.
If we subtract the contacts to water molecules in the Hydrates-only sample structures we can count the resulting unsatisfied donor and acceptor atoms. This has the effect of increasing the number of unsatisfied acceptors from 1.8 to 3.3 (difference +1.5) and unsatisfied donors from 0.3 to 1.1 (difference +0.8), indicating that the water molecules frequently contribute two donor H-bonds, and one acceptor bond. (We should not pretend, of course, that such an artificial removal of water molecules from hydrate structures will represent real stable crystal structures, as there are other rearrangements of hydrogen bonds often possible to lower the lattice energy.)
This leaves the question whether it is the number of unsatisfied donors or the number of unsatisfied acceptors that determines if a hydrate will form. The figures in Tables 2 and 3 lend support to either interpretation, and we can only draw a tentative conclusion. A striking feature of the anhydrate crystal structures taken randomly from the CSD, is that they contain relatively high numbers of possible hydrogen-bond acceptors, and relatively high numbers of unsatisfied hydrogen-bond acceptors. This suggests that crystal structures can cope with relatively high levels of unsatisfied hydrogen-bond acceptors. This is in stark contrast to the hydrogen-bond donors: there are significantly less possible donors present in the anhydrates, and in none of the samples does the average number of unsatisfied donors ever exceed 0.4; but when the water is removed, the number of unsatisfied donors increases to three times this value. These observations strongly suggest that unsatisfied hydrogen-bond donors are highly unfavourable for crystal structures, and satisfying these unsatisfied donors appears to be the main driving force for incorporating water. However, this cannot be the complete picture: every water molecule comes with its own two donors, which in turn need to be satisfied. That this is indeed the case is clearly demonstrated by the fact that removing the water from the hydrated structures on average leaves 1.5 acceptors unsatisfied. We seem to have the apparent contradiction then, that water is incorporated to satisfy an unsatisfied donor of the host molecule, but in doing so creates two new unsatisfied donors… which are satisfied by the host molecule.
However, this is not necessarily a contradiction. Spatial packing restrictions can prevent donors and acceptors from achieving geometric conditions for the hydrogen bond: the water molecule being freely mobile, with six degrees of freedom, can act to resolve such spatial conflicts by forming bridges between such unsatisfied donors and acceptors. Furthermore, hydrogen bonds mediated viawater are by definition cooperative in nature, and are therefore stronger than the direct hydrogen bonds between two groups from the host would have been.
![]() | ||
Fig. 11 The distribution of ΔVmolanhydrate → hydrate (see eqn (2) for definition). 0% is indicated by a red line; it can be seen that on average, the hydrate forms pack less closely than the anhydrate forms. |
![]() | ||
Fig. 12 The crystal structure of tert-butanol heptahydrate (refcode LEBKEI). The molecular volume of the anhydrate (described as “very hygroscopic”) is increased by 20% with the uptake of seven water molecules. Hydrogen bonds are shown as dotted red lines. |
![]() | ||
Fig. 13 Scatter plot of ΔVmolanhydrate → hydrate (see eqn (2) for definition) as a function of the molecular volume of the anhydrate (in Å3). The least-squares fit through the points is included as a red line. The plot suggests that the role of the water as space-filler is more noticeable when the molecular volume of the anhydrate is greater. |
All | Hydrates | Non-bio | Non-bio hydrates | % Non-bio hydrates | |
---|---|---|---|---|---|
Chiral | 24 874 | 2247 | 19 789 | 1594 | 8.1% |
Racemic | 18 244 | 546 | 17 127 | 455 | 2.7% |
Achiral | 57 746 | 3179 | 54 490 | 2842 | 5.2% |
Total | 100 864 | 5972 | 91 406 | 4891 | 5.4% |
25% of crystals grown from water are hydrates. The group of chemical compounds that can be dissolved in water in the first place can be expected to be highly biased with respect to the presence of hydrogen-bond donors and acceptors, so this high number is perhaps hardly surprising.
We tentatively conclude that unsatisfied hydrogen-bond donors appear to be the main driving force behind hydrate formation. However, at the same time, hydrates tend not to form if no acceptors are present.
If a chemical compound crystallises as a hydrate, the number of water molecules incorporated per structure increases with the number of possible hydrogen-bond donors and acceptors, and with the molecular weight of the compound.
The relative volume change associated with hydration shows a vague trend suggesting that on average larger molecules use water more often as space filler than smaller molecules, whereas the smaller molecules use the hydrogen-bonding capabilities of the water. However, the trend is vague and lacks predictive power.
Molecular flexibility does not seem to affect hydrate formation, at least not for the definition of molecular flexibility that we have chosen.
There is a strong tendency for chiral compounds to form hydrates. This inclusion of water molecules by chiral compounds presumably serves the purpose of improving the close packing of these compounds in the absence of inversion centres.
Footnote |
† Electronic supplementary information (ESI) available: Table of chemical groups used by RPluto for processing of H-bond contacts and CSD refcode lists of hydrate/anhydrate structures, structures grown from water and hydrates-only structures. See DOI: 10.1039/b613332k |
This journal is © The Royal Society of Chemistry 2007 |