Resolving alternative organic crystal structures using density functional theory and NMR chemical shifts†‡

Alternative (‘repeat’) determinations of organic crystal structures deposited in the Cambridge Structural Database are analysed to characterise the nature and magnitude of the differences between structure solutions obtained by diffraction methods. Of the 3132 structure pairs considered, over 20% exhibited local structural differences exceeding 0.25 Å. In most cases (about 83%), structural optimisation using density functional theory (DFT) resolved the differences. Many of the cases where distinct and chemically significant structural differences remained after optimisation involved differently positioned hydroxyl groups, with obvious implications for the correct description of hydrogen bonding. 1H and 13C chemical shifts from solid-state NMR experiments are proposed as an independent methodology in cases where DFT optimisation fails to resolve discrepancies.


S2
a Column headings represent the tolerances when comparing each of the a, b, and c values for the corresponding reduced unit cells of a structure pair. Structure pairs must be within these tolerances for all three reduced unit cell length values to be accepted. Row headings represent the tolerance when comparing R-factors between structure pairs. The entry in bolded red indicates the combination of parameter tolerances selected for the next phase of the study.  Figure S1. Variation in the number of matched structure pairs as a function of the R factor tolerance (vertical axes) and the tolerance in the reduced unit cell axis lengths (horizontal axes) for three ΔT values (ΔT being the tolerance in experimental temperatures). A: ΔT = 2 K; B: ΔT = 5 K; C: ΔT = 10 K. Each plot was generated by performing discrete calculations at 64 points in the parameter space (data in Tables S1 -S3), followed by interpolation to create a continuous plot. The point on plot B indicated with a cross specifies the tolerances used to select the structure pairs that were used in subsequent RMSD calculations (4240 matched pairs).

Quantification of Alternative Organic Structure Determination Differences/Similarities
For each of the 4 238 pairs of structures identified above, an overlay of both the asymmetric unit(s) and unit cell contents were performed (the latter when possible) to quantify differences in structure for each pair. The overlay process was semi-automated using the CSD Python API (versions 0.7 and 1.0), with API calls being made using variations of the Python script overlay.py (included in data archive). Due to the developmental nature of the CSD Python APIs, it was not possible to cleanly execute this Python script across all structure pairs, and in a significant minority of cases, it was not possible to compare/overlay the structures using an automated approach. In these cases, RMSD information was determined manually using the Mercury software [2] (version 3.7 or 3.8).
While the most significant results derived from the structure pair comparisons are discussed in the main manuscript, here we briefly discuss other details. To produce quantitative results from the comparisons, it was required that both structures: (i) contained the same number of atoms; (ii) represented the same chemical structure; (iii) had fully-specified coordinates for all atoms; and (iv) had S5 atomic occupancies always equal to 1 (i.e., no disorder). After applying these criteria, 933 pairs of structures were rejected due to missing one or more atoms, while 173 structure pairs were rejected due to either possessing disorder, being otherwise ambiguous in the placement of at least one atom in the structure, or were found to not correspond to the same chemical structure (the "difference type" for these cases is given as "undefined" in the spreadsheet described below). While some of these problems could potentially be individually addressed (e.g. adding missing hydrogen atoms), the loss of less than 25% of the comparisons is unlikely to distort the statistics. Removing these pairs of structures from consideration, 3 132 structure pairs remained that we could overlay to calculate the heavy atom (i.e., non-H atom) RMSD, the all-atom RMSD, and the largest local difference. After determining the heavy atom RMSD, the pair of atoms (one atom from each structure) with the largest local difference was determined using outputs from the CSD Python APIs and the above-mentioned Python script. This script also tried classify this pair of atoms according to familiar chemical motifs.
This classification into "difference types" was accomplished using the CSD API to output (irrespective of the literal atomic labels) the atom types, and number of each atom type, that were bonded to a given heavy atom. For many comparisons, assigning differences to a type was then relatively straightforward (e.g., OH groups); however, for other difference types, further explanation is required. For example, to compute the largest difference between hydrogen atoms associated with two methyl groups (one group equals three H atoms from each of the structure pairs), all hydrogen atomic coordinates were first expanded into six 3D arrays. Three vectors (each with three elements) were then formed by calculating the magnitude of the difference between all possible permutations of the hydrogen atoms under consideration. For example, the first vector would be composed of three elements, with the first element being the magnitude between the "first" H atom in the "first" structure and the "first" H atom in the "second" structure; the second element being the magnitude between the "first" H atom the "first" structure and the "second" H atom in the "second" structure and the third element being the magnitude between the "first" H atom the "first" structure and the "third" H atom in the "second" structure. Overall, this process would produce nine elements for the two methyl groups. Using the smallest value, one atom from each structure was notionally removed, and the process was iterated to leave one pair of H atoms (one from each structure) corresponding to the maximum distance between H atoms for these two methyl groups. Summaries of the results from the overlay process for each pair can be found in the Excel spreadsheet in the data archive (overlay_results_and_DFT_summaries.xlsx), an extract of which is provided in Figure S2. Figure S3 shows the number of structure pairs possessing specified amounts of local and overall structural differences, while the most common types of structure pair differences are summarized in Table S4. Figure S2. Screenshot capture of the data archive file: overlay_results_and_DFT_summaries.xslx, which summarizes the outputs produced by the Python script: overlay.py. Each row corresponds to the structural comparison of one pair of alternative structures. The column headings are: PAIR_ID, the unique structure pair identification number assigned by the Python script filter1.py; XTAL1 and XTAL2, CSD refcodes associated with the "first" and "second" structures of the pair, respectively; NEUTRON?, identifies structures obtained using neutron diffraction data; POWDER?, identifies structures obtained from powder diffraction data; ZVALUE, number of "components" (components can be molecules or ions) in the unit cell; ZVALUE_RMSD, heavy atom (non-H) RMSD value when comparing the components contained within the unit cells of the pair; Z_PRIME, number of components in each asymmetric unit; Z_RMSD, heavy atom (non-H) RMSD value when comparing (typically) 2 times the number of components specified by Z_PRIME (this value was used in various plots, vide infra); RMSDH_int, an estimate of the all-atom RMSD within the overlay frame of reference generated when determining Z_RMSD; MAX_ATOM1 and MAX_ATOM2, atom labels of the pair of corresponding atoms making the largest contribution to RMSDH_int; DIST, distance (in Å) between MAX_ATOM1 and MAX_ATOM2; DIS_TYPE, classification of maximum atomic difference.   (8)

S13
Extremely large local difference/RMSD data points (i.e., outliers) were not included in Figure 1a in the main manuscript. Figure S12 presents the same data as in Figure 1a, but includes the 6 outliers (out of 3132). These outliers involve cases such as asymmetric rings which differ in orientation between alternative structure solutions (e.g. GOKREE/GOKREE01 and ZITZUX/ZITZUX01), and other groups with alternative positions e.g., sulphonamide group orientation (e.g. QQQAUG01/QQQUAG02). As these are individual cases, which may be complicated by disorder, they were not investigated further. Figure S12. Full results from the structural overlay of alternative structure determinations found in the CSD prior to DFT structural relaxation. The horizontal axis specifies the non-H RMSD value, while the vertical axis denotes the maximum atomic separation for any pair of corresponding atoms. The horizontal dashed line at 0.25 Å indicates the boundary between structure pairs considered sufficiently different to warrant further analysis, and those which were not.

Selection of Structure Pairs with Significant Differences, and Attempting to Resolve These Differences Using Dispersion-Corrected DFT Structural Relaxations
After the structure overlay and difference type classification, dispersion-corrected DFT structural relaxations ("geometry optimizations") were used to see if the pairs of structures relaxed to similar structural/energetic minima. As noted above, structure pairs whose maximum atom-atom difference was ≤ 0.25 Å were deemed to be sufficiently similar and were not considered further, leaving 658 structure pairs to be investigated.
To perform the structure relaxations, the Cambridge Serial Total Energy Package (CASTEP) software was used (version 8.0) [3] . This code uses a projector-augmented wave (PAW) method, with plane waves describing valence electrons and pseudopotentials representing core electrons. All calculations used the exchange-correlation functional of Perdew, Burke, and Ernzerhof (PBE) [4] , with dispersion corrections specified according to Tkatchenko and Scheffler (TS) [5] . Input files required for these calculations were generated using a Python script, castepgen.py, which is included in the data archive. The script makes calls to the program CIF2Cell [6] , which in turn generates the .param and .cell S14 input files required for CASTEP. The main purpose of using this Python script was the ability to loop over all crystal structures (in .cif format) in a given directory. The following (selected) parameter values were used in the .cell files: The following (selected) parameter values were used in the input .param files: xc_functional: PBE # use of PBE DFT exchange correlation functional sedc_apply: true # enables use of a dispersion correction scheme sedc_scheme : TS # specified TS dispersion corrections cut_off_energy: 520 eV # plane wave kinetic energy cutoff value geom_disp_tol : 3.0e-003 # geometry displacement tolerance (in Å) geom_energy_tol : 1.07788e-005 # geometry energy tolerance (in eV atom −1 ) geom_force_tol : 0.030367 # maximum allowed residual force (in eV Å −1 ) The geometry energy and geometry force tolerance values are derived from the parameters used by van de Streek et al. in Ref. [7]. Although copies of all .cell and .param files used as inputs are included in the data archive, the files ABINOS.cell and ABINOS.param are given in the data archive (computational subdirectory) as representative examples of input files. Outputs from the calculations (files having general formats of abcdef01.castep and abcdef01-out.cif, with abcdef01 representing the associated CSD refcode) are also contained in the data archive.
After fixed-cell structure relaxations were completed, the pairwise overlay process was repeated on the remaining structure pairs. The protocols associated with this overlay process mirrored those described earlier (the Python script used, overlay2.py, is in the data archive), and key information can be found in the Excel spreadsheet in the data archive (overlay_results_and_DFT_summaries.xlsx, spreadsheet tab: structures above 0.25 A). Structural differences were deemed to be resolved if all atom-atom differences had reduced to below the threshold of 0.25 Å. Of the 658 structure pairs considered at this stage, the structural differences of 521 pairs were resolved, while 134 structure pairs remained significantly different. A further 3 structure pairs (EGUJEV/EGUJEV01; BUYRUI01/BUYRUI02; BUYRUI01/BUYRUI03) diverged to clearly different polymorphic forms (i.e., the overlay process could not be meaningfully performed). As before with the set of structure pairs (i.e., prior to structure relaxations), the number of structure pairs possessing specified amounts of local and overall structural differences are provided ( Figure S13), as are the most common types of local differences (Table S5). An additional analysis was performed, concerning how far in space each structure moved in relation to its starting geometry. The information generated, along with the Python script used to generate it (analyseGOstats.py), can be found included in the data archive (fixed_cell_stats.csv).  a Percentages are with respect to the 655 structure pairs that could be meaningfully compared. As mentioned above, 3 structure pairs diverged structurally to such an extent that they could not be meaningfully overlaid. To provide greater flexibility for the convergence towards a common structure, the 134 structure pairs that remained significantly different after the fixed-cell structural relaxations (and also the 3 pairs that had diverged to different polymorphs) were subjected to a second structural relaxation in which the unit cell parameters were allowed to vary (i.e., in the .cell input file, the 'FIX_ALL_CELL' parameter was set to 'false'). All input and output files for these calculations are included in the data archive subdirectory /computational/variable_cell.
These variable-cell relaxations resolved an additional 20 structure pairs (15% of 134), with 1 structure pair (TETCYH01/TETCYH12) diverging into different polymorphic forms, thus making a total of 4 pairs incomparable at this point). As a result, 113 structure pairs remained significantly distinct. The differences are again concentrated on systems with methyl-or hydroxyl-group differences (Table S6). While there appear to several systems having CH2 differences, 9 of the 11 structure pairs in this group belong to the same form of a molecule with several alternative structure determinations.

Calculating 1 H and 13 C Nuclear Magnetic Resonance Spectra for Unreconciled Structure Pairs
For the 113 structure pairs whose differences remained unresolved after the two rounds of dispersion-corrected DFT calculations (plus the 4 pairs that diverged into different polymorphs), the magnetic shielding values at the 1 H and 13 C nuclei were calculated using the gauge-including PAW (GIPAW) approach as implemented in CASTEP [8] . All input and output files for these calculations are included in the data archive, subdirectory: /computational/NMR. Synthetic 1 H and 13 C NMR spectra were generated from these calculated magnetic shielding values using the Python script shieldings_compare.py (included in the data archive), with 0.25 ppm of Lorentzian line broadening being applied in all cases.
Although metrics exist in the literature for distinguishing chemical structures from one another via their 1 H and 13 C NMR spectra [9] , these assume a reasonably complete assignment has been performed. Since it was not viable to match the computed shielding values arising from the geometryoptimized alternative structures with different site labels in an automated fashion using currently available tools, comparisons between pairs of calculated NMR spectra were done without assignment using shieldings_compare.py. Where required, the three 1 H shielding values associated with any methyl group were averaged prior to computing the virtual spectrum. At this point, for each peak associated with one of the structures in a given structure pair, this script determined the smallest distance (in ppm) relative to all the calculated NMR signals of the other structure in the structure pair. More conservative values for spectral differences (in ppm) are used here when compared against those used in the literature with assigned datasets. Specifically, we considered a pair of unassigned 1 H NMR spectra as distinguishable from one another if there existed at least one calculated peak position in one spectrum that was at least 0.5 ppm away from every peak in the other calculated 1 H NMR spectrum. For the 13 C NMR spectral pairs, this value was chosen to be 2.3 ppm. Results of this process were summarized as whisker plots and can be found in the file: alternative_structures_NMRsummary.xlsx in the data archive. The numbers of distinguishable structure pairs, sorted according to difference type, are shown in Table S7. Based on the above, most structure pairs having Me groups as their principal structural difference would not be expected to be distinguishable using either 1 H or 13 C NMR experiments at ambient temperature; using low-temperature measurements to "freeze out" the averaging of the methyl resonances can be expected to improve distinguishability. On the other hand, structure pairs having OH groups as their main structural difference are predicted to be regularly distinguishable (distinguishable in 14 out of 17 pairs). As only 17 structure pairs belonged to the OH group, it was deemed worthwhile to manually inspect and assign the 1 H and 13 C NMR spectra associated with these structure pairs. Subsequently, RMSD calculations were performed (Table S8 below), and using the literature metrics for 1 H and 13 C, the likelihood of differentiating these pairs was determined (discussed in the main manuscript). a Data result from GIPAW DFT calculations using fully-optimized input structures and assigned spectra. The RMSD data columns are also depicted in Figure 2 of the main manuscript. S20 Figure S14. Depiction of the intermolecular packing interactions present in the fully geometry optimized structures of (A) SANYIP and (B) SANYIP02. Figure S15. Depiction of the "four-fold symmetry" present in the hydrogen bonding networks of the fully geometry optimized structures of (A) IPRPOL and (B) IPRPOL03. Figure S16. Depiction of the intermolecular interactions present in the fully geometry optimized structures of EDENEH and EDENEH02. When comparing these two structures the main structural differences are in the S21 positioning of two hydrogen atoms (one from an NH group and another from an OH group). At these sites of difference, the hydrogen atom for EDENEH has been coloured magenta.