Leen N.
Kalash
a,
Jason C.
Cole
b,
Royston C. B.
Copley
a,
Colin M.
Edge
a,
Alexandru A.
Moldovan
b,
Ghazala
Sadiq
b and
Cheryl L.
Doherty
*a
aMedicinal Science & Technology, GlaxoSmithKline, GSK Medicines Research Centre, Gunnels Wood Road, Stevenage, Hertfordshire SG1 2NY, UK. E-mail: Cheryl.x.Doherty@GSK.com
bThe Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, UK
First published on 14th July 2021
Information gleaned from crystal structure databases has previously been reported on several pharmaceutically relevant compounds to make knowledge-based predictions of polymorphism. Access to a large dataset that is highly relevant to the molecules under study is considered to be essential for these studies. We present a survey of the GlaxoSmithKline (GSK) database of small molecule crystal structures containing X-ray diffraction results from GSK and heritage companies from the past 40 years for this purpose. These structures were collected at different stages of the pharmaceutical pipeline and are not limited to marketed products. We found that the GSK database matches the CSD Drug Subset in terms of crystal descriptors, but not in the diversity of solid form space. Applying the hydrogen bond propensity model to GSK polymorphs has demonstrated the increased value in using combined published and proprietary data sources to build the training data sets. Within GSK, we have also shown the value of applying knowledge-based predictions in the de-risking of active pharmaceutical ingredient forms of development candidates. The work described here illustrates the importance of database curation to improve the accuracy of the results obtained.
Typically, experimental solid form screening to identify the thermodynamically stable and relevant metastable polymorphs are performed initially during the pre-formulation phase of drug development, with more comprehensive screens being carried out later on. The solid form which is most stable under conditions relevant to the manufacturing and storage of the drug product is typically preferred for development, so a wide range of crystallisation conditions and storage protocols are commonly explored to identify the most suitable stable solid form for each candidate.2,7 Despite these efforts, unexpected new polymorphs can still appear, even in well-screened systems. Notable examples such as ritonavir8 and rotigotine9,10 show that the late discovery of a new stable polymorph can result in significant challenges to providing a safe and efficacious drug product.2,11,12 Similar bioavailability issues were also encountered with other polymorphic drugs such as chloramphenicol palmitate, oxytetracycline, carbamazepine, atorvastatin calcium, axitinib, phenylbutazone, and rifaximin.13
It is clear that the best chance of finding the most stable solid form comes from designing the widest experimental screen possible, but it is not feasible to explore every possible experimental condition in a reasonable timescale. It is therefore crucial to understand the potential solid form landscape for each candidate and to use that information to do the right subset of experiments that will allow the isolation of the stable form for every case.
One route for the investigation of solid form landscapes is the application of structural chemistry knowledge derived from the Cambridge Structural Database (CSD), which is a collection of every organic, organometallic and metal–organic crystal structure published and now totals over one million entries. Data mining these structures using bespoke solid form informatics tools14–16 developed by the Cambridge Crystallographic Data Centre (CCDC) allows a deeper understanding of the solid form, helps identify weaknesses that may relate to the risk of alternate forms, reveals opportunities for intervention and provides validation and reassurance. Solid form informatics is not a replacement for experimental work but a complementary tool which allows a more informed experimental design to probe risks and opportunities. Solid form informatics is now well established and widely used within the pharmaceutical industry in drug development.17–20
By evaluating a structure in the context of existing knowledge in the CSD, it is relatively straightforward to identify both common and unusual structural features. As examples, an unusual conformation of a molecule, ring or functional group, a geometrically unusual hydrogen bonded interaction, or an unusual donor–acceptor combination may suggest that alternative crystal forms without these compromises could potentially exist.14–16 Statistics and comparative CSD analysis can give answers easily and quickly which can influence and advance the decision-making process with respect to risk mitigation in solid form selection.21
To draw useful conclusions from structural informatics analyses the available training data must be relevant to the compound of interest. Bryant et al.,22 as part of the advanced manufacturing supply chain initiative (AMSCI) funded advanced digital design of pharmaceutical therapeutics (ADDoPT) project created the CSD drug subset (CSD-DS) consisting of every published small molecule crystal structure containing an approved drug molecule (8632 entries). A strong overlap in molecular features including size and flexibility between this drug subset and in-house crystal structure data23 for AstraZeneca and Pfizer was demonstrated providing support for the use of statistical informatics models.
As the application of the models described here increases it is valuable to demonstrate the relevance of the publicly available data to modern drug candidates. This allows the accuracy of these models to be assessed for pharmaceutical solid form development. We report herein the first global analysis of the GSK database of small molecule crystal structures. Unlike the CSD-DS, the structures have been obtained across the various stages of the pharmaceutical pipeline and were originally collected to answer structural problems. As such, the database is not limited to marketed products but also includes medicinal chemistry leads, candidate molecules, intermediates and impurities. As of November 2019, there were 2473 entries in the database. It is worth noting that as the structures were collected to answer specific problems, there is no guarantee (indeed it is highly unlikely) that all solid-state forms of a particular material are contained within the database. In addition, the morphology of many of the crystals studied was accurately recorded in the GSK database, with many samples having been routinely face-indexed.
Our analysis described here: offers insights into the difference between proprietary and public domain data; illustrates the relevance of including both CSD-DS and proprietary structural data to build models for GSK candidates; and demonstrates the structural informatics methodology to explore the polymorph landscape of real pharmaceutical candidates.
Term | Definition |
---|---|
Free drug | A structure containing a single type of component. Note that this classification can include structures with multiple symmetry independent copies of a given molecule in the asymmetric unit; vis-à-vis any value of Z′. It includes zwitterionic systems. This term has been applied whether the material is formally a proven drug in its own right or not i.e. includes all organics, such as intermediates etc. |
Salt | A structure where any of the components are charged (not including zwitterions) |
Hydrate | A structure containing more than one type of component where at least one of the components is a water molecule |
Solvate | A structure containing more than one type of component where at least one of the components is a solvent molecule other than water. A solvent was defined as being a liquid whose role is to dissolve the drug in any stage of the synthetic process |
Co-Crystal | A structure containing more than one type of component where at least one of these components is charge-neutral and not a solvate molecule, a water molecule or the free drug |
Note that, other than free drugs, no class of crystal structure is mutually exclusive; indeed, it is possible to have a crystal structure that is in all the other classes at once. For completeness whilst discussing definitions, we use the term API in this paper to describe the biologically active substance of a structure.25 This term is not necessarily interchangeable with the free drug, since the API will include any salt counter ions or co-crystal co-formers.
In this phase of the analysis, we built crystallisation families from the database. Any solid forms that might reasonably be produced from a solvent crystallisation of an API were regarded as being in such a family. To illustrate this, an API in isolation or a hydrate and/or solvate of this would be in a crystallisation family but a different salt for instance would not; that salt would have its own family. These groupings of solid state forms and their interconversion either by crystallisation or desolvation are important to understand in a pharmaceutical context since the hydrated and solvated forms are often not desirable; a better understanding of these families could help improve the risk profile of APIs under development.
Crystallisation families in the GSK database were identified by sorting and grouping the chemical formulas and then manually inspecting the entries to ensure the same API was present in each case. The initial intention was to do this in an automated fashion by pulling together entries with the same canonicalised SMILES for the highest molecular weight component. This was not a successful approach owing to a number of factors, the most notable being the inability to discern the stereochemistry present. Recognising different enantiomers and diastereomers proved difficult without manual intervention, as did separating structures that were single enantiomers versus racemates. Although care was taken to ensure stereochemical differences were taken into consideration, potential atropisomers were considered to be the same entity. This was on the basis that information on the conversion rate was not available within the systems being investigated. SMILES were helpful to identify molecules with the same heaviest component formula and different atomic connectivity but even this was hampered as it is dependent on the correct and identical assignment of bond orders in the database. The discovery that some of the entries within the GSK database were in error or inconsistent was a disappointment but perhaps one of the greatest learnings from this exercise. As a result of this work, a detailed list of database corrections has been drawn up.
Based on the above methodology, 137 crystallisation families were identified in the GSK database. Fig. 1A shows the count of families in the database based on the type of members present in each, where an API could be a free drug or salt (there are no co-crystal examples). We explain the content of this Venn diagram in more detail by means of an example. The intersection of all four sets contains the number 5. This means there are five crystallisation families in the database that contain at least: one structure of the API alone, one hydrated form, one solvated form and one hydrate/solvate. Note that an individual hydrate/solvate does not occur at the intersection of hydrates and solvates since this diagram describes the whole family, not individual structures. By far the largest subset are the families that are composed of only API polymorphs (43 families), followed by the API and solvates (30 families) and the API and hydrates (17 families). These proportions are not representative of family compositions generally, since the search for polymorphic structures and their differences is a primary deliverable for X-ray diffraction studies in GSK. Since the largest fraction of the crystallisation families corresponded to polymorphic systems, these were reviewed separately in more detail. The polymorphs were identified using the procedure outlined in the Experimental section. Based on this analysis, 141 structures or 5.83% of the GSK database were polymorphs, which is notably smaller than the proportion reported in the CSD-DS (approximately 25%).22 The difference is likely to be a result of the different primary deliverables behind the two databases.
Fig. 1 A. Distribution of crystallisation families in the GSK database B. Percentage distribution of categories of hydrogen bonding interactions in polymorphs. |
The hydrogen bonding interactions in polymorphic structures were obtained using a python script (see Experimental section for details). Fig. 1B contains a breakdown of these interactions. The percentage of compounds with hydrogen bonding interactions that are identical in the polymorphs (in terms of the identity of donors and acceptors) is 39.7%. The hydrogen bonding arrangement for polymorphs is different in 47.5% of cases. The difference in hydrogen bonding across polymorphs supports the use of the CCDC's hydrogen bond propensity tool,26–28 in which the identification of possible polymorphs is based on the likelihood of obtaining different hydrogen bonding arrangements. The limitation of the HBP methodology is that polymorphs with the same hydrogen bonding cannot be distinguished from one another. The finding presented here based on the GSK database is that polymorphs with different hydrogen bonding are sufficiently common to support the use of the HBP tool in the pharmaceutical industry. 12.8% of the polymorphs were found to have no hydrogen bonding interactions. Comparison of the percentage of polymorphs from the GSK database that exhibit hydrogen bonding (87.2%) with the percentage of the total number of GSK structures (all 2417 structures) that are involved in hydrogen bonding (77.3%) indicates agreement with the analysis done by Cruz-Cabeza et al.,29 where it was found that compounds that are able to hydrogen bond have higher tendency to form polymorphs than those which do not. Despite the fact that the group of polymorphs that do not exhibit hydrogen bonding are not as frequently observed as those with hydrogen bonding, these should not be overlooked and developments in the CCDC's solid form tools (such as the aromatic analyser and the full interaction maps) are beginning to address these cases.30,31
Another factor relating to polymorphism is chirality. It was found by Cruz-Cabeza et al.29 that chiral molecules are less prone to polymorphism than their achiral counterparts. The sum of chiral arrangements were computed for molecules of polymorphs (71 molecules). It turned out that only 49.3% of the polymorphs have chiral arrangements, as compared to all unique molecules of the GSK database (2009 molecules) where 59% have chiral arrangements. Hence, these percentages agree with the work of Cruz-Cabeza et al. suggesting that chiral molecules are less prone to polymorphism; however, this percentage might not be representative of the actual polymorphs obtained in experimental screens.
The different balance between the free drug and salt structures in the two databases is highly likely to be a result of the GSK collection including pharmaceutically relevant molecules from all parts of the pipeline, not just marketed products: medicinal chemistry samples generated as part of lead optimisation studies would rarely be prepared as salts for instance. Another marked difference between the GSK database and the CSD-DS is the number of co-crystal structures (0.3% vs. 31.0% respectively). Historically, GSK has not actively prioritised the development of co-crystal APIs and this is reflected in the numbers. The above co-crystal definition removes a number of GSK structures from this category, in the case of salts where the drug-like molecule is present as both a charged and a neutral molecule.
The total number of hydrated structures is found to be 17.2% in GSK and 20.4% in the CSD-DS. Despite the overall differences in the composition already discussed, these numbers for the hydrates are reasonably similar. This might suggest that hydration as a phenomenon is not unduly influenced by whether the pharmaceutical material is charged and/or multicomponent. This seems a little counterintuitive at first, as one might expect the salts more prevalent in the CSD-DS to be more hydrated since they are charged.7 There is a slight bias in that direction in the figures, but it is not as pronounced as was expected and more work may be needed to rationalise this finding further. By comparison, there is a bigger difference in the number of solvates found in the GSK and CSD-DS, 15.1% and 10.0% respectively. This difference again probably comes down to the composition of the two databases, mainly due to GSK strategy, where questions (particularly from Discovery) could often be answered using solvates whereas their existence in marketed drugs would be seen as a clear disadvantage.
Fig. 4 A. Percentage distribution of space groups. B. Z′ for small molecule crystal structures in the GSK database and the CSD-DS. |
The relative Z′ distribution profiles obtained for both databases are shown in Fig. 4B. By far the most frequent structures are those with Z′ = 1 in both the GSK database (77.8%) and for the CSD-DS (72.5%). There are far more structures containing symmetry (Z′ < 1) in the CSD-DS by comparison to the GSK database. In particular, Z′ = 0.5 represents 12.06% for the CSD-DS but just 2.40% for the GSK database. Closer inspection of Z′ = 0.5 structures in the CSD-DS is illuminating. 98% of structures that are Z′ = 0.5 in the CSD-DS have more than one distinct component in the asymmetric unit, so the high observation in the CSD-DS of Z′ = 0.5 structures compared to GSK is an artefact of the high number of co-formers that often straddle centres of symmetry in the CSD set, and how Z′ is defined in these cases. The Z′ = 2 structures for the GSK database (17.05%) are slightly greater than the CSD-DS (12.53%). Both these findings fit with the larger number of Sohncke space groups in GSK, since inversion symmetry in these are impossible and chiral molecules tend to mimic a centrosymmetric arrangement33 and this requires two independent molecules to achieve. Otherwise the broad Z′ profiles are similar for both databases.
The packing distributions of these two databases are very similar as shown in Fig. 5A and also agree well with those values originally reported for aromatic molecules by Kitaigorodskii (0.6–0.8).35 The difference in the median values for these packing coefficients were investigated with a Mann–Whitney test and shown to be significant (Table S2†). This analysis uses highly relevant datasets to confirm the typical range of packing coefficients for drug-like crystal systems. This allows future low density materials to be identified clearly as a risk early in development, further allowing mitigating development activities to be targeted.
As part of the packing coefficient analysis, an attempt was made to understand structures at the extremes (in the tails) of the GSK distribution. The most obvious reason for structures with low packing coefficients was the use of SQUEEZE procedures.36 These have not generally been used for GSK structures (with a preference for modelling disordered solvent whenever possible) but in some cases there was no alternative. A ConQuest search was performed on the GSK database in an attempt to identify structures that used SQUEEZE and to see the effect of this on the packing coefficient. Fourteen structures were identified as using SQUEEZE, with nine of these have packing coefficients between 0.46 and 0.59, which is within the tail of the distribution.
To gain a better understanding of GSK structures with low packing coefficients the percentage void volume density distributions (relative ratio) were obtained for both databases as illustrated in Fig. 5B. 9.40% of the CSD-DS has a non-zero percentage void volume (see Experimental section for the definition used to calculate void space) whereas the GSK database has a higher percentage (15.39%) of structures with a non-zero percentage void volume, which could be attributed to the CSD-DS containing a larger proportion of marketed products in which the drug substances have optimised solid state properties.
The number of hydrogen bond pairs (HBP) in each structure was computed with a script written in Python (refer to Experimental section for details). The HBP distributions of the two databases are very different as shown in Fig. 5C, where the distribution of the CSD-DS is shifted towards higher HBP values. The mean value for the CSD-DS is 6, compared to just 3 in the GSK database. This could be explained by the predominance of free drugs in the GSK database (58% of the total solid form distribution) in comparison to the CSD-DS that has fewer of these forms and more salts, which are generally capable of exhibiting more hydrogen bonding interactions. The lack of co-crystals in the GSK entries may also have an impact here. Another interesting finding is that 22.7% of the structures in the GSK database exhibit no H-bonding, which is a much higher percentage than the CSD-DS structures where only 4.8% of the structures do not display any H-bonding.
For the molecular descriptor space analysis, unique molecules in the GSK database (2099 molecules) and the CSD-DS (778 molecules) were considered in an attempt to draw out relations to observed crystal descriptors. The increased percentage of GSK structures with no H-bonding interactions in comparison to the CSD-DS comes in line as well with the logP distribution (Fig. 6A), where for the GSK database it is shifted towards higher values, with a mean value of 3.17 ± 2.24, indicating that the molecules are more hydrophobic in nature in comparison to the CSD-DS with a mean value of 1.79 ± 2.65.
Furthermore, the molecular weight, flexibility, and branching density distribution profiles were obtained for both databases (Fig. 6B–D respectively). Flexibility is defined as integer * (100*rotatable bonds/total bonds), where the rotatable bonds percentage distribution were obtained for each database. Branching37 is the return of the number of walks of order 2 that start and end at the same atom which is an indication of the branching within a molecule. The mean MW value for the GSK molecules is 399 g mol−1 as compared to the CSD-DS molecules, which is 324 g mol−1. The mean value of the flexibility for GSK molecules is 14 ± 8 as compared to the CSD DS molecules 15 ± 13. Finally, the mean branching value for GSK molecules is 61 as compared to CSD DS molecules, which is 41. It is noted that Mann–Whitney tests were performed on each of the GSK and CSD-DS distributions in Fig. 6 showing no significant difference in flexibility but a significant difference in each of the others (see Table S2 in ESI†). The molecular weight and extent of branching distribution profiles for the GSK dataset are shifted towards higher values, which might contribute to the decreased values of the packing coefficient distribution.38 Flexibility does not correlate with these observed trends since the medians were not shown to differ significantly. However, it has been previously reported that crystalline molecules, tend to have low molecular weight and generally are simpler structures with a lower number of rotatable bonds. On the contrary, uncrystallisable molecules tend to be more structurally complex and flexible molecules which makes them harder to crystallise.39 It is noted as well that the walk count/branching descriptors are used in machine learning methods for the prediction of crystallisability of small organic molecules.40,41
In this section, we show the importance of having a good coverage of functional groups when using HBP to build a reliable model. An interesting example that illustrates this is the molecule, GW825964X. We investigate whether the statistical analysis of hydrogen bonds in the publicly available CSD-DS alone is sufficient to predict the likelihood of the hydrogen bonding for GW825964X, or if there is value in incorporating the proprietary data in the GSK database. The GSK database contains a much smaller number of structures but they are highly relevant, and it often includes close analogues from the same chemical series. Training datasets were built from GSK, CSD-DS, and combined GSK/CSD-DS data to build three comparative models, which were used to evaluate the performance and predictive power of these models.
The HBP model performance was evaluated using different training data sets of GSK, CSD-DS, and a combination of the two by comparing the area under the curve (AUC) (Table 2). The GSK database exhibits a larger dataset than the CSD-DS training set when mined for relevant structures. However, two functional groups were not well represented using either the GSK training set or the CSD-DS alone, whereas a combination of GSK and CSD-DS ensured that all functional groups are well represented. This indicates there is value in collecting more structures for compounds containing certain functional groups that are poorly represented in the CSD-DS, in order to build better models for compounds containing these features in the future.
Polymorphs of GW825964X | |||
---|---|---|---|
Training set | AUC | Number of well represented functional groups | Data size |
GSK | 0.93 | Two not well represented | 729 |
CSD-DS | 0.90 | Two not well represented | 450 |
GSK + CSD-DS | 0.92 | All well represented | 1179 |
Plots of the mean hydrogen bond (H-bond) co-ordination versus mean H-bond propensity were generated by the HBP models using the three different types of training sets (GSK, CSD-DS, and GSK + CSD-DS) for two polymorphs of GW825964X and are shown in Fig. 7A–C respectively. Using two of the training sets (GSK and GSK + CSD-DS) as shown in Fig. 7A and C respectively, the more stable form exhibits higher mean H-bond co-ordination and mean H-bond propensity values. With the CSD-DS training set alone (Fig. 7B), the more stable form exhibits a lower H-bond propensity value.
The cause for the lower propensity in the CSD-DS data set is the low representation of the amide carbonyl acceptor compared to the sulfonyl acceptor. The more stable structure contains a hydrogen bond to this carbonyl, whereas the less stable structure contains a hydrogen bond to the sulfonyl. The model underestimates the likelihood of the hydrogen bond to the carbonyl as compared to the sulfonyl (see Tables S3–S8†).
In the GSK training set there are also two underrepresented groups, but neither form hydrogen bonds in the two structures, and so the relative impact of these being under-represented in the model is small (see Tables S9–S11 ESI†).
The coordination score does not change from model to model as these are based on pre-calculated information from the CSD. The model only considers the donors or acceptors in isolation, and so are less affected by a lack of data in the CSD for the more chemically specific functional groups used to build the training dataset. The scores differ here (see Coordination_Score Table S12 in ESI†) because the amide carbonyl is generally more likely to accept the observed number of hydrogen bonds than the sulfonyl carbonyl in the two structures.
This analysis highlights the benefit of combining our in-house database with the CSD-DS to build more statistically certain HBP models. This shows the value of incorporating as much relevant data as possible into knowledge based modeling methods.
The hydrogen bond dimensionality (HBD) (see Experimental section: protocol for generating molecular descriptor distributions) is a useful descriptor to assess the networks formed, and its distribution has been investigated in relation to the morphology of the small molecule crystals studied as part of the GSK database. This analysis aims to seek a correlation between the dimensionality of hydrogen bonding networks and the morphology of the crystals. HBD could be categorised as: discrete (when molecules do not possess hydrogen bonds or hydrogen bonds are formed in discrete units without any long-range extension; chains (when there are chains of hydrogen bonded molecules); sheets (when there are hydrogen bonded sheets); and networks (with extensive interconnected hydrogen bonding in all directions). As shown in Fig. 8C, for the 1D and 2D morphologies, the HBD has a similar distribution, with the chains HBD being the most frequent, followed by discrete, sheets and finally, networks. This does not indicate any clear correlation existing between HBD and morphology. For the 3D morphology crystal structures however, the discrete HBD is clearly the most frequent followed by chains, sheets, then networks. This is clearly different and warrants further consideration. The count of 3D structures with discrete HBD was shown to be different from the count of non-3D structures with discrete HBD and we used a chi-squared test to show that this was significant. This is not only due to a lack of any hydrogen bonding because for discrete HBD structures hydrogen bonds are found in 51.7% of 3D structures and in similar proportion of 45.3% of all structures.
One hypothesis is that the lack of long range hydrogen bonding in these discrete structures leads to the greater influence of van der Waals' interactions and more even growth in all directions, subject to BFDH considerations. Work to investigate this further might include a closer inspection of the unit cell dimensions and indeed the difference between observed and BFDH calculated dimensions. Having accurately face-indexed crystals in GSK would help in this regard.
Crystallisation families of structures containing the same API have been identified in the GSK database and structures for related polymorphs identified. The number of the latter are generally quite low however since most GSK structures have arisen from individual requests where a single structure was used to answer the problem in hand. The resulting study of the hydrogen bonding in these polymorphs was very informative. A large number of polymorph pairs in the GSK dataset have been shown to have different hydrogen bonding arrangements. This is highly encouraging for the use of the CCDC's HBP tools for the prediction of alternative polymorphs.
The R-factor distribution for the GSK database and the CSD-DS were comparable. On the basis of this broad metric, the structures in the GSK database and the CSD-DS are regarded as being of a similar quality. A greater difference was observed when considering the space groups and Z′ values in the two databases. The GSK database favors Sohncke space groups that can contain chiral molecules and consequently, there are fewer Z′ = 1/2 and more Z′ = 2 structures.
A similar distribution of packing coefficients is found in the two databases. Structures treated with the SQUEEZE procedure occupy the lower coefficient GSK tail. The number of hydrogen bonded pairs in the GSK database are much lower than the CSD-DS whereas percentage void volume, molecular weight, and flexibility density distributions are higher for the GSK database, which are factors that could contribute to the shift of the packing coefficient distribution towards lower values for the GSK database.
A study of the morphology of GSK's crystals involved segregating these into 1D, 2D and 3D categories. The 3D morphology crystals, which are most suited for the pharmaceutical industry due to their enhanced handling and processing characteristics, were the most abundant in the GSK database. However some caution is required when considering this as they are also the easiest to measure experimentally. Perhaps more interestingly, an analysis of the hydrogen bond dimensionality shows that structures with no directionality (hence where van der Waals' forces play a larger role) clearly favor the 3D morphologies.
The work described in this paper greatly increases the understanding of the contents of the GSK database and demonstrates the value of this small number of chemically relevant additional structures in models assessing polymorph stability. It has highlighted the need for large amounts of accurate and relevant crystal structure data to build more reliable hydrogen bond propensity (HBP) models, and it demonstrates the value of using them in de-risking API forms of pharmaceutical candidates. In order to extract useful information in this way, particularly using automated scripts, the contents of the database need to be carefully curated to ensure consistency. The work also emphasises the advantages of the database such as the accurate recording of the morphology of many of the crystals that have been routinely face-indexed. This would allow for the successful analysis of the factors impacting the formation of crystal structures with different morphologies, which would also be invaluable to the development of modelling and prediction algorithms for crystal habits and strategies for habit modification. Overall, the data analysed could be used for building more reliable models for the prediction of materials properties and improvement of the performance of GSK materials, aiding crystal engineering and de-risking materials selection earlier in the drug development process to the benefit of patients.
Footnote |
† Electronic supplementary information (ESI) available: Hydrogen bond propensity model tables, protocols for hydrogen bonding and morphology analyses, python scripts and CSD-DS gcd files. See DOI: 10.1039/d1ce00665g |
This journal is © The Royal Society of Chemistry 2021 |