Jan-Joris
Devogelaer
,
Sander J. T.
Brugman
,
Hugo
Meekes
,
Paul
Tinnemans
,
Elias
Vlieg
and
René
de Gelder
*
Radboud University, Heyendaalseweg 135, 6525AJ Nijmegen, The Netherlands. E-mail: r.degelder@science.ru.nl
First published on 3rd October 2019
Cocrystallization is an attractive formulation tool for tuning the physicochemical properties of a compound while not altering its molecular structure and has gained interest from both industry and academia. Although the design strategy for cocrystals has marked several milestones over the past few decades, a holistic approach that utilizes as much cocrystal data as possible is still lacking. In this paper, we describe how information contained in the Cambridge Structural Database (CSD) can be used to construct a data-driven cocrystal prediction method, based on a network of coformers and link-prediction algorithms. Experimental validation of the method leads to the discovery of ten new cocrystal structures for its top ten predictions. The prediction method is not restricted to compounds present in the CSD: by combining the information of only a few cocrystals of an unknown coformer (e.g. an API in development) together with the information contained in the database, a set of relevant cocrystal candidates can be generated.
Cocrystals are single-phase solid complexes, consisting of two or more neutral molecules that are solid under ambient conditions (called coformers) with a well-defined stoichiometric ratio, for which no charge transfer is observed in the resulting crystal structure.5,6 A subclass of cocrystals that is often encountered are pharmaceutical cocrystals, where one of the constituents is an API and the other a pharmaceutically acceptable coformer found in the GRAS‡ list. However, any crystal that contains multiple molecules and conforms to the definition above is considered to be a cocrystal. The presence of an additional component modifies the intermolecular interactions in the underlying crystal structure, making it possible to alter several mechanical and physicochemical properties (e.g. solubility, permeability, taste and hygroscopicity).1,7,8 Because the molecular structure of the constituents remains unchanged, the FDA classifies cocrystals of APIs in the same category as polymorphs and salts.9 This drastically reduces the risks and steps to be taken from a regulatory perspective, as previously determined safety and efficacy tests remain valid for cocrystalline products. Additionally, the preparation of cocrystals gives various opportunities regarding intellectual property rights.10
Chirality, or more specifically homochirality, plays an important role in today's industry that demands enantiomerically pure products.11 For chiral coformers that crystallize as a racemic compound, the presence of an additional component can give rise to the formation of a racemic conglomerate.§ Crystallization of conglomerates is a key requirement for various post-synthetic separation processes based on crystallization,13–17 enabling the enantiopure production of one stereoisomer. Although only a few examples of conglomerate cocrystals are known so far,18–20 the number of available coformers largely exceeds the number of counter-ions and solvents.21 Furthermore, it has recently been shown how enantiospecific cocrystals (i.e. a cocrystal that is formed preferentially with one of the enantiomers by the addition of a chiral coformer) can be used to separate a racemic compound forming molecule,22,23 essentially being the neutral analogue of the widely used resolution via diastereomeric salts. Hence, cocrystallization also has large potential for applications regarding chirality, and could become a key element in enabling resolution.
The design and prediction of new cocrystals are typically performed using the concept of supramolecular synthons.24 A common strategy in crystal engineering is to first investigate the crystal structure of the target compound and to evaluate which non-covalent interactions (mostly hydrogen bonding motifs, but also halogen bonds, π–π and van der Waals interactions) could aid in the formation of new supramolecular synthons between the target compound and coformer (e.g. Espinosa-Lara et al.25 and Kuminek et al.7). Although this approach is rational from a chemical point of view and has been shown to be valuable in cocrystal screening protocols, it remains generally impossible to reliably predict cocrystal formation. One of the method's prime shortcomings is its focus on isolated molecular features (i.e. only the presence of functional groups), whose interactions are not necessarily decisive for the resulting molecular architecture. Moreover, subtle factors, such as steric hindrance, packing issues or even experimental difficulties (e.g. mismatch in coformer solubilities), are generally not taken into account. Furthermore, it has been shown that cocrystallization is not governed by the presence of hydrogen and halogen bonds alone,26 again advocating an approach beyond functional group matching.
Because cocrystallization experiments can be laborious and time-intensive, computational techniques based on molecular modelling,26–28 molecular descriptors,29 hydrogen bond propensity30–32 and machine learning33 have been developed to guide the search for new cocrystals. While these approaches have definitely succeeded in broadening the understanding of the principles behind cocrystallization, a major pitfall is their dependence on small subsets of cocrystals. Therefore, the results tend to lack generality and may be biased towards cocrystals of highly popular coformers, such as caffeine or nicotinamide.
In this paper, we introduce a new knowledge-based cocrystal prediction method based on a network of coformers and link-prediction algorithms. In our previous work,34 we have demonstrated how cocrystals in the Cambridge Structural Database35 (CSD) can be transformed into a network of coformers and shown how clusters, a quantification of the so-called popularity bias and the type of aggregation behavior, can be extracted from this network. Here, we combine several techniques from network science and classification to predict new cocrystals based on the information contained in the coformer network. By including all binary cocrystals present in the CSD, we do not constrain the tool to small or possibly biased data sets, but attempt to include as much relational information as possible. The performance of the method is evaluated on the data of the network itself (through cross-validation) and by analyzing the scoring behavior of cocrystals that were added to the CSD in the last 3 years. Predictions of new cocrystals with the highest likelihood of existence are experimentally verified. Finally, we indicate how our prediction method can be used for target compounds not present in the database.
While the network is built from existing cocrystals, it may safely be assumed that an abundance of coformer combinations has not yet been experimentally verified and are missing. Such combinations are labeled as 0 in the adjacency matrix, and could in principle be predicted and synthesized. By using the information that is contained in the coformer network, the aim of link prediction is to estimate the likelihood of the existence of these missing cocrystals. This likelihood is expressed as a value or score, calculated from the structural features of the network with parameters derived from the adjacency matrix (rather than from their molecular structure). An important advantage of link prediction is that the methods to score coformer combinations are fairly simple to use and that, in this case, the relevant chemistry and physics of cocrystal formation is implicitly contained in the network itself. Therefore, link prediction has the potential to significantly speed up the development of cocrystal screening protocols, bypassing either local interaction predictions or lengthy calculations. The choice of a scoring method is, however, not trivial, and is selected here on the basis of the network properties and through validation on the known cocrystal data (with cross-validation). The performance of the chosen method was further evaluated by analyzing the time-evolution of the coformer network and by experimentally confirming that new, high-scoring coformer combinations indeed yield new cocrystals.
This section is therefore structured as follows. Several network features, required for scoring coformer couples, are introduced in section 2.1 and the scoring methods themselves are described in section 2.2. Section 2.3 covers the various techniques that were used for validation of the approach.
Fig. 1 schematically shows the structural properties of a given coformer couple that can be derived from the network. Each coformer is characterized by a set of direct neighbors n, equivalent to the set of coformers it has successfully formed cocrystals with. Using the adjacency matrix A, the set of neighbors of coformer i is found as follows:
ni = {a ∈ N|A(i,a) = 1}. | (1) |
A property derived from n is its cardinality or degree k, defined as
ki = |ni| | (2) |
The type of the network can be classified as mono- or bipartite. A monopartite network is characterized by a single group of nodes, and combinations between any two nodes are possible (e.g. connections between users in a social network). On the other hand, a bipartite network consists of two separate groups of nodes, and connections appear only between nodes of the different groups. An example of a bipartite network is the network formed by salts: the network consists of a set of cations and anions, and salts can only be formed by combining opposite ions. We have recently analyzed the type of the coformer network34 and have found that it implicitly behaves in a bipartite way. Therefore, the link-prediction methods were selected or adapted to this network type, which requires the formulation of two additional bipartite properties. Originally proposed by Daminelli et al.,36 a combination of nodes (i,j) can be characterized by two sets of bipartite common neighbors, bi,j and bj,i (see Fig. 1), defined as:
bi,j = {a ∈ ni|∃b ∈ nj ∧ A(a,b) = 1} | (3) |
bj,i = {a ∈ nj|∃b ∈ ni ∧ A(a,b) = 1} | (4) |
li,j = {(a, b) ∈ E|a ∈ bi,j, b ∈ bj,i ∧ A(a,b) = 1} | (5) |
Method | Expression |
---|---|
Common neighbors index (CN) | s i,j = |bi,j ∪ bj,i| |
Jaccard index36,37 | |
Hub promoted index (HPI)38 | |
Hub depressed index (HDI)39 | |
Salton index40 | |
Sörenson index41 | |
Leicht–Holme–Newman index (LHN)42 | |
LCL common neighbors index (CN LCL)36 | s i,j = |bi,j ∪ bj,i| × |li,j| |
Preferential attachment index (PA)43,44 | s i,j = ki × kj |
Adamic–Adar index (AA)36,45 | |
Resource allocation (RA)36,46 |
A common approach is to compute the scores for every unconnected pair of nodes and rank them in decreasing order. Coformer combinations with the highest scores are then expected to result in new cocrystals. An alternative approach is to rank the scores for one specific coformer, for instance when screening cocrystals for a specific target compound, such as an API.
The capability of the method to repredict the test items is then evaluated by comparing the predicted labels of the test set, determined at a chosen threshold (i.e. an arbitrary score value), to their true labels. Because of the way the network is constructed, it is only possible to certainly know the true labels of existing cocrystals (i.e.A(i,j) = 1). On the other hand, the labels of unknown combinations (A(i,j) = 0) are uncertain, as zeros in the adjacency matrix are more likely to emerge from untested coformer pairs than from unsuccessful cocrystallization experiments, and could therefore represent existing but not yet discovered cocrystals (A(i,j) = 1). The result of the comparison is used to construct a so-called confusion matrix, consisting of four elements (Fig. 2):
Fig. 2 An example illustrating the concept of validation and the calculation of the evaluation metrics. At a certain threshold, the scores of the test set combinations are determined using one of the link-prediction or scoring methods from Table 1 and are ranked in decreasing order (s1 to s9). Combinations for which the score s ≥ t (threshold) are predicted to exist, and vice versa (red labels), and the test set's predicted labels are compared to their true labels, resulting in a confusion matrix. The matrix is used to calculate the precision (eqn (7)) and recall (eqn (6)), and the trio (p, r, t) is added to the precision–recall curve. |
1. True positives (TP): the number of positive labels correctly labeled as positive;
2. True negatives (TN): the number of negative labels correctly labeled as negative;
3. False positives (FP): the number of negative labels wrongly labeled as positive;
4. False negatives (FN): the number of positive labels wrongly labeled as negative.
The performance of the scoring method at the chosen threshold is then summarized using the evaluation metrics recall (r) and precision (p), since the reprediction of positive labels is more relevant than that of negative labels:
(6) |
(7) |
The recall (r) represents the retrieval success or sensitivity for a chosen threshold. The precision (p) is a measure for the relevance of the result and summarizes how many of the positive predictions are actually true. The latter can be understood as a success rate: the score of cocrystal prediction corresponds to a certain precision value, indicating the probability of the coformer combination to actually form a cocrystal.
It is clear that the confusion matrix and its derived evaluation metrics are directly related to the choice of threshold, the value of which will be different for each method from Table 1 and is hard to physically justify. In order to compare the various scoring methods to each other, it is common to use threshold curves, which reveal the method's total performance over the entire score spectrum. In the case of a large class imbalance such as for the link-prediction problem (more unknown than known links), it has been shown that the precision–recall curve and its associated area-under-the-curve metric provide a better overview of performance than the alternative receiver operating characteristic curve (ROC),47 and were therefore chosen to compare the various methods and to select the suitable scoring method for link prediction. By varying the threshold from the lowest to the highest score and computing the precision and recall from the resulting confusion matrix at each threshold, a comprehensive curve is obtained of which the enclosed area-under-the-curve is used as a measure for comparison. Alternatively, the data used to construct the precision–recall curve may be shown as a precision–threshold curve, from which the threshold at a specified precision can be extracted (or vice versa).
Fig. 3 (a) Precision–recall curves for the scoring methods shown in Table 1 constructed using 100 validation sets. The corresponding area-under-the-curve (AUC) metrics are also calculated, resulting in RA as the best performing method. (b) The precision–threshold curve for RA (black, left axis) and the total number of true positives (TP) over all validation runs as a function of the threshold (red, right axis). |
The largest value for the AUC is obtained using the bipartite resource allocation scoring method (RA) and is close to the value obtained using the bipartite Adamic–Adar method (AA). This is not surprising, as the degrees k of the coformers in the network vary within a range of two orders of magnitude, and therefore the effect of the logarithm in AA for higher degree neighbors is relatively small, resulting in a similar scoring behavior. From Table 1 and Fig. 1, it is clear that both scoring methods express the score using a measure for the density of the intermediate network between two coformers: only relationships between common neighbors contribute, and neighbors with a more diverse cocrystallization profile are penalized (via the degree k in the denominator). The reason that these methods perform so well can be attributed to the structure of the network itself: a comprehensive analysis34 has shown that existing cocrystals are characterized by the presence of local bipartite communities between them. Hence, methods capable of condensing this structural phenomenon effectively into their score value are expected to perform better. Although other methods such as CN and CN LCL also take properties derived from these local bipartite communities into account, they perform slightly worse than RA and AA, as they appear to miss the crucial formulation of the interlying density. Since the AUC of RA is larger than AA's, RA was selected as the prediction method of choice.
Methods that suppress the score of new combinations with higher coformer degrees, such as the Jaccard, HDI, HPI, Salton, Sörenson and LHN indices, perform much worse. A plausible explanation can again be found in the structure of the network: the degree distribution exhibits linearity in a log–log graph and can be fitted with a power law. A peculiar feature of power-law distributed networks is their dependence on nodes with larger degrees for their internal structure, which also plays a crucial role in their evolution. It is interesting to note that the numerator of these methods is equivalent to that of the CN scoring method (except for LHN where it is doubled), which, on its own, does show moderate prediction performance. Thus, scoring methods that penalize nodes for their degree cannot describe the network's underlying structure, and are therefore not suitable to predict new cocrystals.
Simply combining any node with higher degree nodes, leading to an unorganized collection of cocrystals as proposed by the preferential attachment index (PA) also does not lead to a satisfactory performance. Additionally, repeating a similar validation procedure with monopartite expressions for the score indices in Table 1 (for example by replacing |bi,j ∪ bj,i| with |ni ∩ nj|) did also not result in any good validation results. Therefore, while the network may contain an inherent bias towards high degree coformers due to its power-law degree distribution, it is still a coherent and bipartitely organized structure. This bipartiteness again stresses the importance of complementarity for the design of cocrystals, whether it is through hetero- or homosynthons (i.e. combining different or identical functional groups, respectively), and was successfully included in and confirmed by the selected link prediction method.
As shown in Fig. 3(a), the overall area under the precision-recall curve is generally low. Because the coformer network is built only with successful cocrystallization attempts in the CSD and does not include information on failures, it is impossible to distinguish whether an unknown combination (A(i,j) = 0) is truly non-existing or was in fact never experimentally verified. The correct number of false positives is therefore debatable, and the values of the precision values disclosed here should be seen as absolute lower limits for their actual values. The same combination of coformers may be assigned a different score value depending on the scoring method, and undiscovered cocrystals (or missing links) would therefore not always contribute to the number of false positives in a comparable way for a certain threshold. Yet, the precision–recall curves remain an adequate comparison method: the entire score spectrum of each scoring method is evaluated and normalized, accounting for the presence of missing links. We will show later that the coformer network is indeed heavily unsaturated, and many predicted combinations of coformers turn out to yield new cocrystals.
Table 2 summarizes the top-ten cocrystal predictions, scoring highest for the RA method.|| These predictions correspond to a validated precision of 95% or higher (see Fig. 3(b)), making the corresponding cocrystals very likely to exist. A recurring feature in these coformer pairs is their hydrogen bonding complementarity, combining coformers containing hydrogen bond donor functional groups (mostly carboxylic acid groups, but also alcohols) with coformers having hydrogen bond acceptor properties (pyridine rings). The link prediction method also recognizes combinations where the strongly favored carboxylic acid⋯pyridine and hydroxyl⋯pyridine synthons emerge,48 and ranks them among the most likely new cocrystals. Hence, from a supramolecular synthon point of view, it can be anticipated that these cocrystals are possible. Additionally, an aromatic group is present in half of the donor coformers, which implies that multiple types of intermolecular interactions will possibly co-exist in the final cocrystal structure of these coformers.
All the coformer couples shown in Table 2 were combined using equimolar amounts and ten new successful combinations were discovered (more details in the ESI†) with relative ease.** It is interesting to note that while all the coformers used in the experiments are either weak bases or acids, for some of the structures, the proton is not completely assigned to the carboxylic acid groups (e.g.7 and 9), making their classification as cocrystals not entirely correct. Moreover, the proton transfer for structure 5 is complete, hence indicating a salt. It may have been anticipated that some of the coformer combinations in the network actually fall into the salt–cocrystal continuum and that the coformer network does not solely contain information about possible cocrystals, but also about (serendipitous) salts. Besides the strength of the respective acid and base, the extent to which the proton is transferred is also dependent on the crystal packing,49 as a polymorph of the same combination of coformers may adopt a different protonation state. Therefore, the occurrence of salts in the top ten predictions of cocrystals is not completely unexpected. Besides salt formation, a cocrystal dihydrate and salt dihydrate were synthesized during the cocrystal screening (see the ESI†). However, water-free structures were obtained by changing the experimental conditions.
With the right analysis of the network's structure, a bipartite link prediction algorithm such as RA can complement the commonly used synthon approach. Additionally, it can extract rules from the network unknown or unclear to experimentalists, which may be decisive for the successful formation of cocrystals (e.g. matching of solubilities in the solvent, feasible interaction patterns, and the absence of steric hindrance). The cocrystals synthesized here are in fact missing links: their cocrystal formation profile is so well-defined (i.e. large k) that combining them may seem obvious in view of the internal bias of the network. Their determination is nonetheless the first step towards the experimental validation of the method. Further validation can be obtained by testing the prediction method for individual coformers, and one should realize that the prediction values are heavily underestimated. We are currently performing such experiments, and the results will be the subject of a subsequent paper.
A possible example is (RS)-ibuprofen, a non-steroidal anti-inflammatory drug, of which the structures of six cocrystals are present in the CSD. If one would assume that this API is not present in the database, then plugging the cocrystal information into the network yields several useful cocrystal candidates (Fig. 6), as the other coformers present in ibuprofen cocrystals are present in the database. For instance, the highest scoring combination with 1,2-di(4-pyridyl)ethylene (structure p1) corresponds to a precision of 89%. Its structure was in fact already determined by Elacqua18 but was not included in the CSD and is therefore part of the predicted combinations. The prediction with caffeine (structure p2, p = 50%) is also reasonable, as ibuprofen's known neighbors nicotinamide and isonicotinamide (n3 and n4) were found to exhibit a similar tendency to form cocrystals as caffeine (i.e. they are present in the same coformer cluster).34
Fig. 6 RS-Ibuprofen, the coformers it forms cocrystals with (neighbors n) and the five new combinations that are scored highest using the RA method (predictions p). |
It is thus clear that information of only a limited number of cocrystals can be sufficient to obtain new valuable coformer combinations. With the emergence of new connections, the adjacency matrix may be updated and whole new classes (or clusters) of coformers could be addressed by the scoring method, guiding the cocrystal screening process towards new combinations that might not have been considered otherwise.
A prerequisite of the network approach is the information of at least one cocrystal, either present in the CSD or determined elsewhere. When this is not the case, it is advised to set up a cocrystal screening method that uses information from the network (as described in the study of Devogelaer et al.34). For instance, by selecting a single coformer from each cluster, the initial screen attempts to include as much structural variation amongst the coformers as possible, thus preventing it to be biased towards one cluster or class of compounds. The discovery of one or several hits can then be followed up by our link-prediction approach described above. Alternatively, the target compound can be mapped onto the network by comparing its chemical fingerprint to those of known coformers through for instance the Tanimoto similarity measure.50 Coformers similar to the target compound are presumed to exhibit a similar tendency for cocrystal formation, and coformers present in cocrystals with these similar compounds are consequently chosen for screening. Such an approach lies, however, outside of the scope of this article.
Footnotes |
† Electronic supplementary information (ESI) available: Overview of the starting materials, experimental details of the synthesis of the cocrystals (S1) and their crystallographic data (S2), the scoring method's top 100 predictions (S3) and predefined lists of solvents and gases used for cocrystal classification (S4). CCDC 1940949–1940959. For ESI and crystallographic data in CIF or other electronic format see DOI: 10.1039/c9ce01110b |
‡ Generally recognized as safe. |
§ This is when crystallization of a racemic mixture results in crystals that separately contain only right- or left-handed enantiomers. Unfortunately, this behavior is more rare than racemic compound formation, where both enantiomers reside in the same crystal lattice.12 |
¶ Due to several improvements made to our classifier algorithm (including a better neutrality check) described in the study of Devogelaer et al.,34 this number is slightly different from our earlier work. |
|| The scoring method's top 100 predictions can be found in the ESI.† |
** Except for structures 1 and 5, crystals suitable for single-crystal X-ray diffraction were obtained by simply dissolving the equimolar mixture in a single solvent, which was subsequently evaporated. Resulting in adequate single crystals, the structure of cocrystal 6 is modulated and will be the topic of a future publication. |
This journal is © The Royal Society of Chemistry 2019 |