Will they co-crystallize? †

A data-driven approach to predicting co-crystal formation reduces the number of experiments required to successfully produce new co-crystals. A machine learning algorithm trained on an in-house set of co-crystallization experiments results in a 2.6-fold enrichment of successful co-crystal formation in a ranked list of co-for-mers, using an unseen set of paracetamol test experiments.

Co-crystals are multi-component crystalline materials that can be assembled via hydrogen bonds, 1-5 halogen bonds [6][7][8] and/ or π⋯π stacking. 9,10 Co-crystallization has attracted academic and industrial interest because it allows the physiochemical properties of active pharmaceutical ingredients (APIs) to be altered, for example bioavailability and solubility, 11,12 compressibility, 13 hygroscopic stability, 14 intrinsic dissolution rate, 15 and thermal properties. 16 The four most common hydrogen bond supramolecular synthons used in the design-phase of co-crystallization studies are shown in Fig. 1. 17 For a hydrogen-bonded co-crystal to form, there must be a degree of complementarity between the two components (co-formers), thus, careful co-former selection is crucial. 18,19 The hierarchical nature of supramolecular synthons is considered a key factor in accessing heteromeric interactions in the solid state. The work of Margaret Etter is fundamental in our understanding of hydrogen bond hierarchy. 20,21 In addition to 'Etter's rules', Hunter reported a set of numerical guidelines for quantifying the molecular interactions of organic molecules in the solid state, by assigning hydrogen bond strength parameters based on calculations of the mo-lecular electrostatic potential surface to hydrogen bond donors and acceptors. 22 Since the strongest donors and acceptors are likely to hydrogen bond with each other, 21 a hierarchical list of these can be used to predict the interaction energy for each pairing until all possible contacts have been made, with excess donors or acceptors ignored. The sum of these interaction energies gives a measure of the stability of a co-crystal relative to the pure components without any knowledge of three-dimensional structure. 23 This approach provides an estimate of the probability of a co-crystal forming, allowing a set of potential crystal co-formers to be ranked and has been shown to be able to identify new cocrystals.
Previous methods to predict co-crystal formation have focussed on comparison of melting points of the co-crystal and the pure components, 16 or co-crystal structure prediction. 24 However, such methods are computationally expensive and require significant calculation for each new set of potential co-crystal components, since the method requires generation of trial structures. It has also been reported that effective cocrystal screening can be achieved using a fluid-phase thermodynamics model to calculate the excess enthalpy of the interactions between a mixture of API and co-former relative to the pure components in a virtually supercooled liquid mixture (which can be approximated to the mixed solid phase crystal). 25 Experimentally, determination of new co-crystals involves systematic screening with a large range of co-formers, typically analysed via IR, DSC, and powder X-ray diffraction (PXRD), with single crystal X-ray diffraction used if possible. Although this approach is effective, it is costly in both time, effort and laboratory resources, particularly when considering the number of failed attempts in a traditional co-crystal screen.
Statistical analysis of the descriptors of components of cocrystal structures extracted from the Cambridge Structural Database (CSD) has uncovered correlations between the shape and polarity of co-formers, 26 but the lack of data on failed co-crystallization experiments mean that no predictive model has been derived from this to date.
A knowledge-based method based on hydrogen bond propensity (HBP) analysis 27 of the CSD 28 to determine the likelihood of co-crystal formation by assessing the probability of the homo-and hetero-interactions has been shown to successfully identify some co-crystals of paracetamol. 29 Machine learning algorithms can be used to produce predictive empirical models from suitable data and have previously been applied to successfully predict whether small organic molecules will or will not crystallize, using molecular descriptors calculated from a two-dimensional representation of a molecule. 30 More recently, artificial neural networks have been successfully applied to predict the melting points of cocrystals. 31 Such data-driven approaches have not been used to directly predict co-crystal formation.
In this paper, we report the use of a machine learning algorithm to generate a predictive model which can classify pairs of co-formers as successful or unsuccessful with respect to co-crystallization using simple descriptors of the co-former molecules. Estimates of the potential success of a particular pair of co-formers can be used to generate a ranked list that will enrich the identification of experimentally-determined successful co-formers at the top of the list.
Since the literature contains almost exclusively positive results for co-crystallization studies, it was necessary to generate a landscape of both positive and negative experimental data to support the training of the machine learning algorithm. A set of 20 target molecules was initially selected to be screened against 34 substituted aromatic acid and amide coformers ( Fig. 2 and 3). Careful consideration was given in the selection process to incorporate the potential for the four main supramolecular synthons (Fig. 1). Assessment of cocrystal formation was based upon changes in the PXRD pattern when accompanied in IR by a shift of the characteristic peaks traditionally involved in hydrogen bonding. In some cases, DSC was also used to assess co-crystal formation. Experimental data for all three techniques are in the ESI. † 191 standard molecular descriptors were calculated for each compound using the RDKit cheminformatics toolkit, version Q1 2016. 32 The descriptors for certain amine, amide and ether functional groups were slightly modified to uniquely identify the functional groups in more complex molecules, while additional descriptors were added to account for aryl halides. This gave a total of 195 descriptors for each co-former compound, see ESI. † Co-crystal descriptors were created by concatenating the co-former descriptors and the acid or amide descriptors along with an extra descriptor implementing the values developed by Hunter 22 This gave a total of 391 descriptors for each potential co-crystal.
The label "1" was assigned to known or experimentally determined co-crystals or salts and the label "0" assigned to those experiments that did not result in a new solid form. Those combinations for which the outcome was uncertain were removed from the dataset, giving a total of 657 training data points (403 unsuccessful, 254 successful). Of these 254 successful data points, 44 were already reported in the literature (39 co-crystals and 5 salts) and the remaining 210 represent novel solid forms. We are confident that this training data is a useful set for predicting co-crystal formation due to the low number of salts found in the CSD, and the low success of salt formation from dry grinding, although the likelihood of co-crystal over salt formation can be assessed by comparing co-former and API pK a values. 33 Machine learning algorithms and performance metrics from version 17.0 of the scikit-learn package were used. 34 Support vector machines (SVMs) were used as the machine  learning algorithm to create the predictive model using the molecular descriptors, having previously been found to give the best performance for a similar classification problem. 30 The best-performing parameters for the algorithm were determined by a grid-search of parameters using a five-fold crossvalidation on the entire training dataset 35 and were subsequently fixed at values of C = 10 and gamma = 0.001. A balanced class weighting was used to account for the potential bias caused by the greater number of unsuccessful co-crystal experiments in the training set.
An external validation set was created from Wood et al., which collated paracetamol studies from a variety of sources. 29,[36][37][38] This gave a total of 34 potential co-formers, of which 13 successfully formed co-crystals. ‡ The descriptors were calculated in the same way as for the training set.
In addition to the classification accuracy on the validation set, the capability of the approach to "enrich" the number of successful hits in a ranked list was calculated, since this has practical application in reducing the number of experiments required to find co-crystal forms. For the external validation set, the co-crystals were ranked according to the probability estimate given by the predictive model. This was compared to the actual hits in order to quantify how many successful co-crystals were identified by this selection method. From this ranking, an enrichment factor (EF) provides a numerical score that quantifies the observed success rate at the top of the list relative to randomly sampling the list. For the top x% of the ranked list: where N hits is the number of hits in the top x%, N x is the total number of successful and unsuccessful co-crystals in the top x%, N total hits is the number of hits in the whole list and N total is the total number of successful and unsuccessful coformers in the whole list.
The probability estimates from the predictive models were also used to generate a receiver operating characteristic (ROC) curve for the validation set classification, which measures the ability of the model to rank the successful cocrystals relative to the unsuccessful ones. 39 The area under the curve (AUC) provides a quantitative measure of the accuracy of the ranking. 39,40 The predictive accuracy of the model in classifying the coformers as forming successful or unsuccessful co-crystals with paracetamol was poor at 64.7%, much lower than the crossvalidation accuracy on the training set (75.0% ± 1.4%). The confusion matrix (Table 1) shows that the model has a ten-dency towards predicting more of the combinations as being successful co-crystals, which reduces the number of false negatives (successful co-crystals that are incorrectly marked as unlikely to form and so would be missed). This may be a result of differences between paracetamol and the co-formers making up the training set, which could affect the reliability of the probability cut-off that the model uses to make its predictions.
However, Fig. 4 shows that the list of co-formers ranked by the probabilities obtained from the model successfully identifies 9 of the 13 co-crystals of paracetamol within the top 11 suggestions in the list. The AUC of 0.85 is significantly better than the AUC of 0.66 obtained from the HBP method, 29 as shown in Fig. 5. This gave an EF 25 of 2.6, corresponding to 100% successful prediction in the top 8 (25%). This suggests that although the probability cut-off between successful and failed co-crystals used by the algorithm to perform the binary classification may be wrongly positioned, the probability ranking itself provides a way of identifying co-formers which are likely to form co-crystals, reducing the number of experiments required to successfully identify co-former pairings.
The importance of ensuring that co-former molecules lie in a similar area of chemical space to the molecules used for training the model is illustrated by salsalate, for which the current model predicts the same probability of co-crystal formation regardless of the co-former paired with it. On examination of the scaled descriptors, 39 salsalate descriptors are found to have values greater than 3 standard deviations from the mean of the training set (compared to 5 descriptors for paracetamol). This is an indicator that the distance between salsalate and the training molecules in chemical space is too great for the model to provide a sensible prediction. Consequently the descriptors of test molecules need to be examined carefully after scaling to ensure that the existing model is suitable for use with that particular The ability of the model to successfully rank co-formers other than those included in the training dataset indicates that the co-formers and descriptors used to train the model provide enough information to allow the model to be applied to a wide range of co-formers. Increasing the scope of the model can be envisaged by retraining the algorithm using a larger range of APIs and co-formers.
In comparison to other methods used for predicting the ability of molecules to co-crystallize together, 26,29 this approach requires a large amount of experimental work to be undertaken to generate the initial training set. Academic and industrial researchers in the field will have access to previous experimental data containing both successful and unsuccessful results, meaning that this will not hinder the implementation of this methodology.
In summary, we have demonstrated that a machine learning algorithm trained on an in-house set of co-crystallization experiments using simple descriptors as the input can be used to guide selection of co-formers for a particular API. This is likely to assist industry by saving both time and resources on experimental screens, particularly in the early stages of co-former selection.

Conflicts of interest
There are no conflicts to declare.