Reducing crystal structure overprediction of ibuprofen with large scale molecular dynamics simulations

Reduction of a large dataset of computationally predicted structures of ibuprofen by employing molecular dynamics and biased simulations at finite temperature and pressure.

Introduction structure calculations and introducing entropic effects through free energy calculations. [1][2][3][4][5] While the quality of the methods employed in the final refinement and ranking stage is constantly improving, their computational cost is typically prohibitive and approaches to achieve a rational reduction of the number of putative structures predicted by lattice-energy based methods (CSP 0) are needed. 4,6 Both in industry and academia, CSP methods are becoming increasingly popular given their success in predicting experimental crystal forms starting from only the molecule geometry. 4,[6][7][8][9][10] For the vast majority of these methods, the different crystal packings are generated ignoring thermal motion and assuming that the lattice energies are a reasonable approximation of the thermodynamic stability of experimental forms. These approaches, identified as CSP 0, 11 have been widely used in the 6th CCDC CSP blind test, 12 successfully reproducing the experimental forms. A limitation in using local minima of the lattice energy landscape as candidate polymorph structures is that the lattice energy landscape is rugged, and its local minima grossly outnumber the experimentally known polymorphs. This limitation makes it impossible to distinguish possible new polymorphs from artefacts of the CSP 0 static models. In fact, those states characterised by small barriers should coalesce to significantly more stable structures or melt at finite temperature and pressure. 1,11,13 Furthermore, it has been shown that the ensemble of configurations accessible corresponding to a single polymorph at ambient temperature can correspond to multiple lattice energy minima. 14 By introducing finite temperature and pressure effects, we have recently proposed a workflow able to reduce the number of putative polymorphs drastically. The procedure described in Ref. 15, applied to the cases of urea and succinic acid, consists of: • Molecular dynamics (MD) simulations to equilibrate all structures at 300 K and 1 bar • Automatically identify those states that are unstable or cluster those that belong to the same dynamic ensembles using structural probabilistic fingerprints • Perform enhanced sampling simulations on the cluster centres to overcome MD limits and assess their stability In this work, we apply the approach introduced in Ref. 15 to a dataset of 555 crystal structures of ibuprofen: an application of the size and complexity typical of modern CSP studies. Ibuprofen is a conformationally flexible, chiral molecule possessing two enantiomers with different pharmacological properties. S-Ibuprofen is the biologically active one while R-Ibuprofen needs to be transformed in the body to its S-counterpart. 16,17 This popular pain-relieving API is commonly available in its racemic Form I. 18 A more expensive enantiopure form, here labelled Form E, contains only the S-Ibuprofen. 19,20 More recently, a second lessstable racemic Form II was observed in a differential scanning calorimetry experiment. 21 Here, in addition to demonstrating the method's applicability to a dataset one order of magnitude larger than previously attempted, we describe new tools implemented in the analysis to handle sets of more than 500 CSP 0-generated structures. In particular, we introduced a fast molecule-dependent classification to reduce the time needed to compare crystal structures and cluster equivalent geometries. This improvement results in a rapid clustering analysis, which was repeatedly deployed at regular steps during biased simulations to systematically detect structural transitions without following molecular trajectories one at a time, an impractical task when dealing with several hundreds of finitetemperature dynamic simulations. Moreover, the application of the simulation workflow to a large dataset of mostly hydrogenbonded CSP 0 crystal packings has allowed us to quantitatively analyse the emergence conformational and orientational disorder at finite temperature, and to assess the persistence of H-bond interaction motifs. Finally, we investigate the dependence of the unsupervised clustering used to identify analogous structures on the choice of collective descriptors at the basis of the probabilistic fingerprints used to define a dissimilarity metric between finite temperature crystal supercells. With the improvements in efficiency introduced in this work, the reduction workflow introduced in Ref. 15 can be deployed efficiently to reduce CSP 0 sets of the size and complexity approaching those of real-world applications.
Methods CSP 0 lattice energy landscape. The ibuprofen search was carried out using CrystalPredictor1.9, 24 which allows the molecule to assume different conformations. In particular, we considered the two torsional groups of angles (t 1 , t 2 ) and (t 3 , t 4 ), shown in Fig. 1, that were separately varied from 0 to 360 degrees in 20 degree steps. The search used molecular fragments taken from the ab initio optimised molecule at the PBE0/6-31G(d,p) level of theory. The fixed point charges used in this initial step are obtained using the larger basis set aug-cc-pVDZ. The parameters from the FIT potential 25 with polar hydrogens 26 were used for the repulsion-dispersion contributions to the energy. The search was performed in 59 space groups with one molecule in the asymmetric unit cell. After removing the duplicates, the structures are labelled as their rank order at this stage, using the prefix R for racemic and E for enantiopure depending on their symmetry.
The resulting unique structures were then optimised with DFTB3-D3 27 to relax atomic positions and remove the possible unfeasible geometries derived from the use of rigid fragments of the molecule. The accurate evaluation of the lattice energies were performed with a single step DMACRYS 28 calculation, using distributed multipoles obtained from the PBE0/aug-cc-pVDZ charge density with GDMA2.2 29 and the repulsion-dispersion potential described in the previous paragraph. 25,26 CrystalOptimizer2.4.7.1 30 was used to finally optimise the 555 crystal structures within 10 kJ mol 1 of the global minimum. Both the crystal structure and molecular conformation are optimised in a two-level method, with the intramolecular energies and hessian evaluated at the PBE0/6-31G(d,p) level of theory and the intermolecular energy calculated from the distributed multipoles (extracted from the charge density at the PBE0/aug-cc-pVDZ level of theory) and the repulsion-dispersion parameters described above. The smaller 6-31G(d,p) basis set was found to accurately assess the conformation of the molecule, but it was not sufficient to model the electrostatic forces of molecules in the experimentally observed crystal structures, thus justifying the use of the augmented basis set.
Structure Preparation and Atom Typing. The General Amber Force Field 31 has been used to describe the ibuprofen molecule. Atom types are assigned with the AmberTools suite 32 while point charges are assigned with the AM1-BCC model. 33 Simulations are performed with the Gromacs MD package. 34,35 This requires atoms coordinate files to be written in the order specified in the forcefield. All atoms in crystals must then be rearranged to match the forcefield index. This is done by transforming molecules in graphs and applying the VF2 graph match algorithms 36 available in the Python library NetworkX. 37 Finally, in order to see possible transitions or formation of orientational disorder in a relatively small computational time, for each crystal we generated a supercell of at least 200 molecules. The simulation boxes are chosen to have a nearly cubic shape with each cell edge around 4.5 nm.
Energy minimisation. We optimised the atoms positions using the steepest descent algorithm. The neighbour lists are updated every 10 steps using the Verlet cutoff scheme. Electrostatic and van der Waals interactions are calculated using a cutoff of 1.0 nm while long-range interactions are treated with the smooth Particle Mesh Ewald (PME) 38 and Lennard-Jones PME. After a first atoms' position optimisation, we used LAMMPS to relax the cell For the relative position of molecules, we calculate the radial distribution function of the centres of mass. For the relative orientation, we define two sets of vectors connecting atoms C6-C8 and C7-C4 of each molecules, and calculate the angles, q 1 and q 2 , between them. The possible conformations are detected by looking at the C1-C2-C10-C11 and C7-C10-C11-H11 torsional angles, here labelled with f 1 and f 2 . The resulting distributions form the fingerprint of each structure. In the interest of simplicity, a supercell of 48 molecules is shown here but typical simulation boxes contain more than 200 molecules. (B) The similarity between each pair of structures is given by the norm of the Hellinger distances between distributions. This is calculated only between structures that share the same chirality and molecules conformation resulting in a distance matrix that avoids negligible comparisons and saves computational time.
parameters (feature not available in Gromacs), using InterMol 39 to convert the molecular forcefield. A second energy minimisation with Gromacs is performed to take into account differences between the two packages 39 . Finally, the GAFF lattice energies are estimated with the equation: where E vacuum is the energy of the isolated molecule and n mols is the number of molecules in the supercell. Equilibration at finite temperature and pressure. We performed a 3 ns simulation in the canonical ensemble at 300 K, followed by 4 ns in the isothermal-isobaric ensemble at 300 K and 1 bar. We used the velocity Verlet integrator with a 1 fs timestep. We controlled the temperature with the Bussi-Donadio-Parrinello thermostat 40 and equilibrated the systems at 1 bar for the first 1 ns using the Berendsen anisotropic barostat 41 and then switched to the Parrinello-Rahman barostat 42 for the following 3 ns.
Probabilistic Fingerprints. Effective descriptors of the different geometries generated should be able to handle the displacement of atomic positions from equilibrium in finite-temperature simulations and efficiently capture the differences between crystal packings. In this context, we previously proposed 15 a set of system-dependent probability densities that describes the relative position, relative orientation and possible conformations, , p i (f )}, as the fingerprint of each crystal when dealing with flexible molecules. PLUMED 2 43 has been extensively used to analyse trajectories and generate distributions.
An example of the inputs used to generate the components of the structural fingerprints described here are available on PLUMED-NEST, the public repository of the PLUMED consortium 44 , as plumID:21.019.
In the case of ibuprofen, the term p i (r COM ) represents the radial distribution function of centres of mass of molecules in the i th crystal structure. The relative orientation of molecules in the i th crystal structure is described by the 2D probability density distribution p i (q ), a function of the intermolecular angles q 1 and q 2 , obtained from two orthogonal vectors connecting the atoms C6-C8 and C7-C4 of the aromatic ring of the molecule, as shown in Fig. 2A. Finally the conformational contribution to F i , p i (f ), was defined following the conformational analysis reported in Ref. 45, which employs the 2D distribution of the global (f 1 ) and local (f 2 ) torsional angles shown in Fig. 2A. The former represents the relative orientation of the two substituents of the aromatic ring while the latter captures the relative position of the methyl groups. In this approximation, molecules can adopt six possible conformational states. In order to assess the generality of the choice of relatively coarse conformational descriptors, we compared it with an alternative, more fine-grained representation, making use of two 2D distributions (p i (t 1 , t 2 ) and p i (t 3 , t 4 ), Fig.  1). The two different approaches, despite the difference in level of detail, and of the associated computational cost, lead to similar results. Finally, the probabilistic fingerprints are complemented by an additional parameter used to classify structures based on their chirality. Structural (dis)similarity and Clustering The similarity between two fingerprints, F i and F j , is quantitatively determined by computing the Hellinger distance, H i j , between each equivalent distribution, defined as: where BC (p i , p j ) = R q p i (x )p j (x )dx is the Bhattacharyya coefficient and x is the vector variable used. The distance between structures i and j, D i j , is finally defined as the norm of the vector of Hellinger distances: However, prior to the clustering analysis, we want to remove those structures that are unstable at finite temperature and pressure and melt or develop into a disordered packing. Two strategies have been adopted in this context in order to take into account the emergence of both orientational and conformational disorder. Firstly, orientational disorder is considered by comparing the distribution of the intermolecular torsional angle, p(q 1 ), of the structures with an uniform distribution typical of the liquid state, p U (q ) = 1/2p, hence: In a second step the emergence of conformational disorder is assessed. To this aim we consider the torsional angle space , which presents six basins corresponding to stable conformers. 45 We can thus identify all the possible conformations the molecules adopt in a structure by detecting which of these basins are populated. We define as conformationally disordered structures that contain molecules conformations in 3 or more different basins. Note that point defects such as single molecule undergoing conformational transitions in the crystal bulk which can occur in stable polymorphs 45 at finite temperature, do not distort the probability density p i (f 1 , f 2 ) and would not lead to a classification of conformational disorder. We can now group together the finite-temperature putative polymorphs that coalesce to the same geometry. In order to reduce the number of comparisons needed, we can exploit two properties that we have already determined for each structure, namely the chirality and the conformations the molecules adopt in the crystal. D i j is therefore calculated only between structures that share the same chirality and conformer composition, drastically reducing the number of comparisons necessary to perform a full clustering of the trajectories, and resulting in the distance matrix in Fig. 2B. To each of the resulting subgroups, corresponding to square sub-regions of the dissimilarity matrix D in which the value D i j is defined (see Fig. 2B), we applied the Fast Search and Find of Density Peaks (FSFDP) clustering algorithm. 46 The structure with the lowest potential energy in each cluster is taken for the next step.
Metadynamics. In order to overcome MD timescale limitations and sample possible slow transitions, we performed Well Tempered Metadynamics (WTmetaD) simulations 47 on the cluster centres. The choice of CVs is motivated by the need to enhance structural fluctuations without specifically leading the transformation along a specific pathway. To this aim we used density and potential energy, a choice that has the advantage of being general and computationally efficient and therefore suitable for large sets of structures. Given their generality, these CVs can be applied to every crystal but are expected to enhance transitions only between similar structures and have a reduced accuracy in computing the free energy differences between two specific crystal structures. The bias potential is updated every 1 ps with Gaussians characterised by an initial height of 2 kJ mol 1 and a width of 10 kg m 3 for the density and 2 kJ mol 1 for the potential energy. These simulations are performed using Gromacs patched with PLUMED 2 43 . The work performed on the system through the introduction of a bias potential at a time t is represented by the re-weighting factor, C(t), 48 defined as: where b = 1/k B T , G(s) is the Gibbs free energy and V (s,t) the bias potential. We searched for possible transitions by looking at the distance RMSD (DRMSD) between pairs of atoms of different molecules using the initial structure as reference. By using a committor analysis in conjunction with DRMSD, both available with PLUMED, we stopped those simulations that exceed the value of 0.3Å. This value guarantees stopping the simulation and saving computational time in those trajectories that show a large distortion of the crystal packing, usually associated with the melting In addition, to automatically detect transitions between similar geometries not observed by the distance RMSD, we perform a cluster analysis every 0.5 kJ mol 1 of work. Finally, persistent structures are ranked based on their energy in the unbiased simulations.

Results
The CSP 0 analysis identified the global minimum, structure R227, as the experimental Form I with an RMSD 30 of 0.014Å between the CSP 0 structure and the experimental structure minimised with the same computational method. The search was performed considering Z 0 = 1. Hence the enantiopure Form E, which has Z 0 = 2, cannot be found. The high-energy structure R5596 approximately reproduces the geometry of Form II with an RMSD 15 of 0.66Å. However, the packing similarity analysis revealed a poor overlap between the two structures. We included in the finite-temperature analysis also the experimental structures IBPRAC16 18 (Form I), JEKNOC12 19 (Form E) and IBPRAC04 22 (Form II) available in the Cambridge Structural Database (CSD) 49 in order to monitor their evolution and predicted persistence throughout the different steps of the reduction process. From a lattice energy perspective, the difference in stability of the experimentally known polymorphs predicted at the CSP 0 stage is significant. Form E is found to be at +5.02 kJ mol 1 from Form I, while Form II is at +16.87 kJ mol 1 . The large scale set of crystal structures simulated at finite temperature and pressure displays a significant variability attested by the 14 different hydrogen-bonding motifs identified in the CSP 0 dataset. The motifs search was carried out using the CSD-Material module in Mercury. 50 The ring R 2 2 (8) motif 51 is the dominant intermolecular interaction motif, recorded in more than half of the structures, including all the experimental ones. H-bonded chains also account for a significant proportion of the structures in the initial dataset, with 151 unit cells displaying the C 1 1 (4) motif and 41 unit cells stabilised by the C 1 1 (2) one. Lattice energies are very sensitive to the method used, so when comparing the CSP 0 energies to GAFF, differences are expected. In general, GAFF tends to overestimate the lattice energy differences. Despite this, Form I is found to be among the most stable structures, 4th in the ranking. Form II was confirmed to be very high in energy, at +28.15 kJ mol 1 from the global minimum. Form E is located between them at +8.77 kJ mol 1 , confirming the relative ranking obtained at the level of theory deployed in the CSP 0 step.
The reduction process starts by equilibrating all structures at 300 K and 1 bar. Fig. 3, shows that around 40% of the structures melt or present disorder after 4 ns of dynamic simulation in the NPT ensemble. The remaining structures are then clustered based on their chirality and molecule conformations.
In the largest group of racemic structures, molecules show conformational flexibility along the local torsional angle (See Fig.  2), producing two peaks in the conformational component of the structural fingerprint. Experimental Form II is among these with less than 10% of the molecules in the supercell assuming a distorted conformation.
The clustering analysis shows that only a few states coalesce into common finite-temperature crystal structures while most of them preserve their geometry. The stable Form I is one of the few systems that produce a cluster, and it is among the most populated ones. On the other hand, Form II formed a small cluster with the CSP 0 structure that best matches its geometry while no structure transformed to a configuration compatible with Form E.
Cluster centres are then subject to WTmetaD simulations. In order to automatically analyse trajectories and detect transitions as a function of the work performed by the WTmetaD bias, fingerprints are generated at every increment of 0.5 kJ mol 1 of C(t) or by looking at the last frames of those trajectories stopped due to large fluctuations of the DRMSD (see the methods section). Every time fingerprints are generated, clustering is performed, identifying structures that convert and are thus removed from the count of independent, persistent structures.
As shown in Fig. 3, the number of persistent putative polymorphs decreases throughout the workflow. By the end of the analysis, from the initial set of 555 CSP 0 lattice energy minima, we retain 205 persistent structures, corresponding to a 65% of reduction. All experimental structures came out as thermodynamically stable, preserving their geometry during finite-temperature biased simulations. In Fig. 4A, we show the lattice energy landscape at 0 K and depict them based on their behaviour at 300 K.
Orientationally disordered structures at finite-temperature are on average, located at higher energies in the 0 K landscape than the structures exhibiting conformational disorder. Persistent crystal structures at 300 K span over the entire lattice energy range. The overlap in lattice energy between the distributions of labile and persistent crystal structures highlights how a reduction of the lattice energy landscape based solely on lattice energy is insufficient and would actually miss high energy experimental structures like Form II. The resulting finite-temperature crystal energy landscape, in Fig. 4B, shows a general decrease in the potential energy difference for those structures that survive.
Among the dominant hydrogen-bonding motifs, the ring motif R 2 2 (8) and the chain motif C 1 1 (4) were shown to be more persistent than the average, with a decrease in number of structures of 29% and 41%, considering all stable structures. While being present in Figure 4: Comparison between the crystal energy landscape at 0K and 300K. (A) CSP0 lattice energy landscape. The symbols highlight how the static states at 0K will behave at 300 K and 1 bar. Structures that are persistent and thermodynamically stable at finite temperature and pressure are represented with blue dots. Structures that develop a disorder are shown with a green triangle or orange square whether it is orientational or conformational. The plot on the right shows how these three groups are distributed over the energy axis. (B) Final finite-temperature crystal energy landscape obtained with GAFF with the experimental forms I, II and E highlighted in red, orange and green. (C) Behaviour of the surviving H-bonding motifs at finite temperature and pressure, with the four most common shown in the blue boxes. For each of them, we show the number of structures that preserve or convert to that motif at the end of the analysis (in blue), those that result in a disordered structure (in green or orange whether the disorder is conformational or orientational) or transform to another motif (in red), rescaled by their initial occurrence in the CSP0 set. the final set, the chain motif C 1 1 (2) tended to convert to the more stabilising C 1 1 (4). Looking at the rare motifs, 9 of them disappear during the analysis. Motifs D 3 3 (10), R 3 3 (12), R 3 3 (6), R 4 4 (8), R 6 6 (12) and R 6 6 (24) all result in melted structures while motifs C 2 2 (6), R 2 2 (6) and R 4 4 (12) transform to other motifs.

Discussion
Through an MD-based reduction of the lattice energy landscape we drastically reduced the number of putative polymorphs of Ibuprofen. Form I, the most stable experimentally known polymorph came out as second in the final ranking with structure R4124 being the global minimum (see the ESI † for a complete list of crystal structures, energies and labels). However, many structures converted to the experimental form, suggesting that form I could act as kinetic trap for labile states. Form I and the enantiopure form E were able to preserve their CSP 0 geometry with little variation due to the molecular motion. In form II, a few molecules dynamically change conformation during biased and unbiased simulations, showing the possibility of dynamic disorder in the crystal at standard conditions. This is evidenced also by the presence of two conformationally disordered structures, R2315 and R6595, that resemble form II. Structure R5596, which presents geometrical similarities with form II at 0 K, effectively convert to the experimental one at 300 K. The R 2 2 (8) motif is the most frequent intermolecular interaction motif in the final set being present in 119 structures and is dominant among the low-energy structures. From Fig. 4C, we can see that structures associated with this motif, are most likely to preserve their H-bonds. However, other motifs seem to be favoured at 300 K compared to their initial number in the CSP 0 set. In particular, the chain motif C 1 1 (4) has a similar persistence to R 2 2 (8) and is present in 10 of the 18 enantiopure structures, including the most stable one (E6134). This could indicate that the hydrogenbonded carboxylic acid dimer of the R 2 2 (8) pattern is favoured during the nucleation or growth process. The rotation of the carboxyl group is associated with the inter-conversion between mo-tifs C 1 1 (2) and C 1 1 (4) with the balance shifted towards the latter. Interestingly, 12 structures are shown to be persistent at finitetemperature and pressure despite the lack of H-bonding motifs and their high potential energy. The use of highly polar solvents that preclude H-bonding interactions could favour the formation of these structures. 52 In Fig. 4A, the states that are effectively persistent at finite temperature and pressure are shown as blue dots. From the probability density on the right of the same figure, we can see that some of these are high energy structures. This implies that the use of energy cutoff, although often necessary, can lead to remove relevant geometries from the analysis. In this case, the ibuprofen form II could have been ignored being higher in energy than the typical energy cutoffs used, usually in the range 5-15 kJ mol 1 .

Conclusions
In this work, we have tackled the systematic reduction of a largescale dataset of CSP 0 crystal structures, including 555 putative crystal structures of ibuprofen by systematically applying MD simulations. 15 To scale up one order of magnitude in the number of crystal structures considered, compared to previous studies, we implemented new strategies to further increase the efficiency of the clustering and analysis protocols, drastically reducing the need for manual inspection of the trajectories in different steps. In particular, through a systematic conformational analysis, we could automatically detect disorder formation in the simulation box. Moreover, partitioning a-priori the distance matrix in subsets based on the number and type of conformers present in the crystal structure, small and fast to manage, allowed us to efficiently repeat the clustering analysis at regular intervals during the metadynamics simulations. This procedure allowed us to detect transitions under progressively enhanced fluctuations of the supercells density and lattice energy. The systematic setup and analysis of 555 trajectories, which altogether amounts to 8 µs across multiple MD protocols, is made possible by an ad hoc Python library, available at github.com/mme-ucl/pypol.
Applying this approach to a set of 555 CSP 0-generated structures of ibuprofen resulted in a 65% reduction of the number of predicted structures, leading to a group of 205 persistent lattice structures. All the experimentally known structures persisted throughout the analysis with minor variations from their original geometry. Interestingly, despite the significant variability in the intermolecular interaction motifs present in the initial dataset (14), we find that the motifs R 2 2 (8) and C 1 1 (4) are dominant in the final set.
By taking advantage of this implementation, we can now study sets of structures of the typical scale of real-world CSP 0generated crystal energy landscapes. We believe that this method is a valid alternative to the simple application of an energy cutoff criterion to select crystal structures as promising putative polymorphs.