Enrico
Berardo
,
Lukas
Turcani
,
Marcin
Miklitz
and
Kim E.
Jelfs
*
Department of Chemistry, Imperial College London, South Kensington, London, SW7 2AZ, UK. E-mail: k.jelfs@imperial.ac.uk; Tel: +44 (0)207 594 3438
First published on 11th September 2018
The chemical and structural space of possible molecular materials is enormous, as they can, in principle, be built from any combination of organic building blocks. Here we have developed an evolutionary algorithm (EA) that can assist in the efficient exploration of chemical space for molecular materials, helping to guide synthesis to materials with promising applications. We demonstrate the utility of our EA to porous organic cages, predicting both promising targets and identifying the chemical features that emerge as important for a cage to be shape persistent or to adopt a particular cavity size. We identify that shape persistent cages require a low percentage of rotatable bonds in their precursors (<20%) and that the higher topicity building block in particular should use double bonds for rigidity. We can use the EA to explore what size ranges for precursors are required for achieving a given pore size in a cage and show that 16 Å pores, which are absent in the literature, should be synthetically achievable. Our EA implementation is adaptable and easily extendable, not only to target specific properties of porous organic cages, such as optimal encapsulants or molecular separation materials, but also to any easily calculable property of other molecular materials.
Potential applications of porous molecular materials include as encapsulants,6 in catalysis,7 molecular separations,8–11 and as sensors.12,13 Most promising are applications in molecular separations, where some molecules have been found to be effective for gas separation,11 separation of aromatics,9 separation of alkanes/alkenes8,14 and chiral separations.11 Several of these applications occur in solution rather than in the solid-state, a unique possibility due to the molecular nature of the material. More recently, an intriguing new application has been reported – the use of molecular cages to form porous liquids.15
Porous molecular materials still only number in their hundreds, compared to, for example, the many hundreds of thousands of MOFs reported.16 Computation has been playing a significant role in the discovery of new porous molecular materials.17 Evans et al. screened the Cambridge Structural Database to identify possible porous molecular materials from previously reported crystal structures.18 Using machine learning, they identified that a large molecular surface area was a predictor of high crystal porosity. For molecular cages, one of the challenges of a priori prediction is the emergent behaviour in the reaction outcome from the molecular precursors. Simple changes to the precursors can double the mass of the product, changing its molecular topology and consequently its properties, for example losing any potential porosity through loss of shape persistency.19 We have previously outlined the 20 most likely topologies for molecular cages, split across four families, which are distinct based on the number of precursor reactive end groups.20 Furthermore, the potential for computational prediction of the reaction outcome by comparing the relative energy of the different assemblies20,21 or considering their reaction pathways has been shown.22 Recently, we tested this hypothesis on a larger scale in a combined robotic and computational screening, where 33 new porous organic cages were discovered.23 This demonstrated the value of computation; the topological outcome could be predicted where there were sufficiently large thermodynamic driving forces, and further, formation energies of the cages could be correlated with the likelihood that a molecular cage as successfully formed on the robotic platform. Calculations can therefore prevent synthetic effort being wasted on unpromising reactions, and also, through further computational property calculations, allow full scale-up and characterisation to only go ahead on the most promising materials.
At the solid-state level, it has been shown that the low energy polymorphs and therefore crystal packing of porous molecular materials can be predicted using crystal structure prediction techniques originally developed for pharmaceuticals.24–27 Most recently, this approach has been used to calculate energy–structure–function maps and thus guide the discovery of targeted properties in extrinsically porous triptycenes.5 For intrinsically porous molecular materials, there have been many computational studies demonstrating the possibility of predicting separation performance in both the crystalline and amorphous solid-state, through, for example, calculation of the pore network, sorption uptake and diffusion barriers for kinetic separations.10,11,28,29 In all cases, a consideration of the flexibility of the porous molecular host has been shown to be critical, particularly in understanding the sorption of guests that look too large to diffuse through the systems from inspection of static crystal structures alone.30 Recently, we have demonstrated the potential for rapid property screening of porous molecular materials through a molecular analysis of the molecules alone, without considering the bulk structure or influence of crystal packing.31 This approach allowed us to discover a promising previously reported material, noria, whose potential for Xe/Kr separation we then validated. In the area of metal–organic polyhedra, Hay et al. have developed the software HostDesigner, which designs molecular receptors for guests.32–37
The chemical and structural space of hypothetical porous molecular materials is enormous, given they are formed from building blocks of organic molecules. Most porous organic cage systems to date have been synthesised using dynamic covalent chemistry (DCC), where the reversible nature of the reaction allows error correction towards high symmetry, discrete structures. If we consider the most commonly used chemistry for cage synthesis, imine condensation, then inspection of only the online Reaxys database of previously reported molecules,38 finds on the order of 105 total aldehyde and amine precursors. If you combine these into two-component cages in all possible topologies, then you already have 107 potential host molecules to consider. For this reason, we have developed an evolutionary algorithm (EA) for the prediction of molecular materials, and demonstrate here its application to the discovery of porous molecular materials. EAs are inspired by biological evolution, where candidates are evaluated for their ‘fitness’, which determines their likelihood of proceeding to a subsequent generation. At each generation, several candidates will undergo random mutations and pairs of candidates will ‘reproduce’, with their chromosomes undergoing crossover. EAs, such as genetic algorithms, have been widely applied in chemistry, including for structure prediction39 or determination and in drug discovery.40 EAs efficiently search disparate regions of multidimensional phase space with multivariable optimisation and are particularly well suited to problems where a computationally cheap calculation can be used to determine a candidate's fitness, as can be the case for porous molecular materials.
Evolutionary methods have been developed both for experimental41–43 and computational44–55 materials discovery and optimization. To date, very few evolutionary approaches have been employed for the study of molecular materials, as this involves multiple obstacles, such as (i) the huge diversity in chemical bonds and flexibility in regards to electronic and structural properties, (ii) the need for a framework for the assembly and the correct generation of the molecular structure, and (iii) the lack of cheap and accurate descriptors that link structure with the potential properties of the final material in many cases.
In this work, we report on the extension of our previously reported software, the supramolecular toolkit (stk),56 which can assemble a variety of (supra)molecular materials, including porous cages, and then automate their property calculation. stk is written in python and makes use of utilities within the RDKit cheminformatics libraries.57 Within stk, each EA individual is defined as a Python object, which allows for the labelling and tracking of its compositional (e.g. precursors, atoms, bonds) and structural (e.g. cavity size, diameter, total energy) properties throughout the evolutionary run. Whilst here we develop the EA for the discovery of cages, it is easily extendable to other (supra)molecular materials in the future. We focus on the molecular prediction of cages, having previously demonstrated the utility of this approach for property prediction,31 and as this significantly decreases the calculation cost of the fitness function compared to solid-state calculations. After first showing that our approach can efficiently rediscover a previously reported cage, we then apply the software to two distinct case studies: (i) targeting elusive shape persistent cages and (ii) targeting a specific pore size, 16 Å, which has not been previously synthetically reported. This will show the utility of our software, which could easily be targeted at specific properties, such as encapsulation or molecular separation, in the future for porous molecular materials, or extended to the broader field of (supra)molecular materials.
Once all the individuals in the initial population have been generated, each cage in the population has its fitness value evaluated (as described below). The individuals are ranked according to their fitness value and, depending on the specific selection algorithm, a set of parent structures is chosen. New offspring structures are generated starting from the selected parents by applying the genetic operations of crossover and mutation, shown in Fig. 2A and B. Both genetic operations modify the chromosome of a molecular cage by the substitution of its constituent genes. In the case of crossover, the genes of two parent cages are mixed to generate two new offspring, for example by switching BBs or topologies. For mutation, one of the existing genes is replaced by a new random one from a database of BBs or a list of topologies, respectively. The newly generated offspring structures have their fitness value calculated and they are used to replace the worst performing individuals from the current population, thus maintaining a constant population size. As shown in Fig. 1A, this cycle continues until a convergence criterium is met or the EA reaches the maximum number of allowed generations. Specific convergence criteria could be the appearance of a particular individual in the current population or that the top 5 cages remain unchanged for 20 generations, suggesting that the EA run has reached a plateau.
The 3D coordinates of each molecule within the chemical database were obtained starting from SMILES codes by using the ETKDG59 algorithm as implemented in RDKit, which allows for the efficient generation of reasonable conformations of small molecules. Differently from other recent applications of EAs for computational chemistry and materials discovery,47,48,53,60 we do not allow for the chemical modification of the original precursors during the EA search. In our implementation, the mutation step can only perform a substitution of one of its two building blocks of a cage with a completely new one from the chemical database. From an experimental point of view, we believe that this approach helps us to circumvent a common problem in EAs, where chemically infeasible candidate structures are generated that can not be synthetically accessed. BBs included from Reaxys have been previously synthetically reported and all cage structures built here can hypothetically be synthesised in a one-pot imine condensation reaction. Within this work, we do not investigate the synthetic accessibility of the final candidates, instead we allow any possible combination of the precursors available from the original database. However, synthetic accessibility can be addressed in the future by including synthetic rules or scoring in the EA's fitness function and refining the original chemical database or using a custom database.
By completely substituting one of the three genes of a chromosome, a mutation step strongly modifies the chemical/structural properties of a cage, possibly losing all the evolutionary advantage that has been developed up to that specific generation in a EA run. For this reason, we developed the similarity mutation, as shown in Fig. 2C, where the current gene (or building block) within a cage is replaced by the most similar BB from the database, as calculated with Dice similarity.57,61 Applying the similarity mutation multiple times to the same molecule yields the next most similar molecule in the list every time. For example, running the similarity mutation for the molecular target in Fig. 2C will lead to molecule #1 if run a single time, to molecule #2 the second time, and so on. In a usual EA run, we alternate between the random and similarity mutation, as we have found that this specific combination leads to a faster convergence of the run, allowing an efficient exploration of the overall chemical space.
Fig. 3 Schematics of the Tri4Di6 and Tri8Di12 topologies. Tri-topic and di-topic chemical precursors are represented by blue and purple sticks, respectively. |
Our fitness function, shown in eqn (1), can be specified from a series of three parameters that we believe are important for the design of reasonable candidate porous organic cages. We chose each parameter after a detailed analysis of the structures of the porous organic cages recently synthesized in the literature.1,2,20,62 The parameters allow for the evaluation of the candidate's porosity, flexibility and degree of symmetry. Each parameter, which will be addressed in the next section, defines the penalty that is going to be applied to the fitness value of each candidate, the larger the penalty, the lower the final fitness function of the candidate. These various contributions can be combined inside the fitness function by using a series of coefficients (a, b, c in eqn (1)), so that the same general fitness function can be tuned for a variety of different applications or targets. More details on the specific implementation of the fitness function can be found in the ESI.† The fitness function is calculated as:
fitness = (aΔPore + bΔWindow + cAsymmetry)−1 | (1) |
In a similar way, we can compare the window size of the candidate, calculated through the use of the software pywindow,63 against an ideal window size, the Window parameter in eqn (1). The cavity diameter is defined as two times the distance between the center of mass of the molecule and the closest atom (i.e. the diameter of largest sphere that can fit in the centre of the host molecule), whereas the diameter of a window is determined by the largest circle that can fit in the window.
The cage structures were assembled using our stk software, as previously described.56 In brief, a molecular cage is generated by placing the tri-topic BBs on the vertices of the topology, whereas the di-topic BBs are placed on the topology edges, which connect two adjacent vertices. Once the BBs are placed on the hypothetical topology, the relevant atom groups are then linked through imine bonds, which replace the existing functional groups of the original BB (e.g. aldehydes and amines). The newly formed bonds and the atoms directly linked to them are initially relaxed while constraining the remaining atoms within the molecular cage. Following this, the geometry of the whole molecular cage is optimised. Finally, we employ high-temperature molecular dynamics (MD) to probe the flexibility of the optimized individual and its tendency to retain its original cavity – also known as its “shape persistency”. In this step, for each candidate, we perform the MD simulation and extract a series of conformers along the trajectory. We then geometry optimize all the extracted conformers and select the lowest energy conformer, which represents the best guess of the experimental geometry upon desolvation (more information on this step can be found in the ESI†). In all stages of the geometry optimisation, we employ the OPLS3 force field,64 which we have previously shown effectively predicts the structure of flexible porous imine cages.20
When facing this excess of functions and options, it can be difficult to know which settings, or combinations of settings, will lead the EA to have better or worse performance. To address this question, we ran the EA with 360 different input files, where each input file contained a distinct combination of functions available within our EA implementation. For each input file, 2500 individual runs of the EA were completed. We have run each setup multiple times as EAs are stochastic algorithms and the generation by which the global minimum is found is dependent on the initial population. Since the initial population and mutation operations are random – the generation by which the global minimum is found is given by a distribution of values. A run was considered complete whenever the global minimum structure was obtained or whenever 100 generations were reached. Each run employed a fitness function developed for the re-discovery of Covalent Cage 3 (CC3),68 a well-known example of a highly symmetric Tri4Di6 imine cage, with a pore size of 5.72 Å and average window sizes of 3.91 Å when OPLS3 optimized.
To allow for the consecutive execution of 900000 EA runs (360 × 2500), we defined a smaller chemical space, for which we could pre-generate and optimize all the possible molecular cages. The fitness value of each individual was then pre-calculated with the fitness function targeting the CC3 cage (pore size and window size used in eqn (1)). This means that for each EA run, the entire search space was loaded into memory and no calculations needed to be performed. Using this setup, 100 EA generations could be completed in approximately 1 s. It also meant that the individual with the largest fitness value, the global minimum CC3 could be found ahead of time. The selected mock chemical space contained 142 trialdehyde BBs and 350 diamine BBs selected randomly from the initial chemical database, as well as the BBs for CC3. For this mock chemical space, we assembled, optimized and calculated the properties of all the possible Tri4Di6 cages, which amounted to a total of 49700 individuals.
In each EA run, we explored an initial population of 25 individuals, allowing 20 possible crossover operations and 5 mutations per generation for up to 100 generations. This means that in a EA run which successfully reaches 100 generations, 2500 possible individuals are explored, corresponding to only the 5% of the total chemical space. As shown in Table S2,† the different EA setups have a probability of finding the CC3 cage that ranges between 0 and 59% of the times (over the 2500 runs). The best performing setups show a huge improvement when the EA is compared to randomly picking BBs from the chemical database, which only gives a probability of 5% of finding the global minimum. Among the best performing setups, we chose the one including the diverse initial population as we believe that this type of initialization function will perform even better with the large chemical space used for the following case studies. All the functions used for the EA setup can be found in the ESI.†
The fitness function for this investigation employed the coefficients a = 10, b = 10, c = 1 from eqn (1), where we defined ideal pore and window diameters of 5.72 and 3.91 Å, respectively. We wanted to put a strong emphasis on the pore and window sizes in order for the global minimum to precisely match the properties of the OPLS3 optimized CC3 cage. We also gave some importance to the level of symmetry of the cage.
Fig. 4 shows the distribution of the fitness value for the 49700 molecular cages obtained, plotted as a function of molecular similarity with the BBs used for CC3 (triformylbenzene and cyclohexane diamine). Molecular similarity is calculated by means of Dice similarity through the use of Morgan fingerprints with a standard radius of 1 within RDKit.57,69 The plot clearly shows that the individual with the highest fitness value is CC3, located on the top-right point of the plot (Fig. 4A). No other solution exhibits such a high fitness value, confirming that the fitness value was tightly linked to the global minimum structure (CC3). However, a few other individuals have medium fitness values (dark-orange, larger dots). For most of those cases, at least one of the BBs shows high Dice similarity with the CC3 BBs (Fig. 4B and C). Both B and C cages have very similar (equivalent to the first decimal place) pore and window sizes compared to CC3, but show overall lower fitness value due to a decreased level of symmetry. Cage B is obtained by mixing a cyclohexane diamine and a substituted (–OH) triformylbenzene, whereas for cage C, triformylbenzene is mixed with a different diamine. Some interesting cages still exhibit a medium fitness value even if their precursors are very different to the ones employed for CC3 (Fig. 4D). In this case, one BB is highly functionalized compared to the CC3 diamine, but this functionalisation is external to the cage–core, and thus barely effects the window and size parameters that we used for the fitness function. The use of chemically different BBs does however lead to a more flexible cage with a slightly larger pore, window sizes and overall lower fitness value.
This example highlighted the efficiency of our EA implementation. The EA search is clearly more efficient than a brute force approach, as only a small fraction of all the possible solutions is selected for computational investigation. However, we stress that although EAs are among the best performing methodologies to probe these vast spaces, the restricted length of each EA run and its stochastic character, do not allow for a comprehensive exploration of the solution space. Solutions will generally be high fitness local minima from very complex multidimensional surfaces, and are very unlikely to be the global minimum.51 By aiming to accelerate the discovery of new materials, we are not only interested in the global minimum for a fitness function, but instead any high-fitness solution is a candidate for experimental realisation. We show in the next two case studies how high-fitness candidates, which do not necessarily represent the global minimum, can be used to extrapolate design patterns for new materials.
All the other parameters (such as selection functions and population size) are equivalent to the ones used for the CC3 rediscovery, with the only difference being the convergence criteria. An EA run was considered to have reached convergence whenever its top 5 candidates remained unchanged for more than 20 generations. This setting allows us to avoid the continuation of a EA run whenever a stable plateau is observed and the probability of finding better performing candidates is very low. Due to the stochastic nature of EAs, we ran this setup 3 different times.
In Fig. 5, we show the evolutionary progress plots for the three different EA runs performed for this case study. All of the runs converged in less than 100 generations (34, 62 and 50 generations for A, B and C, respectively). The total fitness plots (black) show that the average fitness value of the population, in general, increases as a function of the generation number. The best candidates within the population consistently display a much higher fitness value compared to the average. The middle plots (red) show that the best individuals quickly reach the target value of 5.0 Å, even if the pore size penalty is much weaker compared to the asymmetry penalty. This likely reflects the abundance of cages with suitable pore sizes, versus those with low asymmetry values. The weak penalty for pores that are not at the target value is the likely reason why a larger variation is observed in the average pore diameters compared to average asymmetry between generations, as the pore size does not significantly affect the fitness value. The right plots (green) show that the asymmetry parameter converges steadily towards 0 (when all the cage windows are equivalent in size). Both the asymmetry of the best individual and of the population average are very low at the end of the EA runs.
It is of interest whether a certain topology dominates when targeting a specific cage property. Fig. 6 shows the percentage of the two topologies in the populations for the three different runs. All three runs show a strong preference for the Tri4Di6 topology, suggesting that this topology offers better control of the asymmetry when dealing with smaller cages (pore diameter of 4–11 Å), and thus offering better shape persistency. A non-negligible percentage of Tri8Di12 is only observed in run B, where around 20% of the individuals in the final population possess the Tri8Di12 topology.
Fig. 6 Percentage of different topologies observed during the different EA runs (A–C). The Tri4Di6 and Tri8Di12 topological contributions are represented as blue and green bars, respectively. |
Fig. 7 shows the top 5 cages obtained for each EA run with their corresponding pore diameter and asymmetry value. In the ESI† we provide the properties of the individuals selected for the last EA generation, and the structures of the top five for each run. The three different EA runs converged towards three different local minima, where for each run, multiple interesting BBs and therefore cages were explored. From Fig. 7, it can be seen that the best performing individuals are characterized by a combination of good pore size and a symmetric core. For all three runs, the top five cages are Tri4Di6 individuals with pore diameters that fall within 0.5 Å of the target pore size of 5.0 Å. All cages share a very similar shape to that of CC3, typically with some external functionalisation. Some of the external functionalisation or heteroatom substitution on the rings may increase the complexity of the synthesis, however, these hypothetical cages could be simplified prior to any synthesis attempts. In this instance, the external functionalisation has limited or no influence on the fitness function, as this only depends on the pore size, window size and window symmetry. We only observed Tri8Di12 individuals for two of the runs, at rank #16 and #19 within the runs (total of 24 candidates in a generation). This suggests that Tri4Di6 cages are preferred when looking for rigid cages with smaller (5 Å) pore diameters.
To conclude this case study, we analysed the emerging patterns for the molecular properties of the BBs for the individuals in the population. In Fig. 8, we plot the behaviour of the percentage of rotatable bonds, double bonds, and the size of the BBs. The percentage of rotatable bonds and double bonds of a molecule was calculated as the ratio of the bonds of interest over the total number of bonds in the molecule. In all three runs the percentage of rotatable bonds converges towards ∼17% for both di-topic and tri-topic BBs, although there is considerable noise in this value. All three runs display a difference in the percentage of double bonds between the di-topic and tri-topic precursors, with the number for the tri-topic molecules being significantly larger (about 30% compared to 15%). The difference in the percentage of double bonds between the di-topic and tri-topic BBs highlights the more important role that double bonds have on tri-topic precursors in imparting rigidity, and thus shape persistence, to the cage. The mean distance between the reactive end groups (i.e. amine nitrogen to amine nitrogen distance in a diamine and average aldehyde carbon to aldehyde carbon distance in a trialdehyde) is shown on the right-hand side of Fig. 8. This measure defines the approximate size of the BB and gives us an idea of how the size of the precursors is progressing as the EA moves towards the best solutions. For the di-topic BBs, the distance converges on a value of ∼3.5 Å, which is the typical distance between nitrogens in the 1,2-diamines that dominate in the runs and in the reported syntheses of Tri4Di6 cages of this size in the literature. The mean distance for the tri-topic BBs is larger, at ∼5 Å.
The plots in Fig. 10 show that for runs A and C, the EA converged in less than 100 generations (48 and 62 generations, respectively). The middle plots (red) indicate that in the first two cases, a constant increase is observed in the average value of the pore size within the population and the best cage found during the run has a pore size of approximately 16 Å, as requested. For run C, the initial population had much larger pore sizes on average and as a result the generation number does not lead to significant changes in the pore size. The plots on the right (green) correspond to the behaviour of the asymmetry parameter. While generation by generation the value of this property is rather random, there is a clear downward trend (towards more symmetric cages) over the course of the EA in all three cases, reflecting the evolutionary pressure placed on this parameter. We note that there is some conflict between attempting to generate cages with a large pore size and minimizing the asymmetry. As the asymmetry factor is calculated as the sum of the differences between all the topologically equivalent windows within a cage, it is clear that as the size of the cage increases (larger pore diameter and window diameter are correlated), we would expect the asymmetry factor to increase as well. This means that as the individuals are approaching the 16.0 Å target diameter, finding cages with lower asymmetry becomes progressively more challenging.
Fig. 11 shows the top five cages obtained for each EA run, with their corresponding pore diameters and asymmetry values. In the ESI,† we provide properties of the individuals in the last EA generation and structures of the top 5 candidates in each run. Run C found a series of minima containing a boronate tri-aldehyde BB, whereas both run A and B share many cage individuals containing a triphenylamine. For the most part, however, the individuals are constructed from different precursors, with the most common feature being their similar size. As with the previous case study, identified candidates could be simplified, for example removing unnecessary functionalisation to make synthetic realisation more facile. Fig. 12 analyses further the emerging features of the BBs over the runs. The percentage of rotatable bonds for both di-topic and tri-topic BBs is very similar to that of the previous case study that sought smaller shape persistent cages, although slightly smaller at ∼14% compared to ∼17%, but again there is a lot of noise in these values suggesting this is not the sole critical consideration. Similarly, the percentage of double bonds in the BBs has decreased slightly for the tri-topic BBs by about 5%, and increased very slightly for the di-topic BB, although the values vary across the runs (∼17–22%), suggesting no strong changes in these characteristics of the BBs for targeting shape persistency in larger rather than small pores.
As would be expected, the size of the BBs required to target a large pore (16 Å) shape persistent porous organic cage has considerably increased compared to the previous case study of a small pore cage (5 Å). The mean distance between end groups here is 9–11 Å for tri-topic BBs and 6–9 Å for di-topic BBs (compared to ∼5 and 3.5 Å in case study 1). It is also interesting to see that each of the runs evolves towards slightly different size BBs (hence the range of values quoted), suggesting that there are multiple different solutions for shape persistent cages of 16 Å from the database of BBs we explored. It is likely this is due to the interplay between the size of the two BBs, where a larger di-topic BB could compensate for a small tri-topic BB than an alternative solution. We believe that this is an encouraging sign that porous organic cages with an internal cavity diameter of 16 Å are synthetically achievable, despite the absence of synthetic reports to date.
Our two case studies allowed us to not only generate both specific targets, but further to explore the emerging trends of candidates with the desired properties, so we can identify key chemical features for a given property. We have identified that shape persistent porous organic cages require BBs with a very low percentage of rotatable bonds (<20%), and high percentages of double bonds, with tri-topic BBs in Tri4Di6 cages requiring a higher percentage (25–30%) than the di-topic BBs (15–20%). The best candidate from each of the two case studies is shown in Fig. 13. Although there is no guarantee that the predicted compounds can be synthetically realised, we believe that our approach can be used to extract interesting structural motives from the high-fitness individuals of the last generations, which are worthwhile to study experimentally. Similarly, the targets could be simplified, for instance removing unnecessary external functionalisation to increase the ease of synthesis. In our first case study, we targeted cages that were shape persistent, with a small intrinsic internal cavity. The evolutionary pressure for symmetric small pore cages meant that Tri4Di6 topology cages were strongly selected for, and all bore a visual similarity to the CC3 series of cages, and were built using di-topic BBs with the typical nitrogen–nitrogen separation of a 1,2-diamine.
In our second case study, we targeted a cage with a cavity of 16 Å diameter, as we found that this was a cavity size absent in previously synthesised cage molecules. The best candidates among Tri4Di6 cages typically chose a large tritopic building block, for instance the boronate structure shown in Fig. 13. In general, the high-scoring cages contained large building blocks with mean distances between end groups of 9–11 Å for tri-topic BBs and 6–9 Å for di-topic BBs. The range of different solutions found in different runs suggests that there are multiple different solutions for a shape persistent 16 Å pore, and that this should encourage attempts to experimentally explore this target. For these case studies and for future use of our EA, narrowing down the size range and other properties of BBs required to synthesise molecular materials with a given property provides the opportunity for online libraries of BBs to be data-mined for such criteria, or for such BBs to be designed.
The evolutionary algorithm we have included into our supramolecular toolkit, stk, is designed to be flexible. The fitness function can be adapted for new design challenges by adding additional parameters. For porous cages, the next obvious step would be to use our EA for targeting specific properties, such as encapsulation of a particular guest, by, for example, seeking to maximise the host–guest binding energy, or improving the separation performance by optimising window size or the diffusion barrier. If a given precursor is known to have a desirable property, such as fluorescence, which could be used for sensing, then the EA could be used to find a BB partner that forms a cage molecule of desired structure. Any designed molecules could also be used as a first step for crystal structure prediction, in order to produce optimal packing and properties in the solid-state. The EA can easily be applied to alternative molecular materials – indeed stk can already automatically assemble linear polymers, covalent organic frameworks (COFs), and small molecules built from multiple BBs. These molecular materials could then be explored with the EA if a fitness function for them is defined.
In the future, we also wish to extend the EA to make the predictions more facile to synthetically realise. Already, by using a database of known BBs and combining them in what would equate experimentally to a one-pot synthesis, we increase the likelihood of the candidates being synthetically achievable. Of course, much of the Reaxys and eMolecules databases feature molecules of high chemical complexity, this could be countered by including a score of synthetic accessibility72 or a synthetic chemist's scoring73 as part of the fitness function, or by developing a target library specifically designed for the synthesis of a given class of materials. As artificial intelligence is set to revolutionise materials discovery, there is the potential to couple our EA with machine learning (ML),74 through either using the EA to provide training data for a ML algorithm or to maximise both the computational efficiency of a property calculation, whilst using the EA to effectively sample the enormous chemical and structural space of molecular materials.
Footnote |
† Electronic supplementary information (ESI) available: Further details about the fitness function calculation, setup of optimization runs, and the properties of the cages found in the final population for all the case studies. The structure files of the top five candidates in each run are also provided. See DOI: 10.1039/c8sc03560a |
This journal is © The Royal Society of Chemistry 2018 |