Evolutionary chemical space exploration for functional materials: computational organic semiconductor discovery

Computational methods, including crystal structure and property prediction, have the potential to accelerate the materials discovery process by enabling structure prediction and screening of possible molecular building blocks prior to their synthesis. However, the discovery of new functional molecular materials is still limited by the need to identify promising molecules from a vast chemical space. We describe an evolutionary method which explores a user specified region of chemical space to identify promising molecules, which are subsequently evaluated using crystal structure prediction. We demonstrate the methods for the exploration of aza-substituted pentacenes with the aim of finding small molecule organic semiconductors with high charge carrier mobilities, where the space of possible substitution patterns is too large to exhaustively search using a high throughput approach. The method efficiently explores this large space, typically requiring calculations on only ∼1% of molecules during a search. The results reveal two promising structural motifs: aza-substituted naphtho[1,2-a]anthracenes with reorganisation energies as low as pentacene and a series of pyridazine-based molecules having both low reorganisation energies and high electron affinities.

When starting the evolutionary search algorithm an initial population of randomly generated molecules are created using the input variables and transformation operations. Randomised molecules are created by randomly selecting one of the base molecules from the smiles list, to which the addition operation is applied using a second, randomly selected fragment from the same list. Further application of the addition operation with further fragments is carried out until a randomly selected size within the limits given by molsize is reached. A large number (500 in this work) of mutation operations using the smarts variable are then applied to the molecule.
New generations of molecules are created using a elitism rate of 10% so that the new population is made from the top 10% best performing molecules from the previous population. The remaining 90% is made using child molecules created with the following procedure. Two 2-way tournament selections are carried out where in each tournament a molecule is selected out of a set of two randomly selected molecules from the previous population with a probability of 75% to select the fitter molecule. The two molecules from the two 2-way tournaments are then used to create two child molecules using the Crossover transformation. Each child molecule then has a probability of 5% to undergo Mutation and a probability of 5% to undergo Recombination.
The search algorithm therefore runs by creating a initial population where in this report we have used a population size of 100. The fitness of each molecule in the initial population is evaluated and the next generation is created from this. Newer generations are created continually until a desired number of generations or a convergence criteria is reached. In this report we have chosen to run all searches for a total of 30 generation. Since the selection and replication for the creation of new molecules in the next generation favour fitter molecules the search algorithm is therefore driven to a global minimum or maximum.

S2 Full details of the chemical search space
In this report we have chosen to explore the region of chemical space of aza-pentacene type molecules.
To carried this out with algorithm described in Section S1 requires the input variables written in Listing 1. We were able to determined the size of the chemical space that can be explored to contain 68064 unique molecules which was obtained by running the randomised molecule creation in a infinite loop and converting each molecule generated into its canonical InChi string [4] and was then added into a Python set. The program was left to run until no further changes to the number of elements in the set had occurred over a few days. [5,5] Listing 1: Input variables used in this report which define a chemical space of aza-pentacene type molecules to be searched.

S3 Fitness functions
In this report we have designed two different fitness functions, where λ − is the reorganisation energy for electron transport between two molecules approximated using the four-point scheme using isolated molecule energies. [5] The penalty function where Φ = W − A s is the equation for the Schottky barrier from the Schottky-Mott rule for the injection of an electron from an electrode with a work function W into the semiconductor material with a the solid-state electron affinity A s . [6,7] Where the penalty is only applied for cases below the target work function to favour less reactive higher work function metals.
For this report we will be using W = 4.1 eV which we have chosen to match more closely the work function of metals such as Ag, Cu and Au which have the values 4.26, 4.65, and 5.1 eV.
The fitness function F A was used to search for molecules with the best probabilities to produce crystal structures with high electron mobilities when using the transport hopping models. Additionally we use F A to evaluate the performance and reproducibility of the evolutionary algorithm since the global minimum was expected to correspond to pentacene as any aza-substitution or nonlinearity is expected to disrupt delocalisation of the excess electron and increase its reorganisation energy. The fitness function F B includes the additional penalty function to ensure larger electron affinities and therefore a smaller Schottky barrier when using higher work function electrodes in OFETs. Therefore F B is used to minimise both the barrier for injection of an electron into the semiconductor and the barrier for hopping across the semiconductor in hopping transport models.
Both fitness functions were evaluated by taking each molecule generated by the evolutionary algorithm and creating 3D molecular geometries using the RDKit inbuilt initial coordinates generation and UFF optimisation functions. [8] The UFF geometries are then taken for a further optimisation step carried out at the B3LYP/6-311+G** level of theory to generate the neutral geometries and then used to determine the geometries of the charged species. Solid-state electron affinities were approximated from gas phase adiabatic electron affinities which were calculated using energies extracted from the the geometry optimisation calculations. Electron reorganisation energies are obtained by carrying out two additional single point energy calculation at the same level of theory where all DFT calculations for the fitness evaluation were carried out using GAUSSIAN09. [9] Solid-state electron affinities were approximated from calculated gas phase adiabatic electron affinities by taking advantage of the known correlation between the two values. [10,11] Here we calculated the gas phase electron affinities for 12 molecules to produced a linear regression fit against experimental low-energy inverse photoemission spectroscopy (LEIPS) values for thin-films organic semiconductors with constants (m = 1.00, c = 1.11) and an R 2 = 0.97, see Section S4.
Therefore solid-state electron affinity can be obtained from gas phase calculations by making a very simple correction of 1.1 eV to the gas phase adiabatic electron affinities calculations.
Using this simple adjustment with W = 4.1 eV will mean the fitness function F B will also be equivalent applying a penalty function similar to Φ against molecules with gas phase electron affinities greater then 3.0 eV.

S5 Comparison of calculated mobilties between Marcus theory and non-adiabatic molecular dynamics for a set of tetracene systems.
We carried out comparisons of calculated mobilities using Marcus theory, as outlined in our paper, against non-adiabatic molecular dynamics simulations. [18][19][20][21] This method has been shown to obtain excellent correlation with experimental mobilities [22] which therefore will be a good reference to compare against. For the test set, we use a series of substituted tetracene systems whose hole mobilities were recently evaluated. [23] We started with the experimentally determined crystal structures of each crystal structure.
Hydrogen bond lengths are typically underestimated in experimental crystal structures, and an equilibration step is carried out in the non-adiabatic molecular dynamics simulations. Therefore, we give a fair comparison to our method by taking molecular geometries from the experimental crystal structures and reoptimise them using B3LYP/6-311G** with heavy atom positions fixed so that the correct conformations are maintained. These molecular geometries are then pasted back into the experimental crystal structures which are then optimised using the lattice energy minimisation method as outlined in Section S7. Mobilities for these optimised structures are then obtained by running kinetic Monte Carlo simulations using Marcus theory rates as outlined in the paper. Results are shown in Table S2    In general, we see fairly good ranking for these systems with small/large mobilities predicted by Marcus theory corresponding to small/large mobilities with non-adiabatic molecular dynamics.
There are two outliers: the systems TiBuT and TMT. For TiBuT, the hole mobility predicted by Marcus theory is dramatically overestimated. This could be due to the large variation in its electronic coupling ±56.3 eV through the molecular dynamics, indicating an important effect of phonons on hole transport. The hole mobility in TMT, with a smaller deviation in its electronic coupling of ±21.8 eV, is underestimated. The mobility in TMT from non-adiabatic molecular dynamics relies on a complex pathway, where the mobility may depend strongly on dynamics connecting 'columns' of molecules. [23] These errors are probably acceptable for our use case since each molecule may be predicted to include up to a hundred or more crystal structures in its low energy region and the likelihoods of a molecule to produce a high performance organic semiconductor is assessed using a weighted average and deviation over its crystal structure landscape.
In future work it will however be favourable to replace Marcus theory with models which include effects due to the non-local coupling, a parameter which causes the deviation in the electronic coupling. These effects are taken into account in, for example, the transient localisation theory. [24][25][26] However these methods incur an increased computational expense and future work is required to see if they can be used as a part of a high throughput screening program. This not only requires that these methods are computationally inexpensive but that they are easily automated and require no human intervention at any point. There are therefore significant challenges to replacing Marcus theory especially since it will be necessary to scale up the material discovery program we have developed here by at least an order of magnitude.

S6 Definition of molecular non-linearity
We define the amount of non-linearity by the number of bonds that connect two rings together that are not the intersecting bonds. Figure S5: Example chemical structures with increasing non-linearity from TOP left to bottom right. Where we have define the amount of non-linearity by the number of bonds (red) that connect two rings together that are not the intersecting bonds (blue).

S7 Full details of crystal structure prediction calculations
Crystal structure prediction calculations were performed for the most promising molecules identified from the evolutionary search, using the Global Lattice Energy Explorer (GLEE) program. [27] A low-discrepancy, quasi-random sampling of crystal packing variables was used to uniformly sample the lattice energy surface of each molecule. Trial crystal structures were generated and lattice energy minimised until a total of 34,000 successfully lattice energy minimised crystal structures were produced for each molecule in the most commonly observed space groups for organic molecules: 4000 structures in each of (P2 1 /c, P1, C 2/c, P2 1 2 1 2 1 , P2 1 , Pbca) and 2000 in each of (P1, C 2, Cc, Pna2 1 , Pca2 1 ) with one molecule in the asymmetric unit (Z = 1).
Lattice energies were calculated with the W99 intermolecular atom-atom potential [28] combined with an atomic multipole electrostatic model based on the molecular charge densities calculated from a distributed multipole analysis [29] of the B3LYP/6-311G** density, with multipoles up to hexadecapole on each atom. Ewald summation was used for charge-charge, charge-dipole and dipole-dipole interactions, while all higher order electrostatics and repulsion-dispersion interactions were calculated to a 35 Å cutoff. All lattice energy minimisations were performed using the DMACRYS software, [30] with space group symmetry constrained throughout the optimisations.
Molecular geometries were kept rigid, constrained at their neutral isolated molecule optimised geometries using the B3LYP/6-311G** level of theory. The removal of duplicate crystal structures from the final structure sets was performed across all space groups by comparing predicted X-ray diffraction patterns calculated using PLATON. [31] S8 Energy-Structure-Function maps of electron mobility in the optimised molecules