The Flexible Unit Structure Engine (FUSE) for probe structure-based composition prediction

Computationally led materials discovery requires e ﬃ cient methods to generate either exact or approximate crystal structures that span the composition range of a chosen phase space. Here we present a new tool, the Flexible Unit Structure Engine (FUSE), for the generation of approximate ‘ probe structures ’ to predict regions of composition space where compounds can be experimentally realised. We then test FUSE by applying it to 42 compositions in the Y 3+ – Sr 2+ – Ti 4+ – O 2 (cid:1) phase ﬁ eld. FUSE correctly identi ﬁ es all of the target compounds in the regions of stability and identi ﬁ es the exact crystal structure for 8 out of the 10 compositions.


Introduction
The computational search for new materials at its simplest involves a method for choosing an initial structure to test with a particular choice of elements, and a methodology for evolving this structure 1-3 or choosing an alternative from a suitable database. 4,5 There have been notable successes from such approaches 6,7 to identifying missing elemental compositions or new high-pressure phases of materials. Much effort in inorganic materials chemistry has focused on the discovery of new materials with complex compositions or on the prediction of new compositions that can adopt previously known crystal structuresin either case, the problem becomes very complex once we consider systems containing more than three elements due to the number of possible phase elds and the size of the compositional phase space.
In materials discovery our aim is to experimentally discover materials, guided by computational composition prediction. We do not seek to predict the properties of the materials computationally a priori, instead constructing phase elds from combinations of elements or compounds based on an understanding of the likely physical and chemical properties of the constituents, for example combining elements which are known to produce high levels of ionic conductivity. We then use well-established substitution and materials processing techniques to alter or enhance the desired properties of materials as they are experimentally realised.
While compositional complexity has the prospect of yielding multi-functional materials and offers more possibilities to tune functional behaviour, it has the drawback of greatly increasing the computational effort required to explore the range of structures and compositions. The initial choice of elements to investigate is of course determined by chemical knowledge and understanding. For example, Ga-containing oxides have been known to produce high oxide ion conductivity; 8,9 it would therefore be desirable to seek new Ga-containing oxides.
Recently we developed a methodology for addressing the complex materials discovery problem, which has been successfully applied to realise two new materials of previously unknown composition and structure. To make the search through complex structure space tractable, we build crystal unit cells from fragments (we refer to these as modules) found in materials of simpler composition containing the chosen elements. These modules are then combined in a chemically sensible fashion (e.g. with the stacking rules of the cubic perovskite structure type) to construct a unit cell containing ions with plausible coordination environments. Relaxation to the local potential energy minimum determines the energy ranking of a structure. The modules are then permuted to generate new structures to be energy ranked. This Extended Module Materials Assembly (EMMA) method 10 was employed to aid in the determination of the structure of a large layered perovskite, Y 2.24 Ba 2.28 Ca 3.48 Fe 7.44 Cu 0.56 O 21 , where the modular approach worked well to impose chemically sensible co-ordination chemistry on all of the atoms.
We extended EMMA with the addition of a Monte Carlo basin hopping algorithm to the permutation of modules (MC-EMMA) to obtain a structure searching tool that is sufficiently exible and fast to permit exploration of the composition space in addition to the structure at given compositions. 11 Building crystals by EMMA imposes restrictions on the possible outcomes. In particular, the modules are of xed composition and they cannot be updated on-the-y. They also have the same xed periodicity perpendicular to the stacking direction, although during relaxation the unit cell shape and volume and all of the ionic positions can vary, resulting in crystal structures very different to the initial MC-EMMA constructions.
The best structures generated by MC-EMMA at any given composition in a completely unexplored phase space are typically approximations to the true global energy minimum structure, which we refer to as probe structures. If the probe structures contain physically realistic local environments for the constituent ions, they should then be close enough in energy to a local minimum in comparison to the energies of all the known competing phases (i.e. the convex hullthe energy surface derived from stable compositions) to indicate the possibility of obtaining a new compound with that composition. We can then initiate an experimental synthesis programme targeting the indicated composition, with a successful outcome being the experimental identication of a previously unknown phase.
We chose to explore combinations of Y 3+ , Sr 2+ , Ca 2+ , Ga 3+ and O 2À to search for possible new materials. These elements have different charge states, ionic radii and bonding characteristics, giving a range of coordination geometries. Although there are related gallate materials with functional optical and transport properties, there were no known materials in this chosen phase-eld. With MC-EMMA, we generated probe structures based upon large layered cubic perovskite or melilite structure types. Coarse energy ranking with force-elds and renement with Density Functional Theory indicated a target region of composition space, and detailed experimental synthesis and structural characterisation led to the identication of two new crystal structures of previously unknown composition. One material is related to the melilite structure and the other is a 4 Â 4 Â 4 supercell of perovskite. We subsequently modied the perovskite structure to generate a new down conversion phosphor host.
In the present study we introduce a new approach to generating/representing candidate structures that operates below the module level of MC-EMMA, choosing building blocks and the unit cell size and shape within the modules before assembling the full structure with chemically sensible ionic environments. We illustrate this method, the Flexible Unit Structure Engine (FUSE), by applying it to the generation of probe structures to identify the compositions of the stable phases in the experimentally explored Y 3+ -Sr 2+ -Ti 4+ -O 2À phase eld. [12][13][14][15][16][17][18][19][20][21] Although we have used FUSE to generate probe structures and implemented structure search by a generalization of the MC basin hopping steps similar to those used in MC-EMMA, we anticipate that the basic construction approach could also be implemented in other structure evolution approaches, such as genetic algorithms or particle swarm optimisation, as used in other inorganic crystal structure prediction methods. [22][23][24][25] 2 Building the unit cell with the Flexible Unit Structure Engine (Fig. 1) FUSE has been implemented in Python (2.7/3.6 or higher) and is dependent on the atomic simulation environment (ASE) 26 and its subsequent dependencies. We generate an initial random seed structure for a given composition in the phase space under consideration in a series of steps outlined in detail below. First we choose randomly from ve approximate lattice types, e.g. cubic or trigonal (Table 1), but without imposing internal symmetry, there is no triclinic approximation since we restrict two of the three unit cell angles to 90 . This step determines the number of units (sub-modules) and the number of layers (modules) that these should be spread across. A particular sub-module set is then generated by assigning the in-plane coordination of the cations. This set is split into equalsized groups and spread across the modules that are used to assemble the 3D crystal structure, with stacking rules appropriate to the chosen lattice type, as implemented previously in EMMA. Before moving to geometry optimisation, we carry out a nal error check to reduce the number of cations that are in physically unreasonable environments. Should a structure fail at this point, a fresh structure is generated. The geometry optimisation can be performed with an external Density Functional Theory (DFT) or force-eld code.

2.1
The generation of sub-modules (Fig. 1a) The sub-modules are composed of one cation position accompanied by 0-3 nearest-neighbour anions, as shown in Fig. 1a. This denes eight basic motifs for the sub-modules based upon a single unit of AX 3Àd , where A is a cation species, X is an anion species and d ¼ 0, 1, 2 or 3. For d ¼ 1 or 2, there are multiple nonequivalent ways in which anions can be arranged, giving one motif containing three anions, three containing two anions, three containing one anion and one motif with zero anions (Fig. 1a). We select randomly from this set, subject to the constraint that the total numbers of each ion type give the correct stoichiometry.
Within the computer code, each sub-module is stored as three descriptors: the rst labels the cation species, the second stores the motif label, and the nal descriptor labels the anion species accompanying the cation in the sub-module. When all of the sub-modules have been generated they are stored in three lists, each containing all the descriptors of one type (e.g. one list contains all of the cation labels). We describe the integer number of sub-modules in a given crystallographic direction using the symbols x, y and z, referring to the a, b and c crystallographic axes, respectively. The chosen lattice type also affects the total number of submodules, e.g. for a cubic lattice if we choose to have 2 sub-modules in the xplane (Table 1), then we should have 2x 3 ¼ 16 sub-modules in total. During optimisation, the size of the unit cell, i.e. the number of modules, can grow, so it is not necessary for the initial structures to be very large. In this study we have imposed an arbitrary upper limit of 50 atoms in total in the unit cells of the initial structures. In these small unit cells it might prove impossible to t the correct numbers of modules to satisfy stoichiometric requirements into the cubic lattice choice, for example. If FUSE is unsuccessful in populating a lattice of a particular type, it will try a different lattice type until a unit cell can be generated. The monoclinic lattice type is the default which can always be populatedin this lattice it will always be possible to at least assemble the sub-modules into a 1D chain.

2.2
The formation of modules from sub-modules ( Fig. 1b) The order in which the sub-modules are stored is randomised and the set is divided into z groups, where z is the number of modules determined by the choice of lattice. x Â y sub-modules are selected and stored in separate descriptor arrays for each module. The positions are converted from fractional coordinates toÅ scaling by: Table 1 Initial choices of lattice, with the dimensions (in integer units of sub-modules) of the unit cells and cell angles; z is restricted to being even for tetragonal and orthorhombic by cubic stacking and the requirement that each adjacent layer must have a different translation to assemble the structure. As we only vary the angle g in our approximations, monoclinic unit cells are created in a non-standard setting (for the standard setting b s 90 ) Approximate lattice x, y, z 120 This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.
where R cat is the average cation radius and R an is the average anion radius. This gives lattice parameters for the module:  example, for the second module in the a direction, all of the positions will be translated by l p along a. The whole process is then repeated for each of the modules within the structure.

2.3
Stacking the modules to produce the 3D structure (Fig. 1c) To avoid stacking modules with like-charged ions packed on top of each other, each module is assigned a translation to be applied to all of its constituent ions. The translations are selected such that each adjacent module has a different translation instruction. If the selected lattice has g ¼ 90 the translations are based upon cubic packing, such that they alternate between [0l p , 0l p , 0l p ] and [0.5l p , 0.5l p , 0.0l p ]. When g ¼ 120 the translations are based on a close packed structure, with possible translations being [0l p , 0l p , 0l p ], [1/3l p , 2/3l p , 0l p ] or [2/3l p , 1/3l p , 0l p ]. Applying the translations to the module list completes the generation of an initial structure.
2.4 Error checking the co-ordination chemistry (Fig. 1d) With a fully assembled structure, it is possible to determine the number of nearest-neighbour anions for each cation, giving an initial co-ordination number.
As each sub-module is of the same size, nearest-neighbour cations and anions are separated by less than l p . To determine the initial co-ordination of a cation, we need to count the number of anions within 0.72l p . If the calculated co-ordination number of a cation differs by more than one from the value listed in the Shannon radii tables, 27,28 it is counted as an 'error'. If the fraction of these co-ordination errors exceeds the threshold b tol , then this structure is rejected and a new one is generated until the fraction of errors is acceptable. For the systems investigated in this study, we optimised b tol to maintain a high percentage of structures converging successfully in the geometry optimisation, while minimising the time taken to generate a structure with the number of errors below the threshold. We found that a b tol value of 0.25 resulted in typically 95% of structures converging, with minimal impact on the time taken to generate structures.

Geometry optimisation
Before relaxing the assembled structure it is important to apply a random displacement to each ion to prevent the optimisation routine failing to relax ions out of articially high symmetry sites. We use the ASE 'rattle' function, which selects random displacements from a normal distribution of a specied standard deviation (in this study we have arbitrarily used 0.025Å). Having completed the initial structure generation, we generate input les for the chosen geometry optimisation. We have implemented this for the DFT code VASP 29 and the force-eld based GULP 30 code. When the external code has completed its geometry optimisation calculation(s), the co-ordinates and energy of the relaxed structure are recorded by FUSE. During structure evolution, we generate lattices of different sizes containing different numbers of ions; therefore we need to use the energy per atom for ranking structures. If geometry optimisation fails (e.g. due to a physically unreasonable starting geometry, or because the structure cannot relax within the allotted run time), then the energy per atom is articially set to a very high value to prevent any subsequent search routines from selecting it as a structure to evolve. 3 Using FUSE to search for a probe structure at a given composition Structure modications are performed on the unassembled module representation, e.g. we swap ions between modules by simply swapping the appropriate labels within the arrays describing the modules, and then repeat the module stacking and error checking steps above. In principle, the basic FUSE construction could be paired with many existing search algorithms. In this study we implement a Monte Carlo basin-hopping algorithm, building on our earlier study with MC-EMMA 11 (Fig. 2). We begin the search for an optimum structure with an initialisation stage; we build structures until we are able to successfully perform geometry optimisation on 1000 structures. The lowest energy structure is then selected for evolution. The initialisation step ensures that the search starts from a physically reasonable structure. Since the number of atoms in the initial structures is limited (to a maximum of 50 in this study), the initialisation has a small computational cost compared to the geometry optimisation of the larger evolving structures. For the systems studied here, we limited the size of structures in the Monte Carlo search to no more than 150 atoms.
In each step of basin-hopping the current evolving structure (CES) can be altered by six different permutations:

Swap atoms within the structure
We rst select whether to swap cations or anions. If there is only one anion or one cation species, FUSE selects to swap the other, i.e. if the only anion in a system is O 2À , FUSE will only swap cations. When swapping cations, for example, we simply select two cations of different types from the cation arrays and swap their locations before the modules are assembled into a crystal. Both intra-module and inter-module swapping are allowed. This is a key improvement on our original MC-EMMA basin-hopping, in which the module compositions are xed. For anions we currently have to keep both the anion type and the motif type together in order to preserve the overall composition of the unit cell. Note that if only one cation and one anion species are present (e.g. for SrO) then this permutation type cannot be used as no swaps will be possible.

Alter the stacking sequence
Two modules within the sequence are switched before the structure is assembled.

Grow the lattice by doubling the structure
If the current number of atoms in the unit cell is less than or equal to half of the permitted maximum, FUSE doubles the structure along a randomly chosen crystallographic axis.

Alter the lattice by changing g
The lattice angle, g, is swapped from 90 to 120 or vice versa. This is performed prior to assembling the modules into a structure since the lattice angle determines which translation vectors are used in stacking.

Generate a new random structure of similar volume
A new random structure is generated from the sub-module motifs, as detailed above, but with the maximum number of permitted atoms equal to that for the current structure.

Generate a new random structure
A new random structure is generated with a random number of sub-modules. This differs from the generation of the initial structures in that we do not constrain the unit cell to small values, but allow a choice up to the maximum of number of atoms permitted in structure evolution. During the MC search, we use the simpler permutations more frequently. For the results presented here we use permutation 1 53% of the time. Permutations 2 and 3 each occur 21% of the time and the remaining permutations occur much less frequently, only $1% of the time (permutations are rounded to the nearest whole percent).
Once a new trial structure has been generated, geometry optimisation is performed to obtain its energy. The new trial structure replaces the CES if its energy is lower than that of the CES, or randomly if the usual Monte Carlo condition is fullled, i.e. if e ÀDE/q > x, where x is a random number between 0 and 1, DE is the energy difference between the CES and the trial structure, and q is the MC temperature parameter. Higher values of q give higher probabilities of accepting an uphill energy step. In this study we use a value of q ¼ 0.02 eV unless the basinhopping gets trapped in a particular minimum.
The MC loop contains a break condition, at which point the search is considered converged. This occurs when r max structures have been rejected since nding the structure with the lowest known energy. If the number of structures rejected since the last structure acceptance, r, is greater than 0.1 Â r max the MC temperature parameter is gradually increased. q is increased by a random number (<0.001) for every ten steps that r is above 0.1 Â r max . While the system is 'heating up', FUSE only uses permutation types 5 and 6 (in a 2 : 1 ratio), i.e. the entire crystal structure is replaced. For the example presented in this study, these values were found to be effective in helping the MC search escape into a new basin (as can be seen in Fig. 3a). When the system escapes into a new basin, q and the permutation frequencies return to the original values. During probe structure generation, we performed three independent searches at each composition, taking the lowest energy structure from the three runs as the probe structure for that composition.

Testing the methodology: the Y 3+ -Sr
We used FUSE to explore the Y 3+ -Sr 2+ -Ti 4+ -O 2 phase eld. Using the Inorganic Crystal Structure Database (ICSD) 31 we identied 10 compounds with well-dened structures including rutile, rock-salt, perovskite, pyrochlore and Ruddlesden-Popper phases. This diverse range provided a strong test of our ability to identify energy minima corresponding to different ionic environments using the simple construction encoded in FUSE to generate approximate probe structures.

Geometry optimisation
For the calculations presented within this study, we used the General Utility Lattice Program (GULP) 30 for geometry optimisation. Buckingham potentials were used for the short-range component of the potential, with a cut-off radius of 12Å. For all structures, the unit cell parameters and atomic positions were optimised until the norm of the gradient was lower than 0.001 (a.u.). Parameters for the O 2À -O 2À and Sr 2+ -O 2À interactions were obtained from the literature. 32,33 The parameters for Ti 4+ -O 2À and Y 3+ -O 2À were based upon literature values, 22,34 but the A parameters were adjusted so that the lowest energy experimental structures were obtained as the energetic ground states for TiO 2 and Y 2 O 3 in comparison to other polymorphs (for TiO 2 ) and random structures generated using FUSE. The nal potential parameters are presented in Table 2.

5.1
The ground-state structure for a single composition: Y 2 Ti 2 O 7 As an example of a FUSE calculation from within the Y-Sr-Ti-O phase eld, we focus on the composition Y 2 Ti 2 O 7 . Experimentally, Y 2 Ti 2 O 7 crystallises in the pyrochlore structure, 16 containing a mixture of octahedrally co-ordinated Ti 4+ and eight co-ordinate Y 3+ ions in the space group Fd 3m. The full unit cell contains 88 ions, with 22 in the primitive cell, with four unique sites (Fig. 3d). Obtaining the ground-state structure in the absence of symmetry therefore presents a substantial challenge. Fig. 3a shows the energies of all the structures generated and tested for this composition. The energy of the CES is shown by the blue line. Aer the initiation  stage the CES cycled up and down in energy before dropping into a deep minimum at MC step 2715. This minimum actually corresponds to the experimental crystal structure and is the lowest energy structure obtained for this composition. Since the energy could not go lower, the MC routine was stuck in this basin until the MC temperature (also shown) was ramped up and the full structure replacement permutations, 5 and 6, were employed. Several other basins were explored, with temperature jumps required to exit two of them. The structure evolution ended aer $10 000 MC steps. The structure in the lowest energy basin was found to be a small, low symmetry unit cell (Fig. 2b) containing 22 atoms. This was then passed through the FINDSYM program, 35 which returned the correct space group and structure, with four unique sites (Fig. 3c), although with an origin shi relative to the reported crystal structure (Fig. 3d). (Fig. 4) Across the phase eld we sampled 42 different compositions from the binary tie lines and within the ternary space. For each composition we performed three runs, taking the lowest energy structure as the probe structure. The control parameters for FUSE were determined by nding a set for which the correct ground-state structure could be reliably obtained for the Y 2 Ti 2 O 7 composition. The initialisation loop was run until 1000 structures converged during geometry optimisation or until 1500 structures had been tried. The error tolerance parameter was set to 0.25, the default q to 0.02, and the run termination parameter r max was set to 7000 for each composition. The weightings given to the MC moves were not optimised and therefore we hypothesise that given optimisation, the MC search routine would converge on the lowest energy structure more quickly.

Exploring the composition space
Using the results from the FUSE calculations, we assembled the convex hull using pymatgen. 36 Accounting for the approximate nature of the probe structures, energies lying slightly above the convex hull could still identify islands of stability. In the previous phase diagrams explored using probe structures, we found experimentally realisable materials lying $35 meV per atom (ref. 11) above the convex hull and so we used this as a cutoff value to identify stable compositions. All of the known stable compositions were found to lie within this cutoff, i.e. the probe structures correctly identied the compositions of all the known phases. The highest energy above the convex hull was found to be 34 meV per atom for Y 2 TiO 5 . 15 Plotting the compositional phase space, we observed that only these known compositions were found under the 35 meV per atom threshold. Of these, FUSE obtained the exact ICSD crystal structures for: TiO 2 (anatase 18 Fig. 5f and g), which both form Ruddlesden-Popper phases, there are multiple stacking sequences which differ by less than 0.2 meV per atom. With the force-eld used within this study, the experimental crystal structures were found to be 0.1 meV per atom above the lowest energy structure obtained in FUSE.
For Y 2 TiO 5 , the probe structure obtained was found to be 19 meV per atom lower in energy than the experimental structure (Fig. 5h), although the experimental structure was obtained during the FUSE runs, but not identied as the global minimum. This highlights the need to recalculate the energies of the probe structures with more accurate methods (DFT), as we have done previously. 11 The compositions SrY 2 O 4 and Y 2 O 3 were correctly identied by the probe structures as lying in regions of stability, however the present MC search was unable to obtain the exact ICSD crystal structures ( Fig. 5i and j). The FUSE probe structures for SrY 2 O 4 and Y 2 O 3 were found to be +9 and +40 meV per atom above the ICSD structures. Although we do not set out to fully predict structures, but rather the regions of stable composition, we believe that FUSE will correctly obtain the structures of these two outliers once adjustments are made to the search/construction routine. To demonstrate that FUSE can assemble a model that will relax to the correct structure for Y 2 O 3 , we used the FUSE rules to build it by hand. Building a cell for Y 2 O 3 from the mixed anion-cation sub-modules used in the present version of FUSE requires stacking along the [110] axis of the experimental cell. This yields a unit cell containing 160 ions, a greater number than was allowed in the MC search in this study. However, assembling this structure by hand, geometry optimisation (Fig. 6a) followed by symmetry searching did yield the ICSD structure ( Fig. 6b and c).

Conclusions
In this study, we have presented a new method, FUSE, for probe structure-based inorganic materials discovery, based upon the construction of unit cells from randomly generated modules of varying shape, size and composition. FUSE can generate a wide range of physically plausible structures, with only modest restrictions applied to the type of crystal structure being created. The method was tested by exploring the known Y 3+ -Sr 2+ -Ti 4+ -O 2À phase eld, and successfully identied all of the compositions where structures are known to form as lying in low energy regions of the phase diagram. Additionally, in the search we found the exact crystal structure of 8 out of the 10 compositions. FUSE provides a new tool specically designed for the efficient generation of probe structures across the compositional phase space. It has the potential to accelerate the computational prediction of regions of the phase space that can be protably explored to experimentally realise new materials.

Data availability
The data used within this study is available from the authors on request.

Conflicts of interest
There are no conicts of interest to declare.