What is the best or most relevant global minimum for nanoclusters? Predicting, comparing and recycling cluster structures with WASP@N †‡

To address the question posed in the title, we have created, and now report details of, an open-access database of cluster structures with a web-assisted interface and toolkit as part of the WASP@N project. The database establishes a map of connectivities within each structure, the information about which is coded and kept as individual labels, called hashkeys, for the nanoclusters. These hashkeys are the basis for structure comparison within the database, and for establishing a map of connectivities between similar structures (topologies). The database is successfully used as a key element in a data-mining study of (MX) 12 clusters of three binary compounds (LiI, SrO and GaAs) of which the database has no prior knowledge. The structures are assessed on the energy landscapes determined by the corresponding bulk interatomic potentials. Global optimisation, using a Lamarckian genetic algorithm, is used to search for low lying minima on the same energy landscape to con ﬁ rm that the data-mined structures form a representative sample of the landscapes, with only very few structures missing from the close energy neighbourhood of the respective global minima.


Introduction
The application of structure prediction in the eld of clusters and nanoparticles has resulted in literally millions of structures being discovered for different compounds, systems with different magnetic ordering, systems containing different dopants, or simply systems of different sizes. [1][2][3][4] Crucially, each system can be described as an energy landscape and the initial target or targets are the location of the global minimum (GM) or the locations of low energy local minima (LM). 5 Today when one wants to study a new compound of interest within certain sets of parameters, including stoichiometry, size, environment, etc., a key question springs to mind: is it worth running new simulations that employ one or several contemporary global structure optimisation algorithms? We arguenot necessarily! Thoughtful exploitation of the available data that can be found in the literature presents a viable alternative that turns out to be the most efficient way to discover new structures, materials, and their physics and chemistry. [6][7][8][9][10][11][12][13][14] Similar considerations, apart from size, can be applied to crystal structures, including molecular, metallic, ionic, covalent, or hybrid organic and inorganic frameworks. [15][16][17][18] Another problem encountered by practically every practitioner of global optimisation for structure prediction is how to ascertain that the newly discovered conguration of a particular compound is not known from competitors' studies, for example, or exists out there under the guise of a different compound of similar stoichiometry, or is not published but is known as a lower ranked local energy minimum (i.e. data that has a rank that is beyond a chosen set threshold for publication). The use of slightly different energy functions, unintentional effects of tolerances both in energy denition and local optimisation, or possibly an intentional bias to match measurable properties (for example, infrared data) will all muddle the waters further.
The choice of the bestor most suitable for the investigator's purposescost (or tness) function is uncertain, and could be quite different in different studies even on the same system.
To address these challenges, we have developed a database complemented by a toolkit that includes structure comparison as a key element. Aggregating structures and their properties into one place also enables the sophisticated exploration of structural motifs and particular properties and the discovery of structure-property relationships. Databases are not a new concept in materials modelling, 19-29 even in the eld of nanoclusters. 30,31 Crucially, our searchable database generates a map of connections relating different structures. In this article, we describe both the database and the algorithms that generate these mappings, followed by simple showcase examples.

Web-assisted structure prediction at the nanoscale (WASP@N)
In the development of the database, our Hive of knowledge, we aimed to arm the scientic community and general public, from professional researchers to school pupils, with a new intelligent tool to search, discover and disseminate structures and properties of new nanoclusters. To allow access and interaction with the Hive, we built a web interface, which we refer to as the WASP toolkit. The mapping between structures and various properties is an essential element, or feature, of the Hive database, which is generated by algorithms that form part of a separate piece of code that we refer to as the Bee soware. The Bee soware runs on dedicated computing facilities. The WASP interface links the user, the Hive and the Bee sowaresee Fig. 1. With open access to the Hive, a number of security measures have been employed in order to protect the integrity of the data and the computing facilities from malicious attacks (to complete the analogy, we refer to unwanted visitors to the Hive as hornets). Datasets within the Hive are organised as follows: (a) published atomic structures, the atomic coordinates of which were originally used to generate a gure (e.g. ball and stick models) or were explicitly given in a table as part of a published paper (or electronic supplementary information) that has a DOI; and (b) atomic structures generated using the Bee soware. For the former, the atomic structures are labelled using the DOI of the published article they were taken from, and are uploaded as one or more concatenated xyz le(s) using an extended format that contains both the metadata saved on the comment line and the atomic structure, which includes atomic labels: Cartesian coordinates; one additional scalar and one vector record per atom (for example, charges, spin, dipole on atom). Searchable metadata are vital for the use of a database. Values for metadata that can be provided include the denition of energy and soware, total charge, energy ranking, total spin, etc. For example, the comment line: "Name¼drum; Symmetry¼D 3h ; Definition¼{FHI-aims, PBE0/PBE, tight}; Ener-gy¼210Hartree; Size¼6; Atoms¼12; Charge¼0; Spin¼0; Dipole¼(0,0,0)" for the cluster (ZnO) 6 indicates that the user refers to the local minimum conguration as a "drum", the atomic coordinates of which have D 3h point group symmetry aer geometry relaxation using the FHI-aims soware with the generalised gradient approximation in density functional theory in the form of the PBE exchange and correlation density functional and the tight basis set, an energy of 210 Ha with the same basis set and the hybrid PBE0 exchange and correlation density functional, a total charge and spin of zero, and no resultant dipole. If not specied upon upload to the Hive, some of these will be calculated along with, for example, stoichiometry, topology, total mass, centre of mass, and principal moments of inertia. Non-searchable metadata like, for example, thumbnail ball and stick images, are generated on-the-y. The dataset for each DOI string will also contain timestamp metadata (when it was uploaded or last modied) and publication metadata (authors and journal name, volume and page numbers). Generated datasets are given a DOI string by the Bee soware that is based on the chosen energy denition, and the atomic congurations result from structural relaxations of all the published datasets. The essential search and comparison features of WASP enable the user to investigate structural motifs and physical properties. The comparison of clusters can be quite expensive and, therefore, comparison-based pre-searches are performed by the Bee soware upon the upload of new datasets, both published and generated. A description of the algorithms employed in these comparisons is provided in the next section. The results of pre-searches are saved as links between (thus establishing) related structures. These links, or new metadata generated by the Bee soware, form a map linking different structures in the database. The map can be readily exploited by the user through the WASP interface to ascertain the uniqueness of newly found congurations of clusters of a certain compound and size or to compare clusters of different compounds. Moreover, as we will demonstrate below, this map can also help to reduce the effort needed to explore the energy landscapes of a compound that has yet to be investigated. The computational work and the interaction of the three complementary codes (WASP, Bee, and the Hive) are supported by appropriate hardware solutionsas illustrated in Fig. 1 and related operating system server soware (including task scheduler, etc.). In the near future, we plan to expand the solution shown in Fig. 1 to include the exploitation of third party computing platforms.

Uniqueness and similarity
Being able to quickly recognise similar structures, or measure their similarity, has always been a challenge in materials modelling. 32 Consider comparing the atomic structures of two nanoclusters that are essentially the same but have either small random perturbations (noise) resulting from the applied optimisation tolerances or slight differences because of the different, but similar, density functionals employed. In the comparison procedure, the rst task is to correctly align these two congurations: the translation and rotation of each cluster is xed by positioning the centre of mass at the origin and aligning the principal axes of rotation with the chosen Cartesian axes. Hopefully, upon alignment, a one-to-one match is found for each atom in one conguration with the equivalent atom in the other. If not, then there is a combinatorial problem to solve: which combination of atom pairs minimises the sum of the distances between all pairs (a sum of zero implies a perfect match, with each atom in one conguration positioned exactly on top of the equivalent atom in the other conguration). Minimising this measure of likeness for two dissimilar nanoclusters may also require optimising the relative rotation and translation of the two nanoclusters.
The efficiency of stochastic search algorithmsparticle swarm, basin hopping, and genetic or evolutionary algorithmsthat are employed to locate local minima (LM) on the energy landscape can be improved if there is a computationally cheap method that provides a measure of how similar two structures are. For example, this could be used to check whether a newly found/generated conguration is unique, whether the starting points are sufficiently spread apart for different random walkers on the energy landscape, or whether the candidate structures in the current population are sufficiently diverse for the evolutionary algorithm (otherwise inbreeding results in the population not evolving, or improving, any further). One may also want to distinguish between enantiomorphic clusterstwo clusters that are mirror images of each other. One half of such a pair can easily be lost if the comparison of nanoclusters is simply based on their relative energy of formation (since both enantiomorphic clusters have identical energies). There are several approaches in the literature designed to measure the similarity between structures, 33-45 which can be classied in two groups: direct one-toone comparison or an indirect approach that requires the generation of labels, also known as ngerprints or hashkeys, which are then compared.
One-to-one comparison algorithms are typically based around a cost function that measures the degree of similarity between two structures. As introduced above, the cost function will depend on the successful superimposition of the two structures, i.e. the translation and rotation of one cluster with respect to the other. Where Dirac delta functions are used to describe the position of an atom, the cost function will also depend on the matching of atomic pairs between the structures. This in itself can pose a formidable task (see for example ref. 33, which employs the Hungarian algorithm). [34][35][36] This problem is reduced for compounds or alloys if pairs are restricted between like species. Alternatively, where a Gaussian, or a similar function, is centred on each atom, the cost function is typically based on the degree of overlap of atom-centred Gaussians between the two clusters. For compounds and alloys, the overlap of Gaussians can be determined for each species type; there is no explicit need to match pairs of atoms. Goedecker employed a similar scheme, but based on atomic orbitals (see ref. 37). Both types of cost function can also be employed to nd out whether, or how well, a smaller cluster matches a fragment of a larger cluster.
In this article, we only compare pairs of clusters that have the same composition, and use only the species type and atomic coordinates as the input. One of the most straightforward and widely used metrics for the comparison of molecular structures is the root-mean-square deviation (RMSD) of the coordinates of equivalent atoms. 38,39 Following a similar idea, the metrics suggested by Ali Sadeghi et al. 37 use congurational ngerprints based on eigenvalues of matrices of interatomic distances. The structural ngerprints are then compared by measuring the distances between them, as small ngerprint-based distances correspond to small RMSD distances. The H-FORMS (a hierarchical algorithm for molecular similarity) 46 approach estimates a rigid transformation that aligns structures and computes rotation-invariant descriptors, which are then used to match atoms. Similarly, R. Hundt et al. implemented an algorithm in the analysis program KPLOT 40 based on the mapping of atomic patterns constructed using three-atom frame matches. An alternative approach to the problem of structure comparison exploits the properties of the nanoclusters, 41 such as radial distribution functions, vibrational frequencies 42 or principal moments of inertia.
Whichever method is used, when a structure needs to be efficiently compared with vast data for thousands or millions of congurations, the chosen approach needs to be both robust and computationally affordable. The second class of comparison methodsbased on comparing unique labels that are generated for every congurationally unique structuremay address this big data challenge.
Within our database, we implemented the approach rst adopted in the KLMC soware 47 to address the challenge of maintaining the diversity of structures during a genetic algorithm search. The approach relies on the NAUTY soware package (No AUTomorphisms, Yes?) written by McKay and Piperno, 48 which can generate canonical labels for graphs and compute automorphisms between them. NAUTY labels graphs canonically by providing a string consisting of three 8-digit hexadecimal numbers depending on the graph, i.e. a set of vertices and edges, and, in general, every unique graph will have a unique NAUTY string, also known as a hashkey, or ngerprint. By exploiting the feature of uniqueness, we have incorporated NAUTY in the Bee soware in the following way: each cluster is converted to a coloured graph by treating the atoms as vertices and the bonds between them as edges. The number of colours of vertices (atoms) is determined by the number of species in the structure. Thus, (MgO) n clusters will have two different colours (species), whereas Ti n clusters will have only one. It is important to note that (KF) n clusters will also have two different colours, therefore graphs of (MgO) n and (KF) n clusters of the same size can be compared explicitly. The edges of the clusters' graphs are generated from the calculated interatomic distances between the atoms (vertices) of a cluster and can be thought of as "bonds" between atoms. The radial cut-off by which the "bonds" are determined depends on the species and is slightly longer than the expected actual bond length. A owchart of the implemented hashkey generation is given in Fig. 2, where the (MgO) 5 GM cluster is used as an example. Here, the (MgO) 5 GM cluster (shown as a ball and stick model in Fig. 2a) is transformed into a coloured graph (shown in Fig. 2b). This graph is then processed using the "NAUTY" soware package, which in turn generates a unique hashkey for the cluster. An example of a hashkey is shown in Fig. 2d.
Given that the comparison of hashkeys is orders of magnitude faster than comparing atomic structures explicitly, each cluster within the Hive database is labelled with a hashkey. As described above, the hashkeys enable a rapid check of the database for duplicate structures by both the WASP and Bee soware and are used in the generation of maps connecting similar structures (the network of links between clusters entered into the database is updated as soon as the atomic coordinates of generated and published LM nanoclusters are uploaded to the Hive)a feature that is not currently implemented in other structural databases. This feature has proven to be essential when the WASP interface is used to nd out whether a newly discovered cluster is already within the Hive. To demonstrate one of the utilities our database provides, we have used the generated hashkeys to identify unique structural motifs for a particular stoichiometry (1 : 1) and size (24  atoms). We then data-mined from this set, rather than a set of LM congurations of one or all compounds in the Hive.

Data normalisation
Published LM cluster structures, which can be uploaded to the database, are, by denition, dependent upon the theory and accuracy of the level of theory employed in the calculation of energy as a measure of stability. Moreover, the measure of tness may also be based on the deviation from some geometric, physical or chemical observable(s). When LM on a potential energy landscape are targeted, energy calculations at different levels of theory (quantum mechanical (all-electron or pseudopotential), semi-empirical, Hartree-Fock, DFT, tightbinding, semi-classical, or atomistic simulations) yield values that may scatter across a few orders of magnitude. Even if a similar method is chosen, e.g. DFT with identical basis sets and, possibly, effective core potentials, employing different exchange and correlation density functionals could still lead to substantially different values. The situation is just as problematic if semi-classical simulations are employed, as there are oen many different sets of parameterised interatomic potentials for the same material or compound. One trick commonly used across the eld of materials chemistry is to switch from total to binding or cohesion energies, which can be expected to behave better, and do in practice. 49 The scatter in the calculated binding energy values obtained using different approaches is usually, however, still greater than the energy separating low ranking energy minima on the same energy landscape (denition of energy). In practice, the WASP interface lets users upload their data without any restrictions on how the data were obtained, but encourages the users to provide details of the adopted computational approach as metadata. To support the comparison of individual structures obtained using different energy denitions, we introduced an internal standard attained by a data normalisation routine. In particular, when data are uploaded to the Hive database, they are automatically rened by the Bee soware, using the all-electron, full potential electronic structure code FHI-aims 50 with the PBEsol functional, 51-53 the light basis set (which is variationally equivalent to split valence double-zeta Gaussian plus polarisation basis sets but can obtain energies that are much closer to the basis set limit). Further computational parameters are provided in the ESI. † Aer normalisation, the newly obtained structure is automatically uploaded to the Hive database with a two-way link between the original and normalised congurations, along with similarity links to the whole dataset in the database.
Hence, the user can search for structures that rene to the same LM on our normalised energy landscape (particularly useful for the investigation of nanoclusters of the same compound) or structures of any compound with the same connectivity (structural motif), as explained in the previous section.

Data mining
Starting from a known set of atomic congurations with the target stoichiometry and total number of atoms, the Data Mining (DM) module of the KLMC soware package 54 rescales each conguration to obtain an estimate of the expected nearest neighbour interatomic distances for the target compound, and then, using third party soware, relaxes the rescaled atomic structures to LM. In the results shown below, we employ GULP 55 as the third party soware, i.e. a semiclassical level of theory is used for the calculation of energies (and atomic forces). Aer the rescaling and renement procedure, KLMC is also employed to analyse the resulting congurations in terms of their energy ranking, uniqueness and geometrical properties.

Global optimisation
A Lamarckian genetic algorithm (GA) approach implemented in the KLMC soware package 47 was also used to locate LM on the energy landscape dened by the same set of interatomic potentials (semi-classical level of theory) as those used in the data-mining investigation. We note that the ability of the KLMC GA 47 to locate LM and GM efficiently has been proven for various types of system, and thus it is chosen here as a method for providing reliable data that we can use to assess the results obtained using the data-mining approach. The population of each GA run was set to 200 candidate structures, with the initial random structures generated within a 15Å Â 15Å Â 15Å cubic simulation box. Default values, as given in ref. 47, were used for the remaining simulation parameters.

Isomorphic structures, or structural motifs
As an illustration of how the connectivity maps are employed, we consider the case of a GM nanocluster reported in ref. 56 for (MgO) 7 that has the symmetry point group C 3v ; see Fig. 3a. The topological analysis tool nds that this structure has "7Mg3-7O3" topology, i.e. seven Mg and seven O atoms, each with a coordination number of three. When selected using the WASP interface for the Hive, beneath the rotatable ball and stick model of this structure are two lists; one showing the standardised entry for this conguration (as described earlier), and another showing all the "isomorphic structures" found in the Hive based on matching hashkeys (as also described above). A snapshot of the second list is shown in Fig. 4. In our chosen example, the (MgO) 7 GM structure currently has eleven isomorphic structures: eleven atomic congurations within the Hive have the same hashkey as our chosen example. The inclusion of a DOI in the entry for a candidate structure in this list indicates that it is a published LM. The remaining ve are, therefore, standardised LM (using FHI-aims). As more entries are submitted to the Hive, we would expect many more matches to be found. The six published LM show that this structural motif is also reported 54,56,57 to be the GM for (KF) 7 , (CaO) 7 , (SrO) 7 , (BaO) 7 and (CdSe) 7 . There is also another (MgO) 7 conguration, which has a different DOI 54 to that of the original chosen structure. Given that there are six different compounds with the same structural motif, we would expect six standardised LM. The two published LM entries for (MgO) 7 , the same compound, relax to the same standardised LM. To nd all the nanoclusters within the Hive that relax to the same standardised LM, the user only needs to click on the thumbnail of the standardised nanocluster. In our example, the missing standardised LM results from the standardised conguration for (CdSe) 7 relaxing to a different LM. Therefore, it has a different hashkey as it is a different structure (in fact, it has C 1 point symmetry).

Efficient structure prediction
The Hive contains the LM atomic structures for numerous binary compounds with 1 : 1 stoichiometry and a total charge of 0. We now concentrate on one particular size, clusters composed of 12 cations and 12 anions. To investigate a compound that is missing from the Hive database, one could data-mine structures already in the Hive for a similar compound. The success of this approach would rely on the chosen set of initial congurations; the more extensive this set, the greater the probability of nding the target LM. To maximise this probability one could data-mine all the compounds; however, this would generate many copies of each LM. Using the hashkey, which provides a unique identier for each structural motif, we were able to reduce this initial set to just over 100 unique structural motifs (which we will refer to as the DM-set). If the database contained entries for alkali halides, (XY) 12 , and alkaline earth oxides, (ZO) 12 , for X ¼ Li to Cs, Y ¼ F to I, and Z ¼ Mg to Ba, then potentially there would be a maximum reduction of 96%. The determination of this reduced set (calculation and comparison of hashkeys) is orders of magnitude faster to perform than the additional structural relaxations (using standard algorithms within an electronic structure code) that would have been necessary if we could not determine equivalent structures. Moreover, data-mining requires the evaluation of far fewer candidate structures than is typically performed in a stochastic approach. It is expected that the number of datasets within the Hive will grow, and that important unique structural motifs may be missed given our search has been performed soon aer we have created this database. Stochastic approaches may also miss important LM, and the number of unique motifs is likely to increase much more slowly than the number of entries for clusters of any particular size, charge and stoichiometry.
Using our DM-set of unique LM, we now investigate three different compounds that were not included in the initial dataset taken from the Hive, namely (LiI) 12 , Table 1 Parameters for the Buckingham potential, A exp(Àr/r) À C/r 6 , applied between ions X and Y

X-Y
A (eV) r (Å À1 ) C (Å 6 eV)  Table 2 Parameters for the shell model for ions X, where Q and Y are the point-charges of the core and shell, which are connected by a spring with constants k 2 and k 4 . The Coulomb contribution to the energy between point-charges of an individual ion X is replaced with the energy associated with the spring, 1/2k 2 x 2 + 1/4k 4 x 4 , where x is the distance between the core and shell. Note that the strontium cation is treated as a rigid ion and therefore only has one parameter  (SrO) 12 and (GaAs) 12 . As the main focus of this article is the methodology as opposed to the physical and electronic properties of the predicted nanoclusters, we have chosen to present new IP-LM structures, i.e. the atomic congurations and ranks of local minima on the energy landscape are dened using interatomic potentials (IP), the parameters of which are given in Tables 1 and 2. For each compound we also perform a search of low energy IP-LM using an evolutionary algorithm; details of both methods are described in the previous section. We note that the potential parameters for LiI were taken from ref. 58. The small spring constant for the lithium cation caused problems during the global optimisation runs; during the relaxations of new candidate structures (particularly the random structures used in the initial population), the initial electric elds were sometimes strong enough that during structural relaxation the shell was stripped away from the cation. It is known that the polarisability of an ion is dependent upon the electric eld, which is much stronger for our clusters than that experienced within the bulk. Thus, in our simulations, we doubled the value of the spring constant for lithium cations, which corresponds to an apparent reduction in their coordination number compared to the bulk. The results from data-mining our DM-set of unique LM are shown in Fig. 5-7. For strontium oxide, lithium iodide and gallium arsenide, 47, 50 and 41 LM structures were generated, respectively, i.e. not all the structural motifs of one compound were locally stable for another. Moreover, a different global minimum was found for each compound. Labelled DM01 in Fig. 5, the D 3d barrel was found to be the IP-GM for (SrO) 12 , whereas for (LiI) 12 and (GaAs) 12 it was ranked fourth and second, respectively. The 2 Â 2 Â 6 D 2d conguration of alternating atoms, labelled DM01 in Fig. 6, was found to be the IP-GM for (LiI) 12 . One can imagine that this cuboid conguration could be cut from the NaCl rock salt structure, and thus it is not surprising that this structural motif was not generated for (GaAs) 12 . The T h sodalite cage, so named as it is a basic building block of the sodalite bulk structure (given the abbreviation SOD by the zeolite community), was found to be  the IP-GM for (GaAs) 12 . This conguration was ranked h and thirty-eighth for (SrO) 12 and (LiI) 12 , respectively. Comparing the ball and stick models for different compounds but for the same structural motif, one noticeable difference between the LM for lithium iodide and those of the other two compounds is the sharper (more acute) bond angles that directly result from the greater polarisability of the iodide anion. Essentially, the iodide anions are further out from the cluster's centre of mass than the lithium cations. To check the current success of data-mining the Hive for these three compounds, we also conducted global optimisation on each of the three IP-energy landscapes for low lying LM. We present the results as three densities of LM graphs; see Fig. 8. In the panel insert for each compound it is very clear that the data-mined LM present only a sample of all the possible LM. In terms of ranking, fortunately, the missing LM tend to be mid-range rather than at the more stable end (which, typically, is where there is most interest). Looking more closely at the top ranked LM, we identied which IP-LM structures are missing; these are shown in Fig. 9.
For strontium oxide clusters, the rst six missing LM were ranked 6, 7, 8, 9, 13 and 16. The rst three of these are basic rock-salt cuts that could have been included in our data-mined set if we had included the structures from ref. 54 (we did not as this paper includes data-mined structures for alkaline oxides, one of which is one of the compounds we chose to investigate). The GA08 cuboid conguration was in fact found as the IP-GM for (LiI) 12 . Generating this LM during the data-mining process was fortuitous given that this structural motif was not included in the DM-set of unique LM. GA09 and GA13 are composed of a n ¼ 6 drum (typically the IP-GM for (XY) 6 ) and 2 Â 2 Â m cuboids. More interesting is the GA16 conguration, which we have previously seen; it has an unusual distorted planar four-coordinated oxygen anion site. For lithium iodide clusters, the rst six missing LM structures were ranked 3, 4, 5, 7, 8 and 9. Unlike our DM-set, these congurations, which we will refer to as HC, have at least one highly coordinated (greater than 4) anion site and are not one of the possible cuboid cuts from the NaCl rock salt phase. Given the stability of this type of structure, quite a few of the better ranked structures were missed. As already seen, any unstable LM in the DM-set can lead to new structural LM and thus we did not miss all of the HC structures; the enantiomer of GA03 was found (labelled as DM03 in Fig. 6 and ranked equal third). For gallium arsenide clusters, data-mining the DM-set was much more successful in that only four additional IP-LM structures were found in the top thirty; the rst four missing LM structures were ranked 7, 14, 21 and 29. Of these, GA07 is the result of merging IP-GM for n ¼ 6 (a drum) and n ¼ 9 (bubble) across a hexagonal face; GA14 is very similar to the GA16 LM that was missed for (SrO) 12 ; GA21 has the same structural motif as DM18, but with all the anions switched for cations, and vice versa, cf. DM23 and DM24 and also GA06 and GA07 for strontium oxide. We note that the DM and GA runs found different chiral versions of DM23 and DM24.
Finally, we should reiterate that the structures reported above for LiI, SrO and GaAs were obtained on the interatomic potential landscape. These potentials were originally parameterised for bulk compounds, where atoms are typically in higher coordinated environments, and therefore such parameterisations are very limited in scope. For example, arsenide anions are highly polarisable, and more realistic structures should be expected to have more buckled shapes, as seen above in LiI congurations. The latter proved to be easier to optimise due to the relatively low charges on Li and I. Notwithstanding this, the structures obtained here will be uploaded to the Hive and rened using our chosen ab initio approach, which will both give the actual ndings more credence for future applications, but will also allow the parameters of the interatomic potentials to be rened. The latter is an important element of machine-learning techniques that have been particularly successful in studies of metallic clusters. 59,60

Conclusions
We have presented, for the rst time, details of our database of published atomic congurations of nanoclusters. We have described the algorithms employed within this database to establish whether two entries are equivalent LM for Fig. 9 Ball and stick models of (XY) 12 IP-LM configurations obtained by the genetic algorithm that were missing from the IP-LM found using the data-mining approach. The colour scheme is shown in the lower right hand panel and is the same as that employed in previous figures. The numbers in the GA** labels indicate the rank found for the nanocluster, where 01 indicates the IP-GM, whereas in the previous labels, DM**, they indicate the rank before the missing IP-LM were found using the GA. a particular compound and whether congurations of different compounds are equivalent when judged using connectivity arguments, and have shown how to exploit these data in order to predict structures for three new compounds. The database provides initial model structures that were traditionally obtained from experiments, congurations that can be employed in structure prediction using a data-mining approach, and a way of checking whether a candidate structure is indeed new. Data-mining the set of congurations for (XY) 12 structures that have a unique hashkey proved relatively successful in that the top two LM congurations for each of three compounds were found. However, global optimisation techniques are still required for compounds that are chemically distinct enough that their low energy LM structures do not match congurations already in the database, using our connectivity arguments. This will of course change with time, as more data is entered into the database. Lessons learnt in the creation of the Hive and the associated WASP interface as a toolkit will be of direct use for further work on nucleation and crystallisation processes, 61 crucially the nucleation and growth of small particles on or in solid supports and liquid environments. The LM atomic congurations in the database are also readily usable as secondary building units (SBU) for constructing crystal structures. 6,8,10,[62][63][64][65][66][67][68] Here, using low energy SBUs that do not resemble cuts from the main phases of the chosen compounds will produce more interesting results.

Conflicts of interest
There are no conicts to declare.