Open Access Article
Scott M.
Woodley
*a,
Tomas
Lazauskas
a,
Malcolm
Illingworth
b,
Adam C.
Carter
b and
Alexey A.
Sokol
a
aUniversity College London, Department of Chemistry, 20 Gordon Street, London WC1H 0AJ, UK. E-mail: Scott.Woodley@ucl.ac.uk
bEPCC, James Clerk Maxwell Building, Peter Guthrie Tait Road, Edinburgh, EH9 3FD, UK
First published on 1st August 2018
To address the question posed in the title, we have created, and now report details of, an open-access database of cluster structures with a web-assisted interface and toolkit as part of the WASP@N project. The database establishes a map of connectivities within each structure, the information about which is coded and kept as individual labels, called hashkeys, for the nanoclusters. These hashkeys are the basis for structure comparison within the database, and for establishing a map of connectivities between similar structures (topologies). The database is successfully used as a key element in a data-mining study of (MX)12 clusters of three binary compounds (LiI, SrO and GaAs) of which the database has no prior knowledge. The structures are assessed on the energy landscapes determined by the corresponding bulk interatomic potentials. Global optimisation, using a Lamarckian genetic algorithm, is used to search for low lying minima on the same energy landscape to confirm that the data-mined structures form a representative sample of the landscapes, with only very few structures missing from the close energy neighbourhood of the respective global minima.
Another problem encountered by practically every practitioner of global optimisation for structure prediction is how to ascertain that the newly discovered configuration of a particular compound is not known from competitors’ studies, for example, or exists out there under the guise of a different compound of similar stoichiometry, or is not published but is known as a lower ranked local energy minimum (i.e. data that has a rank that is beyond a chosen set threshold for publication). The use of slightly different energy functions, unintentional effects of tolerances both in energy definition and local optimisation, or possibly an intentional bias to match measurable properties (for example, infrared data) will all muddle the waters further.
The choice of the best – or most suitable for the investigator’s purposes – cost (or fitness) function is uncertain, and could be quite different in different studies even on the same system.
To address these challenges, we have developed a database complemented by a toolkit that includes structure comparison as a key element. Aggregating structures and their properties into one place also enables the sophisticated exploration of structural motifs and particular properties and the discovery of structure–property relationships. Databases are not a new concept in materials modelling,19–29 even in the field of nanoclusters.30,31 Crucially, our searchable database generates a map of connections relating different structures. In this article, we describe both the database and the algorithms that generate these mappings, followed by simple showcase examples.
![]() | ||
| Fig. 1 Schematic of the hardware and software solution for web-assisted structure prediction at the nanoscale. | ||
Datasets within the Hive are organised as follows: (a) published atomic structures, the atomic coordinates of which were originally used to generate a figure (e.g. ball and stick models) or were explicitly given in a table as part of a published paper (or electronic supplementary information) that has a DOI; and (b) atomic structures generated using the Bee software. For the former, the atomic structures are labelled using the DOI of the published article they were taken from, and are uploaded as one or more concatenated xyz file(s) using an extended format that contains both the metadata saved on the comment line and the atomic structure, which includes atomic labels: Cartesian coordinates; one additional scalar and one vector record per atom (for example, charges, spin, dipole on atom). Searchable metadata are vital for the use of a database. Values for metadata that can be provided include the definition of energy and software, total charge, energy ranking, total spin, etc. For example, the comment line:
| “Name=drum; Symmetry=D3h; Definition={FHI-aims, PBE0/PBE, tight}; Energy=210Hartree; Size=6; Atoms=12; Charge=0; Spin=0; Dipole=(0,0,0)” |
The essential search and comparison features of WASP enable the user to investigate structural motifs and physical properties. The comparison of clusters can be quite expensive and, therefore, comparison-based pre-searches are performed by the Bee software upon the upload of new datasets, both published and generated. A description of the algorithms employed in these comparisons is provided in the next section. The results of pre-searches are saved as links between (thus establishing) related structures. These links, or new metadata generated by the Bee software, form a map linking different structures in the database. The map can be readily exploited by the user through the WASP interface to ascertain the uniqueness of newly found configurations of clusters of a certain compound and size or to compare clusters of different compounds. Moreover, as we will demonstrate below, this map can also help to reduce the effort needed to explore the energy landscapes of a compound that has yet to be investigated. The computational work and the interaction of the three complementary codes (WASP, Bee, and the Hive) are supported by appropriate hardware solutions – as illustrated in Fig. 1 – and related operating system server software (including task scheduler, etc.). In the near future, we plan to expand the solution shown in Fig. 1 to include the exploitation of third party computing platforms.
The efficiency of stochastic search algorithms – particle swarm, basin hopping, and genetic or evolutionary algorithms – that are employed to locate local minima (LM) on the energy landscape can be improved if there is a computationally cheap method that provides a measure of how similar two structures are. For example, this could be used to check whether a newly found/generated configuration is unique, whether the starting points are sufficiently spread apart for different random walkers on the energy landscape, or whether the candidate structures in the current population are sufficiently diverse for the evolutionary algorithm (otherwise inbreeding results in the population not evolving, or improving, any further). One may also want to distinguish between enantiomorphic clusters – two clusters that are mirror images of each other. One half of such a pair can easily be lost if the comparison of nanoclusters is simply based on their relative energy of formation (since both enantiomorphic clusters have identical energies).
There are several approaches in the literature designed to measure the similarity between structures,33–45 which can be classified in two groups: direct one-to-one comparison or an indirect approach that requires the generation of labels, also known as fingerprints or hashkeys, which are then compared.
One-to-one comparison algorithms are typically based around a cost function that measures the degree of similarity between two structures. As introduced above, the cost function will depend on the successful superimposition of the two structures, i.e. the translation and rotation of one cluster with respect to the other. Where Dirac delta functions are used to describe the position of an atom, the cost function will also depend on the matching of atomic pairs between the structures. This in itself can pose a formidable task (see for example ref. 33, which employs the Hungarian algorithm).34–36 This problem is reduced for compounds or alloys if pairs are restricted between like species. Alternatively, where a Gaussian, or a similar function, is centred on each atom, the cost function is typically based on the degree of overlap of atom-centred Gaussians between the two clusters. For compounds and alloys, the overlap of Gaussians can be determined for each species type; there is no explicit need to match pairs of atoms. Goedecker employed a similar scheme, but based on atomic orbitals (see ref. 37). Both types of cost function can also be employed to find out whether, or how well, a smaller cluster matches a fragment of a larger cluster.
In this article, we only compare pairs of clusters that have the same composition, and use only the species type and atomic coordinates as the input. One of the most straightforward and widely used metrics for the comparison of molecular structures is the root-mean-square deviation (RMSD) of the coordinates of equivalent atoms.38,39 Following a similar idea, the metrics suggested by Ali Sadeghi et al.37 use configurational fingerprints based on eigenvalues of matrices of interatomic distances. The structural fingerprints are then compared by measuring the distances between them, as small fingerprint-based distances correspond to small RMSD distances. The H-FORMS (a hierarchical algorithm for molecular similarity)46 approach estimates a rigid transformation that aligns structures and computes rotation-invariant descriptors, which are then used to match atoms. Similarly, R. Hundt et al. implemented an algorithm in the analysis program KPLOT40 based on the mapping of atomic patterns constructed using three-atom frame matches. An alternative approach to the problem of structure comparison exploits the properties of the nanoclusters,41 such as radial distribution functions, vibrational frequencies42 or principal moments of inertia.
Whichever method is used, when a structure needs to be efficiently compared with vast data for thousands or millions of configurations, the chosen approach needs to be both robust and computationally affordable. The second class of comparison methods – based on comparing unique labels that are generated for every configurationally unique structure – may address this big data challenge.
Within our database, we implemented the approach first adopted in the KLMC software47 to address the challenge of maintaining the diversity of structures during a genetic algorithm search. The approach relies on the NAUTY software package (No AUTomorphisms, Yes?) written by McKay and Piperno,48 which can generate canonical labels for graphs and compute automorphisms between them. NAUTY labels graphs canonically by providing a string consisting of three 8-digit hexadecimal numbers depending on the graph, i.e. a set of vertices and edges, and, in general, every unique graph will have a unique NAUTY string, also known as a hashkey, or fingerprint. By exploiting the feature of uniqueness, we have incorporated NAUTY in the Bee software in the following way: each cluster is converted to a coloured graph by treating the atoms as vertices and the bonds between them as edges. The number of colours of vertices (atoms) is determined by the number of species in the structure. Thus, (MgO)n clusters will have two different colours (species), whereas Tin clusters will have only one. It is important to note that (KF)n clusters will also have two different colours, therefore graphs of (MgO)n and (KF)n clusters of the same size can be compared explicitly. The edges of the clusters’ graphs are generated from the calculated interatomic distances between the atoms (vertices) of a cluster and can be thought of as “bonds” between atoms. The radial cut-off by which the “bonds” are determined depends on the species and is slightly longer than the expected actual bond length. A flowchart of the implemented hashkey generation is given in Fig. 2, where the (MgO)5 GM cluster is used as an example. Here, the (MgO)5 GM cluster (shown as a ball and stick model in Fig. 2a) is transformed into a coloured graph (shown in Fig. 2b). This graph is then processed using the “NAUTY” software package, which in turn generates a unique hashkey for the cluster. An example of a hashkey is shown in Fig. 2d.
![]() | ||
| Fig. 2 Flowchart of hashkey (fingerprint) generation illustrated using the (MgO)5 global minimum configuration. | ||
Given that the comparison of hashkeys is orders of magnitude faster than comparing atomic structures explicitly, each cluster within the Hive database is labelled with a hashkey. As described above, the hashkeys enable a rapid check of the database for duplicate structures by both the WASP and Bee software and are used in the generation of maps connecting similar structures (the network of links between clusters entered into the database is updated as soon as the atomic coordinates of generated and published LM nanoclusters are uploaded to the Hive) – a feature that is not currently implemented in other structural databases. This feature has proven to be essential when the WASP interface is used to find out whether a newly discovered cluster is already within the Hive. To demonstrate one of the utilities our database provides, we have used the generated hashkeys to identify unique structural motifs for a particular stoichiometry (1
:
1) and size (24 atoms). We then data-mined from this set, rather than a set of LM configurations of one or all compounds in the Hive.
In practice, the WASP interface lets users upload their data without any restrictions on how the data were obtained, but encourages the users to provide details of the adopted computational approach as metadata. To support the comparison of individual structures obtained using different energy definitions, we introduced an internal standard attained by a data normalisation routine. In particular, when data are uploaded to the Hive database, they are automatically refined by the Bee software, using the all-electron, full potential electronic structure code FHI-aims50 with the PBEsol functional,51–53 the light basis set (which is variationally equivalent to split valence double-zeta Gaussian plus polarisation basis sets but can obtain energies that are much closer to the basis set limit). Further computational parameters are provided in the ESI.† After normalisation, the newly obtained structure is automatically uploaded to the Hive database with a two-way link between the original and normalised configurations, along with similarity links to the whole dataset in the database.
Hence, the user can search for structures that refine to the same LM on our normalised energy landscape (particularly useful for the investigation of nanoclusters of the same compound) or structures of any compound with the same connectivity (structural motif), as explained in the previous section.
![]() | ||
| Fig. 4 Snapshot of the lower part of the WASP interface showing 11 results (5 generated by the Bee software) returned from the Hive for structures that have the same hashkey as the GM structure for (MgO)7 as reported in ref. 56. | ||
:
1 stoichiometry and a total charge of 0. We now concentrate on one particular size, clusters composed of 12 cations and 12 anions. To investigate a compound that is missing from the Hive database, one could data-mine structures already in the Hive for a similar compound. The success of this approach would rely on the chosen set of initial configurations; the more extensive this set, the greater the probability of finding the target LM. To maximise this probability one could data-mine all the compounds; however, this would generate many copies of each LM. Using the hashkey, which provides a unique identifier for each structural motif, we were able to reduce this initial set to just over 100 unique structural motifs (which we will refer to as the DM-set). If the database contained entries for alkali halides, (XY)12, and alkaline earth oxides, (ZO)12, for X = Li to Cs, Y = F to I, and Z = Mg to Ba, then potentially there would be a maximum reduction of 96%.
The determination of this reduced set (calculation and comparison of hashkeys) is orders of magnitude faster to perform than the additional structural relaxations (using standard algorithms within an electronic structure code) that would have been necessary if we could not determine equivalent structures. Moreover, data-mining requires the evaluation of far fewer candidate structures than is typically performed in a stochastic approach. It is expected that the number of datasets within the Hive will grow, and that important unique structural motifs may be missed given our search has been performed soon after we have created this database. Stochastic approaches may also miss important LM, and the number of unique motifs is likely to increase much more slowly than the number of entries for clusters of any particular size, charge and stoichiometry.
Using our DM-set of unique LM, we now investigate three different compounds that were not included in the initial dataset taken from the Hive, namely (LiI)12, (SrO)12 and (GaAs)12. As the main focus of this article is the methodology as opposed to the physical and electronic properties of the predicted nanoclusters, we have chosen to present new IP-LM structures, i.e. the atomic configurations and ranks of local minima on the energy landscape are defined using interatomic potentials (IP), the parameters of which are given in Tables 1 and 2. For each compound we also perform a search of low energy IP-LM using an evolutionary algorithm; details of both methods are described in the previous section. We note that the potential parameters for LiI were taken from ref. 58. The small spring constant for the lithium cation caused problems during the global optimisation runs; during the relaxations of new candidate structures (particularly the random structures used in the initial population), the initial electric fields were sometimes strong enough that during structural relaxation the shell was stripped away from the cation. It is known that the polarisability of an ion is dependent upon the electric field, which is much stronger for our clusters than that experienced within the bulk. Thus, in our simulations, we doubled the value of the spring constant for lithium cations, which corresponds to an apparent reduction in their coordination number compared to the bulk.
exp(−r/ρ) − C/r6, applied between ions X and Y
| X–Y | A (eV) | ρ (Å−1) | C (Å6 eV) |
|---|---|---|---|
| Li–Li | 1153.80 | 0.13640 | 0.000 |
| Li–I | 8894.70 | 0.26170 | 0.000 |
| I–I | 5502.50 | 0.30660 | 0.000 |
| Sr–O | 1952.39 | 0.33685 | 19.220 |
| O–O | 22 764.00 |
0.14900 | 27.879 |
| Ga–Ga | 470.18 | 0.13490 | 0.000 |
| Ga–As | 1544.69 | 0.42310 | 0.000 |
| As–As | 484.31 | 0.24810 | 0.000 |
| X | Q (e) | Y (e) | k 2 (eV Å−2) | k 4 (eV Å−4) |
|---|---|---|---|---|
| Li | 0.295 | 0.705 | 15.979 | 0 |
| I | 3.087 | −4.087 | 39.950 | 0 |
| Sr | 2.000 | |||
| O | 0.869 | −2.869 | 74.920 | 0 |
| Ga | 3.436 | −0.436 | 2418.361 | 0 |
| As | 0.809 | −3.809 | 7.722 | 300 000 |
The results from data-mining our DM-set of unique LM are shown in Fig. 5–7. For strontium oxide, lithium iodide and gallium arsenide, 47, 50 and 41 LM structures were generated, respectively, i.e. not all the structural motifs of one compound were locally stable for another. Moreover, a different global minimum was found for each compound. Labelled DM01 in Fig. 5, the D3d barrel was found to be the IP-GM for (SrO)12, whereas for (LiI)12 and (GaAs)12 it was ranked fourth and second, respectively. The 2 × 2 × 6 D2d configuration of alternating atoms, labelled DM01 in Fig. 6, was found to be the IP-GM for (LiI)12. One can imagine that this cuboid configuration could be cut from the NaCl rock salt structure, and thus it is not surprising that this structural motif was not generated for (GaAs)12. The Th sodalite cage, so named as it is a basic building block of the sodalite bulk structure (given the abbreviation SOD by the zeolite community), was found to be the IP-GM for (GaAs)12. This configuration was ranked fifth and thirty-eighth for (SrO)12 and (LiI)12, respectively. Comparing the ball and stick models for different compounds but for the same structural motif, one noticeable difference between the LM for lithium iodide and those of the other two compounds is the sharper (more acute) bond angles that directly result from the greater polarisability of the iodide anion. Essentially, the iodide anions are further out from the cluster’s centre of mass than the lithium cations.
To check the current success of data-mining the Hive for these three compounds, we also conducted global optimisation on each of the three IP-energy landscapes for low lying LM. We present the results as three densities of LM graphs; see Fig. 8. In the panel insert for each compound it is very clear that the data-mined LM present only a sample of all the possible LM. In terms of ranking, fortunately, the missing LM tend to be mid-range rather than at the more stable end (which, typically, is where there is most interest). Looking more closely at the top ranked LM, we identified which IP-LM structures are missing; these are shown in Fig. 9.
For strontium oxide clusters, the first six missing LM were ranked 6, 7, 8, 9, 13 and 16. The first three of these are basic rock-salt cuts that could have been included in our data-mined set if we had included the structures from ref. 54 (we did not as this paper includes data-mined structures for alkaline oxides, one of which is one of the compounds we chose to investigate). The GA08 cuboid configuration was in fact found as the IP-GM for (LiI)12. Generating this LM during the data-mining process was fortuitous given that this structural motif was not included in the DM-set of unique LM. GA09 and GA13 are composed of a n = 6 drum (typically the IP-GM for (XY)6) and 2 × 2 × m cuboids. More interesting is the GA16 configuration, which we have previously seen; it has an unusual distorted planar four-coordinated oxygen anion site.
For lithium iodide clusters, the first six missing LM structures were ranked 3, 4, 5, 7, 8 and 9. Unlike our DM-set, these configurations, which we will refer to as HC, have at least one highly coordinated (greater than 4) anion site and are not one of the possible cuboid cuts from the NaCl rock salt phase. Given the stability of this type of structure, quite a few of the better ranked structures were missed. As already seen, any unstable LM in the DM-set can lead to new structural LM and thus we did not miss all of the HC structures; the enantiomer of GA03 was found (labelled as DM03 in Fig. 6 and ranked equal third).
For gallium arsenide clusters, data-mining the DM-set was much more successful in that only four additional IP-LM structures were found in the top thirty; the first four missing LM structures were ranked 7, 14, 21 and 29. Of these, GA07 is the result of merging IP-GM for n = 6 (a drum) and n = 9 (bubble) across a hexagonal face; GA14 is very similar to the GA16 LM that was missed for (SrO)12; GA21 has the same structural motif as DM18, but with all the anions switched for cations, and vice versa, cf. DM23 and DM24 and also GA06 and GA07 for strontium oxide. We note that the DM and GA runs found different chiral versions of DM23 and DM24.
Finally, we should reiterate that the structures reported above for LiI, SrO and GaAs were obtained on the interatomic potential landscape. These potentials were originally parameterised for bulk compounds, where atoms are typically in higher coordinated environments, and therefore such parameterisations are very limited in scope. For example, arsenide anions are highly polarisable, and more realistic structures should be expected to have more buckled shapes, as seen above in LiI configurations. The latter proved to be easier to optimise due to the relatively low charges on Li and I. Notwithstanding this, the structures obtained here will be uploaded to the Hive and refined using our chosen ab initio approach, which will both give the actual findings more credence for future applications, but will also allow the parameters of the interatomic potentials to be refined. The latter is an important element of machine-learning techniques that have been particularly successful in studies of metallic clusters.59,60
Lessons learnt in the creation of the Hive and the associated WASP interface as a toolkit will be of direct use for further work on nucleation and crystallisation processes,61 crucially the nucleation and growth of small particles on or in solid supports and liquid environments. The LM atomic configurations in the database are also readily usable as secondary building units (SBU) for constructing crystal structures.6,8,10,62–68 Here, using low energy SBUs that do not resemble cuts from the main phases of the chosen compounds will produce more interesting results.
Footnotes |
| † Electronic supplementary information (ESI) available. See DOI: 10.1039/c8fd00060c |
| ‡ Free access viahttp://hive.chem.ucl.ac.uk. |
| This journal is © The Royal Society of Chemistry 2018 |