Hugh F. Wilson*ab and
Amanda S. Barnarda
aCSIRO Virtual Nanoscience Laboratory, 343 Royal Pde, Parkville, Victoria, Australia
bDepartment of Applied Physics, RMIT University, Melbourne, Victoria, Australia. E-mail: hugh.wilson@rmit.edu.au
First published on 11th March 2016
Data science methods hold enormous promise for enhancing the efficiency of multiple aspects of theoretical materials science. Here we demonstrate an approach for the use of data science methods for a structural search for high-stability atomic structures in ab initio simulations, via the analysis of a large set of candidate structures. By partitioning a large data set of structures over an appropriate set of variables, we are able to identify a small fraction of structural space into which all low-energy structures are concentrated. Structural search methods may then be applied in the identified area. The method is demonstrated on the problem of finding stable structures of the water bilayer on the ZnO(100) surface.
Finding the global optimum structure of an atomistic system is equivalent to finding the global minimum on the potential energy surface (PES) that describes the energy of the system as a function of atomic coordinates. Searching for the global minimum of a general arbitrary PES is an NP-hard problem for which no general shortcuts exist.5 However, since we are interested exclusively in PESes that represent atomistic configurations, we may take advantage of certain properties of chemical PESes to make the search easier. On an atomistic PES, the maximum curvature is limited by chemical considerations. As a particular consequence, it is known that low-energy minima tend to occur in the vicinity of other low-energy minima. It is also known that the basins which terminate at lower-energy minima tend to have a larger hyper-area than those with higher minima6 and that the hyper-areas of the basins of attraction appear to follow a power law distribution,7 meaning that low-energy minima are both few in number and surrounded by a relatively large hypervolume. The global minimum is thus almost certain to be concentrated within an area of the PES with many low minima.
Methods such as AIRSS and genetic algorithms, combined with the rapidly growing availability of massively parallel but low-communication bandwidth computing resources, allow the generation of large numbers of local minima on the potential energy surface. With the growing sophistication of data science methods to allow the extraction of insight from large datasets, it is reasonable to wonder if the use of such techniques on datasets of possible structures can be used to accelerate the process of structural search by identifying, in an appropriate basis set, targeted regions of the PES which contain low-lying minima and identifying structural characteristic of this region. It is not necessarily obvious what form such an algorithm should take, nor which variables in which the PES should be expressed in order to best localise the interesting regions of the PES in as few regions as possible. Statistical analysis of the potential energy surfaces of chemical systems is likely to produce new insights that can be used for understanding the general properties of such PESes, allowing efficient data-driven structural search algorithms to be developed.
In this work, we explore the use of data science methods in conjunction with both classical and ab initio simulation to extract insight from atomistic PESes and accelerate the discovery of ab initio optimal structures, by studying as an initial example the potential energy surface of the water bilayer on the ZnO(100) surface, a technologically important system offering a reasonably high degree of complexity. We take advantage of the existence of a good classical potential for the system of interest, which allows a large number of candidate structures to be generated and analysed quickly. We identify structural features of the PES of this system by a partitioning of the space of local minima on the surface over an appropriate set of variables. The analysis of the classical PES can then be used as a map of the ab initio PES, allowing the identification of small portions of the PES (in an appropriate basis set) in which all low-energy structures are concentrated, allowing a structural search concentrated in the regions of interest.
The (100) surface of ZnO is the most stable13 and best characterised surface facet. Water on this surface is known to form as stable monolayers, bilayers, and higher coverages depending on conditions, however only the monolayer's structure has been well characterised so far.14–19 Water coverage on ZnO(10
0) of one monolayer has attracted attention with a variety of experimental14,19,20 and ab initio theoretical methods.15–19 Isolated water molecules on ZnO adsorb non-dissociatively,16 while the monolayer has been shown to form a “half-dissociated” phase in which undissociated H2O and singly dissociated OH + H motifs exist on alternate Zn–O surface dimers. The monolayer is known to dynamically change between undissociated and fully dissociated structures at room temperature.19 Quantum chemical methods have also been used to study the interaction of water with small ZnO clusters.15,21
We generated a structural data set consisting of 9073 unoptimised random starting structures, which we label Dataset A. Each surface geometry studied consists of a ZnO slab with boundary conditions periodic in the x and y directions combined with four randomly positioned O and eight randomly positioned H atoms. We set up a wurtzite ZnO slab, oriented in the (100) direction, consisting of sixteen Zn and sixteen O atoms, with two surface dimers along the dimer row (11
0) direction and four bilayers (approx 10 Å) thick. The size of the periodic supercell in the x and y directions was 5.30 by 6.57 Angstroms, which was chosen to be optimal in the DFT rather than the classical calculations. The four oxygen and eight hydrogen atoms were added to the system at random x and y coordinates and at a z coordinate randomly chosen from a uniform distribution between 0.9 Å and 3.4 Å above the top layer. The cells used had periodic boundary conditions in all three dimensions, with a vacuum spacing of 14.7 Å between the bare slab and its periodic image, ensuring at least 10 Å of vacuum after termination; this is of no importance for the ReaxFF calculations but is important in the later ab initio calculations. The unit cell is shown in Fig. 1.
![]() | ||
Fig. 1 Side and end views of ZnO slab used in all calculations. The green shaded region depicts the volume in which the four O and eight H atoms are randomly placed. |
Each starting structure in Dataset A was subjected to a geometry optimisation using the Hessian-Free Truncated Newton method and the ReaxFF potential as implemented in the LAMMPS code22 to generate a new data set of points representing local minima on the ReaxFF energy surface, which we call Dataset B†. The distribution of the energies of the structures in Dataset B is shown in Fig. 1. The minimum energy was −140.5 eV with a mean of −135.5 eV and a standard deviation of 2.4 eV. The distribution of energies in Dataset B exhibits multiple peaks, in contrast to the single-peaked distribution of energies in Dataset A.
Several criteria for such a set of variables are obvious: geometrically similar structures should have similar variables, each unique structure should be characterised by a unique set of variables, and that these variables should not vary when obvious structure-preserving symmetry operations are applied. The PES should also be reasonably smooth and have as few minima as possible when expressed in the new basis. One possible set of variables for describing structures and which exhibits many of these properties which has received some recent attention is the bispectrum representation of the atomic neighbourhoods, as used in the GAP25 and SNAP26 general interatomic potential methods. The bispectrum, however presents a high degree of implementation and interpretational complexity.
We propose instead a set of variables which combines predictive power with computational simplicity. We describe each structure in terms of the sorted shortest distances between atoms of each pair of species. We label the nth-shortest distance between two atoms of species X and Y (taking periodic boundary conditions into account) as L(X,Y)n−1, using a zero-origin numbering convention. We will refer to this basis set as a basis of Sorted Neighbour Distances, or SNDs. In this particular implementation we ignore all lattice atoms except for those in the top layer, since distances between bulk-like atoms do not differ meaningfully between structures, and compute distances only for the four dimer and twelve adsorbate atoms. Since the surface-layer and adsorbate-layer oxygen species play different roles in the structure, we label them as separate species, Os and Oa respectively. This gives a total of 120 SND values which characterise each structure. The 120 variables contain some redundancy of information, which is desirable, since we are seeking the most efficient way to subdivide the data.
By representing the set of structures in the SND basis set we may immediately extract insight into the chemistry of the system. Fig. 3 shows the ReaxFF energy of the 9073 structures in the dataset plotted with respect to four selected SND descriptors. Plotting energy against L(H,H)2 shows a clear distinction between several clumps of clusters – one with a small value of L(H,H)2 and a high classical energy representing structures with H2 molecules, and several other clumps with larger L(H,H)2 values; the lowest-energy structures invariably have a value around 1.7 Å. The plots of L(H,Oa)4, L(Oa,Oa)0 and L(Oa,H)1 likewise show clustering of structures into groups with clear relationships between SND values and energy (Fig. 4).
![]() | ||
Fig. 3 Classical energies of the structures in Dataset B plotted with respect to four descriptors in the SND basis set: L(H,H)2, L(H,Oa)4, L(Oa,Oa)0, and L(Os,H)1. |
Of the 9073 local optima in Dataset B, it is sensible to wonder how many are duplicates. Owing to the large number of possible structures it is not possible to distinguish duplicates by the similarity of their energies, however comparison of the SND values can identify duplicates. If we define two structures to be identical if the supremum norm of the difference between the SNDs of the two structures is less than 0.02 Å, we find just 88 structures, exclusively concentrated in the lowest 1 eV, are identical to some other structure, including one structure found five times, two structures found four times, and many structures found two or three times. The higher duplication rate among lower-lying minima is consistent with the well-known tendency of low-lying structures to lie in larger basins.1 Our 9073 starting structures have thus reached in excess of 9000 individual local minima. The low level of duplication suggests that not every possible minimum of the PES has been reached, however the level of duplication among the lowest-energy structures suggests that a reasonable level of coverage has been achieved among these structures.
Fig. 3(a) shows the first few nodes in a regression tree fitted to Dataset B as expressed in the SND basis set. The regression tree was generated using the rpart package28 in R.29 At the first node of the tree, it is found that the best possible split in the data is L(H,H)1 ≥ 1.33 Å, i.e. the third-shortest hydrogen–hydrogen distance in the sample is greater than this threshold. In physical terms, this distinguishes those structures which have two or more H2 molecules (higher in energy) from those which do not (lower in energy). The next split for structures on the low-energy side is L(H,H)0 ≥ 1.324 Å. The third split for those in the more stable set is on the criterion L(Oa,Oa)0 > 2.39 Å. The tree is constructed using a complexity cost parameter28 of 10−8. Fig. 3(b) shows the full regression tree, consisting of 794 classification bins containing 11.4 structures on average. Each structure is judged by up to 28 criteria.
Fig. 5(a) shows the distribution of classical energies for the structures in Dataset B as a function of partition bin. Bin indices run from 1 to 1587 with some omitted due to the numbering of non-terminal nodes in the tree. The bin identity predicts the energy to within 0.29 eV (one standard deviation) on average. For lower-energy bins the energies are better constrained, with those bins whose average energy lies below −138 eV exhibiting a standard deviation in energy of 0.17 eV.
From each of the 794 classification bins we chose the three lowest-lying structures as bin representatives, and conducted a DFT geometry optimisation on each to the nearest minimum on the ab initio PES. For each DFT geometry optimisation, we used the VASP code with pseudopotentials of the PAW type30 and the exchange-correlation functional of Perdew, Burke and Ernzerhof.31 Wavefunctions were expanded in a basis set with a cutoff of 300 eV and only the gamma point was used; these parameters are somewhat coarse but are suitable for screening the large data set. Geometry optimisation used a conjugate gradient algorithm, and proceeded until the change in total energy at two subsequent steps was less than 10−7 eV. The set of 2382 DFT-optimised bin representative geometries is labelled as Dataset C.
Fig. 6 shows the DFT energies for the classically-optimal and DFT-optimal structures as a function of the classical energy of the starting structure. For the starting geometries (blue) the classical and DFT energies are similar and close to normally distributed, with a standard deviation of 1.24 eV. Notably we see a clustering of post-optimisation DFT energies into five distinct clumps.
The post-DFT-optimisation structures in Dataset C may be sorted into bins using the same decision tree used for the classical structures. Fig. 5(b) shows the UDFT values and bin locations as a function for all structures in Dataset C. Notably, many structures from Dataset B have moved to a different bin during the optimisation, and 243 of 794 bins are occupied. Also notable in Fig. 5(b) is the exclusive segregation of all structures into a small number of bins in which all energies lie below −200 eV, and the remainder of bins in which all structures have energies above −200 eV. These two sets of structures are separated by a wide gulf in excess of −200 eV.
There are a total of 396 optimised structures whose final energy is less than −200 eV. These come from 187 origin bins, but have final structures in only 27 destination bins. It thus appears that only 27 out of the original 794 bins contain structures which are likely to be competitive. This suggests that concentration of the structural search into these particular regions of the PES is likely to produce energetically competitive structures in an efficient way.
Fig. 7 depicts a decision tree by which we may determine whether a given structure exists in one of the 27 low-energy destination bins.
![]() | ||
Fig. 7 Decision tree for determining whether a structure lies within one of the 27 bins of interest. At each node, we split to the left if the condition is false and to the right if it is true. |
As anticipated, structures optimised from the targeted bins exhibit a high degree of stability, with a mean energy of −201.7 eV and a standard deviation of 0.24 eV. All but two of the optimised structures have an energy below −201 eV, and the most stable structure has an energy of −202.1 eV. Fig. 8 plots the final DFT energy against the starting classical energy for each of the structures in Dataset D.
![]() | ||
Fig. 8 Post-DFT-optimisation DFT energy plotted against pre-DFT-optimisation classical energy for the 328 targeted-region structures in Dataset D. |
We find several geometrically distinct structures with extremely similar energies to be the most stable structures in the dataset. Fig. 9 shows the geometries of the three lowest-energy structures. The second- and third-lowest energy structures have DFT energies which are just 0.009 eV and 0.01 eV higher than those of the first. Four additional structures were found to be within 0.025 eV (kBT at room temperature) of the lowest-energy structure. Distinguishing which of these structures as the true ground state is well beyond the capabilities of density functional theory. Moreover, the question is not of importance, since the energetic differences between structures are less than kBT at room temperature, implying that under realistic conditions the water bilayer will exist in a mix of structures.
Several common characteristics of these stable structures may be identified, however; in particular, the lower-lying monolayer of all structures have the same “half-dissociated” character as is present in the water monolayer, with one out of every two surface Zn atoms being bonded to a hydrogen atom. It is notable that the three lowest-energy structures show different structural characteristics to the lowest classical energy structure in Dataset B (Fig. 9(d)) in which the water molecules lie substantially closer to the surface.
Due to near-degeneracy of the distinct lowest-energy structures in the data set, we do not concern ourselves with identifying the true ground state structure of the water bilayer, and satisfy ourselves instead that we have a procedure for reliably generating geometrically distinct structures out of the set of low-energy water bilayer structures. Further work could generate a larger set of these structures and study them with more accurate methods.
The SND basis set has been shown to be hold a great deal of predictive power for the water bilayer system. It is to be anticipated that other surface and interfacial systems such as crystal surface reconstructions, self-assembled monolayers and small clusters may benefit from expression and partitioning in this basis set. The utility of the SND basis for the description of solid crystalline, defective or amorphous systems is a possible object of future study.
The use of the regression tree algorithm represents a relatively simple yet powerful way of partitioning the set of structures. We have not explored other possibilities which may involve making splits on multiple variables simultaneously. The combination of regression trees with clustering algorithms is another potential area of interest. The incorporation of principal component analysis into the description of the structural data32 may also prove useful by providing an alternative basis set in which the partitioning may be more efficient.
The implementation in this case requires the existence of an efficient empirical potential. With the growing sophistication of general-form empirical potentials previous work on high-accuracy low-transferability empirical potentials to be used for ab initio thermodynamic calculations provides additional motivation for improved methods for on-the-fly fitting of empirical potentials,33 which could be achieved with new methods such as SNAP.26
For illustrative purposes we have in this case computed potential energies of a much larger set of bin-representative candidate structures than would be required in practice. Once the degree of mismatch between the classical and ab initio PESes has been established it is not necessary to compute representative energies of bins we may be confident will not be competitive. For an exhaustive structural search, however, it would be desirable to continue generating and DFT-optimising candidate structures within the optimal structural bins until all low-lying structures have been found multiple times.
Data-driven structural search is best viewed as a complement to existing methods rather than a replacement. In this case, for instance, we have used data-driven searching to enhance ab initio random structure search. We may make some remarks about the utility, efficiency and range of applicability of the present method in comparison to the straightforward implementations of existing algorithms including ab initio random structure search and genetic algorithms.
The method presented in this paper involves the production of a large dataset of local minima on a classical potential energy surface, the classification of the data set into structural bins and the ab initio optimisation of bin representative structures. The time taken by the analysis is insignificant, but the time taken to generate the classical minima and evaluate the ab initio energies are significant. Compared to ab initio random structure search, this method will be worthwhile if the time spent on these steps is less than the time which would otherwise be spent generating and optimising points in the non-preferred region, which in the present example is around 99% of structures. The present implementation is ideal for structural search on chemical PESes which contain an extremely large number of minima of roughly similar energy. For this reason, systems such as surface terminations and reconstructions are likely to be most suited for this method.
An important question is the relationship between the size of the random datasets and the probability of finding the correct ground state structure. As with other structural search methods, however, this question defies an easy answer. In this particular case, the probability depends on details of the shapes of both the classical and ab initio PESes and the relationship between the two which are not easily accessible. Future work may assist in developing heuristics for determining the appropriate choice of these parameters.
All existing structural search methods produce a large amount of data about the structure of the potential energy surface as they run, which in the future could be put to better use. In ab initio random structure search, the locations and energies of the minima evaluated so far are not used to inform the generation of future structures, resulting in a vast amount of data left unused. Particle swarm optimisation methods take the global structure of the PES into account in a way, by letting points during a single optimisation communicate information with each other. Genetic algorithms take the global structure of the PES into account in a different way, by cross-breeding existing structures in ways likely to produce new low-energy structures. Data-driven partitioning of the PES in an appropriate basis set to identify regions of greater interest may be able to enhance implementations of each of these methods.
The use of data science methods for structural search provides a promising new avenue of research which may allow greatly enhanced efficiency and automation of structural search in the future. Further research may reveal better descriptors and better algorithms. Ultimately, ab initio simulation gives us an enormous amount of data about the shape of potential energy surfaces which has not previously been taken advantage of, and by mining this data we may be able to produce predictors which understand chemistry better than we do.
Footnote |
† The HFTN algorithm implemented in LAMMPS was chosen as the method for structural optimisation after tests showed it to produce a lower-energy distribution of structures than three alternative methods: steepest descent, conjugate gradient and two different damped dynamics methods (quickmin23 and fire24). |
This journal is © The Royal Society of Chemistry 2016 |