Data-driven learning and prediction of inorganic crystal structures

Crystal structure prediction algorithms, including ab initio random structure searching (AIRSS), are intrinsically limited by the huge computational cost of the underlying quantum-mechanical methods. We have recently shown that a novel class of machine learning (ML) based interatomic potentials can provide a way out: by performing a high-dimensional ﬁ t to the ab initio energy landscape, these potentials reach comparable accuracy but are orders of magnitude faster. In this paper, we develop our approach, dubbed Gaussian approximation potential-based random structure searching (GAP-RSS), towards a more general tool for exploring con ﬁ guration spaces and predicting structures. We present a GAP-RSS interatomic potential model for elemental phosphorus, which identi ﬁ es and correctly “ learns ” the orthorhombic black phosphorus (A17) structure without prior knowledge of any crystalline allotropes. Using the tubular structure of ﬁ brous phosphorus as an example, we then discuss the limits of free searching, and discuss a possible way forward that combines a recently proposed fragment analysis with GAP-RSS. Examples of possible tubular (1D) and extended (3D) hypothetical allotropes of phosphorus as found by GAP-RSS are discussed. We believe that in the future, ML potentials could become versatile and routine computational tools for materials discovery and design.


Introduction
Exploring and cataloguing new crystal structures is one of the principal tasks in chemistry.Decades of careful experimental work are collected in the Inorganic Crystal Structure Database (ICSD) 1 and the Cambridge Structural Database (CSD), 2 which have been impressive examples of "big data" since long before the phrase became fashionable.While structural space has traditionally been explored by synthesis, more recent work has shown that ab initio structure searching can play an important and complementary role in this regard.Computational tools, including genetic algorithms, [3][4][5] particle swarm optimisation, 6,7 or ab initio random structure searching (AIRSS), 8,9 can predict structures that are (sometimes) far from what chemical intuition would suggest.Many of these predictions have subsequently been validated by experiments [10][11][12][13][14] or have given rise to databases of their own. 15][18][19] Despite their predictive power, these structure searching methods are normally driven by quantum-mechanical density-functional theory (DFT) computations, and therefore they are limited to systems with relatively small unit cells.The allotropes of elemental phosphorus, [20][21][22] which are the topic of the present paper, directly illustrate the problem at hand.The thermodynamically stable "black" (orthorhombic) form, as well as the high-pressure As-type (rhombohedral) allotrope, exhibit simple crystal structures with only one symmetry-independent atom each (Table 1).4][25][26][27][28][29][30][31][32] On the other hand, "violet" (Hittorf's) phosphorus 33 has a notoriously complex structure that was solved more than 100 years aer the initial synthesis 34 and contains no fewer than 21 symmetry-independent atoms in the unit cell.Systems of this size have hitherto been out of reach for ab initio crystal structure prediction.
6][37][38] Such potentials are tted to reference databases of DFT energies and forces and, once generated, they allow one to perform atomistic simulations with close to DFT quality but at a computational cost that is orders of magnitude lower. 39Indeed, we have recently shown Table 1 Experimentally known crystal structures of phosphorus. 20Z is the number of atoms in the conventional unit cell; Z 0 gives the number of symmetry-inequivalent atoms in the cell, and is therefore a measure of structural complexity.The molecular ("white") and amorphous ("red") forms are omitted from this study for simplicity a Occasionally referred to as "blue P" in recent literature, 46 especially in the monolayer form.
b To explore whether our GAP would be able to nd molecular ("white") P, we carried out a set of preliminary searches at low density (1.0 g cm À3 ).The resulting structures did include distorted P 4 units, but also other small fragments, and we expect that the GAP will need to have "seen" these in iterative training to distinguish them more clearly.We also expect that fully capturing the intricate structural details of white P, including its packing variants, 74 will require additional reference data; this is the topic of ongoing work.Regarding amorphous forms, our searches are restricted to relatively small periodic unit cells which cannot fully represent the amorphous allotrope(s).
that an ML-based Gaussian approximation potential (GAP), initially developed for amorphous carbon, 40 can be used to identify crystalline allotropes. 36This has added around 150 entries to the Samara Carbon Allotrope Database (SACADA, ref. 15).Very recently, the use of ML to speed up global searching and crystal structure optimization itself has been reported, 41,42 but we focus here on the use of explicit interatomic potentials ("force elds") for this purpose.What is more, in recent work, 37 we have argued that RSS can be used to construct the interatomic potential from scratch, exploring and tting a complex potential energy surface (PES) at the same time.This points towards a more general "data-driven" strategy for atomistic materials modelling.
In this Discussion paper, we aim to take new steps in this direction and further develop our emerging method (which we call "GAP-RSS").Aer briey summarising its components, we present results for elemental phosphorus, generating and applying the rst GAP-RSS potential for this material.Our protocol "discovers" the crystal structure of black P during the iterations and, by construction, adds it to the reference database without prior knowledge of any existing allotropes.We then use this potential for some exercises in crystal structure prediction: we show that a brous allotrope of P (Table 1) appears to be prohibitively hard to nd in free searches, and we outline the use of an alternative approach, showing exemplary predicted structures in 1D and 3D.We also discuss open questions and expected future directions.

Methodology
The protocol for GAP-RSS potential tting and structure searching consists of three components: single-point DFT computations, GAP tting to an updated reference database, and structural optimisation using GAP.We give details of these in sequence, and an overview is provided in Fig. 1.

DFT computations
These provide the input data for GAP tting: initially, single-point DFT computations are done for randomised atomic congurations, later, for intermediates or local minima of GAP-RSS searches.We obtain these data using standard DFT procedures, with high numerical accuracy to minimise noise in the input for tting.In this work, we used the PBEsol functional, 47 which has been validated for black P before, 48 and on-the-y ultra-so pseudopotentials as implemented in CASTEP. 49Reciprocal space was sampled on dense k-point meshes (maximum spacing 0.02 ÅÀ1 ).The cut-off energy for plane-wave expansions was 600 eV, and an extrapolation scheme was used to counteract nite-basis errors. 50In Fig. 1, all parts that involve single-point DFT computations (viz.the construction of the reference database) are enclosed by dashed lines.

GAP tting
With reference data available, an interatomic potential is tted to these using GAP.The framework was introduced in 2010 (ref.3][54] A high-dimensional t to reference energy and force data is performed based on structural similarity or kernel functions, comparing atomic environments one by one.The initial choice for these have been many-body descriptors, most importantly the Smooth Overlap of Atomic Positions (SOAP), which includes all neighbours of an atom up to a cut-off radius. 55To improve the stability of the t, similar to our previous work on amorphous materials modelling 40 and structure searching, 36,37 we combine the manybody SOAP expansion with non-parametric two-body ("2b") and three-body ("3b") terms that encode interatomic distances and bond angles, respectively.The 2b and SOAP descriptors have radial cut-offs of 5.0 Å, whereas that for the 3b term is 2.6 Å (to include only "true" bond angles involving nearest-neighbour contacts).The nal GAP uses 10 sparse points (that is, tting coefficients) for the 2b term, 100 for the 3b term, and 3000 for SOAP.The convergence and smoothness parameters for the SOAP expansion are n max ¼ l max ¼ 8 and s at.¼ 0.75 Å, respectively, which were ; later, we only add selected structures.As a criterion to select cells, here we use coordination numbers, demanding that all atoms in a candidate structure must have three nearest neighbours, and discarding any structures which do not comply.The GAP-RSS iterations are deemed "converged" when the resulting potential shows satisfactory performance.Here, we do so after generation 4, as that version of the potential has "found" black P from scratch (see Fig. 3 and 4).
found to be suitable for GAP-RSS in ref. 37.For a more detailed walk-through of the underlying ML framework, we refer the reader to ref. 56.

GAP-RSS relaxation
This is the heart of the technique: structural space is explored by optimising random structures, akin to the well-established AIRSS technique, but now using GAP.In AIRSS (and consequently in GAP-RSS), it is important to make a sensible choice of randomised initial structures.For example, a reasonable minimum interatomic distance ("hard-sphere" criterion) is imposed.Furthermore, exploiting space-group symmetry can signicantly reduce the number of attempts required.Here we search with 2-16 atoms in the unit cell, allowing for either 1, 2, 4, or 8 symmetry operations to be present.To some extent, this penalises rhombohedral space groups and their subgroups, but our data below show that the rhombohedral A7 structure is found by our protocol nonetheless.We choose the initial densities to be distributed around 2.5 g cm À3 , in between the black and red forms of phosphorus, and we constrain the P-P distances in the initial structures to at least 1.8 Å.An independent set of randomised input structures is generated for initialisation and for each new generation of the potential.

Exploring the potential energy surface of elemental phosphorus
To showcase the principle of GAP-RSS, in this paper we explore the potential energy surface (PES) of elemental phosphorus.This is summarised in Fig. 2, where we show the energy-volume data for the reference database that is built during construction of the potential.We start with randomised input structures, the data for which are added to the database without relaxation (light green points in Fig. 2; generation 0).We then extend the database by three rounds (generations 1-3), where snapshots of both intermediate (teal) and relaxed (purple) structures are added; nally, in generation 4, we perform a larger search and keep only the unique structures with all-threefold coordinated atoms ("3c", determined using a 2.4 Å bond-length cut-off; light grey points).Aer this, we deem the database "converged" for the purpose of the present study, by the (subjectively chosen) criterion that the GAP has "learned" black P; see below.Future work will be concerned with less heuristic criteria for convergence.The distributions shown on the right-hand side of Fig. 2 illustrate how the different parts of the database, and thus of the PES, comprise progressively lower-lying structures.We will show in the following that, concomitant with increasing exploration of the phase space, the GAP-RSS interatomic potential becomes more accurate, ultimately describing crystal structures without prior knowledge.We stress that in most of the previous literature on tting ML potentials, reference databases are constructed by including known crystal structures (and distorted variants thereof). 39,40,53o assess the quality of the potential, we use it to compute energies for independent test sets of structures not included in the training, for which the DFT energies are known.The results are shown in Fig. 3a.The description of the higher-energy regions ("RSS intermediates", open symbols) appears to converge quite early on, but it carries a residual root-mean-square prediction error of the order of 0.05-0.10eV per at.that does not improve with further iterations.The same is observed, and is even more pronounced, for a test set of randomised seed structures, where the error is practically constant at z0.10 eV per at.for all the GAP generations (not shown in Fig. 3a for clarity).This is intuitively understandable: the PES is smooth, and since the high-energy structures are so diverse, Fig. 3 Quality of the evolving interatomic potential, assessed by computing the energies for independent test sets (configurations outside the database) with different generations of the GAP, and reporting the root-mean-square error against the DFT data.(a) Errors for sets of structures from an independent GAP-RSS run without symmetry, taken after 5 CG steps ("intermediates") and 200 CG steps ("minima"); (b) the same for sets of distorted unit cells of known allotropes, viz.Hittorf's and black P. In generation 4, our GAP has "seen" black P and therefore the predicted error for this allotrope falls close to zero (see the arrow).
only a handful of them are required for acceptable sampling.A similar observation was made for liquid carbon in our earlier work: the high-temperature liquid contains many different environments and is thus straightforward to "learn" from a single ab initio MD trajectory.In contrast, the amorphous regions of the PES required several rounds of GAP-driven MD and iterative renement of the potential. 40or the purposes of GAP-RSS, these ndings are encouraging: the early steps of relaxation appear to be easily "learned" by the potential, and even a notable degree of residual error can be tolerated if the potential is successful in navigating the high-energy region of the PES (say, the green data points in Fig. 2).In contrast, a growing database is needed for the relaxed minima.Indeed, testing for a set of local RSS minima outside the database initially leads to a signicant prediction error above 0.5 eV per at.(lled symbols in Fig. 3a), but this gradually improves and falls below 0.1 eV per at.with increasing quality of the potential.
The latter accuracy turns out to be acceptable for exploratory searches, but it is still worse than what one would expect from an ML potential for the stable allotropes.Our long-term vision for GAP-RSS is therefore to nd stable minima (such as black or orthorhombic P) during the iterations and automatically include them in the reference database.We expect this to improve the accuracy substantially, which is supported by the following result.Fig. 3b shows test set errors as before, but now for ensembles of distorted unit cells of two experimentally known allotropes.The orthorhombic structure of black P is a particularly interesting test, as it competes with other, more highly symmetric structures that are very close in energy on the PES. 22Our potential, during generation 4, did succeed in nding black P, and hence includes this structure type in the nal reference database.This is reected in Fig. 3b: the energy errors for black P get progressively lower, but only in generation 4 do they drop to very close to zero.
Likewise, the computed energy-volume scans for black P (Fig. 4a) progressively improve during the GAP-RSS iterations, but the structure needs to have been "seen" and included in the tting (in generation 4) to achieve an accurate result.The same is true for computed structural properties (Fig. 4b): for this test, we fully relaxed the black P structure with each version of our potential, as well as with DFT.In generation 4, all three GAP-computed lattice parameters come very close to the DFT benchmark.
"Learning" high-pressure structures Two important phosphorus allotropes, As-type and simple-cubic, are obtained from the black form at high pressure (Table 1). 44With this in mind, we decided to explore what effect external pressure would have on the result of GAP-RSS for this element; doing so has already been suggested and proven to be benecial for exploring the PES of elemental boron. 37To sample higher-pressure structures, we apply hydrostatic external pressure during the GAP-RSS relaxation runs, with a randomised value in each run.The values are drawn from an exponential distribution with the probability density P of nding a given pressure p, 37 P(p) ¼ l exp(Àlp) Â p 0 where l is the rate parameter (we chose l ¼ 5, a narrower distribution, to sample more around the low-pressure region but still include some high-pressure points), and p 0 is an arbitrarily chosen reference.In this work, we tested the settings of p 0 ¼ 10 GPa and p 0 ¼ 100 GPa, respectively.For both, we generated new sets of potentials through iterative GAP-RSS tting, starting with the same ensembles of seed structures as used in the pressure-free searches reported above (p 0 ¼ 0 GPa).
The results are easily rationalised by looking at the evolving energy-volume scans again, but now for the high-pressure forms (Fig. 5).For reference, our pressure-free search (p 0 ¼ 0 GPa) already yields an acceptable description of both allotropes in generation 4 (purple lines), but it fails to reach quantitative accuracy, especially for the simple cubic form.Searching with moderate external pressure (p 0 ¼ 10 GPa) visibly improves the description of both high-pressure forms: the GAP and DFT data are now in better agreement.Note that due to the nature of the exponential distribution, most of the pressure values drawn are signicantly smaller than p 0 , and thus correspond to typical experimental conditions.When using external pressures that are an order of magnitude higher (p 0 ¼ 100 GPa), the description of the As-type form deteriorates visibly, whereas the simple cubic form is already "learned" very easily in generation 2 (bottom right panel in Fig. 5).All this is intuitively understood, as the experimentally observed sequence of pressure-induced phase transitions is black / As-type / simple cubic, with transition pressures around 5 GPa and 11 GPa, respectively. 44,57brous phosphorus "Fibrous" P, described by Ruck et al. in 2005, 45 is structurally reminiscent of Hittorf's P: both allotropes exhibit characteristic tubes built from covalently bonded cage motifs, reminiscent of organophosphorus compounds.58,59 The difference between the two allotropes is in the arrangement of the tubes, which is simpler (all tubes parallel) in brous P. Both allotropes contain 21 independent atomic positions (Table 1), which are repeated by space-group symmetry two and four times, respectively.We performed 100 000 attempts to nd the structure of brous P using GAP-RSS, seeding with 21 independent, randomised atomic positions in space group P 1.However, this search was unsuccessful, and none of the attempts led to the known structure aer minimisation. Weshow three of the resulting structures, as mere examples, in Fig. 6.This reects the more general problem that in global structure determination, the searching task becomes exponentially more complex with system size (and each independent atom adds three structural degrees of freedom), as discussed by Stillinger 60 and in ref. 9.While the GAP still provides an approximation of the DFT potential energy surface, and therefore some care must be taken when transferring conclusions from GAP-RSS to (DFTbased) AIRSS, we take the present ndings as clear evidence that the structure of brous P (and, by extension, Hittorf's P) is too complex to nd in free random searches.

Merging modular decomposition and GAP-RSS: towards fully data-driven crystal structure prediction
Inevitably, a fully unconstrained random search for more and more complex structures will fail at some point, as discussed above.Very recent work has shown that molecular network analysis can be used to great effect in this regard. 61g. 5 Energy-volume scans similar to those in Fig. 4a, but now assessing the highpressure forms, As-type (top) and simple cubic P (bottom).Data were obtained with three separate sets of GAP-RSS iterations that utilise, from left to right: no external pressure (i.e. the main dataset), 10 GPa, or 100 GPa of reference pressure p 0 , as defined in the text.In all cases, the respective lowest DFT energy point is set as zero.The data for generation 0 are outside the range of the energy axis.Inspired by Pauling's iconic rule of parsimony, stating that the number of constituents in a crystal structure tends to be small, 62 it was proposed in ref. 61 to decompose a crystal structure into building blocks such as to minimise the (mathematically quantiable) information content.Here we combine this with GAP-RSS searching, and argue that both together provide a useful way forward for the prediction of complex crystal structures.Fig. 7a summarises the procedure and illustrates how the structure of brous P is decomposed into two fragments using the approach of ref. 61.While this is a purely automated, data-driven procedure, its outcome is oen in line with chemical intuition.61 Indeed, in the original publication on brous P, the structure was described as a sequence of alternating [P 8 ] and [P 9 ] cages, interspersed with P 2 dumbbells.45 Similar fragments have been identied by Thurn and Krebs in the crystal structure of Hittorf's P; 34 they have been discussed in Baudler's seminal work on the nomenclature of phosphorus cages, 58 and have been the topic of comprehensive theoretical analyses both in the gas phase and in the bulk.22,59,64 Without chemical knowledge, but with an algorithm to nd the most simple and representative building units, the network analysis likewise "cut" the chains into fragments.61 In these, the well-known [P 9 ] cage was combined with P 2 dumbbells on either side, forming a [P 13 ] unit ("Fragment A"), whereas the [P 8 ] cage was recovered directly ("Fragment B").We believe that alternative ways of decomposition, e.g.isolating the [P 9 ] cage, may also be useful as starting points for structure searches-as will combinations of different fragments.We also note the advantage of a chemically "agnostic" approach: while there is an extensive body of literature on the building blocks of P allotropes, 22,58,59 future work might be concerned with new materials where the local atomic structure is a priori unknown.
Even with an exemplary and simplied approach that uses only one fragment or the other (Fig. 7a), our GAP-RSS searches readily identied several hypothetical phosphorus allotropes with different dimensionalities.6][67] We explored the effect of feeding structural data back into the GAP as before, but now for the more ordered structures coming from fragment-based searches, and generated two additional generations of the potential.An intermediate generation was used to nd the structures shown in Fig. 7 (generation 5a); another, nal one additionally contains all the minima found in that search (5b).We observed that this reduces the energy error distorted unit cells of Hittorf's P (cf.Fig. 3b) from 0.19 eV per at.(generation 4) to 0.16 eV per at.(5a), and further to 0.13 eV per at.(5b); note that in none of these generations has our GAP-RSS potential "seen" the actual crystal structure of Hittorf's P. A full account of this, including the results of much more diverse searches, will be published elsewhere, and for the sake of brevity we present only a few salient examples here.As expected, several 1D tubular motifs were found in our GAP-RSS searches, packed in different ways to form extended structures.One of these tubes, shown in Fig. 7b, retains some features of the [P 8 ] cage from which the initial seed is constructed: namely, fused ve-membered rings, of which there are four in the [P 8 ] cage (topology symbol [5 4 ]), and up to three in the chain structure (dashed box in Fig. 7b).Another structure found by GAP-RSS, shown in Fig. 7c, breaks up the initial seed fragment, forming a tube that consists of only six-membered rings.This corresponds to a sphere-packing graph for a hexagonal net ( 63 ) with coincidence vectors (m, n) ¼ (3, 2). 68In this, it is also equivalent to one of the smallest theoretically possible carbon nanotubes.
The same protocol also identied fully three-dimensional structures.One of them is a rare binodal net, bbe-3,3-Imma (Fig. 7d), which has occasionally been observed in MOFs. 69We nd another structure with a uninodal pbp net containing six-and eight-membered rings (Fig. 7e) that had originally been proposed for a hypothetical carbon structure dubbed "6.8 2 P" (ref.70) and also investigated as a hypothetical allotrope of P. 27 We nally nd the uninodal lig net (Fig. 7f), which corresponds to the anion network in the Zintl phase LiGe. 71Such a structure has not been proposed for P thus far, but it is in conceptual agreement with the Zintl-Klemm-Busmann framework (in which the group-14 element is viewed as "Ge À ", and thus takes the structural signature of a group-15 element due to its excess electron). 72The computed dispersion-corrected DFT energies [65][66][67] for the structures shown in Fig. 7d-f are z+2 kJ mol À1 (bbe-3,3-Imma) and z+8 kJ mol À1 (pbp and lig) above that of black P, placing them well within the experimentally and computationally derived stability range of known allotropes. 22n ongoing work, beyond the scope of this paper, we are performing much larger-scale searches, including attempts to nd "brous" P and possible related structures using GAP-RSS, and trying to understand the observed preference for the experimentally known form.It seems especially interesting to study such tubular forms since some of these structural units can be chemically extracted from phosphorus-rich compounds. 73

Conclusions
Machine learning-based interatomic potentials can speed up random structure searching by orders of magnitude.Therefore, they seem well poised to become useful tools for crystal structure prediction and materials discovery.We expect them not to replace DFT-driven searching, but to provide a valuable complement, especially for very large and complex structures that are outside the reach of DFT.In this paper, we have discussed and further developed our recently introduced technique, dubbed Gaussian approximation potential-driven random structure searching (GAP-RSS), which combines ideas from the elds of potential tting and structure prediction.We believe that this technique will be of interest not only for nding new structures, but also for the automated generation of fast, exible, and accurate potentials for diverse materials.

Fig. 1
Fig.1Overview of the GAP-RSS protocol as introduced in ref.37 and extended here by a selection step.We start from a set of randomised seed structures for a given chemical composition (in generations 0-3, 250 each; in generation 4, 5000).The reference database of energies and forces is then built by single-point DFT computations: first, for unrelaxed structures (generation 0), later, at various stages of structural relaxation (1-4), performed by fitting interim GAPs to the evolving database.Initially, we feed all structures back into the training (up to generation 3); later, we only add selected structures.As a criterion to select cells, here we use coordination numbers, demanding that all atoms in a candidate structure must have three nearest neighbours, and discarding any structures which do not comply.The GAP-RSS iterations are deemed "converged" when the resulting potential shows satisfactory performance.Here, we do so after generation 4, as that version of the potential has "found" black P from scratch (see Fig.3 and 4).

Fig. 2
Fig.2Left: energy-volume plot of single-point DFT data during the generation of a GAP-RSS model for phosphorus, given relative to the most stable structure.The different stages of building the database (cf.Fig.1) are highlighted in different colours.Right: binned distribution for energies in these datasets, drawn on the same vertical axis.

Fig. 4
Fig. 4 Quality of the evolving interatomic potential, assessed by computing properties of black P. (a) Energy-volume scans for the orthorhombic unit cell (scaling lattice vectors and atomic coordinates without further relaxation), comparing progressive GAP generations (lines) to DFT reference data (circles).(b) Lattice parameters, obtained by fully relaxing the crystal structure of black P with each generation of the GAP.DFT results, obtained at the same computational settings as for the reference database, are given as horizontal lines.

Fig. 6
Fig.6Unsuccessful attempts to find fibrous P in free searches.Exemplarily, we show three low-energy, DFT-relaxed structures resulting from 100 000 GAP-RSS attempts (21 symmetry-independent atoms in space group P1).One of these structures is a stacking variant of As-type phosphorus (left); the two other are low-symmetry configurations that are clearly different from the tubular structure of fibrous (and Hittorf's) P.

Fig. 7
Fig.7(a) Overview of the fragment-based approach to structure analysis and structure searching introduced in ref.61 that we combine here with GAP-RSS.Two fragments have previously been identified in the structure of fibrous P (see text),61 and here we use both separately to generate seed structures for GAP-RSS structure searches.(b) Example of a 1D-periodic structure identified by this procedure.The tube "inherits" some of the structural information from Fragment B, namely, fused five-membered rings (dashed box).Full topology information is given in the text.(c) The same for a 1D structure composed of six-membered rings only.(d-f) Examples of 3D-periodic structures from the same search, labelled according to their network topologies.63