Predicting the phase diagram of titanium dioxide with random search and pattern recognition

Predicting phase stabilities of crystal polymorphs is central to computational materials science and chemistry. Such predictions are challenging, however, because they require not only searching for potential energy minima,but also considering the effect of entropy at finite temperatures. Here, we develop a framework that applies pattern recognition and machine-learning algorithms in order to determine the phase behaviour of a complex system with many polymorphs, without relying on prior knowledge of known phases. The framework allows automated clustering, classification and visualisation of crystal structures, as well as estimation of their enthalpy and entropy by exploiting all the information obtained from random searches of crystal structures. We demonstrate the framework on the technologically important system of titanium dioxide. We find a number of new phases, and show that the machine-learning predictions satisfactorily capture the phase diagram and the metastabilities of crystal polymorphs at 1600K.


I. INTRODUCTION
Predicting the properties of solid materials requires an understanding of how atoms are arranged, which in turn necessitates the determination of the relative stabilities of possible crystal polymorphic phases under various conditions. Advances in crystal structure prediction [1], including random structure search (RSS) [2], particle-swarm optimisation [3], Monte Carlo simulations with variable box shapes [4] and basin hopping [5], allow us to find minima on the potential-energy surface (PES) and thus to identify potentially competitive polymorphs. However, in these approaches only the enthalpy H at 0 K is typically computed and used to determine phase stability, even though the difference in entropy S between polymorphs can play a significant role [6,7]. Relative Gibbs energies (G = H − T S), which in actuality dictate thermodynamic (meta)stability, are seldom computed, despite their importance for phases that can be synthesised at one thermodynamic condition but remain metastable under other conditions.
Although Gibbs energies can be computed using free-energy methods such as thermodynamic integration (TI) [6,8,9], such calculations are non-trivial and usually require long simulations, making them both expensive and tedious if one wants to use the PES computed from first-principles methods such as densityfunctional theory (DFT) [10]. Moreover, crystal-structure prediction schemes typically result in a multitude of structures. It can be challenging to rationalise how the structures are related to one another and which ones are promising candidates under given thermodynamic conditions.
In this study, we propose an approximate method that combines RSS and machine learning (ML) to estimate free energies * bc509@cam.ac.uk of solid phases and hence the thermodynamic phase behaviour of a system of interest with relatively little computational effort and without relying on prior knowledge of the known phases. We have selected titanium dioxide at ambient and high pressures to benchmark the method because it is a widely used metal oxide with a large number of polymorphs (see the Supplementary Information), many of which may have interesting optical [11], mechanical [12] and electrochemical [13] properties. Furthermore, high-pressure TiO 2 phases can also serve as analogues of the structures adopted by many other important AX 2 systems that are of particular interest in geology, such as high-pressure silicas [14].

A. Characterisation of structures found by RSS
We performed RSS using the empirical Matsui-Akaogi (MA) potential [15] at 0 GPa, 20 GPa, 40 GPa and 60 GPa (see Methods). At each pressure, RSS produced thousands of distinct TiO 2 structures with different atomic co-ordinates, cell shapes and numbers of formula units in the cell. Even though knowledge of the space groups, molar volumes and energies of the structures provides hints on how to classify them, it is still a formidable task to sort through them manually. In recent years, ML-inspired approaches have been used for the classification and visualisation of atomic structures [16][17][18][19][20], but they have not been systematically exploited in the context of recognising the metastable polymorphs from RSS, especially for the purpose of determining the phase behaviour of crystalline systems. We have therefore developed and employed a ML-based method to compare and cluster the structures automatically.
a. Similarity measurements This pattern recognition task is built around the construction of a kernel matrix {K(A, B)} that measures the similarity between each pair of structures A and B in the data set. The kernel matrix should be positivedefinite and normalised. If it is not already normalised by construction, then K(A, B) should be divided by K(A, A)K(B, B). The kernel function K(A, B) can be formulated as an inner product of the features of A and B, where Φ = {φ i } denotes the set of global fingerprints associated with the whole structure, rather than with individual atomic environments X A n that are centred on each atom n in the structure A. A straightforward way of obtaining these global fingerprints from the local ones is to take the average [18], where N A denotes the total number of atoms in the system A if it only contains one element. If the system contains multiple atomic species and if the local fingerprints are specific to atomic species, the average in Eq. (2) should be taken over atoms of the same atomic species α and the resulting vectors associated with each species are then concatenated. Apart from the average kernel used here, one may also use the MATCH or RE-MATCH kernel of Ref. [18], although we did not notice a significant improvement during the subsequent analysis. We employ the Smooth Overlap of Atomic Positions (SOAP) framework that was introduced in Ref. [21] to construct the local fingerprints Ψ (X). SOAP has been used together with Gaussian Process Regression in numerous applications, including metals, semiconductors, molecular crystals and small organic molecules [18,[22][23][24][25], as well for structural identifications of bulk materials, including ice [17,19] and TiO 2 [20]. The SOAP representation is specific to atom species. For atoms of species α inside a local environment X, it uses a smooth atomic density function by summing over Gaussians centred on each atom i of species α that has a displacement r i within a given cutoff r c of the central atom of the environment X. The density ρ α X (r) is invariant to translations and permutations of identical atoms, but not to rotations. The SOAP representation addresses this by first expanding with a set of orthonormal basis functions on radial direction g(|r |) and spherical harmonics of angular directionŝ r as and then taking the power spectra that characterise the rotational-invariant arrangement of atoms of species α inside the local environment X [18,21] k α nn l (X) = π The vector {k α nn l } constructed in this way up to certain cutoffs l max and n max can then be used as the local fingerprint Ψ (X) in Eq. (2), which in turn leads to the kernel matrix. In practice, we use the recently implemented DScribe Python package for constructing descriptors and kernels [26].
b. Low-dimensional maps The kernel matrix {K(A, B)} provides distance measurements between structures in a highdimensional space. To visualise such distances, we project them onto a two-dimensional map using kernel principal component analysis (KPCA) [27], which amounts to projecting {K(A, B)} onto its two eigenvectors with the largest eigenvalues. Fig. 1 shows such 2D maps for the structures found during the RSS at 20 GPa.
c. Clustering For the set of structures generated by RSS at each pressure, we clustered them based on the similarity measurements {K(A, B)} in order to find which structures belong to the same crystallographic family using the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm [28]. In this approach, the density of data points around a given data point is first estimated and then points with a density above a certain threshold are identified as clusters. Input data points that are not ascribed to regions of high data density are classified as noise. The most important DBSCAN parameter is the maximum distance between two samples for them to be considered neighbours. In our case, there is a very clear separation between similar and dissimilar structures obtained from RSS due to the absence of thermal noise, so the clustering outcome is insensitive to this DBSCAN parameter.
As an example, in Fig. 1 we show such clustering results. Even though the information on space groups, molar volume and enthalpy is not explicitly included in the construction of the kernel matrix {K(A, B)}, this agnostic pattern recognition approach is able to group together structures with similar properties extremely well.
d. Identification of known and new phases In the KPCA map of Fig. 1, we also project the location of previously found crystal structures of TiO 2 , and, in an ad hoc manner, the new structures found in the present study. If a cluster that contains several structures found in RSS also includes a previously known phase, it is given the label of this phase, otherwise it is regarded as a new phase. In this way, we found 15 phases which have not to our knowledge previously been considered in studies of titanium dioxide in this pressure range. In the remainder of this manuscript we label them with a standard format P-a-SYM, where P gives the pressure in gigapascals at which the structure was found in RSS, a is a lowercase letter purely used for labelling, and SYM gives the space group of the structure. These new structures are illustrated and described in the Supplementary Information (Fig. S1). One of the new phases (60-d-P62m) is in fact the Fe 2 P phase of TiO 2 that was previously considered at higher pressures than we have focused on here [11,29].
e. DFT calculations Although the MA potential satisfactorily predicts many properties of TiO 2 [30,31], it does not give a perfect description of interactions, so the fact that a new phase is found using the MA potential does not necessarily mean that the same holds for TiO 2 in experiment. To check whether the new structures may be of experimental relevance, we computed the lattice energies and enthalpies of all the phases at the DFT level over a pressure range of 0 GPa to 70 GPa (see Methods). We used three functionals, LDA, PBE and PBEsol, and the results are shown in the Supplementary Information (Fig. S3). Although a recent study [32] shows good agreement in the ranking of static-lattice energies of phases at low pressure between diffusion quantum Monte Carlo and a number of DFT functionals, the DFT calculations for TiO 2 should be approached with a degree of caution particularly at low pressure [29], since for example rutile is not predicted to be the stable phase at sufficiently low pressures for any of them, the results for the known phases are consistent with previous work [7,[33][34][35]. At high pressures (∼60 GPa), the three functionals gave consistent results, suggesting that the results are robust under such conditions. Although none of the newly reported phases have the lowest enthalpy, several phases are very competitive. In particular, 60-d-P62m and 60-e-P2 1 /m have enthalpies very close to the ground-state phase of cotunnite at high pressures (P ≥ 40 GPa), and the 60-a-R3 phase, which, as we discuss below, is particularly favoured for the MA potential, has an enthalpy only a few kcal mol −1 per formula unit larger than cotunnite. The difference in relative enthalpy between different levels of theory highlights the importance of capturing metastable phases, as in general the enthalpies predicted by empirical potentials or density-functional approximations may not be accurate enough to determine the true ground-state enthalpy.

B. Approximation of entropy using frequency of appearance
In this section, we investigate whether it is possible to approximate the entropies of different phases solely from the information one can obtain from RSS. The RSS scheme, which involves performing energy minimisation of random structures, is rather similar to the direct enumeration method [36] that can be used to find the basin of attraction of granular or glassy systems [36][37][38], although RSS is somewhat more complicated as variable cells with different numbers of atoms are used (see Methods). It is thus interesting to explore whether it is reasonable to assign a value for the entropy of the basin as S b = k B ln v for structures found in RSS, where v is a measure of its basin volume, and if this basin entropy is related to the thermodynamic entropy S that enters the Gibbs energy.
To estimate the entropy of the basin of the crystal structures found in RSS, we assume that (i) S b of a crystal structure is extensive with respect to its system size, S b = ns b , where n is the number of TiO 2 formula units; and (ii) when performing energy  (Fig. S4). Each data point is coloured according to the relative enthalpy of a structure compared to the ground state at the same pressure. As only the relative entropy under a given thermodynamic condition between different phases matters, we centred the data about the origin. minimisation with the same number of TiO 2 formula units, ns b is proportional to the logarithm of the frequency of finding a certain crystal structure. With these strong assumptions, we are able to infer the relative basin entropy for each crystal structure considered. The value of s b obtained in this way depends on the frequency of occurrence of each structure for a given choice of n. However, for structures whose unit cells do not have compatible numbers of formula units, we can nevertheless infer the difference in s b between two such polymorphs provided that they are found in searches together with other structures that have a well-defined s b for both values of n.
To test our assumptions, and to compare the basin entropy s b of the crystal structures and the thermodynamic entropy S, we plot them against each other in Fig. 2. Given that the two sets of entropies were estimated using completely different approaches, it is remarkable how similar their spread is. The Pearson correlation coefficient (PCC) is estimated to be r ≈ 0.3, and the root mean squared error (RMSE) is 0.2k B . Note that the entropy difference between different phases of this system is itself relatively small, so even a rather small RMSE can significantly reduce the PCC.
We notice that low enthalpy structures tend to have s b larger than S: low enthalpy phases are found more frequently in RSS than one might expect from their thermodynamic entropies. This may be because how the initial structures are prepared and how the subsequent enthalpy minimisation is performed during RSS can introduce a bias towards locating deeper minima, which is advantageous if one aims to find stable phases over high energy ones. To achieve a better estimate of S, one can thus use a linear combination such as s b − aH, where a is a parameter to the fit aH + constant = s b − S over all data points. We show a comparison between this adjusted entropy s b − aH and S in Fig. S4 of the Supplementary Information, which shows a much stronger correlation (r ≈ 0.7). Of course, determining the value of a requires prior knowledge of S, so such a correction scheme is not especially useful in practice.
Nevertheless, the results here demonstrate that the frequency of finding a structure in RSS does encode some information about its entropy. Furthermore, the correlation between s b − S and H means that the difference in s b is a better estimator of the actual entropy difference between structures with similar enthalpies than for pairs with large enthalpy differences. Since we are typically interested in phases in the vicinity of the most favourable structure, this may be sufficient.
For the sake of demonstrating the level of approximation one can achieve in estimating the free energy using only RSS results, whenever we discuss ML-based approximations of G, we use s b to estimate the true entropy. To obtain a more accurate estimate of S, one can use the harmonic approximation [6], or if resources allow, the TI method [6,8,9].

C. Machine-learning prediction of enthalpy and entropy
As the number of local minima in the PES grows exponentially with increasing system size, it is difficult to ensure that all the important structures have been found during RSS. Indeed, not all the previously characterised or new structures of TiO 2 have been found at each pressure considered. There may also be situations where a certain crystal structure is known for an analogous system, but may not have been considered for the system of interest. In all such cases, we may wish to predict the enthalpies as well as the free energies of such structures. Of course, for a certain lattice arrangement, one can choose to perform energy minimisations and free-energy calculations, but it would be convenient to have a surrogate model that can quickly assess its enthalpy and basin entropy.
We developed such an approximate model using the kernel ridge regression (KRR) approach. In the following, we only discuss estimating H; the estimation of S b follows the same workflow. For a given pressure, an enthalpy estimate function defined on a given structure A is represented as a linear sum over kernel functions which are summed over a representative set of structures M.
The M representative structures are selected from the total N structures that are generated from RSS at each pressure.
For this selection, we exploited the farthest-point sampling (FPS) approach, which is a greedy algorithm that aims to select reference structures that are as diverse as possible by successively selecting a new point that is farthest from the ones already selected. The set of weights {w M } can be determined where K M M , K N N , K M N denote the kernel matrix between the M representative set, between the N complete set, and between the representative and the complete set, respectively; Λ is an N × N diagonal matrix that is commonly used in the KRR framework to regularise the fit; and H N is the enthalpy per formula unit of the N structures found by RSS. To estimate the statistical error due to the finite size of the training set N, we employed a subsampling technique [39]: we created an ensemble of KRR models using a subset of the training data and used the variance of these model predictions to infer the uncertainties.
We plot the predictions of the KRR model alongside values obtained in direct energy minimisation simulations in the Supplementary Information [Fig. S2]. The agreement of the prediction with the calculated value is excellent, demonstrating that the ML model is able to capture the fundamentals of the interactions based on local environments, even though long-range Coulombic terms are present in the empirical potential.

D. Phase behaviour and phase diagram estimation
In Fig. 3(a), we show chemical potentials (i.e. Gibbs energies per formula unit) computed using TI at a relatively high temperature of 1600 K. We chose this temperature because entropic effects have become important, but the majority of the phases of interest are still metastable across the pressure range considered. The baddeleyite phase and several of the newly considered phases [namely 0-c-I4 1 /a, 0-e-C2/c, 60-c-Pc and 60e-P2 1 /m] do not appear in Fig. 3(a), because they are no longer metastable at such high temperatures. Baddeleyite, for example, spontaneously converts into pyrite at approximately 1150 K when heated at 20 GPa. One particularly interesting transition is that between rutile and the new phase 40-a-Pnnm (see Supplementary Information (Fig. S1)), which are structurally particularly similar. The transition between them appears to be continuous as the system is pressurised: while the space group changes, there does not seem to be any change in density or enthalpy. In a sense, the 40-a-Pnnm polymorph is thus merely a high-pressure analogue of the rutile phase. Intriguingly, Fig. 3(a) shows a new phase, 60-a-R3, is the most stable phase above ∼60 GPa, even though for some of this pressure regime, its enthalpy is not the lowest. This example highlights the importance of properly accounting for the entropy of the polymorphs. A number of other new phases also have free energies lower than some of the known structures of TiO 2 , demonstrating the power of the RSS approach in successfully locating possible candidate phases for consideration.
To circumvent extensive and laborious free-energy calculations, we can approximate the chemical potential using solely the data obtained from RSS. If a structure is found in RSS at a particular pressure, we use h(0 K) − T s b to approximate its chemical potential, where h is the enthalpy per formula unit. For structures that are not found, we first use the ML schemes outlined in Subsection II C to estimate the enthalpy and basin entropy, and then estimate the chemical potential using the same expression. These chemical potentials are plotted in Fig. 3(b); phases found in RSS are indicated by solid lines and phases not found by dashed lines. As RSS was performed at four pressures, linear interpolations were used between the intervals. Alternatively, if one has data at only one pressure, a simple linear approximation G(P + ∆P) ≈ G(P) + ∆PV for extrapolation to nearby pressures can be considered [2].
The agreement with the calculations of Fig. 3(a) is rather remarkable, with broad consistency of both the shape of the curves and the ordering of the structures (within error bars). In part, this is because of the excellent prediction of the enthalpy ( Supplementary Information, Fig. S2); in addition, we note that the approximate approach is also able to account for example for the fact that the OI phase has a low enthalpy, but a narrow basin of attraction, and becomes less favourable at sufficiently high temperatures.
We note that Fig. 3 encodes the information required for constructing phase diagrams: the phase with the lowest chemical potential at a given pressure is the most stable for this potential at this temperature, and the crossover of the chemical potential curves is a point of phase coexistence. Using either the full free-energy calculation approach or the MLbased approximation, we can repeat the procedure at several temperatures and can thus determine the phase diagram of this system [see Supplementary Information, Fig. S7]. However, it ought to be borne in mind that a phase diagram hides much of the underlying thermodynamic behaviour, particularly for metastable phases, so the prediction of chemical potentials, as illustrated in Fig. 3(b), is thus a rather more rigorous test of the ML approach. Furthermore, since the free energies of many of the phases considered are similar to one another and many phases are metastable under certain conditions, predicting the metastablities of all relevant polymorphs is particularly important.

III. CONCLUSIONS
In this paper we present a framework that uses a combination of ML toolkits to help extract information from a random search of crystal structures and to estimate the phase diagram and metastabilities of solid polymorphs. This framework does not rely on any prior knowledge of the phase behaviour of the system, and allows an automated and comprehensive exploration and characterisation of possible solid phases. We have implemented a pattern recognition approach that permits us to cluster similar structures with little manual input. Grouping the structures enables us to estimate their entropy by assuming that their basin of attraction is related to the number of times each structure is found. Moreover, we use a machine learning approach to predict the enthalpy of unknown structures based on ones that are observed. Combining all these, we can estimate the free energies of solid phases and hence the thermodynamic phase behaviour of a system of interest with relatively little computational effort.
We tested the methods on the rather complicated system of TiO 2 at ambient and high pressures. We predicted a number of possible phases and estimated their chemical potentials at finite temperatures. Even though we only explicitly showed the chemical potentials at a moderately high temperature, the MLbased framework can be used for any reasonable temperature. Pair potentials similar to the ones used for our RSS are widely used for other binary oxides [40,41], so the new phases we found and the methodology introduced may be relevant to other oxides as well.
The methods we propose are transferable to other systems with many solid polymorphs, such as high-pressure solid hydrogen [42], perovskites [43], different phases of water-ice [17] and molecular crystals [44]. A useful strategy for determining the chemical potentials of low-free-energy polymorphs may be to use the ML toolkit to predict the chemical potentials initially, and then to compute explicit free energies only of those phases which the ML prediction indicates can be competitive. Such an approach would make the determination of phase behaviour of systems exhibiting many solid polymorphs using accurate free-energy calculations considerably more straightforward and computationally tractable. In particular, if one wishes to study the phase behaviour of a PES computed from expensive quantum chemistry methods or sophisticated ML potentials [45], using the ML toolkit may become indispensable.

IV. METHODS
a. Calculations using empirical potential For RSS and accurate free energy calculations, we focus on the simple MA empirical pair potential for TiO 2 [15] because of its low computational cost [9].
We performed RSS at 0 GPa, 20 GPa, 40 GPa and 60 GPa using the AIRSS [46,47] package interfaced with Lammps [48]. For each RSS run, we first chose a reasonable cell shape at random, and added 2, 3, 4, 6, 8, 9 or 12 formula units of TiO 2 into the simulation cell at random positions while keeping the initial density of the cell close to the typical density range of this system. We set a lower bound on the interatomic distance for each pair of atomic species, but otherwise imposed no additional constraints or symmetries on the initial structures.
We then relaxed the structure using a second-order conjugate gradient algorithm until the forces on atoms and the difference between the target and actual pressures both became negligible.
To benchmark how well our approximate framework can predict the phase behaviour of TiO 2 at ambient and high pressures as described by the MA potential, we also performed free-energy calculations for both the known phases and the new phases obtained from RSS. Specifically, we computed the chemical potentials of the polymorphs using Frenkel-Ladd integration [8,49] and subsequent TI along isobars and isotherms, as detailed in Ref. 9.
All the necessary input files for performing the abovementioned calculations using AIRSS and Lammps are provided in the supporting data.
b. DFT calculations For the metastable polymorphs identified with the MA potential, we performed geometrical optimisation and computed the enthalpies at the DFT level over a pressure range of 0 GPa to 70 GPa in steps of 10 GPa. We separately employed three common functionals, LDA [50], PBE [51] and PBEsol [52], using the CASTEP ab initio simulation package [53]. Full details of the DFT set-ups and configurations can be found in the input files supplied in the supporting data.