Elena
Zamaraeva
a,
Christopher M.
Collins
b,
Dmytro
Antypov
a,
Vladimir V.
Gusev
ac,
Rahul
Savani
*cd,
Matthew S.
Dyer
ab,
George R.
Darling
b,
Igor
Potapov
ac,
Matthew J.
Rosseinsky
*ab and
Paul G.
Spirakis
ac
aLeverhulme Research Centre for Functional Materials Design, University of Liverpool, Materials Innovation Factory, Liverpool, UK. E-mail: M.J.Rosseinsky@liverpool.ac.uk
bDepartment of Chemistry, University of Liverpool, Liverpool, UK
cDepartment of Computer Science, University of Liverpool, Liverpool, UK. E-mail: Rahul.Savani@liverpool.ac.uk
dThe Alan Turing Institute, London, UK
First published on 26th September 2023
Crystal Structure Prediction (CSP) is a fundamental computational problem in materials science. Basin-hopping is a prominent CSP method that combines global Monte Carlo sampling to search over candidate trial structures with local energy minimisation of these candidates. The sampling uses a stochastic policy to randomly choose which action (such as a swap of atoms) will be used to transform the current structure into the next. Typically hand-tuned for a specific system before the run starts, such a policy is simply a fixed discrete probability distribution of possible actions, which does not depend on the current structure and does not adapt during a CSP run. We show that reinforcement learning (RL) can generate a dynamic policy that both depends on the current structure and improves on the fly during the CSP run. We demonstrate the efficacy of our approach on two CSP codes, FUSE and MC-EMMA. Specifically, we show that, when applied to the autonomous exploration of a phase field to identify the accessible crystal structures, RL can save up to 46% of the computation time.
We focus on basin-hopping (BH),15 which is a popular choice for CSP.16–18 In the BH approach, Monte Carlo (MC) sampling is used to generate possible crystal structures at a given composition. For each generated structure, a local energy minimisation is performed to find the nearest minimum. Given the locally optimised crystal structure with its calculated energy, the code will have a set of possible actions that generate the next trial structure according to a set of deterministic or probabilistic rules. Intuitively, the choice of action will determine how “far” the new trial structure is from the current structure on the PES of that composition. For example, a “local action” may swap two atoms, and a more transformative action may replace the current structure with a new randomly created one (i.e., this action “starts the search again”). A typical policy would be a fixed discrete probability distribution of actions, which is used to randomly draw an action every time BH generates a trial structure. While such policies can effectively balance actions with a good tuning of action probabilities, which requires historic data and is not trivial, they also remain fixed during the run and are applied independently of the current structure or other contextual factors.
Compared to the typical case of a fixed policy that does not depend on the current structure, the RL approach that we develop and evaluate in this paper makes two changes. Firstly, our policies depend on the current structure, in particular the policy uses the calculated energy to help decide which action to take. Secondly, our policies are improved on the fly during a CSP run, thus removing the need for costly and composition-specific hand tuning.
RL algorithms adhere to the principles of Markov Decision Processes19 and operate on states, actions, and rewards. In a given state, a decision is chosen for this state and then the corresponding action is taken. By taking this action, a reward is obtained (positive, negative, or zero), and the state is updated according to the underlying dynamics of the RL problem. The goal of RL is to obtain policies, i.e., (possibly stochastic) mappings from states to actions, which maximise the expected long-term aggregated reward. The state space that we use to allow the policy to condition on the current structure is defined by the calculated energy of the structure. The calculated energy is clearly a very important “feature” of the structure, and other features could also be incorporated into the state space. The available actions that we consider are based on the existing CSP codes that we apply our approach to. The reward in the simplest case is the reduction of the calculated energy between the current and the trial structure. In addition to rewarding those actions that produce structures with lower energies, we also introduce penalties to disincentivise certain outcomes such as repeating structures.
We apply RL to basin-hopping CSP by adapting and implementing the reinforcement learning algorithm REINFORCE,20,21 which is based on “policy gradients”. With this approach, the policy is parameterized, and the parameters of the policy are improved over time using gradient-based approaches, where the gradients are based on recent experience of the current policy so that actions that led to desirable outcomes are likely to be used again in similar situations. Intuitively, the RL algorithm leverages its past experience to select good actions when similar situations are encountered again as the run proceeds. We refer to this variation of REINFORCE in the CSP context as RL-CSP and illustrate it in Fig. 1.
Although the last penalty is more general than the one before, it is practically useful to control these penalties independently, e.g., by removing one completely, or by imposing different penalty levels for zero reward cases where the action does not alter the structure and non-unique energy structures where the action does not promote exploration.24 When penalties are used and one of the three outcomes above takes place, RL-CSP does not calculate the reward in a usual way as the energy drop. Instead, to penalize this action as undesirable, the algorithm calculates the reward for getting into a structure with the “mean high energy” Ep, which is the mean of the 10% of the highest energies found so far. The RL-CSP code allows the user to fix the sizes of the different penalties or turn on only some (or none) of them. In this work we either used all three types of penalties calculated via Ep or no penalties at all.
The RL-CSP routine can be divided into two alternating steps: action selection according to the current policy and updating the policy. We provide the descriptions of these steps below.
In this paper we used ε = 0.1.
(1) |
The threshold on the entropy (0.8 in this study) below which the entropy regularization mechanism turns on can be tuned as the hyperparameter. The list of hyperparameters with used in the study values is provided in ESI Note 1.†
The pseudocode of RL-CSP is shown in Fig. 2.
Fig. 3 Basin-hopping (BH) search routine with RL-CSP. BH (in black colour) is the basis for the searches used in both the FUSE and MC-EMMA structure prediction codes (described in ESI Notes 3 and 4†). “Rand” is a randomly generated number between 0 and 1, ΔE indicates the difference in energy between the trial structure and the current structure, kT is the “temperature” parameter for the system (set by the user). The “Break” condition is triggered when the maximum allowable number of consecutive rejected structures is reached (defined by user). The text in red shows how RL-CSP is embedded into BH; the RL-CSP policy is used instead of the fixed probability distribution to select the action to alter the structure that maximises the reward and then the policy is updated based on the results of the energy calculation. |
A common practical use case of CSP in material discovery is to identify candidate new compositions for synthesis. This amounts to approximation of the convex hull by means of CSP calculations over a wide range of compositions in the phase field.3,26,29 As a measure of potential gain in this context, we aggregate performance improvement across all compositions within the same phase field to demonstrate that our RL implementation with FUSE outperforms the two fixed policies it is compared with, requiring less than 68% and 54% of the steps required by the fixed policies without the additional cost of having to manually tune a fixed policy.
Fig. 4 Actions in FUSE. (a) The probability distribution of actions 1–7, 9–10, utilized in the FUSE study.4 Action 8, which is currently in development, is not used in this study. (b) Actions 1, 9 & 10 swap the positions of two, three to N − 1 or all of the atoms in the structure respectively (N is total number of atoms in the structure); these actions were labelled by number 1 in the FUSE study but the proportions they were called in match the segments in the pie chart. (c) Action 3 alters the instructions used to assemble the structure, for example by changing one of the unit cell angles. (d) Action 2 switches the positions of two sub-modules within the structure. (e) Action 4 switches the positions of two complete modules within the structure. (f) Action 5 doubles the structure along a chosen unit cell axis. (g) Actions 6 & 7 generate a new random structure. For action 6, the unit cell is limited to be of the same size or smaller, for 7 it can be of any size, so long as it conforms to the limit of the number of atoms used within a run of FUSE. |
FUSE was benchmarked on compositions from the Y3+–Sr2+–Ti4+–O2− phase field, containing the following known compounds: TiO2, SrO, Sr2TiO4, Sr3Ti2O7, SrTiO3, Sr4Ti3O10, Y2TiO5, Y2Ti2O7, SrY2O4, Y2O3 (other compounds are reported, but adopt modified versions of the structures of these ten compounds).4 In preliminary testing, we found that TiO2, SrO, Sr2TiO4, Sr3Ti2O7, and SrTiO3 are trivial for FUSE with the global minimum being found within ten BH steps, and so do not use these compositions in our experiments. We test RL-CSP integrated into FUSE on the remaining five compositions Sr4Ti3O10, Y2TiO5, Y2Ti2O7, SrY2O4, Y2O3, for which highly probable global minimum structures were reported.4 For the compositions: Sr4Ti3O10, Y2TiO5, Y2Ti2O7, SrY2O4, we allow FUSE to use up to 50 atoms per unit cell and for Y2O3 we allow 80, ensuring that for each composition it is possible to generate the observed global minimum structure. We contrast these results with those obtained with FUSE using two fixed policies.
In the first fixed policy, all possible actions are chosen with equal probability (Uniform). The other fixed policy was hand-tuned in the FUSE study4 in such a manner that the code efficiently locates the crystal structure of Y2Ti2O7 with a limit of 150 atoms per unit cell. This policy is the default search policy in the FUSE code (Original, Fig. 4a). Thus the comparison of three policies, Uniform, Original and the dynamic RL-CSP policy (RL-CSP) is made by computing the structures of materials from the Y3+–Sr2+–Ti4+–O2− phase field defined above.
For the test set of five compositions (Sr4Ti3O10, Y2TiO5, Y2Ti2O7, SrY2O4 and Y2O3), we compare performance on one key metric: the number of steps taken to identify the global minimum for each composition-policy pair (see ESI Note 5†), where each step consists of the structure modification via an action and subsequent local energy minimisation using GULP. Each step in the CSP run is followed by an accept/reject stage to determine whether to accept the new structure or retain the previous one (see Section 2.2). As CSP codes are stochastic, multiple repeat runs are required to build up statistics. While there is no common standard for how many repeat calculations to perform when testing a CSP code, typical benchmarking examples perform ∼30–705, in this work we perform 40 repeats for every composition and policy tested, for a total of 600 runs across the five compositions and calculate the “mean number of steps” Nmean for each composition-policy pair. Nmean represents how many steps FUSE needs on average to find the structure with the global minimum energy (Fig. 5a and b). For the majority of compositions, namely Sr4Ti3O10, Y2TiO5, Y2Ti2O7, SrY2O4, Nmean is calculated as the mean number of steps required to find the global minimum among all 40 runs. However, the Original policy is unable to attain the global minimum for SrY2O4 in 28 runs out of 40, given a minimum of 90000 steps in the search, so the lower bound on Nmean is provided for this composition-policy pair calculated as the mean number of steps to completion among 12 successful and 28 stopped runs.
For Y2O3, the calculation of the mean number of steps required to find the global minimum among all 40 runs presents a challenge as only 12 runs on average attain the global minimum within ∼90000 steps, with 15, 9 and 12 successful runs for Original, Uniform and RL-CSP policies respectively. As we aim to evaluate RL-CSP policy and only the steps taken to obtain the global minimum are relevant here, we use the fastest 12 runs to compute Nmean for Y2O3. The Uniform policy managed to find the global minimum in only 9 runs, so the mean number of steps among the 9 successful and the 3 fastest unsuccessful runs is used as the lower bound of Nmean in this case. Nmean for all composition-policy pairs is shown in Fig. 5a and coloured red if the lower bound is presented. We also display the number of steps as a distribution for each composition and policy in a box and whisker plot (Fig. 5c).
For the compositions comparing RL-CSP, Uniform and Original policies, Fig. 5 shows that RL-CSP is fastest to locate the global minimum for three compositions (Y2TiO5, Y2Ti2O7 and SrY2O4), and the second fastest for two; Sr4Ti3O10 behind Uniform and Y2O3 behind Original. The latter case is discussed in detail in the subsequent section, the analysis of the former case reveals that although RL-CSP does not improve upon the Uniform policy for Sr4Ti3O10, the difference between their Nmean (144 steps) is relatively small. Thus, we conclude that the efficiency of RL-CSP is comparable to that of Uniform, and that the lack of improvement from learning in Sr4Ti3O10 can be attributed to the limited duration of the runs in this case.
We also provide a comparison based on Nmean for all five compositions, which represents the average number of steps taken across 40 runs for each of Sr4Ti3O10, Y2TiO5, Y2Ti2O7, and SrY2O4, along with the 12 fastest runs for Y2O3, resulting in a total of 172 runs. When compared with either fixed policy, RL-CSP is faster over the whole test set of compositions within the phase field, as judged by Nmean for all of the compositions, using fewer than 68% and 54% of the steps required by the Uniform and Original fixed policies respectively (Fig. 5a). The effect of learning across the phase field is more pronounced for more challenging compositions (for the Uniform policy) based on the strictly decreasing numbers in the last column of Fig. 5a. This leads to improved absolute savings of computational resources for the compositions for which they matter the most.
Fig. 6 Efficiency of different FUSE policies at the compositions SrY2O4 and Y2O3. (a) and (d) show how many runs have found the global minimum in the given number of steps. The analogous plots for compositions Sr4Ti3O10, Y2TiO5, Y2Ti2O7 are provided in ESI Fig. 1(a–c).† (b and e) show the final policies learnt by RL-CSP for SrY2O4 (b) and Y2O3 (e) at the end of a typical RL-CSP run. (b and e) confirm that, unlike the fixed policy (see Fig. 4a) RL-CSP policy provides an action probability distribution that depends on the composition and the energy of the current structure. The typical final policies for Sr4Ti3O10, Y2TiO5, Y2Ti2O7 are presented in ESI Fig. 1(d–f)† respectively. (c and f) show the full dynamic policy for action type 2 (swap the position of two sub-modules within the structure), varying with both energy (the full energy range found for both compositions are shown) and step number for SrY2O4 and Y2O3 respectively, illustrating that for different compositions, RL-CSP uses very different policies for the same action type between compositions. |
In contrast to Y2O3, for SrY2O4 the Uniform policy is the best of the two fixed policies, being at least 2.9 times faster than Original. RL-CSP is then able to learn a policy that improves upon the performance of the Uniform policy (Fig. 6a), resulting in a significantly further reduced Nmean with RL-CSP using just 71% of the steps required by Uniform.
These cases show that a fixed policy, even when tuned by hand, can demonstrate particularly poor performance for some compositions, for example, in the original FUSE study4 the global minimum for SrY2O4 was not obtained. In this work, the Uniform policy for Y2O3 and the Original policy for SrY2O4 both have a low success rate at finding their respective energy minima. The dynamic RL-CSP policy reliably improves the Uniform starting policy for both materials, taking at most 56–71% of steps required by the initial fixed policy.
Finally, we observe a clear improvement in the dynamic policies used by RL-CSP during a CSP run, illustrated by plotting the probability of RL-CSP using action 2 (swapping the position of two sub-modules) with both the step number and energy (Fig. 6c and f). Studying the dynamic policy for different compositions illustrates that the learned policy not only depends on the composition, but also on the structure that the CSP run is currently located in, as in no cases does RL-CSP converge to an ideal probability distribution that works best for all structures at a given composition. However, for both compositions action 10 (Fig. 4b) exhibits a greater preference in lower energy structures, indicating its increasing significance at low energy levels (Fig. 6b and e).
The comparison of RL-CSP performance in SrY2O4 and Y2O3 with different fixed policies indicates that, while manually tuned fixed policies may be able to provide a performance advantage over RL-CSP for specific compositions, such as the Original policy for Y2O3, RL-CSP should be the default option when exploring a whole phase field because this necessitates performing CSP for a range of compositions, and RL tailors the policy dynamically to each composition, thus avoiding locked-in poor performance from a fixed policy, such as the Original policy for SrY2O4 or the Uniform policy for Y2O3.
MC-EMMA uses a BH routine that is similar to FUSE, however it was designed for larger crystal structures with up to 160 atoms while FUSE was tested with the structures up to 50–80 atoms depending on the composition. Although fewer steps are usually required to find the global minimum (∼1500–2000 steps for fixed policies) than those in most FUSE compositions, the relaxation of the larger structures is more time-consuming and so MC-EMMA presents a different challenge for RL, which is to improve the starting policy using less data. As in the evaluation of RL for FUSE, we compare three versions of policy in MC-EMMA: the original hand-tuned fixed policy (Original), the uniform policy (Uniform) and the RL-CSP policy that starts learning from the Uniform. The action probability distributions for the Original policy are shown in ESI Fig. 2a.† We test MC-EMMA policies at the composition YBa2Ca2Fe5O13, which the Original policy was tuned for. As in the FUSE benchmarking, we use 40 runs of MC-EMMA with each policy. However, in FUSE we observed that RL-CSP does not have enough time to learn the policy at compositions where Nmean is less than ∼10000 steps for the Uniform policy (Sr4Ti3O10 and Y2TiO5). Taking this into account, we boost the learning process in RL-CSP by increasing the learning rate to 0.01 (see Section 2.1) in comparison to the FUSE settings and sharing the policy learning process across 40 runs.
The policy sharing is organised via the policy vector θ, which is common for all runs. 40 runs start simultaneously but choose the initial structure independently. Each run follows its own trajectory (sequence of structures and actions) and does not consider the trajectories of other runs. Given the current structure, a run selects an action according to the shared policy θ, observes a new structure and utilizes the obtained reward to update θ. When the run finds the global minimum, it stops while other working runs continue following their own trajectories in which they did not find the global minimum yet. Fig. 7e demonstrates the shared policy learned by 40 runs of MC-EMMA with RL-CSP.
We compare the performances of different policies using Nmean; the mean number of steps required to find the global minimum among 40 runs (Fig. 7). The numbers in Fig. 7a show clear RL-CSP improvement over the starting policy, as it uses only 71% of the steps required by Uniform to find the global minimum. Due to learning sharing between runs, the RL-CSP policy (Fig. 7e) can be efficiently learned even though the average number of steps per run is relatively small (∼1500). Finally, RL-CSP essentially reproduces the performance of the Original policy, requiring just 4% more steps despite starting from Uniform.
Considering the specific practical case of calculating the convex hull for a phase field to identify candidate new compositions for synthesis, the ability of reinforcement learning to improve a fixed policy enables the development of policies tailored to each composition within the phase field and should favour the more rapid construction of the hull. Although it is possible that a single fixed policy performs best at the some of the compositions studied, this study indicates that this is not likely to be the case in general, favouring further development of RL in CSP. Specifically, autonomous exploration across compositions of a phase field can benefit from RL, as our work suggests at least 32% improvement of RL-CSP when compared with the starting uniform policy.
Many such developments can be envisaged, for example in the type of reinforcement learning algorithm that could be applied beyond policy gradient methods. The learning model in RL-CSP operates only with energy as a feature to define the state space. This simplicity allows efficient operation. However, enriching the feature space with relevant structural characteristics would make learning quicker and more accurate. For this purpose, one could implement known structural characteristics such as the bond valence sum, or deploy neural networks to reveal the most relevant structural features, such as characteristic local coordination geometries for specific elements. More advanced RL algorithms can be adapted to the basin-hopping routine, for example, actor-critic methods were developed to deal with the high variance often encountered in policy gradient methods.31–33 In non-basin-hopping CSP algorithms containing decisions that constitute a policy, these decisions could also be amenable to reinforcement learning optimisation.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3dd00063j |
This journal is © The Royal Society of Chemistry 2023 |