Materials Discovery by Chemical Analogy: Role of Oxidation States in Structure Prediction

The likelihood of an element to adopt a speciﬁc oxidation state in a solid, given a certain set of neighbours, might often be obvious to a trained chemist. However, encoding this information for use in high-throughput searches presents a signiﬁcant challenge. We carry out a statistical analysis of the occurrence of oxidation states in 16,735 ordered, inorganic compounds and show that a large number of cations are only likely to exhibit certain oxidation states in combination with particular anions. We use this data to build a model that ascribes probabilities to the formation of hypothetical compounds, given the proposed oxidation states of its constituent species. The model is then used as part of a high-throughput materials design process, which signiﬁcantly narrows down the vast compositional search space for new ternary metal halide compounds. Finally, we employ a machine learning analysis of existing compounds to suggest likely structures for a small subset of the candidate compositions. We predict two new compounds, MnZnBr 4 and YSnF 7 , that are thermodynamically stable according to density functional theory, as well as four compounds, MnCdBr 4 , MnRu 2 Br 8 , ScZnF 5 and ZnCoBr 4 , which lie within the window of metastability.


I. INTRODUCTION
The idea of ascribing an oxidation state to a metal can be traced back almost 200 years. 1 As the phrase suggests, it was used to describe the amount of oxygen bound to an element that is known to form multiple oxides. Since then, oxidation states have helped in the formation of many fundamental chemistry concepts. For example, a plot of the periodicity of accessible oxidation states ( Figure 1) by Irving Langmuir was one of the key pieces of evidence that led to the adoption of the octet rule around 100 years ago. 2 The English term itself, "oxidation state" (or equally "oxidation number"), first came into common use in the realm of electrochemistry in the 1930s, 3 and in the 1940s it had gained widespread use 4 to replace the less-than-perfect system of appending -ous and -ic to the lower and higher oxidation states of metals, respectively. Ferrous became Fe(II), ferric became Fe(III), and transition metals with more than two oxidation states could now be unambiguously described. The term has remained an indispensable heuristic tool in almost all sub-disciplines of the physical sciences. It is integral to the way in which chemists think about the interaction of elements within molecules and solids. Linus Pauling first postulated that oxidation states could be determined by approximating bonds as 100% ionic according to the electronegativities of the elements involved. 5 This simple approach did not initially gain acceptance as the use of Pauling's electronegativity scale 6 resulted in some unusual assignments. Nevertheless, his approach is reflected in the modern definition given by IUPAC: "An atom's charge after ionic approximation of its heteronuclear bonds". 7 In practice, knowledge of the chemical formula is sufficient to assign formal oxidation states in many inorganic compounds; however, there are cases were ambiguities exist (e.g. mixed-valence compounds, electrides, polyanions and polycations). As highlighted in a recent essay by Karen, the "atom's charge", its "heteronuclear bonds" and the "ionic approximation" are all terms that need clarification, and there are choices to be made about how each is defined. 8 The subtlety of assigning oxidation states is still the subject of many lively discussions in both pedagogical and research contexts. [9][10][11][12][13] For practical purposes, we emphasise the insight of Jansen and Wedig, who point out that "It is a purely formal concept; nowhere within the definition is it claimed that a particular oxidation state can be associated with a real charge. Nevertheless, the term is certainly useful, since a specific oxidation state can be correlated to real properties." 14 It is this correlation to real properties that useful in a materials design context. Oxidation states have had a role to play in materials design for many decades. In the 1950's and 1960's, Goodman and Pamplin were able to systematically and exhaustively design superlattices of multicomponent compounds by substitution of the cations in simple binary semiconductors, while ensuring the octet rule remains satisfied. 15,16 This cation substitution (mutation) concept continues to inspire modern computational work on semiconductor design. 17,18 Knowledge of accessible oxidation states for each element is advantageous because we can generate many stoichiometric combinations while ensuring that there is overall charge neutrality. For example, the formal oxidation states q of any ternary combination A x B y C z must sum to zero: We have previously demonstrated that many chemically plausible formulas can be generated in this way. 19 For example, if the stoichiometry values (x, y and z in the above equation) are limited to integers ≤ 8, the search space for ternary combinations exceeds 1 × 10 8 , and for quaternary combinations it is over 2 × 10 11 . The resulting formulas can be fed into a high-throughput screening workflow that uses machine learning structure prediction models to screen for new functional materials. 20 In this study, we first carry out a statistical analysis on the occurrence of oxidation states in 16,735 stoichiometric, inorganic compounds in order to highlight trends and show that many elements only exhibit certain oxidation states in the presence of particular elements. We then go on to construct a screening model based on this data and apply it to the search space for ternary transition metal halides. The model we propose can be used as a general chemical filter when dealing with large composition search spaces, in order to remove those combinations of elements that are unlikely to form stable compounds. For example, we find that many transition metals are only likely to adopt their highest accessible oxidation states in the presence of sufficiently electronegative anions.

A. Data curation
We focus on the variation of oxidation states of metals in the presence of common anions. The anions we include are the first four group VI and VII elements in their most common oxidation states, i.e.: O 2 -, S 2 -, Se 2 -, Te 2 -, F -, Cl -, Br -, I -. These provide a reasonable range of electronegativities (Table I) and as such we do not include group V anions. This also allows us to avoid the metalloids As and Sb. The compounds included in the dataset originate from the Inorganic Crystal Structure Database (ICSD) and are downloaded from the Materials Project (MP) 21 using their API. 22 Full details of how the dataset was refined can be found in the Methods section. In broad terms, all the compounds meet the following criteria (total number of compounds remaining in the dataset shown in brackets): 1. Feature in both the ICSD and MP databases (34,913) 2. Calculated to be less than 100 meV/atom above the thermodynamic convex hull by the MP (30,781) 3. Oxidation states of all elements can be determined automatically using a bond valence analysis algorithm a (24,376) a This rules out those elements that were not included in the original study which proposed the algorithm 23 as well as intermetallic compounds.
4. Contain at least one anion (as defined above) and at least one metal (16,735) Figure 2 shows the resulting metals that are included after this refinement has been applied. In total, 16,735 different compounds are included.
FIG. 2: Periodic table illustrating the metals included in our statistical analysis (green) and the anions considered (purple).

B. Occurrence of oxidation states
In the first instance, we examine the occurrence of metal oxidation states as a function of the most electronegative anion present in each compound (see Table I). In each case, we normalise by the total number of compounds containing a given species (metal in a given oxidation state), i.e., we look at how the total number of instances of each species is distributed across the compounds. This is given by the ratio N SX N S , where N SX is the number of compounds containing the species S where the most electronegative anion is X, and N S is the total number of compounds containing the species S. These values are shown graphically for all species in the Supplementary Information.
Transition metal have the largest number of accessible oxidation states. Figure 3 shows the distribution of some first-row d-block species. Each of these exhibits the same general trend: The likelihood of finding a metal in a higher oxidation state increases when a more electronegative anion is present in the compound (increase in the relative heights of red bars in Figure 3). Meanwhile, metals are more likely to exhibit low oxidation states when the most electronegative anion present is of low electronegativity. More specific trends can also be extracted. For example, the higher oxidation states of Mn (Mn 5+ -Mn 7+ ) are exclusively exhibited in oxides. This is also the case for Cr 6+ , while Cr 5+ is limited to oxides and fluorides.
For higher oxidation states, the likelihood of finding the metal with an anion of moderate electronegativity, such as Cl -, Brand I -, often goes to zero before the likelihood of finding it with an anion of low electronegativity, such as S 2 -, Se 2 -, and Te 2 -. This is a trend that may not necessarily be expected, for example, going from V 2+ to V 4+ . It is also important to mention at this stage that oxides nearly always dominate each distribution as 10,546 of the 16,735 compounds contain oxygen. This point is addressed later when using the data predictively, in order to minimise bias. Figure 4 shows a similar trend in the distribution of some second-row d-block species. Again, compounds containing lower electronegativity anions, in the absence of any higher electronegativity anions, are more likely to contain lower oxidation state metal species. For the highest oxidation states of Ru (Ru 5+ and Ru 6+ ), the presence of For O 2is required.
The distribution of oxidation states is more even across moderate and low electronegativity anions for second-row transition metals compared to the first row. This is consistent with established principles of chemical hardness, b as applied to inorganic compounds by Pearson. 25 The order of chemical hardness for the halides is F − > Cl − > Br − > I − and, in general, the halide anions are harder than the chalcogenide anions, which is consistent with the electronegativity ordering in Table I. For cations, hardness increases with increasing charge. The species in the second row have consistently lower chemical hardness than the corresponding species above in the periodic table with the same oxidation state, so it should be expected that they form more compounds with softer halides.
For p-block metals, the trends are less pronounced. As shown in Figure 5, Fand O 2containing compounds are likely to exhibit the higher oxidation states of the metals. Moving from low to high oxidation states, there is less of a reduction in the fraction of compounds containing lower electronegativity anions compared with the transition metal compounds discussed so far. The reduction in the fraction of compounds containing moderate electronegativity anions is more pronounced for these metals. The general observation from this data that the oxidation states of these metals are more weakly correlated to the electronegativity of the counter-ions than transition metals, is expected based on the fact that transition metals have multiple, readily accessible oxidation states by virtue of their partially occupied d-bands. This is not the case for p-block metals, for which adding or removing electrons results in more significant energy differences.
The third row transition metals and lanthanide series display similar trends to the first b Chemical hardness is estimated by where I is the ionisation potential and A the electron affinity. This represents half the energy gap between the highest occupied orbital and lowest unoccupied orbital. Absolute electronegativity, 24 distinct from Pauling's definition, 6 is defined as − I+A 2 and represents the midpoint between the two orbitals. and second row transition metal series (see Supplementary Information). For completeness, we note that the alkali and alkali-earth metals only exhibit +1 and +2 oxidation states, respectively, for the vast majority of compounds. Similarly, Sc, Y, Zn and Cd are usually not strictly classified as transition metals as there is a strong energetic preference for them to adopt the oxidation states that lead to empty (Sc 3+ , Y 3+ ) or filled (Zn 2+ , Cd 2+ ) valence always reflect what is chemically accessible. Even assuming that the dataset is sufficiently diverse, heightened interest in particular compounds or compound classes can result in their over-representation, which is a general problem in data mining. Nevertheless, we have shown analysis of the dataset both recovers established chemical concepts and provides new insights. We now go on to develop a simple model that can be universally applied based on the dataset as a whole.

C. Probabilistic model of species combinations
There are more compounds where O is the most electronegative anion present than any other anion as shown in Table I. To use the information from our analysis, we must ensure that the occurrence of each anion does not bias the results. To this end, we define the probability that a species is present with a given anion as: where N M X is the total number of compounds containing the metal element where X is the most electronegative anion. We use this formula to construct a lookup table of 1,320 species-anion pair probabilities (P SA ). The table contains 411 probabilities that equal 0, and 195 probabilities that equal 1. The former represent all the pairings that do not occur within the dataset and the latter represent pairings whereby for a given anion, the metal only exhibits one particular oxidation state. The P SA values are also presented graphically in the Supplementary Information. We note that this still does not mitigate against limitations that are intrinsic to the dataset. For example, there are over 100 distinct CdI 2 crystal structures in the dataset (owing to the large number of distinct polymorphs) giving rise to an anomalously high probability for the Cd 2+ -Ipairing.
An overall compound probability can be calculated as the product of the individual P SA values. For example, for a ternary metal halide A a B b X x , the compound probability is calculated as: where M A and M B are the metal elements corresponding to species A and B. Stoichiometries are not factored in to the probability calculation, such that P AaB b Xx = P ABX . This ensures that compounds featuring elements that all have only one oxidation state are assigned a probability of 1.0. For example, Ca 2+ and Al 3+ are the only species in the database of Ca and Al, hence P CaAl 2 O 4 = 1.0. The number of compounds in the dataset that have compound probabilities above a given threshold, t, is shown in Figure 6. The number decreases steadily and linearly, before dropping off more rapidly as the threshold becomes more strict.

D. High-throughput compound design
We now use these compound probabilities to inform a high-throughput design workflow. Specifically, we explore the compositional landscape for ternary transition metal halide compounds. An overview of the workflow is shown in Figure 7. The smact code 19 is used to generate 54,484 A a B b X x compositions. Of these only 4,276 are in known chemical systems (A-B-X) within the MP database. The compositions are assigned probabilities as per Equation 3, and only 18,164 are non-zero, which represents an immediate three-fold reduction in the search space.
The number of compositions produced by smact that pass through this probability filter as the threshold, t, is increased from zero is also shown in Figure 6. Many compositions have FIG. 6: Total number of allowed compounds from the entire dataset (green triangles) and of allowed compositions for ternary metal halides only generated by smact (red crosses) as a function of compound probability threshold, t. Dotted vertical lines represent cut-offs that return 99%, 95% and 90% of the original dataset.
low probabilities, hence, contrary to the scenario for the compound dataset, the total number drops off rapidly as the threshold increases. This separation between the two curves would, in principle, allow for a threshold to be chosen that eliminates many suggested structures but is still inclusive of the majority of the structures in the dataset. For example, choosing a threshold that includes 90% of the structures in the dataset results in a further three-fold reduction of the search-space to < 6,000 compositions. If we set a probability threshold of t = 1, there are 346 compositions that pass through the filter and this equates to 88 distinct sets of three species. In order to demonstrate the rest of our workflow, we take 10 of these sets (Table II) to the next step, which is to assign them to likely crystal structures (first yellow box in Figure 7). For this, we adopt the structure substitution algorithm developed by Hautier et al. 26 This method also uses a statistical model and relies on a database of known compounds including oxidation state FIG. 7: Data-driven design workflow used to generate new stable compounds. P ABX is the compound probability from oxidation states analysis, which must be greater than the threshold, t. The structure prediction procedure has a separate threshold, σ. The structure with the highest σ is placed onto a phase diagram constructed using compounds from the MP database, and corresponding energies from the AFLOW-ML approach. Density function theory (DFT) is used to calculate the total energies of competing phases in order to determine the energy above the convex hull, E hull , of the new compound.
information: A combination of species is substituted onto lattice sites in known structures from the dataset of known materials. Each species substitution is associated with a certain probability, which comes from a model trained on the compounds that already exist in the ICSD. If the overall probability for a given set of substitutions is above a certain threshold, σ, it is added to a list of possible structures. This substitution process is performed, for each set of species, on each known crystal structure in the MP database. The structure with the highest overall probability for each of the set of 10 species is taken forward to the next stage of the workflow (second and third green diamonds in Figure 7). These are listed in Table II along with their parent compounds. Each structure is placed on a phase diagram in order to determine likely competing phases, which requires total energies as calculated using DFT. The key quantity of interest is the energy above the convex hull (E hull ) that is formed by drawing straight lines between thermodynamically stable phases. It was recently estimated by Sun et al. that around half of all known inorganic materials are metastable, 27 so to focus solely on thermodynamically stable compounds would be to potentially overlook kinetically stabilised, useful materials. The likelihood of existence drops off exponentially as E hull increases. The rate of decay depends on the chemistry of the system and we use 100 meV/atom as a guiding principle for the maximum E hull . The set of competing phases on which DFT calculations were performed was determined using a trained machine learning model (AFLOW-ML) in which structures are represented as property labelled fragments. 28 This stage reveals a key advantage of pursuing only those compositions with higher probabilities based on the oxidation states analysis: the parent binary compounds are well defined. Competing binary compounds exhibiting the metals in the same oxidation states as in the ternary (or multernary) compound are more likely to be known and amenable to total energy calculations to determine phase stability. Arbitrary combinations of species can result in stoichiometries that require energies of competitive gas or liquid phases which are subject to larger errors in DFT simulations. Figure 8 illustrates this point with a comparison between the phase diagram of three proposed ternary compositions. MnZnBr 4 has a probability of 1.0 as both MnBr 2 and ZnBr 2 are known and these decomposition products do not require a change of oxidation state of either metal. The ternary, therefore sits on the tie-line between the two binaries. The proposed compositions MnRuBr 6 and ScMnI 7 , however, both have probabilities of zero, as in each case one or more species-anion pair is not known to occur. These compositions sit in an equilibrium triangle as opposed to on a tie line, and the stability of the proposed compounds now depend on the chemical potential of the anion.  Table II. Two of the new compounds, MnZnBr 4 and YZrF 7 , are predicted to be thermodynamically stable with respect to competing phases. Of the remaining eight compounds, four sit within the metastability window of 0 < E hull < 100, while four are unlikely to form stable compounds. The crystal structures of the two compounds identified as stable are shown in Figure 9. By comparison with previous work where similar a workflow was employed, 20 this result provisionally indicates that the additional step of considering compound probabilities based on our oxidation states analysis increases the chance of identifying stable compounds.
The main limitation of the procedure outlined here is that it is based on analysis of known materials with extrapolation to new systems. This assumes that the range of structure types and chemistries found in current materials databases provide a complete basis for materials design. While this is a reasonable starting point, advances in materials synthesis -for example in the area of hybrid organic-inorganic solids -will require adaptations and the development of alternative approaches. We have noted that there are many instances were oxidation states themselves become ill-defined, which often is associated with interesting and important physical behaviour (e.g. superconductivity). Before tackling such challenge cases, we have highlighted 19 that a vast amount of "conventional" materials space remains unexplored.
FIG. 9: Two new stable ternary metal halides predicted using this workflow. a) YZrF 7 consists of vertex sharing irregular polyhedra of YBr 8 (red) and octahedra of ZnBr 6 (green). b) MnZnBr 4 consists of vertex sharing ZnF 4 (orange) and MnF 4 (purple) with both metals in a tetrahedral coordination environment.

III. CONCLUSION
We have performed a statistical analysis of the occurrence of oxidation states in 16,735 inorganic compounds and shown that qualitative trends in keeping with chemical intuition can be extracted from the data. Many of the highest oxidation states of transition metals are only observed in the presence of the most electronegative anions, Oand F -, whilst an absence of these anions are required for many of the lower oxidation states of transition metals. We go on to use the data to construct a model that is applied to inform a highthroughput search for new stable ternary halide materials. The application of the model results in an immediate three-fold reduction in the search space of 54,484 compositions. The search space is reduced to those compositions which are more likely to have known chemically similar compounds as competing phases, such as binary halides, thereby increasing the confidence we have in their calculated stability. Our workflow is able to identify two new stable compounds, YZrF 7 and MnZnBr 4 , using modest computing resources.

A. Dataset
The MP API 22 is used to download the structures of all the compounds that are associated with at least one ICSD entry and with a calculated energy above the hull of < 100 meV/atom. An attempt is made to add oxidation states to all species in each structure using the pymatgen 29 bond valence module (See oxidation state assignment subsection for details). Those compounds for which oxidation states cannot be assigned are discarded.

B. Oxidation state assignment
In order to assign integer numbers of electrons to atoms, the bond order must be determined. This task easily carried out for molecules but not for extended solids. The bond valence (BV) is a quantity similar to bond order that is used instead and, for atoms i and j, is calculated by where d is the distance between the atoms and B is a parameter usually fixed to 0.37. R 0 is the single bond length between the two atoms, although in practice it is a function of the coordination number and oxidation state of the approximated cation for a given approximated anion and is fitted to a set of structures. In the general implementation by Brese and O'Keeffe 23 it is calculated as where r and c are parameters related to the size and electronegativity of the atoms, respectively.
We use the maximum a posteriori (MAP) estimation method to determine oxidation states using the BV approach, as implemented within the pymatgen code 29 with a maximum nearest-neighbour radius of 4Å.

C. Compound design
Using as input the metal species for which we have P SA values, we use the smact package 19 to generate all charge neutral A a B b X x compositions where A and B are d-block metals, X is one of the first four halides, and the stoichiometries a, b, x are integers ≤ 8. For structure prediction, we use the structure substitution algorithm developed by Hautier et al., 26 as implemented in the Pymatgen framework 29 with a probability threshold, σ, of 0.00001. The structure with the highest probability that does not contain more than 40 atoms/unit cell is selected as the candidate compound for a given set of species.

D. Total energy calculations
For calculating E hull , first-principles calculations are carried out using Kohn-Sham DFT with a projector-augmented plane wave basis 30 as implemented in the Vienna Ab-initio Simulation Package (VASP). 31,32 We use the PBEsol exchange-correlation functional 33 and a k-point grid is generated for each calculation with a density of 120Å 3 in the reciprocal lattice. The kinetic-energy cut-off is set at 600 eV and the forces on each atom minimised to below 0.005 eVÅ −1 .
We note that no Hubbard +U parameters have been used in the calculations to correct for the self-interaction error present in the generalised gradient approximation (GGA) for some transition metals. 34,35 The use of GGA+U has been shown the improve stability estimates of ternary oxides, 36 however, in the absence of any reliable U parameters fitted to metal halides, we use GGA for all calculations for consistency.

V. DATA ACCESS STATEMENT
The smact package can be accessed from https://github.com/WMD-group/SMACT. Screening results from these calculations may be reproduced using the Python code available on-line from https://github.com/WMD-group/SMACT/tree/master/examples. Optimised structures are available on-line from https://github.com/WMD-group/Crystal_ structures/tree/master/TM_halides. All other data may be obtained from the authors on request.

VI. ACKNOWLEDGEMENTS
DWD gratefully acknowledges support from the Engineering and Physical Sciences Research Council (EPSRC) via the Centre for Doctoral Training in Sustainable Chemical Technologies (grant no. EP/L016354/1). Calculations were carried out on the Balena HPC cluster at the University of Bath, which is maintained by Bath University Computing Services. AW acknowledges support from the Royal Society and the Leverhulme Trust.