Mapping inorganic crystal chemical space

The combination of elements from the Periodic Table defines a vast chemical space. Only a small fraction of these combinations yields materials that occur naturally or are accessible synthetically. Here, we enumerate binary, ternary, and quaternary element and species combinations to produce an extensive library of over 10 10 stoichiometric inorganic compositions. The unique combinations are vectorised using compositional embedding vectors drawn from a variety of published machine-learning models. Dimensionality reduction techniques are employed to present a two-dimensional representation of inorganic crystal-chemical space, which is labelled according to whether they pass standard chemical filters and if they appear in known materials databases


Introduction
The fundamental building blocks of materials are the chemical elements of the Periodic Table .Depending on the choice of elements and the interactions between them, the resulting material may be stable or unstable; crystalline or amorphous; insulating or conducting.The principles connecting chemical composition, crystal structure, and physical properties of materials remain a subject of longstanding interest 1,2 and ongoing study.Materials informatics has emerged as an important subject at the interface of traditional materials science and data science 3 .It uses informatics techniques to understand, design, and discover materials.The underlying materials data may be drawn from experimental investigations (e.g.crystal structure databases populated from X-ray or neutron diffraction measurements) or from computer simulations (e.g.structure-property databases based on density functional theory calculations).
In this study, we consider the cartography of inorganic crystal chemical space.Specifically, we address the combination of 2-4 elements to form stoichiometric inorganic compounds.This builds upon our earlier work 4 by featurising each chemical composition using embedding vectors from machine learning models and labelling the entries to probe the distribution of known and unknown materials.The resulting hyperspace is reduced to two dimensions to produce visual representations that show hints of the innate separation between allowed and forbidden compounds.

Chemical enumeration
As the number of chemical components increases, there is a combinatorial explosion in the number of possible compounds.We previously reported the code Semiconducting Materials from Analogy and Chemical Theory (SMACT) to enable rapid screening over such large configurational spaces 5 .This was inspired by early work on the exploration of new semiconducting materials based on electron counting principles 6 .The Python library features element and species classes, with integrated iteration tools and adjustable chemical filters.
We can make the combinatorial space of multi-component compounds more tractable by introducing chemical constraints.We choose to work with the first 103 elements of the Periodic Table (from H to Lr).This pool of atomic building blocks is expanded into 421 species when the accessible oxidation states are considered.For instance, Fe(II) and Fe(III) are both formed from the element Fe, but exhibit distinct physicochemical properties such as the black pigment Fe(II)O and the red antiferromagnet Fe(III) 2 O 3.
We consider the set of binary (A w B x ), ternary (A w B x C y ) and quaternary (A w B x C y D z ) combinations where the stoichiometric factors w, x, y, z < 9 . This approach yields a total number of 225,879 unique compounds for binary combinations, 77,637,589 for ternary combinations, and 16,902,534,325 for quaternary combinations.We ensure that combinations with equivalent stoichiometry, such as MgO and Mg 2 O 2 , are excluded from our analysis.
We apply established chemical filters to distinguish between plausible ("allowed") and implausible ("forbidden") inorganic stoichiometries.The first filter is charge-neutrality, based on the sum of the formal charge (q) of each species: This chemical filter can be framed equivalently in terms of electron counting or valency, which is common in the study of semiconductors 6,7 .The filter applies to a broad range of inorganic materials, particularly those classified as being formed of ionic and covalent interatomic interactions, where the charge neutrality principle holds.However, it may not be suitable for describing metallic alloys (e.g.Cu 1-x Zn x ), intermetallic compounds (e.g.Ni 3 Al), and nonstoichiometric compounds (e.g.YBa 2 Cu 3 O 7-δ ), as these materials often involve different chemical bonding and variable compositions with electron counting rules that go beyond the scope of simple charge neutrality considerations.Special consideration would also be required for mixedvalence compounds where a single element appears in the same compound with multiple oxidation states.For example, the binary compound magnetite Fe 3 O 4 contains an equal number of Fe(II) and Fe(III) ions, and thus would be described as a ternary compound by SMACT based on its three distinct constituent species.
A second filter is the electronegativity balance, which requires that the most electronegative ion has the most negative charge in the compound.Using the Pauling electronegativity scale 8 , χ anion -χ cation > 0. For example, the pnictide semiconductor GaSb is allowed by this filter (χ anion(Sb)χ cation(Ga) = 0.24), while the oxide catalyst Sb 2 O 3 is also allowed, where Sb is the cation (χ anion(O)χ cation(Sb) = 1.39).This filter helps distinguish between allowed and forbidden inorganic stoichiometries based on electronegativity considerations, ensuring that the composition contains sensible combinations of species.
Each chemical composition can be assigned a label {'allowed', 'forbidden'} according to whether it passes these chemical filters for inorganic compounds.They can also be labelled as {'known', 'unknown'} according to the presence of that composition in the Materials Project (MP) 9 database.The entries considered here were retrieved via the MP API (v2023.11.1) using an anonymised formula notation (e.g.AB2), ensuring a consistent approach to formula representation.We can then categorise each enumerated composition according to their combination of labels: standard {'allowed', 'known'}; missing {'allowed', 'unknown'}; interesting {'forbidden', 'known'}; and unlikely {'forbidden', 'unknown'}.
Table 1.Chemical compositions are labelled "standard", "missing", "interesting", or "unlikely" according to whether they pass the chemical filters implemented in SMACT and their presence in the Materials Project database.Examples are provided for metal oxides in the Li-Zn-Sn-O chemical space.

Chemical filter
Allowed Allowed Forbidden Forbidden  This pattern is more extreme in ternary compounds, where only 0.03% are standard, and the number of interesting compounds is negligible.The quaternary compounds continue this trend, with a mere 0.00% being standard or interesting, and 82.8% falling into the unlikely category.Significantly, the data reveals an increase in missing compounds from the MP database across the complexity spectrum: 4.4% in binary, 13.9% in ternary, and 17.2% in quaternary.This escalation may suggest that as the complexity of the compounds grows, the probability of their synthesis or identification of novel stable crystal materials decreases.Concurrently, the potential for discovering new crystalline materials increases, as evidenced by the larger missing category in higher-order compounds.The statistics hint at unexplored territories in materials science, particularly for ternary and quaternary compounds.A Periodic Table including the elements that commonly appear in binary compounds allowable by the chemical filters is shown in Figure 1.Elements with a greater number of oxidation states (accessible species) are more abundant.Among the non-metallic elements, carbon (C), nitrogen (N), oxygen (O), silicon (Si), phosphorus (P), sulfur (S), chlorine (Cl), and germanium (Ge) are notable for their multiple accessible oxidation states.This enables them to participate in a diverse range of chemical compositions while maintaining charge neutrality.The same is true for transition metals such as chromium (Cr), manganese (Mn) and iron (Fe).Furthermore, elements with high electronegativity values, such as fluorine (F) and oxygen (O), are also favoured by the filters.F with an electronegativity of 3.44 and O with 3.98, despite having only one and two negative oxidation states, respectively, are likely to pass the SMACT filters as stable anions.In summary, elements with either many oxidation states or high electronegativity are favourable to form more inorganic compounds.

Materials embedding vectors
An integer representation of elements, in terms of atomic number, is straightforward and intuitive for human chemists to learn.However, machine learning models benefit from the descriptive power of a higher dimensional representation, often in the form of continuous element vectors V i .To effectively represent elements, several types of element embedding have been developed.The Magpie 10 representation, for instance, incorporates diverse element properties such as atomic weights, electronegativity, and melting temperature.different materials as mentioned in scientific literature.This method effectively leverages unstructured textual data to enhance understanding of material properties.Skipatom 13 learned representations by predicting the surrounding atomic environment of a target atom based on structural information.It emphasizes capturing the local chemical environments and their impact on material properties.Megnet16 14 utilises graph neural networks, where the embedding is based on graph attributes that include whole graph information.This method employs the weights of the neural networks to predict the formation energy of crystalline materials, treating the atomic structure of materials as a graph with detailed node and edge representations.We employ the Python package ElementEmbeddings 15 to compile the various embeddings.
To make compositional embeddings from element embeddings for compounds, a weighted sum of the constituent element embeddings is performed, i.e.

V Composition = (wV
This step is implemented as the CompositionalEmbedding function in the ElementEmbeddings package.

Dimensionality reduction
To systematically map the inorganic crystalline chemical space in two dimensions, we utilise three primary dimensionality reduction techniques.These are Principal Component Analysis (PCA) 16 and t-distributed Stochastic Neighbour Embedding (t-SNE) 17 , both implemented using the sklearn library 18 , as well as Uniform Manifold Approximation and Projection (UMAP) 19 , which is implemented using the UMAP Python library.
We consider five distinct element embeddings: Magpie, Mat2vec, Megnet16, Skipatom, and Oliynyk.A random embedding of 200 dimensions is used to act as a control with no embedded chemical information, while still providing a unique representation for each element.The dimensionality of the embedding vectors is 22, 200, 16, 44, 200 for Magpie, Mat2vec, Megnet16, Oliynyk, and Skipatom, respectively.For a comprehensive analysis, 3000 data points were randomly selected for each of the four categories: standard, missing, interesting, and unlikely.These data points are transformed into two-dimensional vectors using the specified dimensionality reduction methods.The resulting embeddings are visually represented for binary, ternary, and quaternary compounds in Figures 2, 3 and 4, respectively.
For binary compounds, the distribution patterns of embedding vectors reveal distinct characteristics across different element embeddings.Vectors derived from Mat2vec, Skipatom, and Random element embeddings exhibit a dispersed distribution across the reduced space.In contrast, the embeddings generated using Magpie and Oliynyk show a more concentrated, clustered configuration.Figure 5 captures this phenomenon, presenting the reduced embedding vectors for binary compounds, consistent with those in Figure 2, but classified into distinct categories according to types of chemical compounds according to the anion present, such as pnictides, halides, chalcogenides, and oxides.Notably, the observed clustering patterns with the Mat2vec, Skipatom, and Random embeddings indicate a pronounced tendency for these vectors to group according to specific types.For instance, the oxide binary compounds (marked as green points) form isolated clusters.Such a tendency suggests that atom types play a significant role in the construction of compositional embeddings, which are derived from a weighted sum of individual element embeddings.On the other hand, the Magpie and Oliynyk embeddings, formulated based on a variety of atomic properties indicate the presence of influential atomistic characteristics that extend beyond merely the types of atom species.

Faraday Discussions Accepted Manuscript
The analysis of the PCA plots of Mat2vec, Oliynyk, and Megnet16 in Figure 2 exhibits a separation of interesting from standard and missing compositions.This segregation indicates that interesting compounds, which are known stable materials yet excluded by chemical filters, possess unique and distinct characteristics that set them apart from other categories.This is expected as large families of metallic alloys and intermetallic compounds fall into this category.
For ternary systems, a similar trend is observed where standard materials demarcate themselves from those interesting and missing.This is particularly evident in Mat2vec, Megnet16, and Oliynyk embeddings in Figure 3.It is worth highlighting that the standard and missing materials have a separate distribution in the quaternary space of Figure 4.It hints that navigating missing materials could unveil unexplored regions of the chemical space, potentially leading to the discovery of synthesisable materials with unique properties and applications.Indeed, the high fraction of empty space that exists for multi-component compounds has recently been exploited in a largescale computational screening study that identified 2.2 million plausible inorganic crystals 20 and offers a fertile playground for generative machine learning models [21][22][23][24] .
Overall, the degree of clustering across categories escalates from binary to quaternary systems with the increasing order of complexity and chemical diversity inherent in higher-order compounds.As transition to quaternary compounds, the distinct characteristics of each class become more salient.It is worth highlighting that the dimension reduction, resulting from the unsupervised learning algorithms, demonstrates cohesive clustering that corresponds to our classification with a striking clustering of the standard and missing compositions.

Conclusions
We have explored the vast expanse of inorganic crystal space, encompassing an array of 10 10 compounds that span binary, ternary, and quaternary compounds.While the uncharted space may be considered infinite, we tame it by introducing chemical constraints in the form of filters and limits on stoichiometric combinations.We label the resulting entries as standard, missing, interesting and unlikely, according to whether they pass these filters and if they are present in the Materials Project database.This separates the proportion of discovered compounds that conform to standard chemical rules to form stable inorganic solids.Furthermore, we have visualised the inorganic crystal chemical space through the lens of these two filters, revealing that higher-order compounds exhibit pronounced distinctive characteristics.It hints that navigating complex spaces could unlock materials with novel properties in unexplored regions, offering new avenues for scientific exploration.The study thus serves as a foundational reference for future endeavours in data-driven materials discovery, emphasising the potential of unknown regions within the chemical space.

Data access statement
This study used several open-access tools, including the SMACT (https://github.com/WMDgroup/SMACT)and ElementEmbeddings (https://github.com/WMD-group/ElementEmbeddings)packages.The associated scripts (or notebooks) to generate the plots in this paper are available in the SMACT examples directory.Interactive plots for binary combinations can be generated using CrystalSpace (https://github.com/WMD-group/CrystalSpace).

Conflicts of Interest
There are no conflicts to declare.

Figure 1 .
Figure 1.Periodic Table including s, p and d-block elements commonly found in binary (A w B x ) compounds that are allowed by the chemical filters implemented in SMACT.The number below each element indicates their frequency of occurrence.
Oliynyk 11   embedding comprises chemical descriptor vectors, derived from properties of elements properties.Mat2vec12 , utilising natural language processing (NLP) techniques, learned material representations from an extensive text corpus, capturing the context and relationships ofFaraday Discussions Accepted ManuscriptOpen Access Article.Published on 16 April 2024.Downloaded on 8/26/2024 12:12:47 AM.This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article Online DOI: 10.1039/D4FD00063C

Figure 2 .Figure 3 .Figure 4 .Figure 5 .
Figure 2. Visualisation of embedding vectors for the space of binary compounds with six element embeddings across PCA, t-SNE, and UMAP dimension reduction methods.The data points are colour-coded to indicate the four categories of composition: standard (blue), missing (red), interesting (green), and unlikely (grey).
Binary (Zn-O) …Quaternary (Li-Zn-Sn-O)LiZnSnO …Examples of chemical compositions generated from the screening procedure are given in Table1.The metal oxide examples include binary (Zn-O), ternary (Li-Zn-O), and quaternary (Li-Zn-Sn-O) systems.Seven compounds ZnO, 8 are present in the MP database within our stoichiometry limits.In the binary system, Zn(II)O and Zn(II)O 2 are classified as standard, given that Zn can exhibit an oxidation state of +2 with electronegativity 1.65, and oxygen has oxidation states of -2 (oxide) and -1 (peroxide) with an electronegativity of 3.44.Interestingly, Zn 2 O passes the chemical filter with the less common +1 oxidation state of Zn (often associated with the presence of Zn-Zn bonds) but is not found in the MP database so it is identified as missing.In the ternary Li-Zn-O system, LiZnO 2 and Li 6 ZnO 4 are standard materials, while various missing and unlikely compounds are identified.

Table 2 .
Among binary, ternary, quaternary compounds, 13,464, 10,779,441, and 2,909,434,982 compounds respectively passed the chemical filter.Within the MP database, there are 9,981 binary, 36,866 ternary, and 17,417 quaternary compounds identified.For binary compounds with a total of 225,879 unique combinations, 3,627 (1.6%) are standard, 9,837 (4.4%) are missing, 6,354 (2.8%) are interesting, and the vast majority, 206,061 (91.2%), are deemed unlikely to be formed.Even for the simple case of combining two components, the compositional space is sparsely populated.

Table 2
. Number of binary, ternary, and quaternary compounds based on enumeration and chemical filtering of 421 chemical species in SMACT and their presence in the Materials Project database.

Ternary (A w B x C y )
Open Access Article.Published on 16 April 2024.Downloaded on 8/26/2024 12:12:47 AM.This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.