Element similarity in high-dimensional materials representations

The traditional display of elements in the periodic table is convenient for the study of chemistry and physics. However, the atomic number alone is insufficient for training statistical machine learning models to describe and extract composition-structure-property relationships. Here, we assess the similarity and correlations contained within high-dimensional local and distributed representations of the chemical elements, as implemented in an open-source Python package ElementEmbeddings. These include element vectors of up to 200 dimensions derived from known physical properties, crystal structure analysis, natural language processing, and deep learning models. A range of distance measures are compared and a clustering of elements into familiar groups is found using dimensionality reduction techniques. The cosine similarity is used to assess the utility of these metrics for crystal structure prediction, showing that they can outperform the traditional radius ratio rules for the structural classification of AB binary solids.


I. INTRODUCTION
The periodic table offers an effective description of the elements in order of increasing atomic number.Its true power comes from the latent information that it contains.Chemists are educated to recall periodic trends in electronic configuration, atomic radius, electronegativity, accessible oxidation states, and related characteristics.6][7] A critical factor in the performance of such ML models for chemical systems is the representation of the constituent elements.The atomic number of an element can be augmented or replaced by a vector that may be built directly from standard data tables, trained from chemical datasets using a machine learning model, or even generated from random numbers.Such representations can be categorised as local (vector components with specific meaning) or distributed (vector components learned from training data).[10][11] Perhaps the simplest local representation is one-hot encoding where a binary n-dimensional vector v is used to categorise the atomic number of the element, e.g.H can be represented as 1000... and He as 0100... .A single component is 'hot' for each element, thus providing an orthogonal and sparse description.A selection of other common representations from the literature is given in a) Electronic mail: k.butler@qmul.ac.uk b) Electronic mail: a.walsh@imperial.ac.ukTABLE I. Summary of the element vector representations discussed in this work.
In this study, we are interested in the latent chemical information that can be distilled from such highdimensional element representations.We consider the fundamental concept of element similarity, which can be defined here as the distance or correlation between elemental vectors.We explore various metrics and then apply them to data-driven structure classification for the case of binary solids.The underlying tools have been combined into an open-source and modular Python package ElementEmbeddings to support future investigations.

A. Element representations
We consider four vector representations of the chemical elements in the main text, but cover all seven mentioned in Table I in the final section for applications to crystal structure prediction, with additional analysis provided as Electronic Supplementary Information (ESI).The aim here is not to be exhaustive but to cover a set of distinct approaches that have been developed for chemical mod-els.The analysis is performed on elements 1 (H) -83 (Bi) as higher atomic number elements are not covered in all representation schemes.For SkipAtom, only 80 elements are considered as the noble gases Ar, He and Ne are not contained within the representation.The source of the training data for these vectors was the Materials Project, which is largely focused on inorganic crystals.
The Magpie 12 representation is a 22-dimensional vector.It is a local representation where the vector components have specific meaning as they are built from elemental properties including atomic number, effective radii, and the row of the periodic table.The Mat2vec 14 representation is a 200-dimensional vector distributed representation built from unsupervised word embeddings 18 of over 3 million abstracts of publications between 1922 and 2018.In contrast, the atomic weights from a crystal graph convolutional neural network trained to predict the formation energies of crystalline materials are used to generate the 16 dimensional MEGnet 15 representation.The Random200 representation is simply a 200-dimensional vector generated randomly for each element, employed here as a control measure.Each vector component is generated from the standard normal distribution, N (0, 1).
The actual vectors were collected from various sources: the Magpie, Olinyk and Mat2Vec representations were obtained as csv files from the cbfv repository 19 ; the Matscholar and MEGnet16 were obtained from the lrcfmd/elmd repository 20 ; the SkipAtom embeddings were obtained from the lantunes/skipatom repository; numpy 21 was used to generate the Random 200 vectors.We found that the original Oliynyk csv file had 4 columns with missing values: Miracle Radius [pm]; crystal radius; MB electronegativty; Mulliken EN.For Miracle Radius [pm], we used the mode to impute the missing values and for the other 3 columns, we used knn-imputing with the default parameters in scikit-learn 22 .The choice of imputation was such that the overall distribution was preserved.All embedding vectors used in this work have been standardised prior to analysis.

B. Similarity measures
The distance between two vectors depends on the choice of measure in n dimensional space.We assess the pairwise distances between elements representations A and B. The Minkowski distance is a metric in the normed vector space, which is a generalisation of the common distance metrics Euclidean, Manhattan and Chebyshev: Those three distance metrics can be derived from the Minkowski distance by appropriately choosing the exponent p.
For p = 2, we obtain the Euclidean (or L2) distance which is the length of a line segment connecting A and B: For p = 1, the Manhattan (or L1) distance is obtained which can be defined from a sum of the absolute differences in each dimension: In contrast, the Chebyshev distance is obtained from the limiting case of p → ∞ and takes account of the greatest one-dimensional separation across the n-dimensional space: Taking the example of the separation between the elements Li and K in the Magpie  3.55.For completeness, the Wasserstein metric (earth mover's distance), which has been adapted for materials problems, 23,24 is also included as a function in Ele-mentEmbeddings and shown in Figure S5.Element separations are plotted for Euclidean and Manhattan distance in Figures 1 and 2, with other measures shown in the ESI.The elements are ordered in increasing atomic number along the x-axis and decreasing atomic number along the y-axis.This cuts across the groups in the periodic table.The leading diagonals in the distance plots are zero-valued as they correspond to d(A,A).The lighter blues correspond to elements whose vector representations are close to each other within the chosen metric space.These elements can be interpreted as similar to each other.Stripes are seen for the nobel gas elements, such as Kr and Xe, which are very different from the neighbouring halogens and alkali metals.On a visual basis, the global structure of the heatmaps appears similar for the Euclidean and Manhattan distances, with the main difference being the absolute scale of the distances.Less structure is seen for the Random 200 vectors, as expected for this control representation.
Alternatively, we can consider the angle between vectors using the cosine similarity based on the dot product: For the case of Li and K, cos(θ) = 0.738 for Magpie and -0.095 for Mat2vec.These change to -0.603 and -0.001, respectively, for the Li and Bi pair.The pairwise cosine similarities for the four chosen representations are shown in Figure 3.The Pearson correlation coefficient provides a measure of the linear correlation: where the numerator and denominator refer to the covariance and standard deviation, respectively.For the same case of Li and K (Bi), ρ Li, K = 0.717 (-0.533) for Magpie and -0.094 (0.005) for Mat2vec.The Pearson correlation between each element is plotted in We note that the cosine similarity is scaleinvariant as it only depends on the angles between vectors.Some elemental representation schemes may be sensitive to bias in the training data, such as an abundance of certain metal oxides, that produce outliers in vector components.Therefore, we use cosine similarity in later sections.

C. Periodic trends
Beyond understanding the pairwise connection between elements, we can go deeper to investigate how the elements are distributed across the n dimensions in each representation.For this, we use dimensionality reduction techniques based on unsupervised machine learning analysis.These two-dimensional plots enable intuitive interpretations of the elemental representations and aid in determining the connection to standard elemental groupings.
The first method is principal component analysis (PCA).Here two principal component axes are defined using a linear transformation of the original features that give the greatest variance in the vector components.The PCA, generated using scikit-learn 22 , is shown in    The second approach is t-distributed stochastic neighbour embedding (t-SNE).Unlike PCA, this algorithm is a nonlinear dimensionality reduction technique that can better separate data which is not linearly separable.Here a probability distribution is generated to represent the similarities between neighbouring points in the original high-dimensional space and a similar distri- bution with the same number of points is found in a lower-dimensional space.The t-SNE, also generated using scikit-learn 22 , is shown in Figure 6 with each data point coloured by their group in the periodic table.
We observe that the element representations, with the exception of the random vectors, possess an insightful structure in the reduced dimensions, Figures 5 and 6.The lanthanoid elements cluster together in the nonrandom representations independent of the choice of dimension reduction technique.In most of the representations Sr, Ba, Ca tend to group closely together, which reflects their common application in substitutional mixtures, for example in tuning ferroelectric solidsolutions.Interestingly the learned, distributed representations pick up some similarities, which are obvious to a trained chemist, but are not captured in the local Magpie representation, such as the similarity between Bi and Sb.In the Magpie representation, H tends to be considered more of an odd-one-out element, at the periphery of the distributions, whereas in the distributed representations it tends to be clustered with other elements, reflecting how it has been observed in training data from crystals such as HF and LiH.

D. Application to crystal structure prediction
We have established that chemical correlations are found within the various elemental representations.The next question is if they can be useful beyond their original purpose.We consider a simple classification case in crys-tal structure prediction, a research topic of widespread importance in computational chemistry. 4,25,26he radius ratio rules were developed to rationalise the local coordination and crystal structure preferences of ionic solids. 27In this model, the coordination number of a cation is determined by the balance between the electrostatic attraction (cation-anion interactions) and repulsion (anion-anion interactions).A geometric analysis predicts that 8-fold (cubic) coordination should be obtained when the radius ratio ρ = r cation /r anion falls in the range 0.732 -1.000.A 6-fold coordination environment is predicted for 0.414 < ρ < 0.732, while 4-fold coordination is predicted for 0.225 < ρ < 0.414.For binary AB solids, these regimes are typified by the CsCl (8-fold), rocksalt (6-fold), or zinc blende/wurtzite (4-fold) structures.While it is accepted that there are many cases where these rules fail, especially in the lower radius ratio regime, 28 they are still commonly taught in undergraduate programs due to their instructive nature.
1][32] In this approach, the likelihood that a new chemical composition (X) will adopt the crystal structure of a known chemical composition (X ′ ) depends on the substitution probability function p(X, X ′ ).The original pairwise substitution weights were learned from a training set of inorganic materials from the Inorganic Crystal Structure Database. 33However, we instead use the cosine similarity between element representations, i.e. we make an assumption that the preferred crystal structure is the one that maximises cos(X, X ′ ).
Unary substitutions are considered here, i.e.where two compositions differ by one element.This allows us to approximate the probability function to p(X, X ′ ) = e λ Z , where Z is the partition function, and λ is the metric for chemical similarity.These are the pairwise substitution weights in the original model. 29.In the SMACT implementation, these can be a user-defined, pairwise metric for similarity which here is defined as cos(X, X ′ ).A related procedure has been employed by Wang et al to predict new stable compounds 34,35 , and an extension based on metric learning has been reported by Kusaba et al. 36 To obtain a set of binary AB solids that adopt one of the four structure types as their ground-state structure, we queried the Materials Project (version: 2022.10.28) 37using pymatgen 38 .The query was carried out using the parameters: formula= * 1 * 1; theoretical=False; is metal=False.This query returned 494 binary AB solids.We chose to exclude metallic materials to focus on compositions where the bonding should be heteropolar.Some of the materials in this dataset contained polymorphs of the same composition.For example, 83 ZnS entries were returned.The data was filtered by only keeping the polymorph of a composition with the lowest energy above the convex hull as an approximation for relative stability.This filter reduced the dataset from 494 materials to 233.The query data was further filtered by matching the structures to one of the four aforementioned structure types using the structure matcher module in pymatgen 38 with the default parameters.
Our process led to a dataset of 101 unique compounds.The final filter was to check that the remaining compounds could be assigned oxidation states, which led to a final dataset of 100 compounds.Taking the empirical Shannon radii 39 for each ion, averaged over coordination environments, the radius ratio rules are found to correctly predict the ground-state crystal structures in 54% of cases.This assessment was performed on 81 of the 100 compounds as Shannon radii are not available for all ions.For instance, oxygen is assigned a -1 oxidation state in AgO (mp-1079720), which has no available radius.The performance is lower than the 66% reported in a recent study of the predictive power of Pauling's rules, and using Pauling's univalent radii, to assign the coordination preferences of metals in a dataset of around 5000 metal oxides. 40The differences likely arise from the use of averaged Shannon radii and sensitivity to the chosen dataset.
The measure of performance defined here is classification accuracy.It is determined by the number of compositions with correctly predicted ground state structure, via the most probable substitution, over the total number of compositions in the dataset: Accuracy = Number of correct structure types Total number of compositions (7)   The performance of the elemental representations ranges from 68 to 81 %.Each representation performed better at this task than the previous data-mined weights of Hautier et al, with Random 200 performing the worst.The classification between structure types is compared in Figure 7, with confusion matrices shown in Figure 8 to further illustrate the breakdown in class predictions.We find that representations derived from literature word embeddings (MatScholar and Mat2Vec) have comparable performance with their confusion matrices being almost identical.Both capture similar correlations from the dataset of abstracts on which they were trained.The poorer performance of the original weights from Hautier et.al 29 can be attributed to the absence of particular oxidation states, which led to some compositions not being assigned to a structure.This is a limitation of speciesbased measures as compared to those based on the element identity alone.As materials databases have grown compared to a decade ago, there should be a greater diversity of compounds not included in the original training of these weights, which could extend their functionality.Finally, we note that while we can not exclude data leakage due to structure environments being present in the training data for some of the chosen element vectors, this particular use case has not been explicitly targeted in the training of the distributed representations.

III. CONCLUSION
In summary, by exploring high-dimensional representations of chemical elements derived from diverse sources, we have demonstrated the potential for enhanced similarity and correlation assessments.These descriptions can complement and even outperform traditional measures, as shown in the case of crystal structure prediction and classification for binary solids.Effective chemical representations can enhance our understanding and prediction of material properties and we hope that the associated Python toolkit provided will support these developments.
Data availability statement: A repository containing the element embeddings and associated analysis code have been made available on Github (https://github.com/WMD-group/ElementEmbeddings) with a snapshot on Zenodo (DOI: 10.5281/zenodo.8101633).The package is readily extendable to other elemental and material representations and similarity measures.The cosine distance is also included as a distance measure within the package.It is the complement of the cosine similarity: The heatmaps associated with this distance measure are shown in Figure S4.
Another metric included in the package is the Wasserstein distance which can be defined as the minimum amount of work required to transform distribution u into v.The first Wasserstein distance for distributions u and v is: where Γ(u, v) is the set of distributions on R×R whose marginals are u and v on the first and second factors, respectively.The heatmap associated with this distance measure is shown in Figure S5.
The Pearson correlation and cosine similarity measures which appear in the main text are also extended to seven representation schemes.Their maps are shown in Figures S6   and S8.An additional correlation coefficient included in the package is the Spearman's rank correlation coefficient.This correlation measure assesses the monotonic relationship between two variables.
Figure S7 shows the heatmaps associated with this measure.The elements are ordered in increasing atomic number along the axes.

FIG. 1 .FIG. 2 .
FIG. 1. Map of the pairwise Euclidean distance between element vectors for four representation schemes.The elements are ordered in increasing atomic number along the axes from 1 (H) to 83 (Bi).

Figure 4 .FIG. 3 .
FIG. 3. Map of the cosine similarity between element vectors for four representation schemes.
Figure 5 with each data point coloured by the group in the periodic table.

FIG. 4 .
FIG. 4. Map of the Pearson correlation coefficient between element vectors for four representation schemes.

FIG. 5 .
FIG.5.Two-dimensional projection of four element representations using principal component analysis.

FIG. 7 .
FIG. 7. Performance of element representations at classifying the crystal structures of binary AB solids.The Materials Project bar refers to the ground truth label (structure at the bottom of the thermodynamic convex hull) for the 100 compositions in the dataset.

FIG. 8 .
FIG. 8. Confusion matrices for the classification of binary AB crystal structures for 8 element substitution (similarity) measures.
have been defined following the CRediT system.Conceptualisation: A.O., A.W. Investigation and Methodology: A.O., A.V.H., K.N.Software: A.O. Data curation: A.O. Supervision: A.O., K.T.B., A.W. Writing -original draft: A.O., A.W. Writing -review and editing: all authors.Resources and funding acquisition: A.W. Supplementary Note 1: Similarity measures Within the main body of the text, only the Euclidean and Manhattan distances are shown for four of the similarity measures.Here, we show the distance measures currently available in ElementEmbeddings for seven of the representation schemes.The distance measures mentioned in the main body of the text, Euclidean, Manhattan and Chebyshev are depicted in Figures S1, S2 and S3, respectively.
FIG. S1.Map of the pairwise Euclidean distance between element vectors for seven representation schemes.The elements are ordered in increasing atomic number along the axes.The pronounced vertical and horizontal stripes in the skipatom representation, and features in others, correspond to the noble gases Kr and Xe.
FIG. S2.Map of the pairwise Manhattan distance between element vectors for seven representation schemes.The elements are ordered in increasing atomic number along the axes.
FIG. S3.Map of the pairwise Chebyshev distance between element vectors for seven representation schemes.The elements are ordered in increasing atomic number along the axes.
FIG. S4.Map of the pairwise cosine distance between element vectors for seven representation schemes.The elements are ordered in increasing atomic number along the axes.
FIG. S5.Map of the pairwise Wasserstein distance between element vectors for seven representation schemes.The elements are ordered in increasing atomic number along the axes.
FIG. S6.Map of the Pearson correlation coefficient between element vectors for seven representation schemes.The elements are ordered in increasing atomic number along the axes.
FIG. S9.Distribution of the Pearson correlation between the element vectors for the four representation schemes.
FIG. S10.Distribution of the cosine similarity between the element vectors for the four representation schemes.
FIG.S13.Two-dimensional projection of seven element representations using uniform manifold approximation and projection (UMAP).
representation, d E = 4.09, d M = 7.87 and d C = 3.39, which shows the typical variation in absolute values.A larger difference between Li and Bi, expected due to their placement in the periodic table, is found with d E = 9.85, d M = 37.74 and d C =

TABLE II .
Classification accuracy for the crystal structure preference of 101 binary AB solids.For comparison, the radius ratio rules, based on Shannon ionic radii, have an accuracy of 54 %.

TABLE S1 :
Table of the target formulas from the Materials Project and the template materials used to predict the structure of the target material.The template materials can be considered the most similar under each representation.

TABLE S1 :
Table of the target formulas from the Materials Project and the template materials used to predict the structure of the target material.The template materials can be considered the most similar under each representation.

TABLE S1 :
Table of the target formulas from the Materials Project and the template materials used to predict the structure of the target material.The template materials can be considered the most similar under each representation.

TABLE S1 :
Table of the target formulas from the Materials Project and the template materials used to predict the structure of the target material.The template materials can be considered the most similar under each representation.

TABLE S2 :
Table of the target formulas from the Materials Project and the template materials used to predict the structure of the target material.The template materials can be considered the most similar under each representation.

TABLE S2 :
Table of the target formulas from the Materials Project and the template materials used to predict the structure of the target material.The template materials can be considered the most similar under each representation.