Andrij
Vasylenko
a,
Dmytro
Antypov
a,
Sven
Schewe
b,
Luke M.
Daniels
a,
John B.
Claridge
a,
Matthew S.
Dyer
a and
Matthew J.
Rosseinsky
*a
aDepartment of Chemistry, University of Liverpool, Crown Street, L69 7ZD, UK. E-mail: rossein@liverpool.ac.uk
bDepartment of Computer Science, University of Liverpool, Ashton Building, L69 3DR, UK
First published on 3rd January 2025
Computational modelling of materials using machine learning (ML) and historical data has become integral to materials research across physical sciences. The accuracy of predictions for material properties using computational modelling is strongly affected by the choice of the numerical representation that describes a material's composition, crystal structure and constituent chemical elements. Structure, both extended and local, has a controlling effect on properties, but often only the composition of a candidate material is available. However, existing elemental and compositional descriptors lack direct access to structural insights such as the coordination geometry of an element. In this study, we introduce Local Environment-induced Atomic Features (LEAFs), which incorporate information about the statistically preferred local coordination geometry at an element in a crystal structure into descriptors for chemical elements, enabling the modelling of materials solely as compositions without requiring knowledge of their crystal structure. In the crystal structure of a material, each atomic site can be quantitatively described by similarity to common local structural motifs; by aggregating these unique features of similarity from the experimentally verified crystal structures of inorganic materials, LEAFs formulate a set of descriptors for chemical elements and compositions. The direct connection of LEAFs to the local coordination geometry enables the analysis of ML model property predictions, linking compositions to the underlying structure–property relationships. We demonstrate the versatility of LEAFs in structure-informed property predictions for compositions, mapping of chemical space in structural terms, and prioritisation of elemental substitutions. Based on the latter for predicting crystal structures of binary ionic compounds, LEAFs achieve the state-of-the-art accuracy of 86%. These results suggest that the structurally informed description of chemical elements and compositions developed in this work can effectively guide synthetic efforts in discovering new materials.
In this study, we explore a novel approach to explicitly incorporate geometrical local structural information for describing chemical elements and materials compositions, resulting in the creation of Local Environment-induced Atomic Features (LEAFs). The LEAFs maintain a direct and explicit relationship between chemical elements and preferred structural characteristics such as atomic coordination and local structure motifs and we demonstrate how this direct connection can help explain machine learning models for materials property predictions. We further employ this link to address other structure-induced challenges in materials science such as derivation of a metric for mapping chemical space in structural terms, and selecting elemental substitutions for novel materials design.
In the LEAFs approach, we hypothesise that atomic properties, and hence their descriptors, can be deduced from the nature of their local structure environments in crystalline inorganic compounds. To produce LEAFs, we collect statistics of the variations in the local geometries for chemical elements in crystal structures; this element-wise statistics can then be used as the unique identifiers of chemical elements in materials modelling. The determination of these descriptors hinges upon the definition of locality and atomic neighbourhood in coordination environment (CN),49 where each atomic site in a crystal structure of a material can be described in terms of its similarity to the common structural motifs. The atomic site is first described by the interatomic distance-based algorithm for finding CN,34,50 which performs well among other algorithms for near-neighbour finding;51 then for each CN, the geometrical arrangements of the neighbouring atoms can determine similarity to one of the common motifs, e.g., whether a CN2 arrangement is linear or water-like, a CN4 arrangement is tetrahedral or square planar, etc., up to CN12 (Fig. S1 and S2†). Quantification of the similarity between the local structure motifs is performed by comparing interior and dihedral angles for each atomic site in a local structure and the 37 selected common structural motifs presented in ref. 50, using the angle-based similarity metrics34,50 (ESI, eqn (1) and (2)†). Thus, each atomic site can be represented with a set of 37 numbers, each determining similarity to one of the 37 common motifs; this set of numbers is further used as the atomic site's unique vector identifier. In Fig. 1, for the example of a Mg atom in MgO, the local structure environment is compared to the CN6 structural motifs. Concatenation (denoted as ‖ in Fig. 1) of the similarity values s(CN) to all common motifs within different CNs produces a Mg-site identifier in MgO – vector a(Mg | MgO); in this particular vector, for all but three coordination environments in CN6, s = 0.
![]() | ||
Fig. 1 Schematic calculation of local environment-induced atomic features (LEAFs). Similarities of the atomic local structure environments in crystal structures are calculated for the common structural motifs50 within different coordination numbers (CNs), using angle-based similarity metrics.34,41 For the example of the six-coordinate Mg atom in MgO, similarities, s, is zero (s = 0.0) to all common motifs, except for the three structural motifs in CN6: hexagonal (s = 0.2), octahedral (s = 1), and pentagon pyramidal (s = 0.5) motifs. Concatenation (symbol ‖) of the similarity values for all structural motifs in all considered CNs produces a local environment vector for an atom in a crystal structure, e.g., for Mg in MgO, a(Mg | MgO). Collecting these vectors for the 86 most common chemical elements in the crystal structures reported in Inorganic Crystal Structure Database (ICSD),42 and averaging them over the corresponding occurrences, N, of each element produces a set of LEAFs for chemical elements. |
Using this approach, we examine the local structure environments for all atomic sites of the 86 most common chemical elements across the experimentally studied materials reported in the Inorganic Crystal Structure Database (ICSD).52 For each element, the 37 similarity values were collated for all individual atomic sites containing that element across all considered structures. The mean was then taken for each of the 37 coordination environments, resulting in 37 values which form a vector-descriptor for that element, e.g., a(Mg | ICSD) in Fig. 1. Carrying out this procedure for all 86 elements produces the LEAFs.
![]() | (1) |
We employ the test for predicting crystal structures of binary compounds proposed in ref. 53 to compare the efficacy of the elemental descriptors. For this test, 494 binary ionic solids reported in Materials Project (MP) were selected, in which metals were excluded to focus on heteropolar bonding and polymorphs were represented with only lowest energy compositions, resulting into a final set of 100 AB ionic solids, matching four structure types: CN8 CsCl (4 compositions), CN6 rock-salt (67 compositions), CN4 zinc blende (20 compositions) and CN4 wurtzite (9 compositions), using labels from Materials Project.54 The uneven distribution of structure types in this dataset impedes evaluation of model performance through accuracy in imbalanced classification tasks.55 To address this, we computed the Matthews correlation coefficient (MCC)56 providing a more balanced evaluation of performance for all elemental characteristics studied in ref. 53. In this task, where for each composition in the test set the structure type is predicted based on the most likely substitution, according to eqn (1), into the remaining 99 compositions in the test set, using the original classifier in ref. 53, LEAFs increase the best values achieved to date (Table 1).
Features | Origin of descriptors | Acc., % | MCC |
---|---|---|---|
LEAFs | Local coordination geometry in ICSD | 86 | 0.72 |
MatScholar9 | ML-derived from literature | 81 | 0.63 |
Mat2Vec10 | ML-derived from literature | 80 | 0.60 |
Atom2Vec13 | ML-derived from compositional content | 79 | 0.59 |
GNoME40 | Prediction of elemental substitution based on frequency of elements occupying the same atomic sites in GNoME | 79 | 0.58 |
Magpie7 | Elemental physical characteristics | 78 | 0.54 |
Oliynyk8 | Elemental physical characteristics | 75 | 0.50 |
MEGNet11 | ML-derived from atom, bond and graph attributes in MP | 73 | 0.45 |
SkipAtom12 | ML-derived from atom connectivity graphs in MP | 68 | 0.35 |
Random | Random numbers | 58 | 0.22 |
Hautier33 | Prediction of elemental substitution based on frequency of elements occupying the same atomic sites in ICSD | 54 | 0.28 |
The enhanced crystal structure classification suggests LEAFs' capability to capture chemical trends. To illustrate this, we plot t-distributed Stochastic Neighbour Embedding (t-SNE) maps of LEAFs representations for chemical elements Fig. 2a. Noteworthy trends include clustering of the elements belonging to the same group of the periodic table (colour-coding) or to specific families, such as halogens, chalcogens, metals, metalloids, and noble gases (symbols); the size of the markers corresponds to the atomic number.
![]() | ||
Fig. 2 The LEAF representation reveals chemical trends for the elements (a) and for compositions in ICSD (b). (a) t-Distributed Stochastic Neighbour Embedding (t-SNE) map of elements reveal chemical trends: elements belonging to specific families, such as halogens, chalcogens, metals, metalloids, and noble gases (symbols) and different periodic table groups (colour-coding) cluster together; the marker size denotes atomic number; (b) compositions forming the ten most populous structure types in ICSD52 are represented with LEAFs as in eqn (2) and plotted in two principal dimensions of the t-SNE map, displaying clustering patterns based on structure type and crystal system: Cu-like structure type within the fcc system (purple circles), the perovskite (LaAlO3) structure type within the trigonal system (mustard crosses), and the Laves (MgZn2) structure type within the hexagonal system (raspberry diamonds) the ThCr2Si2 structure type in the tetragonal system (pink crosses) and the rock-salt structure type (blue circles) each occupy distinctive areas of the map. The observed patterns suggest that distance in multi-dimensional LEAFs can be used for structural comparison of compositions and design by similarity. |
In contrast to the random number descriptors (Fig. S7†), elemental descriptors based on physical and chemical elemental characteristics,7,8 and data-derived vectors9–12 can effectively organise chemical elements,48 offering insights specific to their properties. In the case of LEAFs, the observed grouping of elements based on their local environments implies similarities in element-specific local structures across experimentally realised inorganic materials: similarity of 3d and 4f elements, Li, Mg and 3d metals, Ca, Y and 4f metals, etc. (Fig. S5†).
These qualitative insights into chemical similarity arising from purely geometrical description of local coordination align with the observations derived with ML of local structural topology.38 Furthermore, to confirm the LEAFs' ability to recognise chemical patterns beyond elemental grouping, we represent chemical compositions in ICSD as vectors, by summing the weighted elemental LEAFs according to stoichiometry in a chemical formula, e.g., Li0.375P0.125O0.5 can be represented with a vector:
aLi0.375P0.125O0.5 = 0.375aLi + 0.125aP + 0.5aO, | (2) |
The t-SNE map of the subset of the ten most common structure types in ICSD with compositions represented with LEAFs as in eqn (2) illustrates the organisational patterns of structure types and crystal systems (Fig. 2b). Notably, distinct densely packed clusters representing various structure types are evident: using notations from ICSD, the clusters include the Cu-like structure type within the fcc system (depicted by purple circles), the perovskite (LaAlO3) structure type within the trigonal system (represented by mustard crosses), and the Laves (MgZn2) structure type within the hexagonal system (depicted by raspberry diamonds). Broader distributions, such as the ThCr2Si2 (CeGa2Al2, BaAl4) structure type in the tetragonal system (marked by pink crosses) and the rock-salt structure type (represented by blue circles) are observed, each occupying distinctive areas of the map. Less represented structure types, omitted in Fig. 2b for clarity, also demonstrate clustering in analogous t-SNE maps, built with LEAFs (Fig. S8†). The observed patterns indicate that the multi-dimensional space distance defined by LEAFs, which can be measured, for example, as Euclidean, Wasserstein or other metric distance between compositions represented with LEAFs, can be a metric for structurally-informed comparison between materials defined only by their composition (eqn (S3†)), complementing other efforts for effective mapping of chemical space.15–17,19–21,57
aLi0.291(6)La0.125Zr0.83(3)O0.5 ∼ 0.292aLi ‖ (0.125aLa + 0.833aZr), | (3) |
Elemental descriptors | Compositional representation | Model | Accuracy, % | MCC |
---|---|---|---|---|
Subset (T = 300 K): 403 entries | ||||
LEAFs | Eqn (2) | Random forest | 81 | 0.62 |
LEAFs | Eqn (4) | CrabNet | 81 | 0.60 |
Mat2Vec | Eqn (2) | CrabNet | 81 | 0.47 |
LEAFs | Eqn (3) | Random forest | 75 | 0.47 |
LEAFs | Lithium only | Random forest | 72 | 0.42 |
![]() |
||||
Full dataset (all T): 756 entries | ||||
LEAFs | Eqn (4) | CrabNet | 77 | 0.52 |
Mat2Vec | Eqn (2) | CrabNet | 70 | 0.47 |
![]() | ||
Fig. 3 Importance of structural environments for classifying materials' ionic conductivity. (a) Structural insights from LEAFs can highlight the local motifs that influence materials properties predictions: feature importance can be calculated using random forest model in supervised classification, considering conductivity of chemical compositions in Li-ion database.60 Inset illustrates the contribution of all local structure environments in comparison to equal contribution (dashed line). (b) In Li-conducting materials, there is a wide distribution of Li local structure environments, demonstrating the absence of a specific preferred Li coordination associated with high Li-ion conductivity. |
This may be explained through the feature importance, according to which the majority of top-contributing features are associated with lithium: 29 out of 34 above the equal, uniform contribution line in Fig. 3a, and nine out of top ten. This is consistent with 72% accuracy (MCC, 0.42) achieved for classification of Li-ion-conductivity, based on compositional representation solely with lithium content, i.e., Li7La3Zr2O12 is represented as 0.292aLi, according to fractional Li content. We analyse the crystal structures in the Li-ion conductors database in terms of the similarity of the Li atom sites to the top nine local structure motifs, rendered important for conductivity classification in Fig. 3a. The diverse array of Li site local structure environments in Li-conducting materials (Fig. 3b and S9†) challenges the notion of a specific Li coordination determining Li-ion conductivity, including the widely discussed tetrahedral coordination, as suggested in the literature.58,59 This observation underscores the significance of considering the collective influence of various local environments of the constituent atoms on materials properties.65
Furthermore, LEAFs can be integrated with neural network-based models for predicting properties of materials represented only as compositions, which instead of using a generic set of descriptors for every task, can learn elemental descriptors specific to predicting a particular property of materials from the local structure environments (Fig. 4). This alignment can be achieved through coupling and end-to-end training of the integrated models.29 To implement this, we utilise multi-hot encoding to represent the full information regarding elemental local environments across material structures in ICSD in a format easily interpretable by machine learning algorithms. One-hot encoding can represent real values by discretising continuous range values into predefined bins, where only one bin (hot) is set to 1, and the position of this bin indicates the value, for example, numbers 0.0, 0.5 and 1.0 can be represented as strings (1 0 0), (0 1 0) and (0 0 1), respectively, in the 3-bit one-hot encoding scheme. Similarly, we can represent each of the considered common motifs and each individual similarity value, s, ranging from 0.001 to 1, with three digits of precision as 1000-bit vectors. We note that the exact vector length does not appear to have a major effect on the results and re-doing the experiment with 100-bit vectors yielded similar results. In the considered example of the MgO crystal (Fig. 4a), the similarities of Mg in the octahedral environment to the CN6 motifs, s = 0.2, 0.5, 1, can be represented as 1000-bit binary vectors with 1s in positions 200, 500, and 1000, respectively; for the other 34 motifs, the Mg atom in MgO has similarity s = 0, and hence the corresponding binary vectors will have 1s in the first positions. Concatenating these binary vectors for all 37 motifs results in a sparse 37000-bit multi-hot vector with exactly 37 1s in the corresponding positions, encoding the similarity values of Mg in MgO. This representation also affords encoding of those materials where an atom is found in more than one coordination environment; such materials are represented with binary vectors with more than one bit set to 1. We then use the binary vector to collect all occurring similarity values for Mg local environments in all Mg-containing materials reported in ICSD and populate the bins in the corresponding positions with 1s. Doing this for all chemical elements, we encode each element as a 37
000-bit binary string, where 0s denote the absence and 1s the presence of a similarity value to one of the motifs within the corresponding local environment in ICSD. We illustrate this matrix of local elemental environments conceptually by black and white pixels, representing ones and zeros, respectively for the subsets of elements and their similarities to local environments in Fig. 4d and S1–S3,† where more detail is given. The full matrix of local elemental environments is 37
000 columns of binary strings by 86 rows of considered elements. This matrix is then pruned to remove all-zero columns and used as a source for nonlinear learning of LEAFs, e.g., with an unsupervised autoencoder12,28,29,38,67 (Fig. S4†), and for integration with the supervised models utilising property-specific elemental descriptors in a variety of downstream tasks for materials property prediction. Such integration can be performed as follows:
![]() | (4) |
![]() | ||
Fig. 4 Schematic learning of local environment-induced atomic features (LEAFs) aligned with prediction of properties of materials represented solely by their compositions. Similarities of the local structure environments of the atomic sites in a crystal structure, exemplified by MgO (a), to the 37 selected common structural motifs50 (b) are calculated for the atomic sites for experimentally verified structures reported in ICSD. (c) For the example of the six-coordinated Mg octahedral environment in MgO, compared to planar hexagonal (similarity, s = 0.2), octahedral (s = 1), and pentagon pyramidal (s = 0.5) motifs, these similarity values, s, are discretised into a thousand bins spanning from 0.001 to 1, illustrated as 10-digit binary strings for Mg example in (c) for simplicity. Such discretisation and subsequent concatenation of the binary strings for all 37 structural motifs form 37![]() ![]() ![]() |
Notably, such integration trained on the full dataset of 756 entries of conducting materials reported at all temperatures, achieves a higher accuracy of 77% and MCC of 0.53 in comparison to the 70% accuracy and MCC of 0.37 achieved with CrabNet with mat2vec (Table 2), demonstrating enhanced robustness of the proposed approach to noise in the data arising from label ambiguity as the same compounds may have multiple conductivity entries at different temperatures. By employing the integration in eqn (4) to train the models for other properties datasets such as dielectric, elasticity, formation energy, energy band gap, etc.,69 LEAFs demonstrate a comparable performance with the state-of-the-art models for compositions (Table 3), while offering a route for improved interpretability through connection to the prevalent structural features affecting the properties.
Data set | Number of samples | CrabNet Mat2Vec | CrabNet LEAFs |
---|---|---|---|
Mean absolute error | |||
Perovskites form. energy (eV per unit cell) | 18![]() |
0.3473 | 0.3495 |
Dielectric (unitless) | 4764 | 0.4439 | 0.4254 |
Elasticity G_VRH (log10(GPa)) | 10![]() |
0.0994 | 0.0973 |
Elasticity K_VRH (log10(GPa)) | 10![]() |
0.0741 | 0.0761 |
JARVIS exfoliation energy (meV per atom) | 636 | 49.8551 | 52.8234 |
Experimental band gap (eV) | 4604 | 0.3463 | 0.343 |
Footnote |
† Electronic supplementary information (ESI) available: Details of the data processing, similarity calculations for local structure environments and their discretisation; elemental similarity using LEAFs and LEAFs' distance for comparison of compositions; details of LEAFs' performance in crystal structure prediction – confusion matrices are available at https://www.github.com/lrcfmd/LEAF. See DOI: https://doi.org/10.1039/d4dd00346b |
This journal is © The Royal Society of Chemistry 2025 |