Lucas Foppa* and
Matthias Scheffler
The NOMAD Laboratory at the Fritz Haber Institute of the Max Planck Society, Faradayweg 4-6, D-14195 Berlin, Germany. E-mail: foppa@fhi-berlin.mpg.de
First published on 25th June 2025
Useful materials are often statistically exceptional and they might be overlooked by artificial intelligence (AI) models that attempt to describe all materials simultaneously. These global models perform well for the majority of materials, but they do not necessarily capture the useful ones. Subgroup discovery (SGD) identifies descriptions of subsets of materials associated with exceptional values of a chosen property. Thus, SGD can better capture exceptional materials compared to widely used AI techniques. Previous studies focused on the SG that maximizes an objective function establishing a tradeoff between the SG size and the exceptionality of the distribution of property values within the SG. However, this optimization does not give a unique solution, but many SGs typically have similar objective-function values. Here, we identify a “Pareto region” of SGD solutions presenting a multitude of size-exceptionality tradeoffs. The approach is demonstrated by learning descriptions of perovskites with a high bulk modulus.
AI methods often fail in describing the exceptional materials first because the training data are typically not well distributed over (or representative of) the huge, unknown materials space. Therefore, interpolation schemes are unable to generalize to potentially interesting portions of the materials space that were disregarded in the training data.10–14 This issue, which might be referred to as an “out of distribution” issue, can be alleviated by AI approaches that can better extrapolate compared to methods that are inherently interpolative.12,15 Besides, AI model training can be combined with the systematic acquisition of new data corresponding to portions of the materials space that were not covered by the initial training data using sequential-learning approaches such as active learning or Bayesian optimization.16–19 However, the efficiency of sequential learning often relies on the quality of uncertainty estimates, which is in some cases problematic.20–22 A second key reason that can explain the inability of current AI approaches to capture exceptional materials is the focus on global models. These models attempt to describe all materials simultaneously. They are obtained by optimizing an objective (loss) function that reflects the average performance, e.g., the mean prediction error. Thus, global models are designed to perform well in average for the majority of (uninteresting) materials, but do not necessarily perform well for exceptional ones.23 Objective functions can be adapted to give more importance to the description of specific property values, e.g., high values.24 However, different groups of materials could operate according to different mechanisms. This might render a global description not only inaccurate, but also inappropriate.
Alternative AI methods for materials discovery include strategies based on similarity among materials25,26 or among their constituents, e.g., ions in solids,27 and the subgroup-discovery (SGD)28,29 approach. In particular, SGD tackles the limitations of global descriptions and it has the potential to better capture exceptional materials. Indeed, SGD has been recently put forward in materials science.30–35 SGD is a supervised, descriptive rule-induction36 technique and it identifies subsets of a dataset associated with exceptional values of a target quantity of interest, for instance a materials property or performance indicator. Crucially, SGD identifies these subsets (SGs) of data along with the descriptions of these subsets, referred to as rules. The SGD analysis starts with the choice of many features that relate to possibly relevant mechanisms governing the target materials property. Then, SGD creates a number of statements about the features that are satisfied only for a part of the dataset. These statements are, for instance, inequalities constraining the values of the features. Finally, a search algorithm37–39 identifies the combination of typically few statements that results in a SG that maximizes an objective function. This objective (or quality) function is a product of the relative SG size and the so-called utility function. The relative SG size is the fraction of data points that satisfies the statements describing the SG. The higher the relative SG size, the more general the description. The utility function quantifies the “exceptionality” of the distribution of target values in the SG with respect to the entire dataset. The positive mean shift is one example of a utility function often utilized when the target values of interest are high. This utility function measures the shift of the mean value of the target in the SG with respect to the mean value of the target in the entire dataset. The higher the value of the positive-mean-shift utility function, the higher the target values in the identified SG. Thus, such a utility function favors the identification of SGs associated with high target values. We will discuss this utility function in more detail in the Results section. We will also discuss a second example of the utility function, namely the Jensen–Shannon divergence between the distributions of target values in the SG and in the entire dataset.
SG rules typically constrain the values of only a few key features, out of the many initially offered ones. Thus, SGD learns a (low-dimensional) representation. The SGD approach is illustrated in Fig. 1 (top). The aim of SGD is to find descriptions of portions of the materials space that are exceptional. Thus, it accepts that the mechanisms governing the materials' properties might vary across the materials space and that not all of these mechanisms need to be described for the discovery of useful materials. Indeed, SGD was used to identify rules associated with high-performance materials even based on datasets dominated by low-performance situations.35 Additionally, SGD is an exploratory analysis that can identify unexpected patterns and anomalies. We note that SGD is significantly different from clustering techniques because it does not aim at describing the entire dataset. Moreover, SGD is a supervised approach, and it explicitly identifies rules indicating why the data points belong to the SG. Clustering is an unsupervised technique that groups data points into clusters based on similarity, without considering any target quantity. Besides, clustering does not explicitly identify why the data points are clustered together.
Previous SGD studies30–33 focused on the identification of the SG (and rules) that maximizes the objective function, as illustrated in Fig. 1 (top). However, the SG that maximizes the objective function does not reflect all possible tradeoffs between the relative SG size and utility function that could be relevant for a given application. Additionally, the definitions of some utility functions assume that the distributions of target values in the entire dataset and in the SG are appropriately characterized by one single summary-statistics value, such as the mean value in the case of the positive-mean-shift utility function. However, this assumption may be questioned in materials science, as the distributions can significantly deviate from the normal distribution. Indeed, distributions related to materials can be skewed or even bimodal. Finally, utility functions often assume that the summary-statistics values appropriately reflect the huge, unknown materials space, e.g., the mean value in the dataset is a good approximation for the mean value in the entire materials space. This assumption does not hold if the datasets are created according to certain selection biases. Thus, the datasets can be highly unbalanced compared to the materials space.
In this manuscript, we introduce a multi-objective optimization of SGs that identifies coherent collections of SGs and rules in the “Pareto region” of optimal SGD solutions. These SGs present a multitude of tradeoffs between the relative SG size and the utility function, the two conflicting objectives in SGD. This concept is schematically shown in Fig. 1 (bottom). The multi-objective optimization of SGs is demonstrated for the identification of ABO3 perovskites with a high bulk modulus as an example of a target. We compare the SGs obtained with two different utility functions, the positive mean shift and the cumulative Jensen–Shannon divergence. The latter does not make assumptions on the shape of the distributions of target values. We also analyze the sensitivity of the results with respect to the offered set of features. Finally, we exploit the rules trained on a dataset of 504 single ABO3 perovskites to identify high-bulk-modulus perovskites out of a candidate space of 12096 single ABO3 and double A2BB′O6 perovskites. Our results show that rules focusing on perovskites with a high bulk modulus do not necessarily correspond to the single SG which maximizes the objective function, but they can be systematically derived with the Pareto-region concept. These rules identify perovskites of the candidate space that present the bulk modulus up to 13% higher than the highest value of the training set.
![]() | (1) |
The inputs to the SGD analysis are the datasets containing a target quantity of interest (e.g., a materials property) and the features that characterize the materials. Additionally, one has to choose an appropriate quality function, which determines the desired distribution of target values in the SGs. The outputs are the subsets of data (SGs) and the rules (statements) that describe these subsets of data. These rules typically depend only on key features, out of all initially offered features. In analogy to genes in biology, these key features might be called materials genes,9 as they correlate with the mechanisms governing the materials properties. The rules can be exploited to efficiently identify the few exceptional materials in the huge materials space P, for which the target property is unknown.
The dataset52 used to train the SG rules contains 504 perovskites composed of A elements from the alkali, alkaline-earth, and scandium groups and lanthanides. B elements include transition metals and main-group elements such as bismuth, antimony, and germanium. The choice of A and B elements reflects common elements reported in perovskites.44–46 We only consider the cubic, highly symmetric perovskite structure in our dataset, which is often only stable at high temperatures. Thus, our analysis focuses on diversity of the chemical elements entering the material rather than on the diversity of structures. However, it is straightforward to extend the SGD approach to other, less symmetric crystal structures. We used 24 features characterizing the perovskites (Table 1). Two of the features are properties of the solid perovskite materials (denoted S), the equilibrium lattice constant (a0) and the cohesive energy (E0). The equilibrium lattice constant is the only structural degree of freedom of the cubic structure. The cohesive energy corresponds to the energy required to atomize the materials' crystal. Ten of the features are atomic properties of free atoms of the elements A or B (denoted A), such as orbital radii, ionization potential, and electronegativity. Finally, we included two features that depend on the composition of the material (denoted C), the expected oxidation states of A and B elements in the compound (nA and nB, respectively). The bulk modulus and the features (except the atomic numbers of A and B, nA and nB) were calculated using density-functional theory (DFT) with the PBEsol exchange-correlation functional. The bulk modulus is evaluated by fitting the Birch–Murnaghan equation of state to a series of energies of the crystal calculated using structures that present slightly larger or smaller volumes than the equilibrium volume. Further calculation details are provided elsewhere.52 We note that some of the features in Table 1 are correlated with each other. This is not a limitation for SGD. However, the presence of correlated features might result in similar SGs defined by slightly different rules.
Type | Name | Symbol | Unit |
---|---|---|---|
a Properties of the solid material.b Properties of free atoms of elements constituting the material.c Properties of the composition of the material.d Evaluated using DFT-PBEsol.e Energy needed per atom to atomize the crystal.f Defined based on the periodic-table group of the A element and on the charge neutrality of the ABO3 composition, i.e., nA + nB = 6. | |||
Sa | Equilibrium lattice constantd | a0 | Å |
Sa | Cohesive energyde | E0 | eV per atom |
Ab | Radii of the valence-s orbitals of the A and B neutral atomsd | rs,A and rs,B | Å |
Ab | Radii of the valence-s orbitals of the A and B +1 cationsd | rcats,A and rcats,B | Å |
Ab | Radii of the highest-occupied orbitals of A and B neutral atomsd | rval,A and rval,B | Å |
Ab | Radii of the highest-occupied orbitals of the A and B +1 cationsd | rcatval,A and rcatval,B | Å |
Ab | Electron affinity of the A and B atomsd | EAA and EAB | eV |
Ab | Ionization potential of the A and B atomsd | IPA and IPB | eV |
Ab | Electronegativity of the A and B atomsd | ENA and ENB | eV |
Ab | Kohn–Sham single-particle eigenvalue of the highest-occupied orbital of the A and B atomsd | εH,A and εH,B | eV |
Ab | Kohn–Sham single-particle eigenvalue of the lowest-unoccupied orbital of the A and B atomsd | εL,A and εL,B | eV |
Ab | Atomic numbers of A and B elements | ZA and ZB | ![]() |
Cc | Expected oxidation states of the elements A and B in the perovskite formulaf | nA and nB | ![]() |
![]() | (2) |
The 5000 SGs with the highest objective-function values identified in the analysis using the positive-mean-shift utility function are shown as grey points in Fig. 2(A). The Pareto region is shown in blue in this plot: the 60 and 49 SGs belonging to the Pareto front and to the near-Pareto-front region are displayed in dark and light blue, respectively. This plot shows, in orange, the SG that maximizes the objective function, denoted as . The curve corresponding to the constant value of
is shown as a dashed orange line. Note that we assign the i indices to the SGs of the Pareto region SGmi according to increasing values of relative SG size. The star in
indicates that this is the SG associated with the maximum objective-function value. The Pareto region contains many SGs with objective-function values close to the maximum at relative SG sizes in the range [0.4, 0.6]. Conversely, SGs in the Pareto region with relative sizes lower than 0.4 and higher than 0.6 present relatively lower objective-function values compared to the maximum value.
![]() | ||
Fig. 2 Collections of SGs describing perovskites with a high bulk modulus (B0) obtained using the multi-objective optimization approach and the positive-mean-shift utility function. The results for the full feature set containing the 24 features in Table 1 are shown. (A) The 5000 SGD solutions with high values of the objective function are shown in grey and the Pareto region of SGD solutions is shown in blue. The SG associated with the maximum value of the objective function is shown in orange. (B) The 109 SGs of the Pareto region are clustered according to their similarity via hierarchical clustering. Each cluster is shown in a different color. (C) Distributions of B0 in the entire training dataset and in some examples of SGs of the Pareto region. The rules associated with these SGs are shown in Table 2. (D) SG rules for some examples of SGs of the Pareto region that constrain the values of the equilibrium lattice constant (a0) and cohesive energy (E0). |
The Pareto-region concept leads to the identification of not one, but a collection of 109 SGs presenting multiple generality-exceptionality tradeoffs. However, it is unclear how to choose which of these SGs should be considered for a detailed analysis of physical insights or for materials discovery in larger candidate materials spaces. Many of the SGs of the Pareto region might be similar to each other and they might contain redundant information. In order to assess the variability of the SG rules of the Pareto region and to facilitate further analysis of these rules, we established a measure of similarity between SGs and used it to identify clusters of SGs containing similar SGs.53 This analysis is described in detail in the ESI.† In summary, the similarity is assessed using Jaccard indices. These indices consider that the similarity between two SGs is proportional to the overlap of their elements, i.e., to the number of data points that satisfy the rules defining both SGs. Thus, this similarity will be high between SG rules that result in a similar selection of materials, even though the rules themselves might be different, e.g., due to correlated key features or due to different thresholds. To obtain the clusters, we applied agglomerative hierarchical clustering.54
The results of the similarity analysis and clustering of the Pareto region in Fig. 2(A) for a chosen number of four clusters are displayed in Fig. 2(B). In this figure, the four identified clusters are displayed in four different colors. In general, the clusters of SGs correspond to different ranges of relative sizes. The cluster shown in orange, for instance, can be related to the SGD solutions with objective-function values close to the maximum in the relative size range of [0.4, 0.6]. Interestingly, a small cluster containing only three SGs is identified at low relative sizes. This indicates that these three SG rules are unique compared to the remaining ones. We note that the SGs of the Pareto region are spread in a continuous manner in the utility-function vs. relative-size plot in Fig. 2(A). The aim of the clustering technique is not to identify clusters of SGs that preexist in the utility-function vs. relative-size space, but rather to partition the pool of SGs of the Pareto region into clusters containing similar rules. This partitioning aims to facilitate the analysis of the many SGD solutions identified with the multi-objective-optimization approach.
We analyzed in more detail one SG per identified cluster. SGm1, SGm6, , and SGm100 are examples of SGs belonging to the purple, red, orange, and magenta clusters, respectively. The distributions of bulk-modulus values in the entire dataset and in the mentioned SGs are shown in Fig. 2(C). As the relative SG size decreases, the SGs of the Pareto region have higher mean bulk-modulus values and narrower bulk-modulus distributions. For the goal of identifying perovskites with an extremely high bulk modulus, the rules associated with SGs with low relative SG size and high mean bulk-modulus values, e.g., associated with SGm1 or SGm6, are useful, since they provide a more focused description. Such SGs would not be detected based solely on the maximization of the objective function.
Next, we analyzed the rules defining SGm1, SGm6, , and SGm100, shown in Table 2. The rules constrain the values of 3 key features, out of the 24 offered features: the equilibrium lattice constant (a0), cohesive energy (E0), and the radius of valence-s orbitals of +1 cations (cat) of the B element (rcats,B). In particular, the a0 and E0 values are always constrained to maximum and minimum thresholds, respectively. Thus, perovskites with a short lattice constant and high cohesive energy tend to present a high bulk modulus. This reflects the inverse relationship of the bulk modulus with the lattice constant and the direct relationship of the bulk modulus with cohesive energy.52 This analysis illustrates how physical insights can be obtained from the key features identified by SGD.
Features | Indexb | Q(SG, ![]() |
u(SG, ![]() |
s(SG)/s(![]() |
ȳ(SG)c | ystd(SG)d | Rules |
---|---|---|---|---|---|---|---|
a S, A, and C correspond to solid, atomic, and compositional, respectively (see Table 1).b The star indicates the SG with the maximum value of objective (quality) function Q obtained with a given utility function and feature set. The superscripts “m” and “JS” correspond to the utility functions positive mean shift and cumulative Jensen–Shannon divergence, respectively.c Mean value of the target within the SG, in eV Å−3.d Standard deviation of the target within the SG, in eV Å−3. | |||||||
S, A, and Ca | SGm1 | 0.16 | 0.67 | 0.24 | 1.36 | 0.07 | E0 > 7.48 eV per atom ∧ rcats,B ≤ 1.44 Å |
S, A, and Ca | SGm6 | 0.19 | 0.63 | 0.31 | 1.34 | 0.08 | E0 > 7.48 eV per atom ∧ a0 ≤ 4.00 Å |
S, A, and Ca | ![]() |
0.25 | 0.53 | 0.48 | 1.28 | 0.12 | E0 ≥ 6.41 eV per atom ∧ a0 ≤ 4.07 Å ∧ rcats,B ≥ 0.94 Å |
S, A, and Ca | SGm61 | 0.25 | 0.45 | 0.55 | 1.27 | 0.13 | E0 ≥ 6.41 eV per atom ∧ a0 ≤ 4.07 Å |
S, A, and Ca | SGm100 | 0.20 | 0.26 | 0.77 | 1.19 | 0.18 | E0 ≥ 5.42 eV per atom ∧ a0 ≤ 4.16 Å |
S, A, and Ca | SGJS5 | 0.06 | 0.61 | 0.10 | 1.42 | 0.04 | −4.55 ≤ εL,B < −4.33 eV ∧ a0 < 3.85 Å ∧ nB < 3.5 |
S, A, and Ca | SGJS96 | 0.10 | 0.45 | 0.22 | 1.37 | 0.06 | rval,A ≤ 1.51 Å ∧ EAB ≥ −1.84 eV ∧ rcats,B ≤ 1.50 Å ∧ rval,B ≤ 1.14 Å ∧ E0 > 7.11 eV per atom |
S, A, and Ca | ![]() |
0.11 | 0.29 | 0.39 | 1.32 | 0.09 | EAB ≤ 0.31 eV ∧ εL,B ≤ −3.50 eV ∧ rcats,B ≥ 1.09 Å ∧ E0 ≥ 6.77 eV per atom |
S, A, and Ca | SGJS169 | 0.09 | 0.16 | 0.55 | 1.27 | 0.13 | E0 ≥ 6.41 eV per atom ∧ a0 ≤ 4.07 Å |
A and Ca | ![]() |
0.05 | 0.67 | 0.08 | 1.43 | 0.03 | EAB ≥ −1.03 eV ∧ −4.55 ≤ εL,B < −4.33 eV ∧ nB < 4 |
A and Ca | ![]() |
0.05 | 0.67 | 0.08 | 1.43 | 0.04 | IPB ≤ 7.82 eV ∧ εL,B < −4.33 eV ∧ rval,B < 0.68 Å ∧ nB < 3.5 |
A and Ca | ![]() |
0.06 | 0.65 | 0.09 | 1.42 | 0.04 | rval,A < 1.28 Å ∧ EAB ≤ 0.31 eV ∧ ZB < 36 ∧ rs,B ≥ 1.26 Å |
A and Ca | ![]() |
0.10 | 0.43 | 0.22 | 1.37 | 0.06 | −5.11 ≤ εL,B ≤ −3.50 eV ∧ rcatval,B ≤ 0.94 Å ∧ nA > 2 |
A and Ca | ![]() |
0.10 | 0.35 | 0.30 | 1.34 | 0.08 | rcatval,A ≤ 1.38 Å ∧ εL,B ≤ −3.49 eV ∧ rval,B ≤ 1.14 Å ∧ nA ≥ 1.5 |
A and Ca | ![]() |
0.06 | 0.12 | 0.52 | 1.26 | 0.15 | ENA ≥ 2.76 eV ∧ ENB ≤ 4.85 eV ∧ rs,B ≥ 1.09 Å |
The rules associated with SGm6 and SGm100 are presented in the coordinates of the key parameters a0 and E0 in Fig. 2(D). We also present, in this figure, the rules associated with SGm61, as an example of an SG that belongs to the orange cluster in Fig. 2(B) whose rules only depend on a0 and E0 – see Table 2. In this plot, the bulk-modulus values are indicated by the grey scale color of the circles. The figure shows graphically that a more focused description is achieved as the utility-function values increase (and the SG size decreases) within the Pareto region. This figure also highlights that more focused rules might arise at the expense of missing some high-bulk-modulus materials. For instance, some dark-grey circles corresponding to high-bulk-modulus materials are outside the limits of SGm6. Thus, these high-bulk-modulus materials are not captured by SGm6.
To understand which high-bulk-modulus materials might be “missed” in SGm6 as an example of an SG with a high utility function, we verified whether the 5% materials with the highest bulk moduli of the dataset are contained in SGm6. These are 26 perovskites presenting bulk moduli higher than 1.43 eV Å−3 and composed of the B elements chromium, manganese, iron, cobalt, and tungsten, and the A elements scandium, praseodymium, neodymium, cerium, promethium, yttrium, samarium, and beryllium. 24 of these 26 materials satisfy the rules associated with SGm6. The two high-bulk-modulus materials that do not satisfy the rules of SGm6 are BeMnO3 and BeWO3, with bulk moduli of 1.43 and 1.45 eV Å−3, respectively. These two materials present the lowest cohesive energies (6.63 and 7.47 eV per atom, respectively) among the 26 materials. They are shown as lime crosses in Fig. 2(D). Thus, they do not satisfy the inequality E0 > 7.48 eV per atom, which is part of the rules of SGm6 (Table 2). The bulk modulus for these two materials could be governed by a different mechanism compared to the materials that are part of SGm6. We will analyze this unexpected pattern in more detail in a dedicated subsection of the manuscript (see below).
We evaluated the variability of the identified SGs with respect to the dataset size by training the SGD with random selections of 75%, 50%, and 25% of the dataset. Even though the similarity between the SG identified based on the entire dataset and the SG identified based on a fraction of the dataset (measured using the Jaccard similarity index) decreases with decreasing data-set size, the SGs obtained with only 25% of the dataset and presenting large relative sizes are significantly similar compared to the SGs obtained with the entire dataset. The SGs obtained with 25% of the dataset and presenting low relative sizes, however, are significantly different from the SGs obtained with such relative sizes using the entire dataset. Thus, for the problem under consideration, SGD is efficient with up to 50% less data. More details of this analysis can be found in the ESI.†
![]() | (3) |
The results obtained with the cumulative Jensen–Shannon-divergence utility function and with the full set of the 24 features are shown in Fig. 3(A) and in Table 2. The Pareto front and region contain 101 and 189 SGs, respectively. The SGs identified in the Pareto region contain materials with a high bulk modulus and they present narrow distributions of target values. We analyzed the selectors defining some of the SGs of this Pareto region. The rules associated with SGJS5, SGJS96, , and SGJS169 constrain the values of the key features: the equilibrium lattice constant (a0), cohesive energy (E0), the radius of valence-s orbitals of +1 cations (cat) of the B element (rcats,B), the Kohn–Sham single-particle eigenvalue of the lowest-unoccupied orbital of the B atom (εL,B), the expected oxidation state of B in the perovskite (nB), the radius of the highest-occupied orbital of A and B neutral atoms (rval,A and rval,B, respectively), and the electron affinity of the element B (EAB). Therefore, the rules highlight that the lattice constant and the cohesive energy are important parameters for describing high-bulk-modulus materials, along with atomic and compositional properties that mainly reflect the nature of the B element of the ABO3 perovskites.
![]() | ||
Fig. 3 Collections of SGs describing perovskites with a high bulk modulus using the cumulative formulation of the Jensen–Shannon divergence as the utility function. The 5000 SGD solutions with high values of the objective function are shown in grey and the Pareto region is displayed in blue. The SG associated with the maximum value of the objective function is shown in orange or red. The two panels show the results for two different sets of features. (A) Results for the full feature set containing the 24 features in Table 1. (B) Results for the reduced feature set containing 22 atomic and compositional features (see Table 1). The rules associated with the SGs indicated in the figure are shown in Table 2. |
Let us now compare the results obtained with the positive-mean-shift with the results obtained with the cumulative-Jensen–Shannon-divergence utility functions. The Pareto region of SGs identified based on the positive-mean-shift utility function is associated with relative SG sizes in the approximate range [0.20, 0.80] (Fig. 2(A)). The range of relative SG sizes in the Pareto region identified for the case of the cumulative-Jensen–Shannon-divergence utility function is, in turn, ca. [0.10, 0.70] (Fig. 3(A)). Thus, optimal SGs with relative SG sizes below 0.20, i.e., SGs that contain less than 20% of the dataset, could only be obtained with the utility function based on the Jensen–Shannon divergence. These SGs with small size are associated with distributions of bulk-modulus values that are narrower and more shifted towards high values than those associated with the SGs identified with the positive-mean-shift utility function. For instance, the materials selected in SGJS5 and SGJS96 present standard deviations of bulk-modulus values of 0.04 and 0.06 eV Å−3, respectively. The mean values of the bulk modulus among the perovskites in these two SGs are 1.42 and 1.37 eV Å−3, respectively. None of the SGs identified with the positive mean shift present higher mean values or lower standard deviation values (see Table 2). Thus, the rules corresponding to small SGs identified with the cumulative Jensen–Shannon divergence are more focused on the high bulk modulus. This can be related to the fact that only this utility function explicitly favors narrow SGs.
The SG rules identified using the cumulative-Jensen–Shannon-divergence utility function contain, in general, more statements and more features compared to the SG rules identified using the positive mean shift (Table 2). However, we note that the equilibrium lattice constant, the cohesive energy, and the radii of B atoms are identified as key features by both approaches. The cumulative Jensen–Shannon-divergence utility function provides more focused rules for the present dataset and it is thus better than the positive-mean-shift for the purpose of identifying exceptional perovskites with very high bulk moduli. However, we stress that this utility function does not explicitly require low or high values of the target. This is a disadvantage compared to the positive-mean-shift utility function. Dispersion-corrected utility functions simultaneously take into account positive or negative shifts of the mean (or of the medians) of target values and the narrowness of the distributions of targets in the SG. These utility functions were proposed in order to simultaneously incorporate the requirements for a shift in a specific direction and for small narrowness.31
We identified the Pareto region of SGs using the reduced set of 22 features and the cumulative Jensen–Shannon-divergence utility function. The SGs identified with this approach are denoted as , where ′ indicates the reduced feature set. The results are shown in Fig. 3(B) and in Table 2. The identified Pareto front and near the Pareto front contain 66 and 136 SGs, respectively. The maximum value of the objective function obtained with the reduced feature set is slightly lower than that obtained with the full feature set (0.10 and 0.11, respectively, see orange and red dashed lines in Fig. 3(A) and (B)). This indicates that the quality of the SG description is lower with the reduced feature set. The SG identified based on the reduced set of features that displays the maximum objective function, denoted as
in Table 2 and Fig. 3(B), contains 150 materials. This SG contains fewer materials than the SG identified with all the features SGJS*129 (197 materials). However, the overlap between the two SGs is significant. Indeed, 125 materials are present in both SGs. Thus, there is a high similarity between both descriptions.
SG rules focusing on high-bulk-modulus materials in the low relative size portion of the Pareto region were also obtained with this reduced feature set. For instance, ,
, and
display mean bulk-modulus values of 1.43, 1.43, and 1.42 eV Å−3 and standard deviations of 0.03, 0.04 and 0.04 eV Å−3, respectively. These figures are similar to those associated with SGJS5, which was identified based on the entire set of the 24 features. The SGs
and SGJS5 contain, respectively, 45 and 48 materials. 40 materials are present in both SGs. This reflects a significant similarity between both descriptions. Thus, the SGs with small relative size identified in the Pareto region with the reduced set of features are comparable to those obtained with the full feature set. The rules associated with the SGs identified using the reduced set of 22 features depend on some key parameters that were also identified by the analysis of the entire set of the 24 features and the cumulative-Jensen–Shannon-divergence utility function, e.g., radii of A and B elements, Kohn–Sham single-particle eigenvalue of the lowest-unoccupied orbital of the B atom (εL,B), electron affinity of the B atom (ENB), and expected oxidation state of the B element in the perovskite. However, some additional key features are identified in the analysis of the reduced set of features, such as the ionization potential of the B atom (IPB), the atomic charge of the B element (ZB), the expected oxidation state of the A element in the perovskite (nA), and the electronegativity of the A element (ENA). This shows how SGD attempts to reconstruct the information contained in the important features a0 and E0 by using the information on other offered features. Overall, the results obtained with the reduced feature set are comparable to those obtained with all 24 features. The rules derived based on the reduced feature set can thus be applied to identify high-bulk-modulus materials in larger materials spaces compared to the training set.
The distribution of B0 for the materials that were randomly selected from the candidate materials space (Fig. 4, in grey) has practically the same mean value compared to that of the distribution of B0 in the training set (Fig. 4, in black), i.e., 1.09 eV Å−3. This indicates that the training data might be representative of this specific candidate materials space. The highest B0 among the materials randomly selected from the candidate materials space is 1.36 eV Å−3. This value is significantly lower than the highest B0 in the training dataset, 1.49 eV Å−3 for ScMnO3.
The distribution of B0 for the materials that were selected from the candidate materials space using the SG rules (Fig. 4, in orange) is concentrated in high B0, with a mean B0 value of 1.22 eV Å−3. One material suggested by these SG rules has B0 higher than the highest value in the training dataset, PaCoOsO6, with a bulk modulus of 1.52 eV Å−3, respectively. This value is slightly higher than the highest B0 in the training dataset (1.49 eV Å−3). However, we note that the rules suggested several materials that turned out to have relatively low values of B0.
The distribution of B0 for the materials that were selected from the candidate materials space using the SG rules ,
, and
(Fig. 4, in blue) is concentrated in higher values compared to the other distributions, with a mean B0 value of 1.35 eV Å−3. Additionally, four of the materials suggested by the SG rules have B0 higher than the highest value in the training dataset. These are PaMnO3, Pa2CrFeO6, Pa2VCrO6, and PaVO3, with a bulk modulus of 1.67, 1.63, 1.59, and 1.55 eV Å−3, respectively. Thus, the SG rules lead to the identification of materials with B0 up to 13% higher than any observation in the training dataset. This result indicates that the SGs identified by the Pareto-region analysis can point at more exceptional materials compared to the standard SGD approach. In particular, in the present use case, materials that present higher performance than the known compounds of the training set were more efficiently selected using the focused SG rules provided by the multi-objective optimization of SGs. We note that in previous studies SGD was compared with global approaches such as decision trees in the context of materials science applications.34,55 The results indicate that SGD is more efficient in describing exceptional situations compared to the decision-tree approach. This can be related to the fact that the objective function of the decision tree aims at a good performance on average, while the objective function of SGD eqn (1) can focus on statistically exceptional situations.
The dataset utilized to train SGD and the verification of SGD suggestions are based on DFT-PBEsol calculations. Even though DFT-PBEsol presents an overall good accuracy for describing properties of solid materials such as the bulk modulus,56 the errors of DFT-PBEsol should be taken into account when comparing the reported bulk moduli with experimental results.
We note that the rules identified with SGD will be valid as long as the physical processes governing the materials in the training dataset also govern the behavior of the materials in the materials space to be explored. This is a crucial aspect, since the choice of the materials in the training dataset is often influenced by bias and the number of materials in the training set is very small compared to the practically infinite space of possible materials. In order to cover portions of the materials space where underlying processes different from those present in the training set are important, the incorporation of new data points and retraining of SG rules will be required. Indeed, exceptional SGs and phenomena might emerge in regions of the data space that are not sufficiently covered by the training dataset. Additionally, the identified SGs can be associated with genuinely exceptional phenomena, but they might also correspond to measurement artifacts when the data are generated by an experiment or calculation subjected to noise.61 These two situations would not be distinguished by SGD. However, by analyzing the SG rules and key identified parameters, one might be able to judge whether SGD identified correlations that have a physical meaning. For instance, the rules derived in Fig. 2(D) reflect that stronger bonds between atoms in the crystal result in short lattice constants, high cohesive energy, and a high bulk modulus. Additionally, SGD models for materials rely on the fact that the offered features correlate with the underlying physical processes governing the materials. Thus, the choice of features is critical. The performance of SGD can be assessed by cross-validation, as described in ref. 35. Finally, useful materials might present unusual combinations of different materials properties. These could also be considered exceptional materials. Identifying such materials calls for multi-objective optimization of materials properties.62–65 SGD can be adapted for this scenario and this aspect will be addressed in an upcoming contribution.
Footnote |
† Electronic supplementary information (ESI) available: Comparison of approaches for defining the Pareto region and analysis of the Pareto region of SG solutions based on similarity between SG rules and hierarchical clustering. See DOI: https://doi.org/10.1039/d5dd00174a |
This journal is © The Royal Society of Chemistry 2025 |