Open Access Article
Akhil S. Nair
*ab,
Lucas Foppa
a and
Matthias Schefflera
aThe NOMAD Laboratory at the Fritz Haber Institute of the Max Planck Society, Faradayweg 4-6, D-14195 Berlin, Germany
bInstitut für Chemie und Biochemie, Freie Universität Berlin, Arnimallee 22, 14195 Berlin, Germany. E-mail: akhil.sugathan.nair@fu-berlin.de
First published on 27th January 2026
Bayesian optimization (BO) efficiently explores vast design spaces using probabilistic surrogate models, enabling the guided discovery of materials with desired properties. However, most BO frameworks rely on the knowledge of a few key physical parameters (features) correlated with the materials property of interest. This is a challenge in heterogeneous catalysis, where the material properties are governed by an intricate interplay of multiple physical processes, and the mentioned parameters are typically unknown. Here, we introduce the Sparse Adaptive Representation-based Bayesian Optimization (SARBO) framework that utilizes the sure independence screening and sparsifying operator (SISSO) symbolic-regression method for on-the-fly selection of key physical parameters correlated with materials properties during BO. Crucially, SISSO takes into account nonlinear relationships and interactions between multiple parameters when selecting key features. We demonstrate that SARBO enables efficient navigation of the materials spaces and outperforms widely used feature-selection approaches for the simulated discovery of single- and dual-atom alloy surface sites capable of activating CO2, a critical step in the CO2 reduction reaction.
Despite its successes in materials discovery applications,12–14 the efficiency of BO decreases as the number of features increases. This is because the search domain increases exponentially with the number of features, making it difficult to identify optimal regions within an affordable number of property evaluations.15 This is a limitation in fields such as heterogeneous catalysis, where the catalytic performance of a material is governed by an intricate interplay of multiple physical processes, and the key features correlated with the catalyst performance are often unknown. In such cases, BO might utilize significant resources to explore irrelevant portions of the design space. In order to mitigate this “curse of dimensionality” challenge in BO, several methods have been proposed where the surrogate model parameters can be tuned to focus on the features that matter most, ignoring irrelevant ones.16,17 However, such methods (e.g., automatic relevance determination16) are typically tied to a specific surrogate model (e.g., GP) and therefore are not easily transferable to other model classes. Feature-selection (FS) methods, in turn, automatically identify a few relevant features, out of many offered, potentially relevant features. Then, the surrogate model uses only these most relevant features during BO.18 However, most commonly used FS methods rely on linear correlation metrics and consequently fail to capture complex, non-linear feature interactions. Therefore, they may neglect features that appear redundant or weakly correlated with the target property when considered individually, but that might become relevant when they are combined with other features. In addition, the FS is typically performed based on the initial data set and this fixed set of features is utilized throughout the BO iterations. This might be an issue in materials science applications, since the relevance of features can change as new materials are acquired and included in the training data set. Indeed, the initial training set might not be representative of the entire design space. Thus, BO with a fixed feature set may overlook relevant materials. To address this issue, Rajabi-Kochi et al. introduced adaptive FS within each BO iteration, demonstrating improved performance over traditional BO approaches.19 However, this framework did not account for the non-linear interactions between features. As a result, even with adaptive representations, features which are potentially relevant to describe the target property might be excluded, causing FS methods to function as bad advisors, limiting the effectiveness of BO.
In this work, we present the Sparse Adaptive Representation-based Bayesian Optimization (SARBO) framework to address the curse of dimensionality faced by standard BO approaches (Fig. 1). This framework utilizes the Sure-Independence Screening and Sparsifying Operator (SISSO) symbolic-regression method for on-the-fly selection of features, taking into account nonlinear target-feature relationships and interaction between features.20 SISSO identifies, at each iteration of the BO, a small set of physico-chemical features correlated with the target property, also termed as materials genes.21 Then, a surrogate model, e.g., GP, RF or the SISSO model itself, is trained only with the features selected by SISSO. This strategy refines the design space description on-the-fly. In addition, SISSO is an inherently interpretable method in the sense that it provides models as analytical expressions or descriptors. These expressions make the most relevant correlations between the features and the target property explicitly. In the prior work,22 we have demonstrated how SISSO can be directly used as a surrogate model for BO, combining both its feature selection, predictive ability and uncertainty estimates by considering ensembles of SISSO models. In the current study, we repurpose SISSO as a general FS method that can be utilized with any surrogate model. In particular, SISSO-based FS can be combined with GP, taking advantage of the principled uncertainty estimates provided by this approach. We critically evaluate the performance of the SARBO approach against standard BO integrating alternative FS strategies, demonstrating its superior performance in identifying the surface sites of single- and dual-atom alloys for CO2 activation, a critical step in CO2 reduction reaction.
| ΔdC–Omax = dC–Ochem − dC–Oeq | (1) |
The 26 offered features can be classified in five different types: elemental properties of the (i) host and (ii) dopant such as d-orbital radius and electronegativity, (iii) properties related to the CO2 adsorption such as its adsorption energy, (iv) elemental and geometrical properties of the surface sites such as averaged d-orbital radius and coordination number, and (v) properties of the surface sites plus nearest neighbours such as generalized coordination number42 (see Table S1 for the full list of features). It should be noted that adsorption-related properties are included as features to model ΔdC–Omax because they are required to achieve accurate SISSO models. For example, including these features increased the training performance from an R2 of 0.4 to 0.74. This approach, however, presupposes knowledge of adsorption properties that are only available after performing DFT calculations. This reliance could, in principle, be alleviated by increasing the model complexity or by adopting the hi-SISSO strategy,43 although we find that the latter only slightly improve the performance of the model in this case (see Fig. S1). Because our focus is on evaluating optimization performance using a dataset in which these properties are available for all surface sites rather than predicting ΔdC–Omax for previously unseen surface sites, we did not explore further alternatives to circumvent the use of the adsorption-related properties. For the case of DAAs, we have only considered bridge sites between the two dopant atoms for CO2 adsorption and hence the elemental properties of the dopant are taken as the average of the properties of the two atoms. This dataset is particularly suitable for our study as ΔdC–Omax could be governed by multiple factors of various origins, such as dopant and host chemistry, surface structure and local coordination. Thus, the different SAAs and DAAs may require descriptions based on different features, making this a useful test case for adaptive feature selection.
for which the target property is known and (ii) a large pool of unlabelled candidate materials
, for which the target property is not known. Each material x is represented by a set of initially offered features ϕtotal.
The closed-loop SARBO proceeds as follows:
(1) Descriptor learning with SISSO: the core component of SARBO utilizes SISSO for adaptive feature selection. During each BO iteration, SISSO processes the current
to construct an extensive space of candidate features Φ by applying a set of mathematical operations to ϕtotal. These operations are unary (log, exponential, square) and binary (addition, subtraction, multiplication). Thus, they capture nonlinearities and interactions between features. Then, the sure-independence screening procedure followed by
-regularized regression is applied to select a model of the form:
(2) Adaptive representation: the SISSO model is used to redefine the design space representation for subsequent BO iterations. The materials are represented using only the set of features ϕrelevant ⊂ ϕtotal that appear in the SISSO model. This creates a minimal, relevant subset of the original set of features.
(3) Surrogate modelling: a surrogate model f(x) is trained on
represented only by ϕrelevant to approximate the objective function. The model provides a mean prediction µ(x) and an uncertainty estimate σ(x) for all
. This work demonstrates SARBO using GP and RF as surrogates where σ(x) is estimated as the standard deviation of the posterior distribution and predictions across ensemble trees, respectively.
(4) Acquisition and evaluation: we employ expected improvement (EI)45 and upper confidence bound (UCB)46 as acquisition functions as they balance exploitation and exploration for BO. For both acquisition functions, the next candidate(s) are selected by maximizing:
.
.50 MRMR is performed with the
package.51 We trained GP and RF models for the target ΔdC–Omax by splitting the dataset into 80% train and 20% test data. The test-set performance of the models obtained with the features selected by the methods SISSO, RFE, MI, and MRMR is shown in Table 1 along with a baseline (labelled NoFS) corresponding to using the entire set of features.
| FS method | R2 | MAE (Å) | RMSE (Å) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| GP | RF | SISSO-SM | GP | RF | SISSO-SM | GP | RF | SISSO-SM | |
| NoFS | 0.763 | 0.702 | 0.753 | 0.012 | 0.013 | 0.013 | 0.017 | 0.019 | 0.017 |
| SISSO-FS | 0.844 (1.106) | 0.719 (1.024) | 0.753 (1.0) | 0.010 (0.835) | 0.013 (0.970) | 0.013 (1.0) | 0.014 (0.812) | 0.019 (0.971) | 0.017 (1.0) |
| MI | 0.644 (0.844) | 0.646 (0.920) | 0.695 (0.923) | 0.017 (1.411) | 0.015 (1.111) | 0.015 (1.141) | 0.021 (1.225) | 0.021 (1.091) | 0.019 (1.111) |
| RFE | 0.650 (0.852) | 0.655 (0.933) | 0.648 (0.861) | 0.016 (1.395) | 0.015 (1.087) | 0.017 (1.321) | 0.021 (1.215) | 0.021 (1.077) | 0.021 (1.193) |
| MRMR | 0.671 (0.880) | 0.597 (0.850) | 0.648 (0.861) | 0.015 (1.269) | 0.016 (1.157) | 0.017 (1.321) | 0.020 (1.177) | 0.022 (1.164) | 0.021 (1.193) |
Let us first examine the performance of the GP models across the different FS methods. GP trained on SISSO selected features (labelled SISSO-FS) show the best performance among all methods, achieving the highest R2 (0.844) and the lowest MAE (0.01 Å) and RMSE (0.014 Å) values. This represents an improvement over the full-feature baseline. In contrast, the MI, RFE, and MRMR approaches all lead to reduced GP performance, with consistently higher errors. A similar trend is observed for the RF models: SISSO-based FS again provides the highest performance (R2 = 0.719, MAE = 0.013 Å, RMSE = 0.019 Å). However, the relative gains are more modest than for GP, reflecting the robustness of RF to irrelevant or redundant features via ensemble averaging. The remaining FS methods once again underperform relative to the full-feature baseline, exhibiting lower R2 and higher MAE, RMSE values. When SISSO is used as a surrogate model (labelled SISSO-SM) with its default FS, it is observed to have an intermediate performance between GP and RF. When combined with other FS methods, SISSO-SM generally performs better than RF for MI but is slightly worse than GP, and for RFE and MRMR, it tends to underperform compared to GP while being comparable to or slightly worse than RF. Table S2 shows the features selected by each FS method. While an overlap is observed for the selected features, only SISSO-FS uniquely captures effects associated with the host surface, single-atom dopant, surface site and the coordination environment, which are likely critical for describing CO2 activation. In contrast, the other FS methods systematically overlook at least one of these key factors. The enhanced performance observed with SISSO-FS compared to using the full set of features can be attributed to the reduction of redundancy and irrelevant information present in the full feature set. By relying on the most informative features, surrogate models can more effectively capture the underlying relationships, leading to improved prediction accuracy and generalization.
Overall, these results highlight SISSO as the most effective FS strategy among the considered methods, and GP as the surrogate model achieving the highest predictive performance, particularly when combined with SISSO-selected features. This suggests that even when using non-SISSO surrogate models such as GP or RF, employing SISSO solely for feature selection can provide substantial gains, as it identifies the most informative features while leveraging the flexibility of other surrogate models to learn complex feature-target relationships. Hence, in the following discussion, we focus on GP and RF surrogate models.
for training, while the remaining data constitute a hidden candidate pool
. This approach mimics real-world materials discovery scenarios where the optimal candidates are unknown a priori and has been adapted for comparing and benchmarking various BO methodologies in previous studies.52–54 The optimization is performed for Niter iterations. At each iteration, the surrogate models
are trained. They provide predictions for the targets ΔdC–Omax, along with uncertainty estimates σ(x). The acquisition functions (α ∈ {EI, UCB}) then select a batch (b) of candidates xnext = arg max α(x), whose values of the target ΔdC–Omax are added to the training dataset. Finally, the FS method selects the relevant features for the next iteration. Since the optimization objective targets maximization of ΔdC–Omax, we reserve a fraction (ftop) of top candidates (highest ΔdC–Omax value) in
, excluding them from
. We monitor the performance of the BO using two metrics: the highest observed ΔdC–Omax value at each iteration and the cumulative count of surface sites improving the target property, i.e., with ΔdC–Omax values larger than the best known value in the previous iteration. Each BO campaign is performed Ntrial times with different randomly selected initial training sets. Thus, confidence intervals are obtained for the BO campaigns. For all the results presented in the manuscript, we consider ftrain = 0.1, ftop = 0.2, b = 5, Niter = 40 and Ntrial = 10, unless stated otherwise.
We begin by comparing the performance of SARBO against standard BO under two distinct scenarios. The first is a baseline where BO operates on the full feature set without any selection (NoFS). The second, labelled Static, employs a SISSO-derived feature set determined once at the outset and held fixed throughout all BO iterations. This static condition represents a data-rich ideal, where sufficient data exists a priori for feature selection, a scenario uncommon in typical BO experiments but plausible when leveraging pre-existing data from low-fidelity simulations or a closely related property.
Fig. 3 shows the results for the GP surrogate model and the EI acquisition function. The evolution of the highest ΔdC–Omax values identified by BO with the number of iterations is shown in the left panel. The highest ΔdC–Omax value around 0.25 Å is identified at Niter = 17 by SARBO. Conversely, Niter = 37 is needed for the standard BO approach utilizing the full set of features. While the static feature selection method shows a marginal performance improvement over the standard BO in the initial and final stages, it still required Niter = 38 to achieve the highest target value. We also analyzed the evolution of the number of surface sites with improved ΔdC–Omax identified by BO with the number of iterations (Fig. 3, right panel). SARBO exhibits a small number of early, high-impact improvements and quickly reaches the vicinity of the global optimum, after which few further improvements are possible. In contrast, both standard and static feature set-based BO accumulate many incremental improvements over a larger number of iterations, reflecting slower progress through lower-quality regions of the design space. Overall, these results demonstrate that SARBO achieves substantially faster convergence toward the optimal region of the design space, indicating that the adaptively selected features from SISSO models provide a more informative and efficient representation of the design space for the surrogate model. Although an initial, one-time feature selection can improve performance, it is ultimately outperformed by SARBO. This static approach also carries the prerequisite of a substantial initial dataset for the feature selection itself. These results collectively underscore the critical advantage of SARBO’s adaptive feature adaptation.
We now compare the performance of various FS methods when integrated into an adaptive BO loop like the one used in SARBO, for the case of the GP surrogate model with the EI acquisition function. Fig. 4 compares SARBO with BO approaches that employ MI and MRMR. Additionally, we also consider a random-selection baseline (RANDOM) where features are selected randomly without any measurement of feature importance. For consistency, all methods use the same number of features selected by SARBO. It can be seen that SARBO outperforms the other FS strategies in identifying surface sites with the highest ΔdC–Omax. Indeed, SARBO reaches the maximum mean value around 0.25 Å by iteration Niter = 17, whereas the other methods require Niter = 38 to achieve the same performance. This advantage is already evident at early iterations: at Niter = 10, the highest ΔdC–Omax values are 0.20 (RANDOM), 0.21 (MI), 0.19 (MRMR), and 0.24 Å (SARBO). A similar trend appears in the cumulative number of improved candidates. SARBO identifies improved candidates earlier and with larger mean values relative to the other methods. Although MI and MRMR show slight improvements over SARBO at later iterations, these gains occur only after substantially more evaluations, indicating that they require greater sampling effort for marginal benefit. These results are consistent with the ability of SISSO to explore large spaces of possibly nonlinear expressions, thus incorporating important nonlinear interactions when selecting relevant features.
To assess the generality of the performance trends observed with the GP-EI setup, we have also performed after-the-fact BO runs using RF as the surrogate model and UCB as the acquisition function (Fig. S2). Compared to GP-EI, it is observed that the RF-UCB-based BO campaigns show slightly faster convergence in early iterations. The high ΔdC–Omax values are identified at Niter = 7 instead of 17 for SARBO. Nevertheless, the performance of BO with different FS methods remains the same: SARBO > MI > MRMR > RANDOM. These results suggest that SARBO’s adaptive feature selection provides benefits for BO experiments involving alternate choices surrogate models and acquisition functions.
To understand the origin of the difference in BO performance of different FS methods, we investigated which types of features were selected by each FS method. The five types of features contained in the dataset are host (8 features), dopant (7 features), CO2 adsorption (2 features), surface site (4 features), and local coordination environment (5 features) – see Methods section for details. We computed the total frequency of selection for each feature category per trial, summing over all 40 iterations shown in Fig. 4. The box plots in Fig. 5 show the distribution of these aggregated counts from the 10 trials. Distinct selection patterns emerge across different FS methods. The random feature selection exhibits the expected uniform distribution across categories, serving as an unbiased baseline. SARBO demonstrates a strong preference towards host- and adsorption-related features, followed by local-coordination related ones. MI primarily selects surface sites related features, whereas MRMR preferes SA-related features. All methods select adsorption-related features to some extent, consistent with their direct correlation to the target property ΔdC–Omax. The distinctive feature selection pattern of SARBO suggests it identifies alternative physical relationships that may capture different aspects of the catalytic behavior. This unique feature selection pattern enables SARBO to effectively guide the optimization toward promising regions of the design space, as evidenced by its superior performance in identifying CO2-activating sites.
Supplementary information (SI) is available. See DOI: https://doi.org/10.1039/d5fd00159e.
| This journal is © The Royal Society of Chemistry 2026 |