Open Access Article
Wei-Ting Tangab,
Ankush Chakrabartyc and
Joel A. Paulson
*ab
aUniversity of Wisconsin–Madison, Department of Chemical and Biological Engineering, Madison, WI, USA. E-mail: joel.paulson@wisc.edu
bThe Ohio State University, Department of Chemical and Biomolecular Engineering, Columbus, OH, USA
cInsulet Corporation, Acton, MA, USA
First published on 23rd June 2026
Novelty search (NS) aims to uncover diverse system behaviors through simulation or experiment without requiring a pre-specified scalar objective. This capability is especially relevant to modern discovery problems in chemistry, materials science, and molecular design, where researchers often seek broad coverage of attainable property space rather than a single optimum and where each evaluation may require a costly computation or experiment. For such expensive black-box settings, we propose BEACON, a sample-efficient NS strategy inspired by Bayesian optimization principles. BEACON models the input-to-outcome mapping using multi-output Gaussian processes and selects new inputs by scoring how far plausible posterior outcomes lie from a denoised archive of previously observed outcomes. This gives a distance-based novelty acquisition that accounts for predictive uncertainty and observational noise while operating directly in continuous outcome space, rather than requiring direct optimization over a discretized partition of behaviors. By leveraging efficient posterior sampling together with scalable high-dimensional Gaussian process models, the proposed framework can be extended to settings with large data sets and high-dimensional design variables. We demonstrate BEACON on established benchmark problems together with real-world case studies in materials and molecular discovery. Across these settings, BEACON consistently discovers broader sets of distinct behaviors than several competing baselines under limited evaluation budgets.
To formalize this setting, consider a black-box map f: X → O that takes user-defined system inputs x ∈ X to outcomes (or behaviors) f(x) ∈ O. Classical derivative-free search methods, including simplex methods, meta-heuristics, and Bayesian optimization (BO), are designed primarily to optimize some prescribed objective defined over the outcomes; see, for example.5–8 This objective-centric viewpoint is highly effective when the user knows what should be maximized or minimized. However, if the real aim is to understand the range of outcomes that the system can realize, then reducing the problem to a single scalar objective can be restrictive and, in some cases, fundamentally misaligned with the scientific goal.
A simple example arises in molecular/materials discovery. Suppose f maps a candidate design to a vector of experimentally relevant properties, such as solubility, uptake capacity, stability, or selectivity. If the objective is to identify new types of property combinations rather than only the single best design under one fixed score, then standard optimization becomes awkward. Any scalarization will prioritize certain regions of the outcome space and can easily overlook scientifically useful intermediate or rare behaviors. More broadly, the corresponding search landscape is often highly multi-modal and deceptive, making exhaustive exploration prohibitively expensive when each evaluation involves a simulation, synthesis, or physical measurement.
Novelty search (NS) offers a natural alternative by shifting the emphasis from objective maximization to the discovery of previously unseen behaviors.9,10 Instead of rewarding progress toward a predefined target, NS promotes candidates whose outcomes differ from what has already been observed. In the classical archive-based view, a candidate is novel when its behavior is far from the archive of behaviors discovered so far. This is the sense in which we use the term NS throughout this work. The broader goal is to build a discovery campaign that covers a wide range of attainable behaviors under a limited evaluation budget. We later measure this coverage using reachability, but the acquisition function itself operates in the continuous outcome space rather than over a discrete coverage count. This distinction lets BEACON use a continuous, distance-based notion of novelty while still evaluating success in terms of behavior-space coverage.
This perspective has proven useful in automated scientific discovery,1,2 reinforcement learning,11 and more recently, autonomous experimentation workflows that explicitly seek previously unobserved phenomena rather than only improved objective values.12 Most established NS algorithms, however, rely on meta-heuristics such as evolutionary search. Genetic algorithms, for example, have been used to promote behavioral diversity in deceptive landscapes;13 swarm-based methods have also been adapted for novelty-driven exploration;14 and MAP-Elites15 has become a widely used framework for illuminating trade-offs between diversity and performance. While powerful, these methods are typically sample-inefficient and therefore poorly matched to settings in which each function evaluation is costly.
In this work, we draw inspiration from BO to develop a sample-efficient Bayesian novelty search strategy for expensive black-box systems. Rather than placing a surrogate model on a scalar objective, we model the full input-to-outcome relationship using Gaussian processes (GPs).16 This distinction is important since the surrogate is used to reason directly about the outcomes a user wishes to explore, not to collapse them into a single utility function. As a result, the outcome space can be defined flexibly by the practitioner and may include multiple quantities of interest. In a drug discovery setting, for example, the outcome vector may encode efficacy, solubility, partitioning behavior, synthesizability, or stability, while in a materials setting it may encode multiple uptake or transport properties.
Our main contributions are summarized as follows:
• We introduce a new novelty search algorithm, referred to as BEACON (Bayesian Exploration Algorithm for outCOme Novelty), for noisy, expensive black-box systems. BEACON is designed to improve behavior-space reachability using a limited number of evaluations.
• We propose a Thompson sampling-based acquisition function for NS that scores candidate inputs by the distance between their sampled outcomes and a denoised archive of previously observed outcomes. This gives a continuous NS policy that accounts for predictive uncertainty and observation noise without requiring direct optimization over a fixed behavior partition.
• We extend BEACON to high-dimensional settings through both sparsity-inducing fully Bayesian priors and chemistry-aware surrogate modeling strategies.
• We demonstrate BEACON on a broad set of benchmark and real-world discovery problems, including metal–organic framework discovery for clean energy applications and molecular property discovery with input representations as large as 2133 dimensions.
Taken together, these results show that BEACON can substantially improve the efficiency of novelty-driven exploration in settings that are directly relevant to modern discovery workflows.
The rest of the paper is organized as follows. In Section 2, we define the black-box NS problem considered in this work. In Section 3, we present the required background on BO and multi-output GP regression and introduce key mathematical notation. Section 4 introduces the proposed BEACON algorithm, including the acquisition function, implementation details, and extensions for high-dimensional problems and user-guided behavior constraints. Section 5 presents benchmark comparisons and real-world case studies that illustrate the practical value of BEACON for sample-efficient discovery. Lastly, we provide some concluding remarks in Section 6
| f: X → O, f = (f(1),…,f(n)), |
to an n-dimensional outcome space
. In this work, the outcomes are the behaviors we wish to explore. A behavior may be a scalar property, a vector of measured properties, a terminal state, a generated image, or another application-specific representation chosen by the practitioner. The key assumption is that this representation supports a meaningful notion of behavioral similarity, either through a distance metric, a finite partition, or both.
Although the examples in this paper mostly define behaviors using measured or predicted outcomes, the behavior representation can also include known input-derived quantities when they are scientifically relevant. For example, if two chemistries produce similar measured outcomes but differ in cost, synthesis route, reaction conditions, or material availability, these quantities can be included in the behavior representation (or in the distance metric used by the novelty score defined in Section 4). Known quantities do not require GP modeling; they can be appended directly to the sampled outcome representation when evaluating novelty.
We use the term “discovery campaign” to refer to the full sequence of evaluations made under the budget T. The campaign-level goal is to choose this sequence so that the observed outcomes cover as much of the attainable behavior space as possible, rather than optimizing a single scalar objective. To measure this coverage, we introduce a finite behavior space B = {1, …, M} together with behavior regions {Cb}b∈B that partition the relevant outcome space. The behavior assignment map ϕ: O → B converts each continuous outcome to its behavior region index, so that ϕ(y) = b if y ∈ Cb.
At iteration t = 1, …, T, we query an input xt ∈ X and observe
| It = {ϕ(f(xi)):i = 1,…,t} ⊆ B |
![]() | (1) |
The corresponding sequential design problem is to choose a sampling policy that discovers as many distinct behavior regions as possible over the evaluation budget. One way to express this goal is through the following problem
![]() | (2) |
This distinction motivates the design of BEACON. A direct one-step reachability acquisition can be derived by maximizing the posterior probability that f(x) lands in a behavior region not yet reached. Section 1 of the SI derives this acquisition and provides a one-dimensional illustration comparing it with the BEACON novelty acquisition. This direct acquisition is natural when the behavior map ϕ is known in advance and the behavior space is small enough to enumerate. In many early-stage discovery settings, however, the practitioner may know how to represent and compare outcomes before knowing the right partition of outcome space into behaviors. Direct reachability acquisitions also require probability mass calculations over unreached behavior regions, which can become expensive or numerically delicate as the number of regions grows.
BEACON therefore uses NS in the archive-distance sense. Candidate outcomes are preferred when they are far from outcomes already observed. The acquisition operates directly in the continuous outcome space through a user-defined distance metric, while the finite behavior space B is used to evaluate coverage after samples have been collected. This evaluation space can be constructed using a uniform grid, an ε-cover, clustering, or user-defined behavior regions. The choice sets the resolution of the reachability metric. A smaller ε (or a finer partition more generally) distinguishes more outcomes and makes reachability stricter, while a coarser partition groups nearby outcomes together. Under the default BEACON acquisition, this choice affects the reported reachability and interpretation of the final discovered set, but not the sampling decisions themselves. We study the sensitivity of reported reachability to this behavior-space resolution in the SI.
Because f is expensive to evaluate, exhaustive exploration is generally impractical. We thus use a probabilistic surrogate for the input-to-outcome map. Specifically, BEACON places a multi-output Gaussian process (MOGP) prior on f. Given observations
| Dt = {(xi, yi)}ti=1, |
The BEACON framework does not require a particular MOGP construction. In most of the experiments reported here, we use independent-output GP models, meaning that each outcome coordinate is modeled with its own scalar GP. This choice keeps the surrogate modeling component simple and makes the empirical results focus on the proposed NS policy rather than on advances in multi-output modeling. Correlated-output models could be substituted when cross-output structure is important, but this is not the focus of the present study.
For conciseness, it is useful to write the vector-valued surrogate using an output-indexed representation. Define
| h(x, j) = f(j)(x), j ∈ J: = {1,…,n}, |
| κ((x, j), (x′, j′)) = δjj′κj(x, x′) | (3) |
Suppose we observe the indexed dataset A = {(xi, ji, yi)}i=1N, where yi is a noisy measurement of h(xi, ji). Under Gaussian observation noise, the posterior remains a GP,
, with mean and covariance functions16,18
| µA(x, j) = µ(x, j) + kA(x, j)⊤(KA + σ2IN)−1(y − µobsA), | (4a) |
| κA((x, j), (x′, j′)) = κ((x, j), (x′, j′)) − kA(x, j)⊤(KA + σ2IN)−1kA(x′, j′). | (4b) |
Here, y = [y1,…,yN]⊤, µobsA = [µ(x1, j1),…,µ(xN, jN)]⊤,
is the vector of covariance values between the test pair (x, j) and the observed indexed inputs in A, and
is the covariance matrix over the observed indexed inputs.
With the independent-output kernel in eqn (3), these posterior equations decouple across outputs. The posterior mean vector and covariance matrix for f(x) are then assembled from the scalar GP posteriors for each coordinate. This is the implementation used throughout the numerical studies (except for the MNIST case). In the chemistry and molecular examples, task-specific kernels are used for the relevant scalar GP components; these choices modify κj but do not change the BEACON acquisition principle.
It is worth noting that, if output correlations are important, the independent-output GP can be replaced by a correlated multi-output kernel or by a structured high-dimensional output model. For example, in the MNIST case study we use a high-dimensional, multi-task GP model19 to model the full set of 14 × 14 pixels in the decoded image.
Common acquisition functions include upper confidence bound (UCB),20 expected improvement (EI),21 and Thompson sampling (TS).22 TS is a randomized decision rule that draws a sample function g from the surrogate posterior and then selects the optimizer of that sample. In the minimization setting, this corresponds to choosing arg minx∈X g(x). TS often performs well in practice and is especially attractive in our setting because it naturally balances exploration and exploitation while remaining easy to adapt beyond classical objective-based optimization.23
BEACON borrows this posterior sampling idea from BO but applies it to NS rather than scalar objective optimization. Instead of drawing a posterior sample to identify a likely optimizer of a scalar objective, BEACON draws a posterior sample of the input-to-outcome map and scores candidate inputs by how novel their sampled outcomes are relative to the denoised archive of previously observed outcomes. This adaptation is formally introduced in the next section.
![]() | (5) |
are the k nearest previously observed outcomes to f(x) under the user-defined distance metric dist(·) over O.
Although intuitive, directly using eqn (5) is problematic in our setting for two reasons. First, f is unknown and expensive to query, so ρ(x|D) cannot be evaluated without performing the experiment or simulation at x. Second, when observations are noisy, novelty scores based directly on raw outcomes can be distorted by measurement noise. In particular, it may occur that
| ϕ(f(x) + η) ≠ ϕ(f(x)), |
To address these issues, we replace the true black-box map with its MOGP posterior surrogate f|D ∼ MOGP(µD, κD). The surrogate serves two purposes. First, it denoises previously observed outcomes by replacing raw observations with posterior mean predictions. Second, it allows us to account for uncertainty in unobserved regions of the input space through posterior sampling. Using Thompson sampling, we draw a sample function g(x) ∼ f|D and define the BEACON acquisition function as
![]() | (6) |
are the inputs corresponding to the k nearest previously sampled outcomes to g(x), measured relative to the denoised set {µD(x1),…,µD(xN)}. This acquisition promotes candidates whose plausible outcomes are far from what has already been observed, while naturally accounting for posterior uncertainty.
The acquisition in eqn (6) does not directly optimize the reachability metric in eqn (1). Instead, it is a continuous archive-distance novelty score used to choose inputs whose plausible outcomes are far from what has already been observed. This design is intentional. A direct reachability acquisition would require a fixed behavior map and would sum posterior probability mass over behavior regions not yet reached, as derived in the SI (Section 1). Such an acquisition is useful when the behavior partition is known and computationally manageable, but it can be less convenient when the practitioner can define meaningful outcome distances before knowing the right behavior partition. BEACON therefore uses the distance-based novelty score as a general-purpose policy for improving reachability while avoiding direct dependence on the evaluation discretization.
To generate g efficiently, we use the decoupled posterior sampling approach of Wilson et al.,24 which yields accurate and differentiable approximate GP samples. This is particularly useful because it allows the resulting acquisition function to be optimized with gradient-based methods. In other words, for each BEACON decision step, all random quantities defining the TS are drawn once and then held fixed while the acquisition function is optimized. Thus, the optimizer evaluates a deterministic sampled acquisition surface within that decision step.
We use Thompson sampling because it provides a simple pathwise mechanism for incorporating posterior uncertainty into the novelty score. Once a posterior sample of the input-to-outcome map is drawn, the same archive-distance novelty calculation can be applied regardless of the chosen outcome representation or distance metric. Alternative acquisitions, such as a UCB-style novelty score, are possible but would require additional modeling choices, tuning parameters, or numerical approximations. We do believe, however, that deriving and studying additional acquisition functions for NS is an interesting future work.
For clarity, Algorithm 1 is written in a standard sequential setting. However, because the method is built around posterior sampling, parallel and asynchronous variants can be developed using ideas from prior work on parallel (distributed) BO.23
Although the novelty score depends on a k-nearest-neighbor operation, it can be written in a form amenable to differentiation:
![]() | (7) |
The cost of evaluating eqn (7) scales linearly with N, the number of archived outcomes used in the nearest-neighbor calculation at the current decision step. In the experiments considered here, N remains modest because the evaluation budgets are small and each evaluation is assumed to be expensive. For larger campaigns, the archive can be compressed by retaining representative outcomes rather than every previously sampled outcome, or by using standard nearest-neighbor data structures. For example, if several observations fall in the same behavior region or are very close in outcome space, one can retain a representative point for that local region of the archive. The effectiveness of this reduction is problem dependent. With fine behavior partitions or higher-dimensional outcome spaces, new evaluations may continue to occupy distinct regions, in which case the representative archive can grow nearly as fast as the total number of samples. Thus, archive management becomes more important as evaluation budgets increase, while remaining a secondary cost in the expensive-evaluation settings emphasized in this work. We do show that BEACON is able to effectively scale to several thousand data points in Section 5 of the SI, especially when taking advantage of efficient large-scale GP implementations.
D case study presented later.Let
denote a user-defined partition of the outcome space into non-overlapping regions. This partition can be tailored to the application and need not match the discretization used to compute reachability. Let
denote the subset of regions that are either already explored or intentionally excluded by the user. UG-BEACON defines the feasible set
This modification allows BEACON to avoid redundant sampling and direct exploration toward user-relevant parts of the outcome space, while leaving the underlying surrogate model and posterior-sampling machinery unchanged. UG-BEACON can be viewed as replacing full-range coverage with a targeted notion of coverage over behavior regions that satisfy user-defined preferences. Practical examples and implementation details for constrained behavior discovery are provided in Section 3 of the SI.
Unless otherwise stated, all methods begin with 20 points sampled uniformly at random from X to initialize the models. The baselines include evolutionary novelty search (NS-EA),10 novelty search based on distance to explored area (NS-DEA),29 input-space novelty search (NS-FS), maximum posterior variance sampling (MaxVar), Sobol sequence sampling, and random search (RS), depending on which methods are applicable to the problem setting. Full implementation details are provided in Section 2 of the SI. Each method then selects an additional 80–300 evaluations, depending on the problem. All experiments are repeated 20 times unless otherwise noted. Performance is measured using reachability, the fraction of predefined behavior regions discovered after t evaluations. In all plots, we report the mean reachability together with ± one standard deviation across independent replicates. For problems defined over discrete candidate sets, the evolutionary NS baselines are not directly applicable, so comparisons are restricted to methods that can operate in that setting.
Additional problem-specific details and ablation studies on the impact of the behavior space resolution, neighborhood size k, and observation noise can be found in the SI (Section 4). To demonstrate the flexibility of the user-guided extension, we also consider two constrained discovery tasks using UG-BEACON: an MNIST digit discovery problem and an oil sorbent material design problem.
Fig. 2 summarizes the reachability results across these nine problems. BEACON consistently attains the highest reachability and often approaches complete coverage of the discretized behavior space within the allowed budget. The performance gap relative to the baselines generally widens as the input dimension increases, which is consistent with the expected difficulty of exploring high-dimensional spaces using purely input-space sampling or evolutionary heuristics.
To further test the multi-output setting, we also construct a synthetic Multi-Output Plus problem with a long-tailed two-dimensional outcome distribution. This function has six inputs and two outputs (full definition is in Section 3.1 of the SI). Fig. 3 (top) shows the distribution of outcomes induced by 10
000 uniformly sampled inputs from X = [−5, 5]6. Reachability is computed using a 10 × 10 grid over the two-dimensional outcome space. This example provides a controlled vector-valued coverage problem in which the outcome distribution can be visualized directly. It is useful because the attainable outcomes are highly non-uniform, i.e., a large fraction of the probability mass is concentrated near the center, while many behavior regions lie in low-density tails. This structure makes the problem challenging for NS because search procedures can easily oversample dense regions while missing the tails. As shown in Fig. 3 (bottom), BEACON substantially improves coverage of the long-tail outcome regions, reaching nearly 80% reachability, whereas competing methods remain below roughly 40%.
In each task, a candidate MOF is represented by a fixed-dimensional descriptor vector
, and the property or properties of interest define the outcome y = f(x) ∈ O. Descriptor definitions and outcome distributions are provided in Section 3 of the SI.
000 unique MOFs. Each MOF is represented by 7 real-valued descriptors, and the outcome is a scalar hydrogen uptake value.Fig. 4 compares BEACON with the applicable baselines on these three tasks. Because the candidate MOFs come from discrete datasets, the evolutionary NS baselines are not included here. Across all three problems, BEACON achieves the strongest reachability and, within the available sampling budget, attains full coverage of the discretized behavior space. These results highlight the value of sample-efficient NS for property-space exploration in chemically large but evaluation-limited settings.
encodes molecular structure using descriptor-based or high-dimensional molecular representations, while the outcome is a scalar molecular property. These examples thus test BEACON in high-dimensional molecular input spaces with scalar behavior-space coverage metrics.
D. Finally, we consider a log
D dataset containing 2070 molecules from Win et al.,36 each represented by 125 features. To address the higher dimensionality, we pair BEACON with the sparse axis-aligned subspace (SAAS) prior.27In these experiments, reachability is computed over the full observed property range to provide a neutral comparison of broad property-space coverage. For molecular campaigns where the scientific goal is to identify many candidates within a high-performing region, the behavior space can instead be restricted to a preferred property window, weighted toward high-value regions, or discretized more finely in those regions. The user-guided formulations in Section 5.5 illustrate how BEACON can be adapted to this targeted-coverage setting.
The results in Fig. 5 show that BEACON consistently achieves the highest reachability across all three tasks. The gains are especially pronounced in ESOL and log
D, where the use of task-appropriate surrogate structure is important for effective search. Overall, these results support the view that BEACON can be naturally combined with chemistry-aware GP models to improve sample-efficient exploration of molecular property space.
For BEACON and the NS baselines, the behavior outcome is the two-dimensional terminal position of the agent after executing the policy, and novelty is computed in this terminal-state space. Thus, these methods do not directly optimize reward. Instead, they search for policies that lead to new terminal states, which can help overcome the deceptive reward structure by encouraging broader exploration of the maze. In contrast, expected improvement (EI)37 is included as a reward-driven scalar-objective comparator; its GP surrogate models normalized reward as a function of the controller parameters, and its acquisition seeks policies with high expected reward improvement. We report normalized reward separately as a task-success metric because it measures progress toward the goal and enables comparison with this reward-driven baseline. A schematic of the maze and additional details are given in Section 3.4 of the SI.
As shown in Fig. 6, BEACON substantially outperforms both the novelty-based baselines and the reward-driven methods, including EI. The top panel reports the average best observed reward over 20 replicates, while the bottom panel shows the distribution of final best rewards. BEACON is the only method that solves the task across all replicates, demonstrating that sample-efficient NS can be particularly valuable when objective-based search is vulnerable to deception.
Taken together, these examples show that UG-BEACON can incorporate user guidance in a simple and effective way, making the framework useful not only for unconstrained NS, but also for targeted discovery tasks in which some outcome regions are more important than others.
In the most difficult case, the response may appear nearly flat over most of the input space but contain abrupt, weakly structured jumps. A standard stationary GP can easily over-smooth such behavior, and posterior samples from a misspecified model may not assign enough probability to the rare region to guide sampling there. More flexible surrogates can help when the sharp structure is learnable from the available data. Examples include non-stationary kernels or input warping for spatially varying length scales,40 adaptive hyperparameter treatments that reduce the risk of over-smoothing,41 deep kernel learning models that learn representations jointly with the GP,42 and methods designed to account for GP model misspecification.43 These approaches improve the modeling side of the problem, but they cannot overcome the absence of learnable structure. If a rare behavior is effectively uncorrelated with nearby sampled points and no available side information indicates where it may occur, then no GP-guided acquisition can reliably target it from a small number of evaluations; discovery is possible only through sufficiently broad exploration or chance sampling.
For prospective applications where missing such regions would be costly, BEACON can be combined with more deliberate exploration safeguards. These include broader space-filling initialization, maintaining an input-space diversity component, or adding scheduled exploration when reachability begins to plateau. Such modifications are directly compatible with the framework because they change the surrogate model or the candidate-generation strategy while preserving the core idea of scoring novelty in outcome space. Another potential strategy is to iteratively shrink the search region around the most novel previously discovered outcomes, thereby progressively increasing the resolution of the surrogate model in the sharp regions likely to contain highly novel outcomes.44 Developing these hybrid strategies is an important direction for future work, particularly for discovery problems where broad coverage must be balanced against reliable identification of rare high-performing regions.
Across synthetic benchmarks and real-world case studies in materials and molecular discovery, BEACON consistently achieved stronger reachability than competing methods under limited evaluation budgets. These results suggest that Bayesian surrogate modeling can make novelty-driven exploration practical in domains where traditional evolutionary NS approaches are too sample-inefficient to be useful.
There are several important directions for future work. From a methods standpoint, reducing computational overhead and developing stronger theoretical guarantees remain worthwhile challenges. From an applications standpoint, a particularly promising next step is to integrate BEACON-like strategies into closed-loop autonomous experimentation platforms, where the goal is not only to optimize a known target but also to uncover previously unseen phenomena or property regimes.12 We view this as an especially compelling opportunity for truly autonomous discovery in chemistry and materials science.
Supplementary information (SI): additional methodological and computational details supporting the main text. Section 1 presents a derivation of a direct reachability acquisition function and compares that to BEACON. Section 2 contains all the core implementation details for BEACON and the baseline methods. Section 3 provides additional details on the suite of benchmark and application problems. Section 4 discusses various ablation studies on the BEACON strategy. Finally, Section 5 discusses how to scale BEACON to problems with large evaluation budgets using sparse GP approximations. See DOI: https://doi.org/10.1039/d6dd00123h.
| This journal is © The Royal Society of Chemistry 2026 |