Open Access Article
Maytham Aqeeli
,
Thatchathon Leelawat and
David Shorthouse
*
UCL School of Pharmacy, 29-32 Brunswick Square, London, WC1N 1AX, UK. E-mail: d.shorthouse@ucl.ac.uk
First published on 26th May 2026
Efficient optimisation of complex experimental systems is a central challenge in modern discovery science, particularly in settings characterised by high-dimensional design spaces, expensive evaluations, and multiple competing objectives. Multi-objective Bayesian optimisation (MOBO) has emerged as a leading approach for such problems due to its sample efficiency, but can suffer from limited exploration and reduced diversity, especially in many-objective, multimodal, and constrained settings. Evolutionary algorithms, by contrast, excel at maintaining diversity across the Pareto front but typically require large evaluation budgets. Here, we systematically investigate hybrid evolutionary-Bayesian optimisation strategies that combine the strengths of both approaches. Building on the Evolutionary Guided Bayesian Optimisation (EGBO) framework, we benchmark multiple evolutionary generators within a unified acquisition-driven pipeline across ten synthetic test problems spanning multimodal, many-objective, and constrained regimes. We further introduce a novelty-aware batch selection strategy that explicitly promotes diversity within candidate batches while retaining model-guided prioritisation. Across benchmarks, hybrid methods consistently outperform acquisition-only MOBO in challenging optimisation regimes, achieving improved hypervolume, lower inverted generational distance, and more reliable convergence. Gains are most pronounced in many-objective and multimodal problems, as well as in feasibility-limited search spaces. However, performance advantages diminish in very high-dimensional feature spaces, where evolutionary exploration reduces sample efficiency. The proposed novelty-aware selection further improves performance by reducing redundancy within batches and mitigating optimisation stagnation. Importantly, these trends translate to real-world experimental datasets spanning reaction optimisation, pharmaceutical formulation, materials design, and drug screening. Together, these results demonstrate that hybrid evolutionary-Bayesian optimisation provides a robust and practical strategy for improving optimisation performance in autonomous and data-driven discovery workflows.
Bayesian optimisation (BO) has emerged as a dominant paradigm for this setting, owing to its ability to guide data-efficient exploration using probabilistic surrogate models.8–10 In particular, multi-objective Bayesian optimisation (MOBO) methods based on hypervolume improvement, such as expected hypervolume improvement (EHVI) and its batch variants (e.g., qLogNEHVI), are widely used in SDL workflows.11 These approaches typically employ Gaussian process models (or model ensembles) to estimate uncertainty and select candidate experiments that maximise expected improvement of the Pareto front, representing optimal trade-offs between competing objectives. As a result, MOBO has been successfully applied across a wide range of discovery problems, including reaction optimisation,12,13 materials design,14,15 and pharmaceutical formulation.16
However, despite these successes, MOBO methods rely fundamentally on acquisition function maximisation, which prioritises regions of high expected improvement at each iteration. While effective in low-dimensional or well-behaved settings, this strategy introduces an inherent bias toward greedy, model-driven exploitation, rather than exploration of the experimental state space. In more complex scenarios such as many-objective optimisation, multimodal landscapes, or problems with complex feasibility constraints, this can lead to insufficient exploration of the Pareto front and reduced diversity in proposed solutions. In particular, hypervolume-based acquisition functions may struggle to adequately represent trade-offs across many objectives, resulting in premature convergence to limited regions of the objective space.
This limitation is especially critical in real-world discovery workflows, where identifying a diverse set of viable candidates is often more valuable than locating a single optimum. Motivated by this limitation, recent work by Low et al.17 explored incorporating evolutionary search into acquisition-driven optimisation within autonomous laboratory workflows to improve diversity. These hybrid approaches have since been adopted in emerging experimental optimisation settings.18,19 More broadly, evolutionary multi-objective optimisation (EMO) methods, such as AGE-MOEA-II,20 U-NSGA-III,21,22 and SMS-EMOA,22 are explicitly designed to maintain diversity across the Pareto front through population-based search and non-dominated sorting. These approaches are well suited to many-objective problems and complex constraint landscapes, as they explore multiple regions of the design space simultaneously.23 However, evolutionary methods typically require large numbers of function evaluations, making them less suitable for settings where experiments are expensive or time-consuming.
This creates a fundamental trade-off: Bayesian optimisation is sample-efficient but diversity-limited, whereas evolutionary algorithms are diversity-rich but evaluation-intensive. Bridging this gap represents a key opportunity for advancing optimisation in autonomous discovery. Low et al. recently introduced Evolutionary Guided Bayesian Optimisation (EGBO),17 a hybrid framework coupling evolutionary candidate generation with acquisition-driven Bayesian optimisation. In this architecture, candidate solutions are generated from two complementary sources: (i) acquisition function optimisation using qLogNEHVI, targeting regions of high expected hypervolume improvement, and (ii) an evolutionary search, promoting diversity across the objective space. These candidates are then jointly evaluated under a unified acquisition-based ranking, and only the most promising subset is selected for evaluation. This competitive candidate generation approach allows exploitation- and exploration-driven proposals to coexist and be assessed on equal footing, enabling the optimisation process to balance local refinement with broader Pareto-front exploration.
However, even within hybrid frameworks such as EGBO, the candidate selection step remains vulnerable to redundancy: when the acquisition function consistently favours dense regions of the current Pareto front approximation, both evolutionary and BO-derived candidates can converge toward similar regions of the decision space, leading to stagnation and inefficient use of the experimental budget – a limitation that has motivated several recent efforts to incorporate diversity and novelty signals into BO frameworks. ROBOT24 introduced rank-ordered trust regions to discover high-performing solutions satisfying a user-specified diversity constraint, demonstrating that explicitly promoting solution diversity can improve robustness to post-hoc feasibility constraints in single-objective settings. BEACON25 proposed a sample-efficient novelty search algorithm built on multi-output Gaussian processes, selecting candidates by maximising a novelty metric derived from posterior samples to systematically uncover diverse system behaviours in expensive black-box systems. SANE26 developed a cost-driven probabilistic acquisition function to navigate multimodal, non-differentiable single-objective landscapes, integrating a domain knowledge gate to distinguish true from spurious optima; this approach has since been extended to autonomous microscopy applications27 illustrating the growing relevance of diversity-aware active learning across physical sciences SDL platforms. Collectively, these works highlight that standard acquisition-driven BO can suffer from insufficient exploration of the broader solution space, and that incorporating diversity or novelty signals meaningfully improves campaign outcomes.
However, these approaches are largely designed for either pure novelty search or single-objective multimodal optimisation, and do not directly address the problem of Pareto front stagnation in many-objective constrained settings – a challenge that becomes particularly acute in closed-loop SDL campaigns where experimental budgets are limited and redundant candidate selection represents a direct cost. How best to integrate novelty-aware selection into a hybrid evolutionary-Bayesian optimisation framework, without sacrificing the acquisition quality that drives convergence toward the Pareto front, remains an open question.
In this work we build upon the EGBO framework in two ways. First, we provide a systematic evaluation of hybrid evolutionary-Bayesian optimisation strategies across a wide range of optimisation regimes, including many-objective, multimodal, and constrained problems. Second, we introduce a simple novelty-aware batch-selection strategy designed to improve exploration efficiency within the EGBO framework by reducing redundancy within selected batches. We find that by simply weighting the final sample acquisition by novelty improves optimisation efficiency in low sample regimes typical of self driving labs.
Our results demonstrate that hybrid evolutionary-Bayesian strategies provide substantial benefits in challenging optimisation regimes common to self-driving laboratories, particularly in many-objective, multimodal, and feasibility-limited problems where maintaining diversity is critical, while remaining competitive on simpler tasks. We also identify settings in which these gains diminish, particularly in very high-dimensional feature spaces. Taken together, this work provides a unified perspective on hybrid optimisation in data-driven discovery, clarifying when and why such approaches are effective and offering practical guidance for their use in self-driving laboratories and related experimental workflows.
In a conventional multi-objective Bayesian optimisation campaign, candidate experiments are selected by maximising an acquisition function, such as normalised expected hypervolume improvement (qLogNEHVI), over the design space (Fig. 1A). To introduce diversity in candidate generation, EGBO augments this framework with an evolutionary search that independently proposes additional candidate solutions (Fig. 1B). At each optimisation iteration, candidate points are therefore generated from two sources: (i) acquisition-function optimisation using qLogNEHVI, and (ii) evolutionary search. These candidate sets are then combined into a single pool and evaluated again using the qLogNEHVI acquisition function, which ranks the candidates according to their expected contribution to Pareto front improvement. The top-ranked candidates are then selected for evaluation. In this architecture, qLogNEHVI acts as a common selection criterion, choosing among candidates proposed by both acquisition-driven and evolutionary search strategies.
![]() | ||
| Fig. 1 (A) Traditional BO sample selection method, (B) Low et al.17 method incorporating an evolutionary generator, making a merged pool, and then selecting from that pool using an acquisition ranking. | ||
We surmised that different evolutionary algorithms may perform differently within this framework on different problem types. We established a computational benchmarking framework to evaluate the performance of these hybrid evolutionary-Bayesian optimisation strategies, comparing the effects of different evolutionary algorithm performance to a baseline. All optimisation algorithms were initialised using identical multi-objective test problems and the same seed set of initial samples, allowing comparison of performance across equivalent settings. Optimisation campaigns were then executed using identical batch sizes, numbers of optimisation iterations, and surrogate modelling architectures. This design ensured that each algorithm operated under identical evaluation budgets, enabling fair comparison of optimisation efficiency and reliability. This framework was designed to mimic experimental optimisation campaigns, in which a fixed number of experiments are performed sequentially in batches while the surrogate model is iteratively updated.
To evaluate algorithm performance across a range of optimisation regimes, we selected ten commonly used benchmark problems drawn from the DTLZ,28 ZDT,29 and MW30 test suites used for benchmarking optimisation algorithms. These problems span a range of characteristics representative of real experimental optimisation challenges, including differing dimensionalities, Pareto front geometries, and constraint structures. Specifically, the benchmark set includes standard multi-objective problems (ZDT1, ZDT2, ZDT3, DTLZ1), a many-objective problem (DTLZ2 with five objectives), multimodal problems containing numerous local optima (ZDT4 and DTLZ3), and constrained optimisation problems (MW3, MW5, and MW7).
To assess the impact of incorporating evolutionary candidate generation, we compared traditional batch multi-objective Bayesian optimisation driven solely by the noisy expected hypervolume improvement (qLogNEHVI) acquisition function with hybrid architectures in which evolutionary algorithms were used to generate additional candidate solutions.
U-NSGA-III21 is an extension of the widely-used NSGA-III31,32 framework that incorporates a unified selection mechanism based on structured reference directions distributed across the objective space. At each generation, candidate solutions are assigned to reference directions and survival selection prioritises individuals that provide coverage of under-represented directions, explicitly encouraging a well-spread approximation of the Pareto front. This reference-direction approach makes U-NSGA-III particularly well-suited to many-objective problems, where hypervolume-based diversity metrics become computationally intractable, and it was the evolutionary generator used in the original EGBO study.
SMS-EMOA22 is a steady-state evolutionary algorithm that uses hypervolume contribution as its primary selection criterion. At each generation, the individual contributing least to the hypervolume of the current population is removed, iteratively refining the population toward a front that maximises dominated volume relative to a reference point. This hypervolume-based survival mechanism directly optimises the same criterion used to assess optimisation quality.
AGE-MOEA-II20 employs a geometry-aware diversity preservation strategy in which the shape of the Pareto front is estimated adaptively during the search and used to construct a problem-specific set of reference vectors. Survival selection then promotes solutions that provide uniform coverage relative to this estimated geometry, allowing the algorithm to adapt its diversity mechanism to the actual trade-off structure of the problem rather than assuming a fixed front shape.
These algorithms represent complementary diversity-preservation mechanisms; reference-direction-based, hypervolume-based, and geometry-adaptive. This enables systematic evaluation of how different evolutionary strategies interact with acquisition-driven optimisation within the EGBO framework. We compared numerous metrics for the optimisations, including final hypervolume (HV), Pareto front properties, and importantly – Inverted Generational Distance (IGD),33 which measures average distances from each point on the true Pareto front to the closest point on the discovered front. Smaller values indicate a closer front to the real one, and this provides a measure of diversity as well as convergence.
Across the benchmark suite performed, hybrid optimisation approaches consistently outperformed qLogNEHVI on the most challenging problems (Fig. 2A and B). The largest gain was observed on the many-objective DTLZ2 task with five objectives, where the NEHVI + U-NSGA-III hybrid achieved more than five-fold higher hypervolume than the acquisition-only baseline. This advantage was also reflected in lower IGD (closer approximation to the reference Pareto front), broader Pareto-front spread (better trade-off coverage), and faster convergence (earlier attainment of high-quality fronts) (Fig. S1). Together, these results highlight a key limitation of acquisition-only optimisation in many-objective settings: the acquisition landscape becomes difficult to optimise, increasing the risk of premature concentration in narrow objective-space regions. Evolutionary candidate generation mitigates this by explicitly promoting exploration and front coverage.
A similar pattern was observed on multimodal problems, particularly ZDT4, where hybrid methods again achieved better HV and IGD, with wider front spread and more reliable convergence. This supports the hypothesis that evolutionary generation improves robustness in rugged landscapes where acquisition optimisation alone can become trapped in local optima. In contrast, differences between strategies were small on simpler two-objective problems with smooth fronts (ZDT1, ZDT2, ZDT3, and DTLZ1). Here, HV and IGD were generally close across methods, indicating limited practical benefit from additional evolutionary diversity under easier geometry.
Statistical testing remained consistent with this interpretation. Friedman tests confirmed strong overall algorithm effects for both HV and IGD metrics (HV p = 5.24 × 10−19, IGD p = 2.62 × 10−33). Post-hoc Wilcoxon tests with Holm correction showed that all three hybrid variants significantly outperformed pure acquisition-only qLogNEHVI on both HV (all Holm-adjusted p-values < 2.1 × 10−6) and IGD (Holm p for all < 2.2 × 10−13). Within-hybrid differences were metric-dependent: for HV, no pairwise hybrid comparison was significant whereas for IGD, the U-NSGA-III coupled (EGBO) algorithm was significantly better than SMS-EMOA (Holm p = 6.42 × 10−5) and AGE-MOEA-II (Holm p = 3.47 × 10−4).
To understand how each component contributed to batch selection, we quantified the proportion of selected points originating from the acquisition optimiser versus the evolutionary generator. For more complex problems with higher dimensions and tighter trade-off structure, the selected set remains dominated by evolutionary proposals, while acquisition contributes a smaller but consistently non-zero share that likely helps retain local refinement (Fig. 3A). We also calculated the number of Pareto points discovered by each generator and show that selection share alone is not a proxy for impact. The number of Pareto-optimal points contributed by each generator did not always mirror simple selection share, indicating that some generators converted selected proposals into front-quality solutions more efficiently than others (Fig. 3B). Together, these results support the view that hybrid performance is driven by complementary roles with evolutionary generators providing broad frontier coverage and acquisition targeting exploitation, rather than by either component winning in isolation. Overall however, we find U-NSGA-III provides better and more consistent coverage of the Pareto front and is competitive with the other evolutionary algorithms in HV expansion.
Among the evolutionary algorithms evaluated, U-NSGA-III consistently produced the most reliable Pareto-front coverage across diverse benchmark problems. This behaviour likely reflects the use of reference directions, which explicitly guide population diversity across the objective space. In many-objective optimisation problems, maintaining diversity becomes increasingly challenging due to the exponential growth of possible trade-offs between objectives. The reference-direction strategy used in NSGA-III helps stabilise search behaviour in such settings, ensuring that candidate solutions remain distributed across the Pareto front. Within the hybrid optimisation framework, this diversity complements acquisition-driven exploitation, allowing evolutionary search to provide broad coverage while the acquisition function refines promising regions.
To assess whether incorporating exploration directly at the acquisition function level achieves comparable benefits to hybrid evolutionary search, we additionally benchmarked qParEGO,34 a well-established multi-objective BO method that diversifies acquisition through random Chebyshev scalarisation, sampling a new weight vector from the unit simplex at each batch (Fig. S2). Across a representative subset of benchmarks spanning increasing problem complexity, qParEGO performed comparably to EGBO on the simplest two-objective unconstrained problem (ZDT1), but degraded substantially as complexity increased, achieving approximately seven-fold lower hypervolume than EGBO on the five-objective DTLZ2 problem and less than half the hypervolume on the constrained MW5 benchmark. These results suggest that exploration embedded at the acquisition level alone is insufficient as problem complexity scales, and that the complementary diversity provided by evolutionary candidate generation is not replicated by scalarisation-based acquisition diversification alone.
- Feryal: Combining U-NSGA-III and SMS-EMOA.
- Ikhlas: Combining AGE-MOEA-II and U-NSGA-III.
- Karima: Combining U-NSGA-III, SMS-EMOA, and AGE-MOEA-II.
Across benchmark problems, multi-generator variants did not deliver large or practically transformative gains in final hypervolume over U-NSGA-III (EGBO) coupled optimisation (Fig. 4A and S3), though a significant Friedman statistic (p = 5.67 × 10−6) and post hoc (Holm p EGBO vs. Ikhlas = 0.00255, Karima = 0.00614) results show that two of the generators are consistently higher HV than EGBO alone, these gains were small relative to the added computational complexity. For IGD, we saw no difference between any of the generators (Friedman p = 0.235) (Fig. 4B). Overall improvements were small, suggesting that the additional compute cost of coupling multiple generators is unlikely to be universally beneficial.
Studying generator contributions also demonstrates that multiple generators do contribute to Pareto optimal results (Fig. 4C), but from HV and IGD results they converge to nearly identical solution distributions. U-NSGA-III alone achieves roughly the same Pareto front size and diversity as variants using multiple generators combined, indicating that the additional architectural complexity doesn't necessarily provide meaningful improvement in solution quality or exploration coverage, but does increase compute times significantly through both increasing the number of samples generated, and increasing the pool size qLogNEHVI optimises over to select the final samples.
We extended our analysis, keeping only the qLogNEHVI evolutionary variant coupled to U-NSGA-III (the original EGBO) to compare to traditional qLogNEHVI, as it consistently showed a better ability to map the Pareto front (demonstrated by a lower IGD) across problems compared to the other generators, and as additional generators did not meaningfully impact performance. We first explored how the addition of Gaussian noise impacted the ability of the algorithms to optimise, given that many experimental setups include unavoidable high noise levels, we sought to see how robust each method is to different variance. We tested the addition of gaussian noise to 4 test functions (ZDT2, ZDT3, DTLZ2 with 5 objectives, and MW5) at different levels, 0% (baseline), 1%, 5%, 10%, and 20% of the objective range (Fig. 5), keeping other parameters of the optimisation consistent with the previous studies (12 batches of size 8, with 10 repeats on the same random starting points).
As feature dimensionality increased, we found both EGBO and acquisition only driven optimisation maintained their ability to advance the Pareto front, though this generally decreased as noise levels increased. In particular – for the DTLZ2 problem with 5 objectives, we found that qLogNEHVI failed to increase the HV at all, but EGBOs ability to optimise degraded to a similar level as noise increased.
Next, we studied the ability of the models to handle constraints – U-NSGA-III and other evolutionary algorithms inherently contain constraint aware features, and so we studied the ability of EGBO and qLogNEHVI to meet constraints for MW3, MW5, and MW7 constrained optimisation problems. Across all 3 problems, EGBO generated more feasible points (Fig. 6A), and upon shifting the constraint boundaries of MW5 to make finding feasible points more difficult qLogNEHVI + U-NSGA-III maintained its ability to find more feasible samples than qLogNEHVI alone (Fig. 6B). We also tested the algorithms ability to optimise under high numbers of features – as is sometimes common in chemical optimisations where molecules are described by hundreds of descriptors. As features scale we find that EGBO showed progressively reduced ability to recover Pareto-optimal solutions as feature dimensionality increased, and was ultimately outperformed by qLogNEHVI alone (Fig. 6C), suggesting that for high-dimensional feature systems, using an evolutionary generator can dilute sample efficiency and hinder convergence by spreading evaluations too broadly across a sparse search space.
We hypothesised that the final batch selection stage could therefore represent an opportunity for improvement. In particular, when large candidate pools are generated (e.g. hundreds of evolutionary candidates alongside a small set of qLogNEHVI proposals), multiple high-scoring candidates may occupy very similar regions of decision. Selecting several such candidates within a single batch may limit the effective exploration of the design space and reduce the information gained from each optimisation round, particularly in campaigns with small batch sizes – such as a batch size of 4 used in the original EGBO project.
To address this, we introduced a modified merge-selection strategy in which candidates from the combined pool are selected sequentially using a hybrid score incorporating both predicted merit and novelty. Specifically, candidates were first evaluated according to their predicted optimisation score (as in the previously studied evolutionary-coupled methods), and final batch members were then selected sequentially with an additional novelty term that favours candidates that are distant from previously selected points (Fig. 7). This approach encourages diversity within each batch while retaining the model-guided prioritisation of promising regions.
Our novelty term includes a weight parameter which controls the trade-off between acquisition merit and novelty, with a higher weight favouring more acquisition driven sample selection, and lower weight more novelty-based selection. We first performed a sensitivity analysis of the weight parameter by assessing the influence of different weights (between 0.3, 0.5, 0.7, and 0.9) on 4 of the test problems (Fig. S4). We found a systematic reduction in IGD as the weight increased across these problems. These results indicate that the balance should favour acquisition-driven selection, with the novelty component playing a supporting role. In particular though, we note that these problems have smooth objective spaces and we expect in fully experimental systems for the novelty term to become more influential, where noisier measurements and more complex objective landscapes reduce the reliability of the acquisition signal and increase the risk of premature convergence to a narrow region of the front. The decreasing returns at lower weights on synthetic problems likely reflect over-penalisation of high-merit candidates in settings where the acquisition signal is already reliable and the evolutionary candidate pool provides sufficient geometric diversity across the front. Having identified that a large novelty contribution degrades performance on these test function, we set our weight term to a value of 0.7 for the following analyses, keeping some contribution from novelty but allowing most of the weight to be driven by acquisition.
We re-ran our initial 10 problem sets, comparing U-NSGA-III coupled evolutionary optimisation without (EGBO) and with (Novelty-aware EGBO) our selection metric, as well as qLogNEHVI. Across the 10 benchmark problems, Novelty-aware EGBO produced modest but consistent gains over standard EGBO and marked gains over acquisition-only qLogNEHVI for both HV (Fig. 8A) and IGD (Fig. 8B). Friedman tests indicated significant overall differences for both metrics (HV p = 6.93 10−12; IGD p = 2.04 × 10−12), although direct pairwise differences between EGBO and Novelty-aware EGBO were not uniformly significant across all benchmarks. To better characterise selection behaviour, we quantified an exploration score defined as one minus the average percentile rank of selected points within the merged candidate pool, such that values near 1 indicate more exploratory selection and values near 0 indicate more exploitative selection. A paired sign test showed that Novelty-aware EGBO was more exploratory than EGBO in all matched comparisons (p = 1.58 e−30; 100/100 pairs) (Fig. S5).
![]() | ||
| Fig. 8 Novelty-aware EGBO compared to EGBO alone and qLogNEHVI for 10 test problems. Showing (A) Hypervolume, (B) normalised IGD score. | ||
In addition, we performed a sensitivity sweep of the weight parameter of the Novelty-aware EGBO to study the impact of novelty on the efficiency (Fig. S6). We found no clear trend towards one weight value, suggesting a robustness to the inclusion of a novelty aware term, but that further improvements could be made by dynamically adjusting this value for each dataset, or within runs to maximise efficiency.
This difference in selection behaviour was most consequential on constrained problems (Fig. S7). Studying the MW benchmark problems (MW3, MW5, and MW7), which feature narrow or disconnected feasible Pareto regions analogous to real experimental systems, Novelty-aware EGBO reduced the average number of stagnant batches by ∼30% (1.67 vs. 2.37) and increased final hypervolume by 12.9% relative to EGBO. This is particularly relevant in the self-driving lab context, where a stagnant optimisation round corresponds to a batch of physical experiments that yields no Pareto-front improvement – consuming reagents, instrument time, and researcher effort with no improvement to the Pareto front. The ability of the novelty-augmented selection to escape locally dense regions of candidate space may therefore be of direct practical value in SDL campaigns targeting multi-objective formulation problems, where feasibility constraints partition the design space in ways that are not known a priori and must be discovered through experimentation.
We next evaluated the hybrid optimisation strategies on real-world experimental datasets, in order to assess performance across a range of practical discovery settings with differing design-space structures and objective relationships. These datasets represent problems from reaction optimisation, pharmaceutical formulation, industrial materials development, and drug screening, allowing the behaviour of the optimisation algorithms to be assessed across diverse experimental design spaces and objective structures. We chose to test 4 experimental datasets in this analysis – a Suzuki–Miyaura cross-coupling reaction originally reported by Reizman and Jensen;35 a microparticle formulation campaign derived from experimental data reported from an automated lab generating long-acting injectable formulations;16 an industrial coating formulation optimisation dataset from the ADA database, representing a typical multi-objective materials optimisation problem with competing performance criteria;36 and a drug screening dataset from the Genomics of Drug Sensitivity in Cancer (GDSC) project37 where we have extracted drug response for 5 genetically diverse colorectal cancer cell lines, and each cell line is treated as an independent optimisation objective. These datasets collectively represent a range of optimisation regimes encountered in experimental discovery workflows, including varying dimensionalities, objective trade-offs, and noise characteristics.
For each dataset we ran a post-hoc optimisation campaign in which optimisation algorithms sequentially selected candidate experiments from the existing dataset as if they were conducting a real experimental campaign. At each iteration, the selected sample was revealed from the dataset and used to update the surrogate model, allowing optimisation performance to be evaluated under realistic experimental budgets without performing additional laboratory experiments. We compared the performance of qLogNEHVI, EGBO, and Novelty-aware EGBO based optimisations. We ran 10 repeats of campaigns using 12 batches of 4 samples (in line with the original EGBO publication), where starting samples were shared between algorithms to ensure fairness.
Across the four experimental datasets Novelty-aware EGBO consistently achieved the highest HV values, outperforming both original EGBO and qLogNEHVI (Fig. 9A). Novelty-aware EGBO achieved the highest mean HV on three of the four datasets and remained competitive on the fourth, and overall is significantly superior to qLogNEHVI (Holm p vs. EGBO = 0.00022, Holm p vs. qLogNEHVI = 0.0035). This is also reflected in significantly lower IGD values for Novelty-aware EGBO compared to original EGBO and qLogNEHVI across all datasets (Holm p vs. EGBO = 0.00032, Holm p vs. qLogNEHVI = 0.01) (Fig. 9B).
![]() | ||
| Fig. 9 Novelty-aware EGBO compared to EGBO alone and qLogNEHVI for 4 post-hoc real world problems showing (A) mean hypervolume traces, (B) IGD. | ||
These results suggest that the novelty-aware allocation strategy improves optimisation performance in realistic discovery settings where experimental noise, heterogeneous objective landscapes, and limited evaluation budgets are common. By incorporating information on the novelty of each potential samples alongside acquisition-driven exploitation and evolutionary exploration, Novelty-aware EGBO appears better able to maintain diversity while still prioritising promising regions of the design space.
Importantly, these observations are consistent with the trends observed in the synthetic benchmark experiments. In both settings, hybrid optimisation approaches provide the greatest benefit in complex optimisation landscapes, where maintaining diversity in candidate generation helps prevent premature convergence and improves exploration of the objective space. The novelty-aware variant further enhances this behaviour by penalising candidates that are near to others in the normalised decision space, ensuring that samples selected concurrently are distributed across distinct regions of the input space, promoting broader exploration of the Pareto front.
Taken together, these results demonstrate that hybrid evolutionary-Bayesian optimisation strategies not only perform well on synthetic benchmarks but also translate effectively to real-world discovery problems, including reaction optimisation, formulation development, materials design, and multi-objective drug screening. This suggests that such approaches may provide a practical framework for improving optimisation performance in autonomous laboratories and data-driven experimental workflows, where limited experimental budgets and complex objective landscapes are common.
Our results further show that these benefits arise from the complementary roles of acquisition-driven and evolutionary search. Evolutionary algorithms provide broad exploration of the objective space, while acquisition optimisation refines promising regions of the design landscape. Among the evolutionary algorithms evaluated, U-NSGA-III provided the most consistent Pareto-front coverage, while combining multiple evolutionary generators produced only marginal improvements relative to the additional computational cost. In contrast, these hybrid benefits diminished in very high-dimensional feature spaces, where broad evolutionary exploration significantly reduced sample efficiency. We also demonstrate that introducing a novelty-aware batch-selection strategy improves optimisation efficiency by promoting diversity within candidate batches, reducing optimisation stagnation and improving Pareto-front approximation. Whilst these improvements are modest – showing only slight improvements in hypervolume and IGD, they open an avenue for adjusting the sample selection method in a new way which could lead to further improvements with more sophisticated methods.
Importantly, these findings extend beyond synthetic benchmarks and translate effectively to real-world experimental optimisation problems, where hybrid optimisation strategies consistently outperformed traditional Bayesian optimisation approaches. These results are particularly relevant in the context of self-driving laboratories, where each optimisation batch corresponds to a set of physical experiments. In such settings, optimisation stagnation represents wasted experimental resources, including reagents, instrument time, and researcher effort. Hybrid optimisation strategies that maintain diversity in candidate generation therefore have practical advantages beyond purely computational metrics, as they reduce the likelihood of repeatedly sampling similar regions of the design space and improve the probability of discovering diverse high-performing solutions.
More broadly, this work suggests that diversity should be treated as a design principle in batched multi-objective optimisation, rather than as a by-product of acquisition maximisation alone. In practical terms, hybrid evolutionary-Bayesian approaches appear most valuable for discovery campaigns with complex trade-offs, constrained feasible regions, or rugged optimisation landscapes, whereas simpler acquisition-only methods may remain preferable in very high-dimensional settings where sample efficiency is paramount.
Future work should focus on extending these frameworks to higher-dimensional experimental design spaces, developing more adaptive selection rules that adjust exploration pressure during the campaign, and validating these methods prospectively in live self-driving laboratory workflows. Together, these results demonstrate that evolutionary-assisted Bayesian optimisation provides a robust and practically useful strategy for navigating complex multi-objective design spaces in autonomous discovery.
(i) All systems shared identical randomly-selected initialisation sets (e.g., the same 10 initial sample sets were used for every algorithm tested on MW5).
(ii) Batch size was held constant across systems and problems;
(iii) Total experimental budget (number of batches) was equal for all systems and problems.
For all synthetic test problem assessments, we used 18 initial samples, 10 repeats of each campaign using consistent initial samples, with a batch size of 8, and collecting 12 batches for each campaign. For the post-hoc analysis on real world datasets, in line with experimental protocols in the original EGBO project, we performed 10 repeats of 18 initial samples, with 12 batches of a batch size 4.
(i) Traditional qLogNEHVI, which generates candidates by acquisition optimisation only (no evolutionary generator).
(ii) Evolutionary generator coupled, which merges a fixed-size qLogNEHVI pool with a U-NSGA-III pool and ranks by raw acquisition.
(iii) Multi-generator coupled, which combines qLogNEHVI with two or more EA engines (e.g., U-NSGA-III and SMS-EMOA) to further broaden coverage; and.
(iv) Novelty aware evolutionary generator coupled, which used the same merged pool as EGBO but selected the final batch using a novelty-aware greedy downselection procedure rather than pure acquisition ranking.
All variants used the same surrogate modelling procedure, batch size, and evaluation budget, so differences in performance arose solely from differences in candidate generation and batch selection.
Final batch construction was then performed greedily. At each selection step, each candidate i in the merged pool was assigned a combined score:
| Scorei = wãi + (1 − w)ñi |
| Problem | Features | Objectives | Reason chosen |
|---|---|---|---|
| ZDT1 | 8 | 2 | Smooth, convex Pareto front; baseline test of standard multi-objective convergence |
| ZDT2 | 8 | 2 | Non-convex Pareto front; tests whether methods can recover curved/non-convex trade-offs |
| ZDT3 | 8 | 2 | Disconnected Pareto front; tests front coverage across separated regions |
| ZDT4 | 8 | 2 | Highly multimodal landscape; stresses robustness to local optima and deceptive structure |
| DTLZ1 | 8 | 3 | 3-Objective benchmark with broad trade-off structure; evaluates extension beyond 2-objective settings |
| DTLZ3 | 8 | 3 | Multimodal many-local-front variant; harder convergence challenge in 3-objective optimisation |
| DTLZ2 5obj | 14 | 5 | Many-objective setting; tests scalability of surrogate + selection under higher objective dimensionality |
| MW3 | 8 | 2 | Constrained benchmark with nonlinear feasibility boundaries; tests constrained BO behavior |
| MW5 | 8 | 2 | Constrained benchmark with multiple nonlinear constraints and reduced feasible area; tests feasibility search under tighter constraints |
| MW7 | 8 | 2 | Constrained benchmark with challenging feasible geometry; tests stability of constrained exploration/exploitation |
To assess robustness beyond nominal settings, we performed three targeted stress tests. First, we evaluated noise robustness by adding zero-mean Gaussian perturbations to observed objective/constraint values at controlled levels (0–20% relative noise), while keeping initial seeds, batch size, and evaluation budget fixed across algorithms. Second, we evaluated constraint handling on constrained MW benchmarks (MW3, MW5, MW7), including modified MW5 variants with adjusted constraint tightness (e.g., tighter and looser feasible regions), to test each method's ability to discover and improve within limited feasible domains. Third, we evaluated high-dimensional scaling by increasing decision-space dimensionality (e.g., from standard 8D settings to higher-dimensional variants such as 50D/100D), while maintaining the same closed-loop protocol, to quantify how performance and stability change as the feature space grows. Together, these analyses isolate sensitivity to measurement noise, feasibility geometry, and dimensionality-driven search complexity.
- Suzuki: a Suzuki–Miyaura cross-coupling reaction originally reported by Reizman and Jensen35 and included in the Summit python package.38
- SDL5: a microparticle formulation campaign derived from experimental data reported from an automated lab generating long-acting injectable formulations.16
- ADA coatings: an industrial coating formulation optimisation dataset from the ADA database.36
- GDSC CRC5: a drug screening dataset from the Genomics of Drug Sensitivity in Cancer (GDSC) project.37
The Suzuki, SDL5, and ADA coatings databases were used as-is, with every sample in the dataset used as input for modellig. For the CDSC CRC5 dataset, the Genomics of Drug Sensitivity in cancer dataset was downloaded and subset into only cells from colorectal adenocarcinoma. We then selected the 5 cell lines with the most coverage: SNU-C1, LS-1034, LS-513, LS-123, NCI-H747. For all cells lines we obtained the IC50 (the concentration required to kill 50% of the cells) of 349 drugs, and the optimisation task was set to uncover the drugs with minimal IC50 (lowest concentration needed to kill 50% of cells), treating each of the 5 cell lines as a separate objective. Drug response was represented using IC50 values across these five cell lines, with each cell line treated as a separate optimisation objective. Because of the limited sample size, drugs were represented using one-hot encoded ontology features derived from annotated putative targets and pathway labels rather than higher-dimensional SMILES-derived descriptor sets.
Because these datasets are retrospective, we used a post-hoc closed-loop protocol in which each algorithm sequentially proposes batches from the pool of unqueried experiments; selected samples are then “revealed” from the historical dataset and appended to the training set for the next iteration. This emulates practical batched Bayesian optimisation while preserving full comparability across methods. The same core protocol used for synthetic benchmarks was maintained in post-hoc studies, including matched initial seeds, fixed batch size and iteration budget, and multiple repeated runs, enabling direct, controlled performance comparisons between algorithms across synthetic and real-world settings. Due to these datasets being tabular by nature (rather than a continuous state space), for our post-hoc analysis we optimised in the normalised design space and then mapped each proposed offspring to the discrete experimental set through nearest-neighbour oracles at evaluation, with candidate selection preferring points that map to previously unseen dataset rows.
• Convergence: trajectory of HV (and IGD) across batches, plus summary statistics over the optimization horizon.
• Spread/diversity: uniformity of solutions along the non-dominated front (e.g., spacing/dispersion from nearest-neighbor distances in objective space), where lower dispersion indicates more even coverage.
Additionally, we calculated an exploration score to assess how much algorithms were exploring new samples. This score was based on the acquisition-rank percentile of the candidates chosen for evaluation. For each trial, the percentile ranks of the selected points within the acquisition-ordered candidate pool were averaged to obtain a mean acquisition percentage for the selection (mean_selected_acq_percentile), and this was converted to an exploration metric as:
| Exploration score = 1 − mean_selected_acquisition_percentile |
Thus, higher values indicate that the algorithm more often selected lower-ranked acquisition candidates, consistent with greater exploratory behaviour, whereas lower values indicate stronger exploitation of top-ranked acquisition candidates. Reported values correspond to the mean and standard deviation of this per-trial score across repeated runs.
Supplementary information (SI) is available. See DOI: https://doi.org/10.1039/d6dd00134c.
| This journal is © The Royal Society of Chemistry 2026 |