Open Access Article
Trevor
Hastings
*ab,
James
Paramore
b,
Brady
Butler
c and
Raymundo
Arróyave
bde
aBush Combat Development Complex, Texas A&M University System, 3479 TAMU, College Station, TX 77843-3479, USA. E-mail: trevorhastings@tamu.edu
bDepartment of Materials Science and Engineering, Texas A&M University, 3003 TAMU, College Station, TX 77843-3003, USA
cDEVCOM Army Research Laboratory, 3003 TAMU, College Station, TX 77843-3003, USA
dJ. Mike Walker '66 Department of Mechanical Engineering, Texas A&M University, 3003 TAMU, College Station, TX 77843-3003, USA
eWm Michael Barnes '64 Department of Industrial and Systems Engineering, Texas A&M University, 3003 TAMU, College Station, TX 77843-3003, USA
First published on 20th November 2025
Bayesian optimization (BO) has emerged as an effective strategy to accelerate the discovery of new materials by efficiently exploring complex and high-dimensional design spaces. However, the success of BO methods greatly depends on how well the optimization campaign is initialized—the selection of initial data points from which the optimization starts. In this study, we focus on improving these initial datasets by incorporating materials science expertise into the selection process. We identify common challenges and sources of uncertainty when choosing these starting points and propose practical guidelines for using expert-defined criteria to create more informative initial datasets. By evaluating these methods through simulations and real-world alloy design problems, we demonstrate that using domain-informed criteria leads to initial datasets that are more diverse and representative. This enhanced starting point significantly improves the efficiency and effectiveness of subsequent optimization efforts. We also introduce clear metrics for assessing the quality and diversity of initial datasets, providing a straightforward way to compare different initialization strategies. Our approach offers a robust and widely applicable framework to enhance Bayesian optimization across various materials discovery scenarios.
Among adaptive discovery methods, Bayesian optimization (BO)11 is especially suited to material discovery, where experimental and computational evaluations are expensive and time consuming.11–14 BO utilizes a probabilistic model, typically a Gaussian process (GP),15 to iteratively select promising points by balancing exploration (sampling uncertain regions) and exploitation (sampling regions predicted to have high performance). Although Bayesian inference has a long-established history,16 its systematic integration into practical experimental workflows, particularly concerning initialization strategies, remains relatively underexplored.17,18 BO has demonstrated broad success in applications ranging from single-objective optimization (where a single metric or property is optimized) to complex multi-objective design (optimizing multiple properties simultaneously) due to its ability to treat the relationship between input variables and outcomes as a black-box. Furthermore, batch implementations of BO allow parallel evaluations, aligning closely with practical constraints and timelines in real-world discovery campaigns.17,19
Despite the growing adoption of BO in materials discovery, published efforts frequently overlook critical issues associated with selecting the initial queries (i.e. the initial dataset) used to jump start the campaign. In the context of genetic algorithm (GA)-based optimization, Maaranen et al.20 examined how different strategies for generating initial populations influence optimization performance, particularly in early stopping scenarios. While their work focused on evolutionary methods, the underlying observation—that initialization can have a dominant effect when only a limited number of iterations are possible—is highly relevant to experimental materials discovery, where datasets are generally sparse and queries are expensive. These conditions are not unique to GA-based methods; they apply equally to BO and other adaptive/iterative optimization strategies commonly used in materials research. Despite this, many experimental studies begin with randomly or heuristically selected initial datasets, often without explicit discussion of their rationale or potential impact on optimization outcomes.5 A related effort has explored the initialization role alongside human interventions in an autonomous workflow for piezoresponsive materials;21 this work relied on latent space representations while the present work focuses on domain knowledge and physically interpretable material pro perties. It is worth noting a multi-decade gap here, and that the premiere summary background works on setting up BO problems, such as,12 do not have sections dedicated to initialization.
Moreover, researchers typically assemble optimization workflows from modular algorithmic components—such as machine learning regressors (e.g., Gaussian processes), kernel functions (e.g., squared exponential), and acquisition functions (e.g., expected hypervolume improvement)—available in widely used machine learning libraries (e.g., PyTorch, scikit-learn).22–24 However, the availability of these modular tools does not guarantee effective integration into experimental workflows, particularly with respect to initial dataset selection. Without assessments explicitly designed to evaluate the quality of initial datasets—metrics independent of inaccessible ground truths—researchers lack essential guidance for selecting initial points that optimize subsequent performance.
In this paper, we directly address these critical challenges by systematically examining the impact of initial dataset selection on optimization outcomes. We propose a structured, domain-informed methodology for selecting initial design points, incorporating materials science principles such as subsystem complexity (complexity defined by the number of constituent elements or phases) and configurational entropy (a measure of disorder or diversity in atomic arrangements). Using these domain-informed criteria, we demonstrate an improved representativeness and diversity of the initial data sets, significantly improving subsequent optimization efficiency. We use practical metrics to evaluate the quality of initial datasets without requiring knowledge of optimal solutions. We validate this approach using representative design spaces involving both computational and experimental alloy datasets. This work provides robust and practical methodological guidance for materials researchers, significantly improving the reproducibility and effectiveness materials discovery efforts that rely on these techniques.
BO is well-suited for optimizing expensive, black-box objective functions, which are common in scientific and engineering applications where each evaluation is costly. The method constructs a surrogate model of the objective function—here, a Gaussian Process Regressor (GPR)—that provides both a mean prediction and an uncertainty estimate. This uncertainty quantification enables BO to balance exploitation (sampling candidates expected to perform well) with exploration (sampling candidates where the model is uncertain), typically via an acquisition function. The GP surrogate provides an uncertainty-aware model over the current design space as a guide for exploration. In the multi-objective setting, we employ Expected Hypervolume Improvement (EHVI), which assigns a scalar utility to each candidate based on its contribution to the expansion of the current Pareto front.
To support parallel experimentation and accelerate convergence, BIRDSHOT implements batch Bayesian optimization. In each iteration, an ensemble of GPR models is constructed with randomized hyperparameters, allowing for diverse candidate proposals. From these proposals, a representative batch is selected using k-medoids clustering, ensuring both coverage and diversity in the queried designs. This ensemble-based selection strategy reduces sensitivity to hyperparameter tuning and guards against premature convergence to local optima—factors that are particularly important when the BO campaign begins with limited data. As a result, all batch data retrains the ensemble of GPs (no static partitioning with validation or test sets are used).
In this study, we leverage BIRDSHOT's batch optimization capabilities to investigate how different initialization strategies affect the quality and efficiency of BO performance. We analyze the impact of initialization on model accuracy, exploration coverage, and Pareto front convergence, providing guidance for the deployment of BO in resource-constrained discovery workflows. Complete details of the formation of GP surrogate models, acquisition strategy, and batch selection procedure are provided in Section 5: SI.
The first dataset corresponds to a face-centered cubic (FCC) high-entropy alloy system in the composition space Alx1Vx2Crx3Fex4Cox5Nix6, where xi ranges from 0 to 0.95 in increments of 0.05. This results in approximately 50
000 unique compositions prior to any down-selection. Property outputs in this data set include CALPHAD-predicted room-temperature density and room-temperature heat capacity. The second dataset corresponds to a body-centered cubic (BCC) high-entropy alloy system in the space Tix1Vx2Nbx3Mox4Hfx5Tax6Wx7, where xi ranges from 0 to 0.975 in increments of 0.025 prior to filtering. Property outputs include ROM-calculated Shannon configurational entropy and CALPHAD-calculated room temperature thermal conductivity. In addition, the BCC dataset contains experimental measurements of specific hardness and specific modulus for 48 selected compositions.
The material datasets used in this work consist of data derived from thermodynamic and kinetic property predictions obtained via CALPHAD-based software. The Al–V–Cr–Fe–Co–Ni system is primarily composed of elements that form face-centered cubic (FCC) alloys, and equilibrium properties including solidification range (K), room-temperature density (g cm−3), heat capacity (J mol−1 K−1), and thermal conductivity (W m−1 K−1) were calculated from Thermo-Calc. The Ti–V–Nb–Mo–Hf–Ta–W system contains body-centered cubic (BCC) forming alloys, with further methodology described in ref. 25.
Both of these datasets correspond to experimental BO works in the field of alloy design, where single phase solid solutions enable a simple correspondence between simulation and laboratory validation. CALPHAD-derived material property output spaces is physically interpretable and provides a consistency for benchmarking black box problems. Unlike simulated environments that get to draw conclusions from repeat trials, batch physical experimentation typically only has time and funding for one attempt. The net effects of an initialization strategy become veritably relevant.
These datasets were down-selected to create initial datasets for this work. The initial data sets were created using two distinct sampling strategies: uniform random sampling and k-medoids clustering. Each strategy was repeated 1000 times to ensure statistical robustness. For computational datasets, batches of 20 samples were used in each initialization, while experimental batches were limited to 8 samples due to availability constraints. These initialization methods and their corresponding performance are further discussed in Section 3: Results and discussion.
To evaluate the diversity and representativeness of initial batches, three geometric features were computed for each selected subset: the bounding box area, the centroid deviation, and the average convex hull size. The bounding box area is defined as the smallest axis-aligned rectangular region enclosing the selected points in the multi-objective output space. The centroid deviation is computed as the average Euclidean distance between each point and the centroid of the selected set in the objective space, serving as a compactness metric. Convex hulls were computed by identifying the local neighborhood of each candidate (as if it were a medoid) and measuring the size of the convex region that encloses those neighbors. The average hull size over all points in the batch serves as a proxy for disparity and structural dispersion. These metrics were computed using standard numerical libraries: bounding box and centroid distances were calculated using NumPy, while convex hulls were computed using the scipy.spatial module. Medoid clustering was implemented using the k medoids function from the scikit-learn-extra package.
To construct the full projection space at a given resolution, we first generate a set of all possible barycentric coordinates by discretizing the composition simplex. Each barycentric vector is then projected into R2 using a matrix of vertex coordinates that defines a regular polygon in two dimensions. Experimental, random, or off-grid points are projected in the same manner, allowing consistent visualization of arbitrary subsets relative to the entire compositional domain.
Visual annotations, including figure labels and overlays, were created using Inkscape, an open-source vector graphics editor.27 All visuals in this study were rendered at high resolution to maintain geometric precision and readability in both digital and print formats.
000 compositions was down-selected to 2000 representative points using the k-medoids clustering algorithm.28
This sub-sampling procedure preserves the diversity of the original design space while enabling tractable evaluation during Bayesian optimization. Unlike centroid-based methods such as k-means, k-medoids selects actual data points as cluster centers, making it well suited for generating representative and physically valid subsets in high-dimensional, multi-objective alloy design spaces. A visualization of the resulting optimization domain is provided in Fig. 1.
An optimization policy can be simulated over the reduced design space by initializing with a small subset of data and iteratively selecting new points for evaluation, treating the CALPHAD-modeled properties as if they were experimentally measured. To implement this, we employ the BIRDSHOT method,19 as described in Section 2. From this ensemble, candidate points are proposed based on expected hypervolume improvement (EHVI). Due to overlap in GP predictions, the resulting candidate pool is typically redundant and is reduced to a set of unique suggestions. These are then clustered using the k-medoids algorithm to select a final batch of diverse, high-utility points without imposing an explicit ranking (see Section 2: Methods and ref. 25). This simulation framework enables efficient, uncertainty-aware optimization under realistic data constraints.
As a demonstration, we examine the impact of two distinct initialization strategies on optimization performance. The first strategy employs a uniform random selection of initial points, while the second uses a subset deliberately chosen to span the output space through diversity-aware sampling. In both cases, the same BIRDSHOT policy is subsequently applied for 10 iterations, using a realistic batch size of 20 candidates per iteration. The optimization aims at joint maximization of two objectives, with all other conditions kept constant. The resulting optimization trajectories for both initialization schemes are presented in Fig. 2.
(a) This problem is especially pronounced in sequential (non-batch) optimization, where early decisions are made with sparse and potentially unrepresentative data.
(b) Alternative strategies such as grid-based sampling, input-space clustering, or point-inhibition methods (e.g., nearest-neighbor maximization) attempt to improve upon random initialization by enforcing geometric separation in the input space. However, these approaches do not necessarily ensure adequate coverage of the objective space, particularly when the input–output mapping is non-linear or many-to-one. As a result, they may still produce initial datasets that poorly represent the diversity of achievable outcomes.
These findings collectively point out the importance of principled initialization in black-box optimization. Without a systematic and transparent methodology for selecting initial data, claims about optimization efficiency, convergence behavior, or exploration capability risk being irreproducible or misleading. Initialization should therefore be treated as a critical component of optimization design, not a procedural afterthought.
In contrast, a best-case initialization would consist of a subset that reflects the global structure of the objective space as faithfully as possible, given the available budget. This can be approximated by performing k-medoids clustering directly in the output (objective) space, thereby identifying a representative set of initial points that span the range of achievable responses. By capturing key topological features of the output distribution from the outset, such an initialization reduces the burden on the optimizer to discover underrepresented regions through search alone.
Fig. 3 illustrates the difference between high-quality and low-quality initializations by comparing two candidate datasets in the objective space. Panel (a1) displays a set of 20 medoids selected directly from the output dimensions using k-medoids clustering, along with the associated data clusters. By construction, this initialization provides broad and representative coverage of the output space, capturing both dense and sparse regions.
Panel (b1), on the other hand, shows a random sample of 20 data points, with each remaining point assigned to its nearest selected point to form a “pseudo-medoids” clustering for visual comparison. Because random sampling is biased toward high-density regions, the selected points in this case do not adequately cover low-density, high-information regions of the space. This issue is especially apparent in the present example, where the majority of the data is concentrated at low heat capacity values, leading the random sample to overlook critical areas in the distribution.
To quantify these differences, panels (a2)–(a4) and (b2)–(b4) present three metrics used to evaluate the geometric quality of the initialization set in the output space: (1) bounding box volume, which measures the overall spread of selected points; (2) mean distance to centroid, which captures dispersion and sensitivity to outliers; and (3) convex hull area (or volume in higher-dimensional spaces), which reflects the extent of topological coverage across the dataset. Initialization sets that are well suited for optimization should provide a rich information sample of both input and output spaces. Desirable initializations are those that can incorporate topological outliers and avoid excessive clustering within locally dense regions.
Each of the proposed metrics is designed to reward broader coverage and penalize redundant, tightly grouped selections that fail to span the full range of objective values. The bounding box volume reflects the overall axis-aligned extent of the selected points and serves as a quick diagnostic for whether the initialization spans the full dynamic range of each objective. Although it may overestimate true coverage in non-convex or sparse regions, it remains effective for identifying collapsed or overly concentrated subsets. The mean distance to the centroid measures internal dispersion and is sensitive to structural variability; it increases not only with spread but also with the presence of outliers, offering insight into how uniformly the dataset samples around its central tendency. The convex hull volume provides a more topology-aware measure of coverage by enclosing the minimal convex region that contains all selected points. This makes it especially valuable for distinguishing between broad but disconnected spreads and genuinely representative samples that trace the boundary of the achievable objective space.
It is important to emphasize that the clustering shown in Fig. 1(m) illustrates just one of many valid ways to partition the output space into representative regions. There is no unique solution to this decomposition, even when accounting for local density or distributional features. For example, the placement of medoids can be rotated or perturbed around the center of the space without meaningfully affecting their representativeness. In more complex datasets, small variations in the boundaries of the clusters – particularly those driven by nearest-neighbor assignments – can lead to multiple equally plausible clustering configurations, each reflecting a different but defensible interpretation of the structure of the space. Despite this inherent variability in how the output space may be partitioned, the metrics introduced here exhibit consistent and interpretable behavior across thousands of subsets constructed both randomly and deliberately. As shown later in Fig. 5, these diagnostics remain robust under non-unique sampling conditions, reliably capturing meaningful differences in dataset structure regardless of the specific clustering configuration.
A common strategy for selecting an initial dataset—once purely random sampling has been ruled out—is to construct a grid that spans the input space as uniformly as possible. For example, in the context of a binary phase diagram, one might sample 11 evenly spaced compositions between 0% and 100% of each element in 10% increments. Although this approach is straightforward and effective in low-dimensional symmetric systems, it becomes increasingly impractical in higher dimensions, particularly when physical constraints restrict the set of feasible compositions. In such cases, algorithms such as k-medoids offer a tractable and interpretable alternative by identifying representative subsets that are well distributed across the valid input space.
For example, in the six-element Al–V–Cr–Fe–Co–Ni design space, selecting 20 representative compositions from a candidate pool of 2000 enables efficient initialization of a high-dimensional optimization campaign. Although one of the six compositional variables is linearly dependent due to the constraint that atomic fractions must sum to one, all dimensions are treated equally when computing pairwise distances. Assigning a consistent length scale to each dimension ensures a well-defined geometry in the input space, even when some features are mathematically redundant. As will be shown later, such derived or constrained features can still contribute a meaningful structure to optimization algorithms, particularly when used in kernel-based models or distance-aware sampling strategies.
This strategy is grounded in a widely held assumption in materials science: small changes in input variables generally lead to small, continuous changes in material properties. As a result, sampling the input space at roughly uniform intervals is often viewed as a reasonable way to ensure early-stage information gain. However, this assumption breaks down in many realistic scenarios and uniform input-space sampling becomes problematic for two main reasons:
(1) It assumes that equal spacing in the input space corresponds to comparably informative samples in output space. This is rarely valid in high-dimensional or nonlinear systems, where local gradients in material properties can vary dramatically and unpredictably. In such settings, equidistant sampling may result in the oversampling of trivial regions and the undersampling of highly informative ones.
(2) Empirically, uniform input-space sampling often produces output distributions that are statistically similar to those resulting from random sampling, particularly when the input–output mapping is non-linear or the output space is unevenly populated. This diminishes any practical advantage of regular grids in many real-world design tasks.
To illustrate point (1), consider a binary phase diagram such as the Al–Cu system.29 These diagrams often contain structurally simple regions alongside composition intervals with complex multiphase behavior. Suppose one were tasked with reconstructing the Al–Cu phase diagram from 21 cooling curves. A naive strategy might place samples every 5% across the full composition range, but this would result in wasted effort in simple regions and inadequate resolution where the behavior is most intricate, such as between 50–85% Cu. A more effective strategy would concentrate the samples in the region of highest complexity, as visualized in Section 5: SI. Although such complexity is not known a priori in new systems, domain knowledge, prior experience, and physically grounded heuristics can guide the design of more informative initializations.
Point (2) is demonstrated graphically and may be interpreted as a population-based bias. Input-space distance metrics—such as those used in Euclidean kernels or nearest-neighbor algorithms—implicitly treat all displacements of equal magnitude as equally informative. For example, a 5% increase in one element paired with a 5% decrease in another yields the same distance, regardless of where this perturbation occurs in composition space. However, this symmetry assumption breaks down in constrained or irregular spaces. As shown in Fig. 1, even after down-selection, the 2000 alloy candidates span compositional regions with vastly different densities. Algorithms that rely on geometric distance in input space will tend to oversample high-density regions, implicitly assuming that Pareto-optimal solutions are similarly distributed.
This assumption parallels what is known in the philosophy of science as the Presumptuous Philosopher problem:30–32 reasoning from population priors without direct measurement or justification. Although uniform assumptions may be harmless for provably irrelevant variables, such as the parity of candidate indices—they are epistemically unjustified when applied to output-relevant features. Supposing that optimal solutions mirror the population distribution of candidates introduces a subtle but significant selection bias, which can mislead both experimental design and algorithmic search strategies.
This hypothesis can be empirically tested using the previously described dataset of 2000 alloy compositions, shown in Fig. 4. Fig. 4 (left) illustrates that the most populous elemental subsystem—comprising all six elements—closely mirrors the overall topology of the complete dataset. As a result, restricting sampling to this subsystem does not improve initial diversity relative to random sampling from the full composition space. Fig. 4 (right) presents population distributions for each elemental subsystem. Notably, the Pareto front—defined by optimizing two representative CALPHAD-modeled properties and highlighted in red—shows a marked mismatch between input population density and the location of optimal alloys. Several Pareto-optimal candidates originate from sparsely populated subsystems; in this example, 7 out of 23 Pareto-optimal points fall within low-population groups. This decoupling between subsystem frequency and Pareto quality persists regardless of the specific property pair selected for optimization.
As demonstrated earlier in our toy dataset example, input-space population is a poor predictor of where optimal properties reside. Topologically, such divergence is inevitable: a six-dimensional compositional manifold, even if uniformly sampled, becomes distorted when projected into two- or three-dimensional property space via nonlinear chemistry–property mappings. Consequently, regions densely populated in the input space need not correspond to peaks in property performance. However, when optimization algorithms consistently surface candidates from densely sampled regions, it becomes easy to conflate frequency with merit. This tendency risks introducing a form of selection bias, where genuinely superior candidates from underrepresented subsystems are systematically overlooked.
To mitigate this issue, we propose augmenting the input feature space with chemically informed descriptors, specifically a ’subsystem complexity’ feature that encodes the unique combination of constituent elements for each alloy. Including this dimension in the initialization process, via k-medoids clustering, ensures that medoids are selected not only based on geometric proximity but also on chemical diversity, independent of population density. This approach effectively breaks the symmetry imposed by conventional distance-based methods, which treat equal compositional shifts as equally meaningful. In reality, small changes in minor constituents can have outsized impacts: for example, increasing an element from 30% to 35% may have marginal effects, while reducing another from 5% to 0% can fundamentally alter the behavior of the material—such as fully eliminating Al in stainless steel, which precludes the formation of protective passivation layers of alumina.33 Since the number of possible elemental combinations far exceeds the number of initialization points, this strategy inherently favors subsystem diversity and helps counteract the population-driven sampling bias. A visual example of this approach, using k-medoids clustering over an augmented feature space and visualized via affine projections (as in Fig. 1), is provided in Section 5: SI.
We demonstrate this strategy using the previously described CCA toy dataset. A discrete ‘complexity’ feature was added to the input space, ranging from 1 to 57 and corresponding to all unique elemental combinations of 2 to 6 elements (excluding single-element systems) from a six-element pool, as shown in Fig. 4 (right). Clustering was initialized using the k-medoids++ method over the compositional dimensions, with or without the complexity feature, depending on the strategy. A total of 1000 initialization datasets were generated under four distinct sampling approaches: RAND (random sampling in the input space), COMP (only composition medoids), COMP+X (composition medoids with added complexity dimension) and IDEAL (medoids clustered directly in the output space, serving as a theoretical benchmark for optimal property coverage).
For each initialization set, we computed the boundary box area and the mean distance to the centroid in the objective space, visualizing the distributions of these metrics using histograms and empirical cumulative distribution functions (CDFs), as shown in Fig. 5. These results clearly indicate that COMP and RAND yield nearly indistinguishable distributions, both of which deviate substantially from the IDEAL benchmark. In contrast, the COMP+X strategy shifts both metric distributions toward the IDEAL regime, demonstrating significantly improved output-space diversity.
Further inspection of Fig. 5 reveals that datasets generated by RAND or COMP generally exhibit smaller bounding boxes and shorter centroid distances, indicating limited exploration of property space. As shown earlier in Fig. 3, these strategies also tend to exclude alloys with higher heat capacities, which are relatively rare in the dataset. By incorporating the complexity feature, COMP+X systematically improves the diversity of selected initial points in the output space, providing a stronger foundation for subsequent optimization. In this specific system—although the pattern is likely generalizable—our results suggest that greater chemical complexity correlates with greater property diversity, supporting the use of chemically informed descriptors in initialization design.
This dataset and methodology are based on the framework established by Paramore et al., who demonstrated the utility of Bayesian optimization for alloy design in high-dimensional refractory systems.25 Despite the fact that the input space forms a compact and fully connected manifold, the output space—as shown in Fig. 6(b)—exhibits significant structural discontinuities. In particular, alloys with high thermal conductivity (κRT > 30 W m−1 K−1) are sparsely represented, forming isolated regions in objective space. This undersampling of extreme property alloys can hinder downstream tasks such as optimization and discovery, as further illustrated in Fig. 2.
After applying thermodynamic and production feasibility constraints, 24 unique elemental subsystems remained in the BCC refractory alloy design space. One subsystem in particular—the Hf-free system—contained a disproportionately large number of candidates satisfying all constraints, while several others had fewer than ten viable compositions. As in the previous case, composition-based sampling alone would likely overlook these sparsely populated regions, assigning undue weight to the more populous subsystem. Augmenting the design space with a subsystem ‘complexity’ dimension once again offers a principled solution to this sampling bias.
To evaluate this approach, we generated 1000 initial datasets using the established sampling strategies: RAND (random input sampling), COMP (composition-only medoids), COMP+X (composition medoids augmented with subsystem complexity), and IDEAL (output-space medoids serving as a theoretical upper bound). As shown in Fig. 6(c), the COMP+X strategy yields initializations with output-space diversity approaching that of the IDEAL case. In contrast, RAND and COMP consistently fail to produce initialization sets with sufficient coverage of the property space. An extended version of this figure similar to Fig. 5 is provided in Section 5: SI.
Although earlier analyses demonstrated the utility of encoding subsystem complexity in computational datasets derived from CALPHAD modeling or ideal-mixing assumptions, it is essential to validate this approach using experimental data. Unlike modeled properties, experimental measurements often exhibit additional complexity and variability, which poses challenges to standard sampling strategies. To investigate this, we analyzed a dataset from Paramore et al., comprising 48 experimentally synthesized alloys in the Ti–V–Nb–Mo–Hf–Ta–W system, each tested for specific hardness and specific modulus as part of a batch optimization study. Given the limited size of the dataset, we generated 1000 subsets of eight alloys each to evaluate the robustness of different sampling methods.
Fig. 7 (left) shows the distribution of specific hardness and specific modulus in the output space, along with one representative subset. The subset is visualized via its medoid and connections to the other selected alloys. Because the two mechanical properties are strongly correlated, the bounding box area provides limited discriminative power for quantifying diversity. Instead, we focus on mean distance to centroid as a more robust indicator of output-space coverage. Fig. 7 (right) shows a histogram of this metric across 1000 subsets generated via three sampling strategies: RAND (random selection), COMP (medoids selected from compositional inputs), and COMP+X (medoids selected from inputs augmented with subsystem complexity). As this is an experimental dataset, no IDEAL sampling strategy is available for benchmarking.
Despite the modest dataset size, RAND frequently yielded subsets with low centroid distances and limited property diversity. COMP provided only marginal improvement over random sampling, whereas COMP+X consistently produced more diverse subsets, better spanning the full extent of the measured property space. These results reinforce the finding that augmenting the input space with domain knowledge—such as subsystem complexity—can lead to more representative and effective initializations, even in data-limited experimental regimes.
To illustrate this concept, we examined a subset of the Al–V–Cr–Fe–Co–Ni alloy composition space, explicitly excluding vanadium to focus on alloys containing the remaining five elements. In this example, we incorporate Shannon's configurational entropy, defined as
—as an additional dimension in the clustering process. Since configurational entropy is a convex and nonlinear function of composition, uniform sampling across the input space yields a non-uniform, typically skewed distribution in entropy space. In the context of alloy discovery, one may wish to investigate systems of differing structural or thermo-dynamic complexity as quantified by configurational entropy.
To ensure representative sampling across this dimension, we introduced a binary entropy class (low vs. high entropy) as an auxiliary input feature.
The FCC alloy space, shown in Fig. 1, was filtered to its vanadium-free subset, yielding 3842 CALPHAD-feasible compositions. The predicted values for the solidification range and the room temperature density are used as representative outputs in Fig. 8 (left). We selected the Shannon entropy of an equiatomic quaternary alloy (∼1.386 kB) as the threshold to classify alloys as low or high entropy. This binary classification was appended to the input space and used in k-medoids clustering. As shown in Fig. 8 (right), augmenting the input representation with entropy class consistently improved property-space coverage across 1000 initialization datasets, compared to clustering on composition alone. This result highlights how even a coarse and physically meaningful classification—when used as an additional feature—can enhance the diversity of initializing populations and improve the quality of early-stage exploration.
To evaluate performance, 50 independent optimization campaigns were conducted for each strategy using the FCC dataset introduced in Fig. 1 and discussed in Section 3.1: Variance within optimizations. Each campaign aimed to simultaneously maximize room-temperature density and heat capacity in the Al–V–Cr–Fe–Co–Ni system. Campaigns proceeded over 10 iterations with a batch size of 20, including the initial seed. Each iteration used an ensemble of 1000 Gaussian process regressors (GPRs), each with randomized length scale hyperparameters and acquisition based on expected hypervolume improvement (EHVI), with the final batch selected via k-medoids clustering.35
Prior to simulation, we anticipated that COMP+X would produce higher initial hypervolumes on average, consistent with earlier observations. While hypervolume (HV) was not a direct clustering target, the larger average bounding box area observed in COMP+X initializations inherently implies broader coverage of the property space. The simulations thus evaluate whether this initialization advantage consistently improves the optimization performance across surrogate model realizations.
The results are shown in Fig. 9. For each strategy, the mean hypervolume achieved per iteration in all 50 optimization runs is plotted: COMP+X in cyan and COMP in green. Shaded bands denote the 10th to 90th percentile range. The influence of initialization is evident from the first iteration: 96% of COMP+X seed hypervolumes exceed the mean seed HV of the COMP runs and 60% surpass the COMP 90th percentile. These results indicate that augmenting the input space with subsystem complexity yields significantly more diverse and higher-performing initial datasets. This early advantage persists throughout the optimization process. The COMP+X strategy maintains a consistent lead in hypervolume across iterations, reflecting a durable performance advantage from improved initialization. Although both strategies may con-verge to similar Pareto fronts in the asymptotic limit, the superior starting conditions provided by COMP+X facilitate more effective early stage exploration, a central objective of batch BO frameworks operating under resource constraints. The graph in Fig. 9 follows a standard format commonly used in comparative optimization studies, such as those evaluating different acquisition functions or kernel choices. Error bars are intentionally omitted, as they misrepresent statistical uncertainty in this context. In optimization problems, each iteration is inherently dependent on the previous one, and metrics such as hypervolume are strictly non-decreasing by definition. Moreover, hypervolume is bounded above by the value corresponding to full discovery of the Pareto front. Consequently, standard deviations or standard errors—which assume independent and identically distributed (i.i.d.) samples—are not meaningful when overlaid on such trajectories. To complement the analysis of average hypervolume trajectories, we further examine the results of individual optimization runs. This provides insight into the variability and robustness of each strategy and goes beyond conventional summaries by highlighting differences in the consistency and efficacy of the policies across repeated simulations.
The optimization scheme described in Section 2: Methods contains multiple sources of variation. Length-scale hyperparameters, selected via Latin hypercube sampling, are intentionally randomized in the BBO framework to generate diverse candidate sets across surrogate models without imposing a particular prior. Likewise, as discussed extensively in this work, the initial dataset is selected via a policy that incorporates random seeding. By allowing both the initial dataset and the Gaussian process hyperparameters to vary, we account for the inherent stochasticity of BO campaigns. For a sufficiently large number of runs, this variability ensures statistical robustness when comparing competing initialization strategies.
To systematically compare two policies, we proceed as follows:
(1) For each optimization under policy #1, record the value of the performance metric—hypervolume (HV)—at every iteration.
(2) For each optimization under policy #2, record the same metric at the same iteration indices.
(3) Compare the HV of the individual policy #1 optimization to each of the policy #2 runs at each iteration, and compute the fraction of cases where it dominates.
(4) Repeat this procedure for all policy #1 optimizations, aggregating results into a mean dominance percentage with percentile bands.
(5) Reverse the comparison: evaluate each policy #2 optimization against all policy #1 runs using the same procedure.
If policy #1 consistently yields better performance than policy #2, then steps (1)–(4) should produce high dominance percentages, while step (5) should produce noticeably lower ones.
Previous analyses already established this dominance at the first iteration. As shown in Fig. 5, the COMP+X initialization strategy reliably yields higher HV in the seed dataset compared to COMP. Although previous results focused on the bounding box area and the mean distance to the centroid, these geometric metrics strongly correlate with initial hypervolume. What remains to be determined is the persistence of this advantage across iterations and its cumulative effect on overall optimization performance. These results are summarized in Fig. 10.
Each set of policies' means are shown as scatter points on top of middle %ile bands from 25% to 75%. As expected, the first iteration of Fig. 10 matches the prior histograms. In fact, the COMP+X policy is so consistent in this regard that its mean HV value lies outside the inner 50%ile for the first iteration; out of the 50 datasets from the COMP+X policy that dominate the COMP policy, the vast majority of values are at or near 100%. Continuing onward, only occasionally does variance of an individual simulation cause a COMP+X policy to perform worse than standard initialization practices. Conversely, the COMP policies are often near 0% domination of the COMP+X policies; their non-zero values arise largely from occasional particular iterations that have a large jump in HV (or where they have gotten “lucky”). Fig. 10 puts into context the likelihood of success (given a set of algorithmic changes), which is difficult to parse with “HV vs. iteration” visuals alone (i.e.Fig. 9).
The findings presented here reinforce and extend the conclusions of Maaranen et al.,20 underscoring the critical impact of initial dataset selection on optimization performance. Our results show that incorporating expert-defined descriptors into the sampling process yields markedly improved initialization strategies. These informed approaches mitigate the biases inherent in uniform or geometry-based input sampling, leading to more representative coverage of the design space and more efficient exploration of complex materials landscapes. The principal conclusions are summarized as follows:
• Random sampling is inadequate for initializing optimization campaigns and frequently degrades downstream performance, regardless of the method applied.
• Sampling strategies that define diversity solely in terms of geometric input space (e.g., compositional distance) offer little to no improvement over random sampling in terms of output-space coverage.
• Input spaces can and should be augmented with derived or domain-informed features. This work shows that incorporating classification-like descriptors (e.g., subsystem identity or configurational entropy class) leads to more representative and effective initial datasets.
Importantly, these insights are broadly applicable and largely independent of specific algorithmic choices, including surrogate model type, kernel function, acquisition strategy, or batch selection policy. More generally, embedding domain knowledge into input representations provides a flexible and effective means of improving materials optimization workflows, particularly in systems characterized by complex or sparsely sampled design spaces. Future work may further integrate such descriptors into the optimization loop itself—for example, by informing GP kernel length scales or shaping acquisition behavior during batch selection.
While this work leverages a BO framework, the novelty primarily lies in the initialization strategy and integration of chemically meaningful descriptors (such as subsystem complexity and configurational entropy). Across a variety of simulated and experimental alloy systems, we find that augmenting the input space with these expert-derived dimensions leads to significantly more diverse initial datasets and improved optimization trajectories. Importantly, these gains are observed without changing the underlying BO engine, acquisition strategy, or surrogate model. Future enhancements of these strategies may utilize the properties of select alloys, such as predicted microstructural elements for processing conditions.
Supplementary information (SI): the Bayesian methods, a data clustering representation, additional refractory data, and a diagram of sampling strategies. See DOI: https://doi.org/10.1039/d5dd00361j.
| This journal is © The Royal Society of Chemistry 2026 |