Open Access Article
Farshud
Sorourifar
a,
Thomas
Banker
ab and
Joel A.
Paulson
*a
aThe Ohio State University, Department of Chemical and Biomolecular Engineering, Columbus, OH 43210, USA. E-mail: paulson.82@osu.edu
bUniversity of California, Berkeley, Department of Chemical and Biomolecular Engineering, Berkeley, CA 94720, USA
First published on 1st September 2025
The discovery of molecules with optimal functional properties is a central challenge across diverse fields such as energy storage, catalysis, and chemical sensing. However, molecular property optimization (MPO) remains difficult due to the combinatorial size of chemical space and the cost of acquiring property labels via simulations or wet-lab experiments. Bayesian optimization (BO) offers a principled framework for sample-efficient discovery in such settings, but its effectiveness depends critically on the quality of the molecular representation used to train the underlying probabilistic surrogate model. Existing approaches based on fingerprints, graphs, SMILES strings, or learned embeddings often struggle in low-data regimes due to high dimensionality or poorly structured latent spaces. Here, we introduce Molecular Descriptors with Actively Identified Subspaces (MolDAIS), a flexible molecular BO framework that adaptively identifies task-relevant subspaces within large descriptor libraries. Leveraging the sparse axis-aligned subspace (SAAS) prior introduced in recent BO literature, MolDAIS constructs parsimonious Gaussian process surrogate models that focus on task-relevant features as new data is acquired. In addition to validating this approach for descriptor-based MPO, we introduce two novel screening variants, which significantly reduce computational cost while preserving predictive accuracy and physical interpretability. We demonstrate that MolDAIS consistently outperforms state-of-the-art MPO methods across a suite of benchmark and real-world tasks, including single- and multi-objective optimization. Our results show that MolDAIS can identify near-optimal candidates from chemical libraries with over 100
000 molecules using fewer than 100 property evaluations, highlighting its promise as a practical tool for data-scarce molecular discovery.
Machine learning (ML) offers a promising strategy for accelerating MPO, especially in low-data regimes where molecular simulations and/or wet-lab experiments are costly.22,23 However, progress is often limited by three key challenges: (i) representing molecules in a form amenable to predictive modeling and optimization, (ii) building uncertainty-aware surrogate models from limited labeled data, and (iii) reasoning over high-dimensional representations where only a small subset of features may influence the target property.
A wide range of molecular representations have been proposed, including SMILES24 and SELFIES25 strings, molecular graphs,26 fingerprints,27 and descriptor-based feature vectors.28,29 While these encodings capture varying levels of structural, electronic, or topological information, they are often high-dimensional and not guaranteed to align with the underlying property function landscape – especially when training data is limited and model overfitting becomes a concern. Recent efforts to address these issues have focused on data-driven embedding methods such as variational autoencoders (VAEs),30,31 normalizing flows,32 and deep kernel learning (DKL).33–35 These models learn continuous molecular embeddings that enable the use of Bayesian optimization (BO)36–38 for molecular design. However, they present practical limitations: training can be brittle and sensitive to hyperparameters; the latent space may not reflect smooth changes in the property of interest; and the learned representation is often fixed and, thus, unable to adapt as new data becomes available.
An alternative to learning an embedding space is to apply BO directly on fixed molecular representations using specialized similarity kernels. For example, recent work has employed Gaussian processes (GPs) with Tanimoto fingerprint kernels or graph kernels for molecular structures.39 While these methods avoid the need for learning a separate (typically highly parametrized) encoder, they assume that molecules deemed similar by the kernel will exhibit similar properties – an assumption that may not hold when the kernel structure is not adapted to the target task.
In this work, we present the Molecular Descriptors and Actively Identified Subspaces (MolDAIS) framework, illustrated in Fig. 1, which enables efficient and interpretable molecular property optimization using descriptor-based representations. MolDAIS builds upon the recently proposed Sparse Axis-Aligned Subspace BO (SAASBO) method,40 adapting it to operate over large, chemically informed descriptor libraries. Rather than learning a new molecular embedding, MolDAIS leverages precomputed descriptors and performs adaptive feature selection using sparsity-inducing techniques. This allows the surrogate model to automatically and adaptively identify low-dimensional, property-relevant subspaces during the course of optimization.
The core variant of MolDAIS uses the SAAS prior within a fully Bayesian GP model to induce axis-aligned sparsity in the input space. Importantly, we also introduce two new (more scalable) screening alternatives based on mutual information (MI) and the maximal information coefficient (MIC), which provide runtime advantages while retaining interpretability and adaptivity. These screening variants represent a novel contribution beyond existing SAASBO implementations and offer a practical alternative for subspace selection when full Bayesian inference (using, e.g., Hamiltonian Monte Carlo) is computationally prohibitive.
We evaluate MolDAIS on a wide range of benchmark and real-world MPO problems, including single- and multi-objective optimization tasks. Across these settings, MolDAIS consistently outperforms state-of-the-art baselines based on molecular graphs, SMILES strings, and latent embeddings, particularly when the number of available evaluations is fewer than 100. In addition to its strong empirical performance, MolDAIS offers practical advantages: it avoids deep learning infrastructure, requires minimal tuning, and can be applied out-of-the-box to any descriptor-featurized molecular dataset. These features make it especially well suited for deployment by domain scientists who may not have significant ML expertise. Taken together, these contributions position MolDAIS as a flexible and extensible framework for data-efficient MPO, opening new opportunities for efficient, targeted, and accessible exploration of chemical space.
![]() | (1) |
that defines the molecular search space and
is the black-box objective function that maps a molecule to its corresponding property value. The goal is to identify the optimal molecule m★ such that F(m★) ≥ F(m) for all
. While this is trivial for small sets of molecules, it quickly becomes intractable when the number of molecules in the set approaches ∼104 or more, which is increasingly common in practical settings. Compounding this difficulty is the fact that evaluations of F typically require expensive simulations or wet-lab experiments, often involving noise. Hence, MPO problems require sample-efficient, intelligent search strategies over high-dimensional, structured, and discrete spaces.
with a mean function
and covariance or kernel function
.
Given a dataset
of n observed property values yi = F(mi) + εi for some (heteroskedastic) zero-mean Gaussian noise
, the posterior predictive distribution at a new point m is Gaussian with the following mean and variance41
![]() | (2a) |
![]() | (2b) |
guides the next evaluation by quantifying the utility of querying F at m. Common acquisition functions include expected improvement (EI),42 upper confidence bound (UCB),43 and probability of improvement (PI).44 BO proceeds via the iterative strategy:![]() | (3) |
is the updated dataset. Acquisition functions are typically designed to balance exploration (regions with high uncertainty) and exploitation (regions with high objective values), with the aim of making BO sample-efficient. For the aforementioned acquisition functions, we can derive closed-form expressions for
, making it much easier to optimize than F itself. Thus, in MPO settings, the computational overhead associated with training the surrogate and optimizing α(·) is often negligible compared to running new experiments or simulations.
, where atoms correspond to nodes and bonds to edges. Graph encodings are highly expressive, capturing both topological connectivity and relational information. Atom- and bond-level attributes (e.g., element, formal charge, hybridization, bond order, or aromaticity) are typically stored as node and edge labels or attribute tensors.45,46 Despite their expressivity, graph representations pose two challenges: (i) invariance to node permutations demands permutation-equivariant models (e.g., graph kernels or graph neural networks) and (ii) many graph similarity measures scale super-linearly with molecular size, leading to non-trivial computational cost. Moreover, a single molecule can admit multiple valid graph encodings (e.g., due to resonance or tautomerism), complicating featurization and downstream optimization.
More recent work has proposed expressive, theory-driven descriptor families that aim to overcome some of these limitations. Tensor algebra-based methods use multilinear forms and spatial (dis)similarity matrices to encode higher-order relationships among atoms, capturing geometric and relational structure beyond pairwise interactions.54–56 Graph derivative descriptors augment molecular graphs with experimentally derived node and/or edge attributes (e.g., NMR shifts, bond energies), producing chemically grounded encodings that better reflect molecular structure.57,58 Information-theoretic descriptors treat molecules as symbolic sequences and use entropy-based measures to quantify bonding patterns and structural complexity, often without requiring 3D coordinates.59,60
Each representation ϕ(m) enables the definition of a corresponding kernel k(ϕ(m), ϕ(m′)) that can be directly integrated into the GP modeling paradigm. Graphs require specially designed graph kernels; strings typically use substring-based similarity; fingerprints are often used with a Tanimoto kernel; and descriptors typically rely on standard continuous kernels like those from the Matérn class. The effectiveness of these kernels is tightly coupled to how well the underlying representation captures the key structural or physicochemical relationships relevant to the property of interest. In particular, descriptor-based representations, while chemically rich, can be highly redundant and heterogeneous. As a result, it is often difficult to define a globally smooth/stationary kernel over the full descriptor space. This motivates the need for adaptive strategies that can selectively identify and refine task-relevant subspaces as more information becomes available – a direction we pursue in this work.
MolDAIS, presented formally in Section 3, is agnostic to the specific choice of descriptors and can be readily applied to either classical topological descriptors or newer theory-guided families. Although we focus on traditional descriptors in this study, the modularity of our framework enables future extensions that incorporate more expressive or domain-specific encodings to further enhance model performance and interpretability.
.
However, generative models present several limitations in practice. First, they are typically trained in an unsupervised fashion using loss functions that prioritize reconstruction accuracy over task relevance. Thus, the resulting embeddings may not exhibit smoothness with respect to the target property F, limiting their utility for optimization. Moreover, the encoder-decoder architecture is usually trained offline and fixed during optimization, preventing the representation from adapting to newly acquired labeled data. Second, latent spaces are often still high-dimensional, posing challenges for BO due to the curse of dimensionality: surrogate models become harder to train, uncertainty estimates degrade, and acquisition functions become harder to optimize. While recent approaches attempt to fine-tune the generative model using property data and restricting optimization to local regions of the latent space using trust-region BO,33 they typically require large numbers of evaluations, limiting their effectiveness in low-data regimes.
A related class of encoder-only models, such as DKL,62 offer an alternative by removing the decoder. In DKL, a neural network ϕθ(m) maps molecules to a continuous embedding space used as input to a GP. This allows joint learning of the representation and surrogate model. While DKL avoids the reconstruction problem, it introduces many trainable parameters and is prone to overfitting in low-data settings. Its performance is also sensitive to architectural choices and hyperparameters, which must be carefully tuned.63
In contrast to these data-driven approaches, our work returns to chemically grounded molecular descriptors, which are specifically designed to encode structural and physicochemical information. Although these descriptors often correlate with important molecular properties, their use in BO has been limited by the high dimensionality of the descriptor space and the difficulty of identifying which features are most relevant. We propose an alternative approach: rather than learning a full embedding from scratch, we adaptively identify low-dimensional, task-relevant subspaces from a precomputed descriptor library. By updating this subspace as new data is acquired, our method supports interpretable, sample-efficient optimization in low-data regimes. In this way, we aim to bridge the gap between data-driven representation learning and domain-informed molecular design.
to a continuous feature space
. Assuming that this mapping is injective, i.e., every molecule has a unique descriptor vector, we can equivalently pose the MPO problem as an optimization over the descriptor space:![]() | (4) |
with D ≈ 2000; this space can be even larger if other descriptor libraries are considered and/or combined with the Mordred library.
Given this continuous vector representation, we can define a kernel over the descriptor space to use in the standard GP modeling approach summarized in Section 2.1. We adopt the Matérn 5/2 kernel due to its balance of expressiveness and smoothness:
![]() | (5) |
![]() | (6) |
These hyperparameters are typically estimated by maximizing the log marginal likelihood (LML) of the GP given current data
.41 While LML-based training allows the model to adapt as data accumulates, it is also sensitive to noise and overfitting, especially in high-dimensional settings where the number of hyperparameters grows linearly with D. In effect, standard GPs implicitly assume that all input features could be relevant, which yields an overly broad hypothesis space and can degrade performance when limited amounts of data are available.
A natural way to regularize this space is to assume that only a sparse subset of the descriptor dimensions are relevant for the property of interest. These relevant dimensions may differ across properties, suggesting an adaptive approach to subspace identification. Following the SAASBO framework proposed by Eriksson and Jankowiak,40 we encode this preference using the sparse axis-aligned subspace (SAAS) prior, which imposes a strong inductive bias toward sparse solutions by placing a half-Cauchy prior over each inverse lengthscale ρi (concentrating its mass near zero). Dimensions are only allowed to “escape” this prior (i.e., take on larger values) if the data provides sufficient evidence that they are relevant. As more data is collected, more dimensions can be activated, leading to increasingly rich surrogate models. This mechanism provides an elegant Bayesian formulation for discovering low-dimensional subspaces in a fully data-driven way.
As in SAASBO, we infer a posterior over the kernel hyperparameters {ψl}Ll=1 using Hamiltonian Monte Carlo (HMC). These samples define a marginal predictive distribution over the target property and, crucially, capture uncertainty over which descriptor dimensions are relevant by exploring different activated subspaces across samples. This uncertainty can be accounted for in the acquisition function, as described in Section 3.3. Fig. 2 provides a high-level overview of the surrogate modeling workflow underlying MolDAIS, which enables flexible and efficient learning of task-relevant descriptors for the property of interest.
MI quantifies the amount of information shared between a single feature and the target variable, offering a model-agnostic way to identify relevant descriptors. In our implementation, we compute MI scores using the
function from
,68 which estimates mutual information via k-nearest neighbors. This estimator is non-parametric and well-suited for continuous variables. MIC, on the other hand, is designed to capture a broader range of functional relationships (including linear and nonlinear associations) by searching over grids in the joint space of two variables and identifying the grid that maximizes a normalized MI estimate. The resulting score ranges from 0 to 1, with higher values indicating stronger associations. MIC is particularly robust in settings where dependencies may be complex, heterogeneous, or obscured by noise. We compute MIC values using the
(ref. 69) Python package.
Both MI and MIC can be interpreted as imposing a hard prior over the kernel hyperparameters: only the top-ranked features are retained, while all others are effectively discarded by setting their inverse lengthscales to zero. The resulting GP model is then trained by maximizing the LML over this reduced subspace. While these screening-based methods are less expressive than the fully Bayesian SAAS approach, they offer a nice balance between computational efficiency and predictive performance. We find that even these simple variants yield substantial improvements over standard GP models trained on the full descriptor space, particularly in low-data regimes where overfitting is a concern. Conceptually, this strategy can be viewed as an approximate counterpart to the adaptive subspace learning step shown in Fig. 2D, with the learned feature relevance distribution replaced by a deterministic feature selection procedure.
In our experiments, we default to selecting the top K = 5 features. This choice was guided by preliminary SAAS runs on two property prediction tasks, which consistently identified five or fewer highly relevant features. These supporting results are included in the SI. We then further tested this setting on a third benchmark (FreeSolv; see Fig. 4) and found that it maintained strong predictive performance. This gave us confidence in using a consistent K across all benchmarks, avoiding the need for additional tuning. Our choice of a small K also aligns with prior quantitative structure–activity relationship (QSAR) modeling literature that emphasizes sparsity and interpretability.70,71 While adaptive schemes that grow K with data or optimize it dynamically could further improve performance, we leave these directions to future work.
![]() | (7) |
In this work, we focus on the expected improvement (EI) acquisition function due to its widespread empirical success and simple analytical form,42 which allows for efficient gradient-based optimization. All acquisition optimization routines are implemented using the
(ref. 72) Python library. An advantage of this is that we can easily consider “batch acquisition functions”, where a collection of B points Xn+1 = {x(1)n+1, …, x(B)n+1} are jointly selected at each iteration. This is highly relevant in experimental molecular discovery settings where it is common to have parallel evaluation capabilities. For example, in high-throughput chemical screening, one might evaluate a full batch of molecules in a single round using microtiter plates or multiplexed assays. Jointly optimizing a batch acquisition function allows the BO algorithm to account for redundancy and diversity in the selected molecules, significantly reducing the time-to-discovery in practical applications.
In the constrained setting, the goal is typically to maximize a desired performance metric (e.g., activity or selectivity) subject to one or more feasibility constraints (e.g., toxicity, synthetic accessibility, or stability). Modern constrained BO algorithms can broadly be categorized into two families: (i) safe BO methods, which aim to ensure feasibility throughout the optimization process73–75 and (ii) asymptotically feasible methods, which permit early constraint violations but require that the final solution satisfy all constraints.76,77 Our surrogate modeling approach is compatible with both formulations. However, due to the strength of the (approximate) SAAS prior, uncertainty estimates may be less reliable in early iterations, making asymptotically feasible methods generally more practical in low-data regimes.
Similarly, many real-world applications involve trade-offs between competing objectives. For instance, in the context of designing new battery materials, one would typically want to maximize energy density and minimize degradation (or similarly maximize cycle life). In such cases, the goal is not to identify a single optimal molecule, but rather to characterize the Pareto frontier, i.e., the set of non-dominated solutions for which no objective can be improved without degrading another. Our framework integrates naturally with multi-objective BO algorithms, including those based on hypervolume improvement, which quantify the expected gain in dominated volume in objective space.
It is worth noting that a key advantage of our proposed approach in both constrained and multi-objective settings is that each property or constraint can be modeled independently in different subspaces. As a result, the subspace learned for each target can adapt to the most relevant molecular features for that specific property. This flexibility is particularly useful when optimizing for multiple, physically distinct phenomena – enabling the surrogate model to allocate modeling capacity where it is most needed.
D at pH 7.4), ESOL (1128 molecules; aqueous solubility), FreeSolv (642 molecules; hydration free energy), and Photoswitch (392 molecules; π → π* transition wavelength). For each dataset, we perform ten random 5/95 train–test splits to simulate the data-scarce conditions typical of early-stage molecular discovery and optimization.
We compare MolDAIS to one representative baseline from each major molecular representation class (descriptors, fingerprints, strings, and graphs described in Section 2.2) selected based on their GAUCHE performance. Two ablation variants of MolDAIS are included to assess the effect of adaptive subspace identification and unsupervised dimensionality reduction. All models are trained as GPs using the BoTorch library.72 For all baselines, kernel hyperparameters are optimized via L-BFGS-B.79 MolDAIS is trained using HMC with the No-U-Turn Sampler (NUTS),80 using default BoTorch settings (512 warmup steps, thinning of 16, and 256 samples).
The six modeling approaches evaluated are:
• MolDAIS: a fully Bayesian GP with a Matérn 5/2 kernel and a SAAS prior over inverse lengthscales, allowing for adaptive identification of a task-relevant subspace of Mordred descriptors.
• MD-Mat: standard GP with a Matérn 5/2 kernel over the full Mordred molecular descriptor space.
• MD-Mat-PCA: standard GP with Matérn 5/2 kernel applied to a PCA-reduced embedding of the descriptor space (retaining 99% of variance); represents a fixed, unsupervised dimensionality reduction baseline.
• FP–TM: a fingerprint-based GP using molecular fragprints50 with a Tanimoto kernel. Fragprints are hybrid representations that concatenate ECFPs with fragment count vectors, capturing both local atom-centered neighborhoods and global structural features (e.g., presence of ring systems).
• SMILES-Str: SMILES-based GP using a bag-of-characters string kernel based on Tanimoto similarity.81
• Graph-WL: graph-based GP using the Weisfeiler–Lehman kernel,82 which encodes molecular similarity via subtree pattern counts.
Fig. 3 shows the distribution of test RMSE values across splits for each model–dataset combination. MolDAIS consistently achieves the lowest median RMSE and narrowest spread, demonstrating its robustness in low-data regimes. While Fragprints–TM and SMILES-Str perform well on select tasks, no baseline offers consistent performance across all problems. In contrast, MolDAIS adapts to each task by identifying a property-specific subspace, enabling broad generalization.
In sparse-data settings, accurate uncertainty estimates are as important as predictive accuracy for driving effective acquisition in BO. We evaluate uncertainty quantification (UQ) performance using two metrics introduced by Rasmussen et al.:83 (i) the R2 correlation between predicted root-mean variance (RMV) and observed prediction error (RMSE), and (ii) the miscalibration area, which quantifies deviation between empirical and nominal confidence intervals. Ideal models should achieve R2 near 1 and minimal miscalibration area. As shown in Table 1, MolDAIS consistently achieves high RMSE-RMV correlation and low miscalibration area across all datasets. Descriptor-based baselines (MD-Mat, MD-Mat-PCA) consistently underperform, suggesting that naive use of high-dimensional features (without sparsity or property-based adaptation) leads to poorly calibrated models. FP–TM performs well on the lipophilicity problem, but exhibits degraded calibration on other tasks, indicating fragility across domains. Although graph-WL yields high R2 on ESOL, its calibration error remains relatively large, and it ranks lowest in RMSE across all four datasets (Fig. 3). Overall, these results suggest that MolDAIS offers more reliable uncertainty estimates, which are critical for sample-efficient optimization.
| RMSE vs. RMV R2 | Miscalibration area | |||||||
|---|---|---|---|---|---|---|---|---|
| Method | ESOL | FreeSolv | Photoswitch | Lipophilicity | ESOL | FreeSolv | Photoswitch | Lipophilicity |
| MolDAIS | 0.61 ± 0.20 | 0.70 ±0.25 | 0.64 ±0.10 | 0.79 ± 0.08 | 0.12 ±0.05 | 0.07 ±0.04 | 0.09 ±0.05 | 0.17 ± 0.07 |
| MD-Mat | 0.09 ± 0.09 | 0.02 ± 0.03 | 0.21 ± 0.19 | 0.04 ± 0.04 | 0.29 ± 0.00 | 0.36 ± 0.00 | 0.50 ± 0.00 | 0.18 ± 0.01 |
| MD-Mat-PCA | 0.08 ± 0.07 | 0.03 ± 0.02 | 0.09 ± 0.10 | 0.04 ± 0.04 | 0.29 ± 0.00 | 0.36 ± 0.00 | 0.50 ± 0.00 | 0.18 ± 0.01 |
| SMILES-Str | 0.48 ± 0.16 | 0.49 ± 0.11 | 0.57 ± 0.17 | 0.67 ± 0.10 | 0.19 ± 0.02 | 0.36 ± 0.02 | 0.49 ± 0.01 | 0.14 ± 0.03 |
| FP–TM | 0.37 ± 0.23 | 0.31 ± 0.08 | 0.43 ± 0.15 | 0.85 ±0.05 | 0.17 ± 0.03 | 0.29 ± 0.02 | 0.49 ± 0.01 | 0.05 ±0.01 |
| Graph-WL | 0.96 ±0.01 | 0.05 ± 0.05 | 0.51 ± 0.01 | 0.65 ± 0.02 | 0.24 ± 0.06 | 0.11 ± 0.14 | 0.19 ± 0.10 | 0.07 ± 0.02 |
To further highlight the benefits of adaptive subspace learning, we examine how MolDAIS refines its descriptor space as more training data becomes available. Fig. 4 shows the distribution of median inverse lengthscales across Mordred descriptors at different training sizes (10, 30, 60, 90 samples) on the FreeSolv dataset. These inverse lengthscales serve as feature importance scores: higher values indicate greater relevance. As training size increases, the distribution progressively sharpens, revealing clear separation between relevant and irrelevant descriptors. At 30 samples, a few features begin to emerge as informative; by 60 samples, a distinct sparse subspace is apparent. Interestingly, even between 60 and 90 samples, the model continues to adjust feature relevance, illustrating MolDAIS's flexibility to update its hypotheses as new data is collected. This refinement corresponds to substantial improvements in predictive performance (R2 increases from 0.05 at 10 samples to 0.84 at 90), and is a key indicator of potential strong performance when included in iterative optimization pipelines such as BO.
• Penalized log
P:30,31 a standard benchmark in the MPO literature, consisting of 250
000 molecules randomly selected from the ZINC database. The objective is a penalized octanol–water partition coefficient (log
P), adjusted for synthetic accessibility and ring size, simulating a drug-likeness optimization problem.
• Power Conversion Efficiency (PCE):85 contains 29
978 organic photovoltaic molecules with computed PCE values, representing the efficiency of solar-to-electric energy conversion. This dataset serves as a proxy for high-throughput materials discovery in renewable energy applications.
• Antimalarial activity (EC50):86 comprises 9998 organic molecules with measured half-maximal effective concentration (EC50) values against a sulfide-resistant strain of Plasmodium falciparum. The EC50 value reflects the concentration required to inhibit 50% of parasite.
• Solvation Free Energy (ΔGsolv) & Redox Potential (E0):87 contains over 103
000 quinone derivatives with the two properties of interest (ΔGsolv and E0) being computed with density functional theory. These properties are relevant to energy storage and redox chemistry, and we use them to explore single- and multi-objective optimization scenarios.
We conduct a series of controlled optimization experiments using these datasets. Note that, since LADDER requires a pre-trained generative model that requires a very large molecular search space, we only apply it to the penalized log
P benchmark wherein we use an established accurate generator. To ensure a fair comparison, we closely followed the official LADDER implementation and reproduced the protocol described in its original publication. This includes using the same pre-trained JT-VAE model and kernel configuration/hyperparameter tuning strategy. The only modification was to the initial batch of molecules, which we standardized across all baselines in our study for reproducibility. All methods follow the same procedure and are initialized with the same settings. Each optimization run begins with 10 randomly selected molecules from the search space
, and is given a budget of 90 additional evaluations, resulting in a total of 100 tested molecules per run. We use the expected improvement (EI) acquisition function for all methods except the random baseline, which selects molecules uniformly at random.
To account for variability in the initial data, each experiment is repeated across 20 independent trials, each with a different random seed. All methods share the same set of initializations to ensure comparability. Results are reported in Fig. 5. For each method, we report two performance metrics. The top row of each panel shows the mean evolution of the best-found objective value over the course of the optimization, averaged across the 20 replicates. Shaded regions indicate 95% confidence intervals. This metric captures how quickly each method identifies high-performing molecules. The bottom row of each panel shows the distribution of final quantile scores achieved by each method. For each replicate, we compute the quantile of the best-found molecule relative to the full search space
. A quantile score of q ∈ [0, 1] indicates that the identified molecule outperforms q × 100% of all possible candidates in the search space. We visualize the full distribution of quantile scores across replicates using violin plots, which highlight both median performance and variability across runs.
Across all case studies, MolDAIS demonstrates consistently strong sample efficiency. The fully Bayesian variant achieves the best overall performance on most tasks, with the approximate versions (MolDAIS-MI and MolDAIS-MIC) performing competitively and often closely tracking the SAAS version. Notably, MolDAIS-MI slightly outperforms MolDAIS-MIC in most cases, suggesting that mutual information may be a slightly more effective metric for subspace selection in this context. In contrast, both MD-Mat and Random underperform consistently, highlighting the difficulty of modeling and optimizing in high-dimensional descriptor spaces without subspace regularization.
The alternative baselines MD-Mat-PCA and FP–TM perform reasonably on select tasks (e.g., MD-Mat-PCA for ΔGsolv in Fig. 5c and FP–TM for EC50 in Fig. 5e), but show inconsistent behavior across datasets and tend to plateau earlier or exhibit high variance. These trends are further illustrated in the quantile scores, where MolDAIS consistently yields narrow, high-scoring values across all replicates, whereas other methods often exhibit greater spread or even failure cases. For instance, FP–TM identifies the global optimum in all EC50 trials, but fails to do so in PCE – occasionally even underperforming compared to naive random search. Such volatility highlights the importance of adapting the representation to the property being optimized. By automatically adapting the “active” descriptor subspace to specific tasks, MolDAIS enables more reliable optimization performance across a diverse range of properties.
A closer look at the penalized log
P task (Fig. 5a) further illustrates this point. MolDAIS achieves the highest (best-found) molecular property values and is the only method that successfully identifies the globally optimal molecule in all 20 replicates, doing so after querying less than 0.04% of the candidate search space. While MolDAIS-MI converges more slowly, it consistently reaches top-performing candidates, as evidenced by the high quantile score with minimal spread. In contrast, LADDER is unable to locate the global optimum in any replicate and underperforms all three MolDAIS variants, both in terms of convergence speed and final quantile scores. This further demonstrates the value of MolDAIS's adaptive, descriptor-focused modeling approach in delivering accuracy and robustness across molecular optimization tasks.
We fix the feature selector to the MolDAIS-MI variant (with K = 5) and evaluate each descriptor library using a training set of 40 molecules and a test set of 100 held-out molecules. Fig. 6 shows the correlation matrix among the selected features across the three descriptor libraries. Descriptions of the interpretable descriptors are also provided in Table 2. Substantial alignment is observed between Mordred and PaDEL: three features (GATS1p, AATSC2i, and GATS1i) are shared across both libraries and correspond to autocorrelation measures of polarizability and ionization potential. These features are consistent with the established physics of solvation (that depends on charge distribution and polarizability). The remaining two descriptors in each library also show strong correlations; AETA_beta_s and ETA_BetaP_s are perfectly correlated (R2 = 1.0), and AATS2s and PNSA-1 are moderately correlated (R2 = 0.80), indicating convergence on related physical information. In contrast, the top ChemBERTa features are largely uncorrelated with the interpretable descriptors.
![]() | ||
| Fig. 6 Pairwise correlations among the five descriptors selected by MolDAIS-MI (K = 5) for ΔGsolv across three descriptor libraries. Mordred and PaDEL yield highly overlapping and strongly correlated features, while ChemBERTa embeddings (denoted X1 to X5) are mostly uncorrelated with those from physics-based descriptors. Definitions for Mordred and PaDEL features are provided in Table 2. | ||
| Descriptor | Familya | Formal description | Physicochemical meaning & link to solvation | Package |
|---|---|---|---|---|
| a GATS = geary autocorrelation; ATS = averaged topological autocorrelation; ETA = extended topochemical atom; CPSA = charged partial surface area. | ||||
| GATS1p | GATS | Geary autocorrelation of atomic polarizability at topological lag 1 | Measures polarizability contrast between bonded atoms; higher values indicate uneven electronic softness, which enhances induced dipole–solvent dispersion interactions | Mordred, PaDEL |
| GATS1i | GATS | Geary autocorrelation of atomic ionization potential at topological lag 1 | Captures bond-level polarity due to ionization potential mismatch; reflects spatial heterogeneity in local electronegativity that governs dipole-solvent and hydrogen-bond interactions | Mordred, PaDEL |
| AATSC2i | ATS | Centered Moreau–Broto autocorrelation of atomic ionization potential at lag 2, averaged per atom | Quantifies how ionization potential trends persist over two-bond paths; indicates delocalized electron-withdrawing effects, which shape solvation shell structure and stabilization | Mordred, PaDEL |
| AETA_beta_s | ETA | Averaged extended topochemical atom index βs (sigma-electron contribution) | Encodes density of electronegative atoms (e.g., O, N) within the sigma-bond network; higher values indicate greater potential for electrostatic and hydrogen-bond interactions with polar solvents | Mordred |
| ETA_BetaP_s | ETA | Normalized extended topochemical beta index βs (relative to molecular size) | Reflects concentration of polar atoms per unit size; higher values signal high polarity density, favoring compact, strongly interacting solvation shells | PaDEL |
| AATS2s | ATS | Averaged Moreau–Broto autocorrelation at lag 2 weighted by intrinsic state (electrotopological index) | Measures how electronic influence from polar atoms propagates over 2-bond paths; higher values indicate inductive effects that extend solute polarity, enhancing polarization-based solvation | Mordred |
| PNSA-1 | CPSA | Sum of solvent-accessible surface area over atoms with negative partial charge | Quantifies hydrogen-bond acceptor surface area; larger values support stronger solute–solvent electrostatic stabilization via polar interactions with water | PaDEL |
We find predictive performance mirrors this pattern; Fig. 7 shows parity plots of the predicted versus true ΔGsolv values. The model using Mordred descriptors achieves the best performance (R2 = 0.518, miscalibration area = 0.069), followed closely by PaDEL (R2 = 0.447, miscalibration = 0.074). ChemBERTa performs significantly worse (R2 = 0.203, miscalibration = 0.086), highlighting the advantage of descriptors that are not only informative but also grounded in physical intuition. Mordred's superior performance likely reflects its inclusion of detailed topological and electronic descriptors that are relevant for capturing solute–solvent interactions, particularly for redox-active molecules. It is worth noting similar (useful) trends were found for the PCE case study, which are provided in the SI, though the nature of the selected descriptors greatly differs due to the distinct physical mechanisms involved.
Table 2 provides a detailed overview of the selected descriptors, including their formal definitions, physical interpretation, and relevance to solvation behavior. All features shared between Mordred and PaDEL encode polarizability and ionization potential trends, which are two physicochemical quantities known to impact solvation behavior by influencing solute–solvent electrostatic and dispersion interactions.89–91 Together, these results illustrate how the descriptor library choice has an important and meaningful impact on the transparency and scientific value/interpretability of MolDAIS.
However, identifying suitable OEMs remains challenging due to the need to balance multiple competing criteria, such as energy density and long-term cycling stability.19 Here, we focus on two molecular properties that serve as proxies for these targets: redox potential (E0) (which relates to achievable voltage) and solvation free energy (ΔGsolv) (which governs stability and ion transport in aqueous environments). The goal is to identify molecules that jointly maximize E0 while minimizing ΔGsolv, corresponding to maximization of the Pareto frontier in this two-objective space.
We adopt a multi-objective BO framework using the expected hypervolume improvement (EHVI) acquisition function99 to sequentially select molecules. As before, we use the 103
000-molecule quinone dataset with DFT-computed property values.87 Each algorithm is initialized with 10 randomly selected molecules and allowed 90 additional evaluations. We compare three methods: MolDAIS-MIC, FP–TM, and Random, and repeat each trial 20 times with independent random seeds. We report results only for MolDAIS-MIC among our variants, as it performs similarly to the original MolDAIS while having lower computational cost.
Fig. 8 summarizes the results. The top panel shows the average log hypervolume of the discovered Pareto front as a function of the number of iterations. MolDAIS-MIC achieves substantially higher hypervolume than FP–TM and Random, with consistently faster growth across replicates. The bottom panel visualizes the final sample distribution and learned Pareto fronts for the median replicate of the final hypervolume value over the 20 trials. MolDAIS-MIC successfully discovers a wide range of diverse tradeoff OEM designs that closely approximate the true Pareto frontier, while FP–TM and Random recover only a narrow band of suboptimal candidates. Notably, nearly 20 points on the MolDAIS-MIC front strictly dominate the FP–TM set, highlighting the benefits of learning property-specific subspaces that evolve as more property data is collected. Further inspection of the selected subspaces reveals minimal overlap between those learned for E0 and ΔGsolv, supporting the hypothesis that modeling each property in a tailored latent space is critical to navigating tradeoffs in complex molecular design problems.
Across a diverse suite of benchmark and real-world MPO tasks, including both modeling and optimization, MolDAIS consistently outperforms state-of-the-art methods based on molecular graphs, SMILES strings, and fingerprints. Importantly, it requires minimal hyperparameter tuning and avoids the need to train large-scale, highly parameterized models, making it a practical and accessible tool for experimental researchers without deep expertise in machine learning. While this work focused on single- and multi-objective BO, the modularity of MolDAIS makes it readily extensible to more complex design scenarios. For example, it can support multi-fidelity optimization, where data from simulation and experiment are combined, or novelty-driven discovery, where the goal is to explore underrepresented regions of chemical space or uncover diverse (atypical) molecular candidates.
There remain several interesting opportunities for extending the MolDAIS framework. In this work, we examined three sparsity-inducing strategies that assume axis-aligned relevance of individual descriptors. A natural next step is to account for correlated descriptor interactions – particularly in cases where properties vary along curved or entangled manifolds in the feature space. One potential approach would be to integrate nonlinear dimensionality reduction methods, such as kernel PCA, within the BO loop, while still applying sparsity-inducing priors in the resulting latent space. Alternatively, structured kernel priors that encode pairwise or hierarchical relationships between descriptors could help capture higher-order dependencies. In addition, while we focused on the Mordred descriptor library due to its broad applicability and ease of use, MolDAIS is fully compatible with any descriptor set. Incorporating richer, domain-specific descriptors (such as quantum-derived features for catalysis or electronic reactivity indices for redox chemistry) could further enhance performance and interpretability in specialized applications. In practice, such extensions may involve tailoring descriptor sets to specific tasks or materials domains, enabling domain-informed optimization without sacrificing generality. MolDAIS is also compatible with learned or parameterized descriptor sets, such as those generated by evolutionary approaches like AExOp-DCS,100 which could further enrich the feature space and thus enhance performance on specific MPO problems.
Finally, our results suggest a broader insight: even when the “right” descriptors are not known a priori, it is still possible to perform competitive MPO, provided that a large and diverse library of candidate features is available and appropriate regularization is applied. By learning task-relevant subspaces in a data-driven fashion, MolDAIS demonstrates how sparsity and adaptability can bridge the gap between generic descriptors and tailored predictive performance. A particularly interesting direction for future work is to systematically evaluate how well this framework generalizes across different core chemical scaffolds (an issue commonly studied with scaffold-based train/test splits in benchmarks such as MoleculeNet101), which is important for, e.g., “scaffold hopping” in drug discovery.102 Overall, we believe this work establishes a flexible and probabilistically grounded framework for low-data MPO, with clear pathways for future development and deployment in data-driven chemistry and materials science.
P. Optimization results for the log
P problem are available at https://doi.org/10.6084/m9.figshare.25506292.v1. Datasets for the PCE, log
P, EC50, and Lipophilicity problems are available at https://doi.org/10.6084/m9.figshare.25506295.v1. Datasets for the Photoswitch, FreeSolv, and ESOL problems are available at https://doi.org/10.6084/m9.figshare.29983504. Datasets for the quinone redox potential and solvation free energy benchmarks are available at https://doi.org/10.6084/m9.figshare.29602217.v1.
Supplementary information is available that shows the molecular property distributions for the benchmark problems, additional details on the hyperparameter selection for the approximate SAAS variants, and further analysis on the PCE case study. See DOI: https://doi.org/10.1039/d5dd00188a.
| This journal is © The Royal Society of Chemistry 2025 |