Open Access Article
Sumeet Vadhavkar
*a,
Pradeep Bajracharya
a,
Md Shakil Zamana,
Poornima Padmanabhan
*b and
Linwei Wang
a
aCollege of Computing and Information Sciences, Rochester, NY 14623, USA. E-mail: sv6234@rit.edu
bDepartment of Chemical Engineering, Rochester Institute of Technology, Rochester, NY 14623, USA. E-mail: poornima.padmanabhan@rit.edu; Tel: +1 585 475 4877
First published on 20th March 2026
Rapid and accurate prediction of the colloidal phase diagrams is necessary for controlling their structure and properties. Mapping these multi-dimensional phase diagrams with molecular dynamics (MD) is costly, especially when sharp, localized transition boundaries demand dense sampling to resolve the boundaries. We target this challenge by actively learning the boundary where the order parameter – the number of nearest neighbours, Nnn – undergoes an abrupt change. We describe colloids interacting via isotropic pair potentials with tunable range of interaction and aim to identify the boundary between the dilute and condensed phases. We model Nnn = f(x) with a Gaussian process (GP) and derive a novel gradient-aware acquisition strategy that prioritizes locations with large and/or uncertain spatial derivatives of the GP posterior. The gradient-aware strategy is based on the thermodynamic principle of large susceptibility, i.e., large derivative of the order parameter, at phase boundaries. A simple ellipse-based exclusion heuristic helps spread acquisitions along the boundary manifold. Across 1D and 2D planes of the phase diagram, the proposed strategy localizes phase boundaries accurately with significantly fewer MD runs than dense grid search and outperforms random acquisition or standard acquisition strategies unaware of the gradient of the GP. The framework can be readily extended to higher-dimensional space and support multi-fidelity MD simulations, providing a practical route to sample-efficient and physics-aligned discovery of colloidal phase diagrams.
Design, System, ApplicationColloidal particles self-assemble into a wide variety of equilibrium and non-equilibrium structures. Small changes in their interparticle interactions can affect the phase diagram significantly. In this work, we have developed a gradient-based active learning acquisition framework to efficiently identify the phase boundary between dilute and condensed phases. The system under investigation is comprised of colloidal particles at several volume fractions interacting with an isotropic potential of varying strength and interaction range. Even such a simple potential leads to a computationally expensive, three-dimensional phase diagram. We have applied the gradient-based active learning framework to quickly target the boundary in one-dimensional and two-dimensional slices of the phase diagram within a few dozen simulations. Short molecular dynamics simulations yield a number of nearest neighbors metric that changes abruptly at the phase boundary, making it suitable for gradient-based acquisition strategies. This rapid exploration will enable the rapid prediction of multi-dimensional phase diagrams of colloidal particles. |
Computational models, such as Molecular Dynamics (MD)6 simulations, are commonly employed to simulate colloidal particles under varying conditions (e.g., temperature, concentration) in order to construct phase diagrams. In this context, the parameter space comprises all relevant tunable inputs (such as temperature, concentration, and interaction strength) that dictate the system's phase behavior. By systematically probing this parameter space, researchers can identify phase boundaries, predict material properties, and significantly reduce the need for extensive physical experimentation.7 Traditionally, grid search is used, which exhaustively samples the chosen range of each parameter to provide high-resolution information to construct the phase diagram.8
Although often faster than physical experimentation, MD simulations can still be computationally expensive.9,10 This high cost becomes a significant limitation when employing grid search methods to explore phase diagrams. This is particularly true when dealing with high-dimensional parameter spaces, where the number of grid points required increases rapidly with the number of dimensions—a phenomenon known as the ‘curse of dimensionality.’11,12 Specifically, if a parameter space is divided into n segments per dimension, the total number of grid points for a d-dimensional space becomes nd. This rapid growth in the number of required simulations leads to a significant increase in computational cost, making grid search inefficient and even infeasible for large-scale problems. Moreover, phase diagrams are characterized by regions that are almost piecewise smooth with sharp transition boundaries between different phases.13,14 To capture these sharp boundaries of phase transition necessitate a high grid resolution,15 which exacerbates the curse of dimensionality while wasting significant redundant computations outside the region of interest (i.e., boundaries of phase transition). Finally, simple MD simulations are prone to incorrect predictions due to the lack of adequate sampling or convergence of results, which often necessitates additional, more computationally expensive methods such as the Gibbs ensemble and Gibbs–Duhem integration methods to accurately determine phase boundaries.16,17
While the aforementioned methods are well-suited for systems at equilibrium (stable state), they become invalid or insufficient in rapidly evolving, non-equilibrium environments. Hence, they may not be suitable for the increasingly important fields of gels, glasses, or active matter systems that are not at equilibrium.18,19 The dynamic nature of these non-equilibrium systems poses additional challenges that traditional grid-search methods may not adequately address.
Therefore, there is a pressing need for more efficient techniques that can minimize the computational burden of grid-search approaches while identifying approximate phase boundaries in compute-intensive, high-dimensional, and non-equilibrium systems. The approximate phase boundaries can subsequently be used for more detailed investigation through appropriate advanced techniques and/or physical experimentation. Active learning presents a promising solution to this challenge.20,21 When learning to approximate a function over an input parameter space, an active learner will strategically select new points in the input space to query the input–output values of the function of interest, such that the number of points needed to learn the function can be minimized. More specifically, consider a function y = f(θ), f:
n →
, where y is the function value of interest (e.g., a measure of the system's phase) and θ is the set of parameters under which the system is simulated. Instead of approximating f passively from random or uniform samples, active learning uses an acquisition function (AF) to guide the intelligent selection of θi to query the true value of yi = f(θi), considering a balance between two competing goals: exploration and exploitation. Exploration involves sampling points in unexplored regions of the parameter space to gain new information (often driven by the uncertainty in the current approximation of f), while exploitation focuses on refining the model in regions where the function values are of the most interest, typically the optimum (e.g., maximum or minimum of y over the space of θ). In other words, standard AF design in active learning is focused on the magnitude and uncertainty of the approximated output of function f.22,23 This is typically done using a Gaussian Process (GP) to model f which allows the predictive mean and variance of the GP to be analytically calculated and used in the AF. They have been successfully applied to search the parameter space of a complex system to discover the region of interest with a minimum number of simulations or experiments.24,25
In learning to approximate the phase diagram, however, the region of interest is not the phase but the boundary of phase transitions that are known to be sharp and localized. In learning to approximate the phase diagram, one can alternately describe it as identifying the boundary (manifold) where the order-parameter of interest undergoes a sharp transition. In thermodynamics, susceptibility is the derivative of the order parameter with respect to a thermodynamic property of the system that is often one of the axes of the phase diagram. Characteristically, the susceptibility exhibits a sharp peak at a phase transition. For multi-dimensional phase diagrams, one would analogously consider the gradient of the order parameter as the susceptibility vector. Standard AF design, not concerned with the rate of change (i.e., gradient)22,23 of the underlying function of interest, is thus not suited for capturing the narrow boundaries of phase transitions if directly applied. Moreover, standard AFs often focus on selecting individual query points in isolation over the parameter space, without considering a broader region of interest. This can lead to sampling localized clusters of points near a phase-transition boundary without accurately mapping its complete manifold.
A recent effort attempts to address these challenges by modeling the phase-transition boundaries as a classification boundary to be learned by active learning.26 It involves casting the problem as a classification task where, upon querying the MD simulation at a particular point in the parameter space, a label (e.g., −1 or 1) is assigned based on the simulation output to indicate the current phase of the colloid at that point. A GP is used to approximate these binary labels across the parameter space, with the AF designed to guide the selection of new query points towards regions with high uncertainty (i.e., high variance of the GP) or near the classification boundary (i.e. predicted label value close to 0). While demonstrating promising results in directing computational resources toward refining the transition region, this approach relies on a heuristic strategy that approximates the sharp boundary of phase transitions as a classification boundary between two discrete values. This approximation, partly dependent on a threshold to binarize an otherwise continuous phase diagram, introduces errors in the delineation of phase transition boundary. Furthermore, the assumption that the phase transition will occur at points where the value of y approaches 0 is not necessarily physically realistic. Rather, one must account for some physics-informed criterion such as the behavior of the susceptibility, rather than a unique value of the order parameter, at a phase transition.
In this work, we propose a novel active learning framework for rapidly localizing and approximating the sharp boundaries of phase transition in colloidal systems with a small number of queries for MD simulations. Our approach features two main contributions. First, we derive an analytical gradient-based AF that focuses on the first-order derivatives of the GP approximation of the phase diagram. In other words, we focus on the underlying rate of change of the approximated order parameter over the parameter space, considering both its magnitude and uncertainty, to guide the acquisition of query points towards the phase transition boundary. Second, to prevent excessive and localized sampling in high-gradient regions, we introduce a heuristic strategy that prevents new queries within an elliptical neighborhood of existing points in the parameter space. This strategy promotes the identification of the complete phase-transition manifold while further enhancing computational efficiency.
We validated our framework on one-dimensional (1D) and two-dimensional (2D) colloidal systems, in comparison to uniform grid sampling and active learning with commonly used upper-confidence-bound (UCB) AF without considering the gradient of the underlying function. Our results demonstrated rapid and reliable convergence to accurate phase diagrams—even with randomly chosen initial samples. We further investigated the effect of the length scale of the GP and the radii of the ellipse. Looking ahead, our method can be naturally extended to three-dimensional (3D) systems and further integrated into multi-fidelity active learning where lower-fidelity simulations provide fast lower-resolution approximations while higher-fidelity simulations can be dedicated to refine critical transition zones as guided by active learning.
We utilize molecular dynamics (MD) simulations to study the self-assembly of colloidal particles interacting with a tunable Morse potential given by Eqn. S1 in the SI. Our input parameters to the MD simulations represent temperature changes, volume fraction, and the range of interaction. Mathematically, the parameters are scaled attractive energy V* = V0/kBT, volume fraction of the colloids ϕ, and κa that controls the range of interaction relative to the radius of the particle a. In summary, each simulation tracks the evolution of 62
500 polydisperse colloidal particles of average radius a in implicit solvent undergoing Brownian motion, at the specified points in the phase diagram (attractive energy, volume fraction, and range) for 4000 Brownian times. The MD simulations are executed using LAMMPS with computing resources from research computing at RIT.10,29,30 Additional simulation details are available in the SI. At the end of each simulation, we track the local structure around each particle and compute the number of nearest neighbors Nnn from the radial distribution function (Eqn. S2 in the SI). In the literature, similar order parameters have been used to distinguish dilute and condensed phases. For instance, the discrete number of particles in the neighborhood of a single particle, also called contact number, have been used to identify the water to vapor transition31,32 and to classify non-equilibrium colloidal phases (gels and glasses)15,33 The exact value of this contact number, Nnn, at the phase boundary is often taken to be 6, the mid-way point between the maximum allowable range,31,32 but more generally, a rapid change of the order parameter will occur at the phase boundary which would naturally lend itself to a gradient-based acquisition method. Two sample snapshots of the particles are shown in Fig. 1, where low and high Nnn are able to distinguish dilute and condensed phases. The phase diagram we are approximating is thus represented by the following function:
| Nnn = f(V*, ϕ, κa). | (1) |
Our active learning approach leverages a gradient-based AF, which uses both the magnitude and uncertainty of the first-order derivatives of the approximated f(x) to prioritize regions where sharp transitions are likely. In doing so, we efficiently map the boundaries of phase transitions within a phase diagram with fewer simulations, focusing computational resources on parameter combinations most indicative of phase boundaries. As summarized in Algorithm 1, the active learning process involves the following main components. First, the GP is initialized with an initial set of labeled points. It then enters an iterative process of optimizing an AF to choose new parameter values, at which we query the MD simulations and update the GP with newly acquired data. The iteration continues until a convergence criteria is met.
| f ∼ GP(μ(x), k(x, x′)), | (2) |
![]() | (3) |
is the length scale parameter of the kernel. The RBF kernel is chosen for its smoothness and ability to model complex, non-linear functions.
Given training data {(xi, yi)}ni=1, the joint distribution of the observed outputs y ∈
n and the (scalar) function value y* = f(x*) at a single test point x* is
![]() | (4) |
, K(X, X) is the n × n kernel matrix, K(X, x*) ∈
n×1, and K(x*, x*) ∈
. Conditioning yields the posterior predictive mean and variance at x*:| μ*(x*) = K(x*, X)[K(X, X)]−1y,σ*(x*) = K(x*, x*) − K(x*, X)[K(X, X)]−1K(X, x*). | (5) |
The training is initialized with p labeled pairs (x, y) (see section 3.4), and increases by two at each acquisition (see section 3.2). The GP is iteratively updated with each newly acquired simulation result.
![]() | (6) |
![]() | (7) |
Let ∇y* represent the derivative of y with respect to x at the test point x*. The joint distribution of y and ∇y* at the test points is:
![]() | (8) |
Therefore, the predictive mean and standard deviation of the derivative at a test point x are:
![]() | (9) |
Note that the GP derivatives are calculated separately for each dimension in the parameter space, where regions of high derivative magnitude are expected to be narrow and localized, providing essential information for phase-change detection. We thus acquire points based on the spatial derivative per dimension, exploiting the high gradient magnitude along that dimension along with its variance. For dimension j of a n-dimensional x, this is achieved via a derivative-based UCB acquisition function expressed as:
![]() | (10) |
For each coordinate j, we choose
that maximizes αj over a candidate set, producing up to n points per iteration (after ellipse filtering) as described in the following section. The hyperparameter β controls the balance between exploitation of high-gradient regions and exploration of regions the algorithm has yet to visit: in this paper, the value of β is set to 1.
| ϕ vs. V* | κa vs. V* | ϕ vs. κa | |
|---|---|---|---|
| κa | N/A | 1 | 20 |
| ϕ | 0.05 | N/A | 1 |
| V* | 1 | 1 | N/A |
Prior to initialization, we define the GP prior as a zero-mean function with the RBF kernel as described earlier. A white noise kernel is added to this to account for observation noise from the MD simulations. To provide a data-driven starting point, the RBF kernel's initial length scale for each dimension is set heuristically to the median of the pairwise distances of the initial five samples.
![]() | (11) |
To determine if the model has stabilized, the change in the rolling mean is compared to a predefined threshold ε. If the change in the rolling mean over consecutive windows is consistently below ε, it indicates that the model's predictions are stabilizing. Formally, the process can be considered converged if
| |μrolling(t) − μrolling(t − w)| < ε | (12) |
In the current work, we performed all experiments with a predefined number of iterations, while evaluating the validity of the stopping criteria.
Another set of important hyperparameters is the ellipse radii to avoid excessive clustering of acquisition points and promote exploration of the phase-transition boundary. Their values are determined based on domain expertise. Table 1 lists the radii values we use for each 2D parameter space (ϕ vs. V*, κa vs. V*, and ϕ vs. κa). Each row represents individual parameters, while the columns correspond to the planes over which the phase diagrams are being computed. For instance, in the first row, κa is left unused in constructing the phase diagram on the ϕ vs. V* plane, and the value of the ellipse radius is set to 1 and 20 along the κa direction for identifying the phase-transition boundary in the κa vs. V* and ϕ vs. κa planes, respectively.
| α(x) = |μ(x)| + β × σ(x) | (13) |
• Accuracy is measured by the discrepancy between the predicted and ground-truth phase boundaries using the Hausdorff distance (HD). Hausdorff distance measures how far two shapes are from each other. Given two sets of points A and B (e.g., two contours), it looks at the point in A that is farthest from any point in B, and the point in B that is farthest from any point in A; the larger of those two numbers is the Hausdorff distance. We use the usual Euclidean distance between points.
• Efficiency measured by the number of MD simulations needed to achieve the desired accuracy in delineating the phase-transition boundaries. Obtained from a dense grid of 100 MD simulations.
![]() | ||
| Fig. 4 Hausdorff distance (HD) between the predicted and ground-truth phase boundary against the number of sequential MD simulations. | ||
For the phase diagram at ϕ = 0.05, the primary boundary with the largest gradient in the V* direction is rapidly captured by simulation 13, while the secondary boundary in the orthogonal direction in the κa direction near the edge takes a bit longer until simulation 19 to be localized, achieving a HD ≤ 0.1 beyond 17 simulations.
For the phase diagram at κa = 50, the diagonal phase boundary is again quickly detected at simulation 13, with subsequent acquisitions directed towards this region. By simulation 19, the diagonal structure of the phase boundary is well captured with a HD ≤ 0.1 and, by simulation 24, the detailed shape of the boundary near the top left of the diagram is well captured, with a HD well below 0.1 throughout the rest of the simulations.
Overall, as summarized in Fig. 4, the presented gradient-based acquisition is able to detect the primary boundary within 9–11 simulations and, by simulation 17–21, accurately capture the detailed shape of the phase boundary with a HD ≤ 0.1 if the primary boundary is not near the edge of the phase diagram. This means a reduction of over 80% computation compared to the reference (17 vs. 100 simulations). In the more challenging case where the primary phase boundary in near the edge of the phase diagram, the presented method is still able to accurately capture the primary boundary by simulation 24 (70% reduction of computation) and delineate even the secondary boundary in about 40 simulations (50% computation). This ability to localize phase boundary near edges can be important when applying to physical experimentation, where the feasible windows of parameters could be constrained by other practical factors.
![]() | ||
| Fig. 5 (a): Progression of active learning with random acquisition for ϕ = 0.05. (b) Progression of active learning with mean targeted UCB for ϕ = 0.05. | ||
As shown in Fig. 5(a), for the phase diagram at ϕ = 0.05, the random sampling approach struggles to correctly capture the top left of the transition boundary. This is in contrast with the gradient-based acquisition in Fig. 2 that starts to refine this boundary at simulation 13 and correctly captures its detail by simulation 19.
As shown in Fig. 5(b), for the phase diagram at ϕ = 0.05, acquisition explicitly targeting a high mean quickly captures the top left boundary as early as simulation 7, but fails to learn the bottom right boundary which was along the edge and high in gradients in the y direction. As a result, it was not able to correctly capture the shape of the phase-transition boundary even with more than 45 simulations.
These visual observations are confirmed by the quantitative HD metrics in Fig. 6, which shows that the HD metrics by the gradient-based acquisition drops significantly faster and stays lower compared to the two alternative methods. An important advantage of the gradient-based acquisition is that it does not require prior knowledge or assumption about the Nnn value on which to locate the phase boundary; it remains effective even when such knowledge is absent. We compared this method to a baseline where we explicitly used the knowledge that the boundary lies at Nnn = 6 and the gradient method was comparable to this method. Additional details are in Fig. S4 and S5 of the SI.
1. GP rolling mean (leftmost panel) – the moving average of the GP mean over all parameter pairs.
2. Mean-error to ground truth – the absolute difference between the GP mean and the reference mean.
3. GP rolling mean of ∂f/∂x1 or ∂f/∂x2 – moving average of the derivative along each direction.
4. Derivative-error in x1 or x2 – absolute difference between the derivative and the ground truth in each direction.
An example is shown in Fig. 7 here the rolling mean drops below ε = 0.2 which is the threshold for stopping at this point the RSME has achieved a small value close to 0.5 indicating good results. Beyond this point the GP is further refined but at the cost of additional simulations. In each case, a rolling-mean threshold ε = 0.2 would have stopped the run before the maximum number of simulations is reached. Moreover, when the rolling mean is under the threshold ε = 0.2, not only is the mean difference with the ground truth low but also the difference in the means of derivatives in both directions is low. See supplementary information for more comprehensive convergence metrics. For ϕ = 0.05, this happens at simulation 27. This suggests that under the threshold value, we get accurate phase boundary across both dimensions. Fine-tuning ε therefore offers a principled way to trade off efficiency and accuracy depending on the application need.
Fig. 9 summarizes the quantitative results for the fixed κa = 50 plane with and without ellipse-based heuristics. The HD metrics show that the ellipse-heuristics strategy reaches an error <0.1 within 9 simulations and stabilizes within 17, whereas the one without drops quickly to HD <0.1 within 13 simulations but is unable to stabilize and reaches an HD as high as 0.7 within subsequent simulations. If necessary, one can adjust the rate of convergence by adjusting the exclusion ellipse radii.
In practice, the phase diagram of a given colloidal system may also be drawn using different representations of the variables in each dimension. In the present study, the temperature dependence was reported using V* = V0/kBT corresponding to inverse temperature, but it is common practice to use dimensionless temperature (T* = kBT/V0), or the reduced second virial coefficient, which is a non-linear function of V0 and T.15 In such representations, the new range of variables labeling that dimension will inform the heuristic choice of parameters, namely, the length-scale parameter in the kernel (eqn (3)) and the radius of the exclusion ellipse.
In this work, the phase diagram was represented by a one-dimensional scalar value, Nnn, and approximated by a GP, to distinguish two phases (a dense and dilute phase). This could be extended in future work to allow phase diagrams to be represented by multi-dimensional variables, either for a more complete characterization of a single phase boundary of interest, or for characterizing multiple phase boundaries using different metrics. This can be achieved by replacing a single-output GP with multi-output GPs35,36 or a neural network.37,38 If the intent is to use the multi-dimensional output to identify a single phase boundary, the proposed gradient-based acquisition can be extended to consider the vector output with respect to each input dimension. For example, the characterization of glassy phases might require identification of both local structure and dynamics; local structure (Nnn) alone would be insufficient. If the intent is to use each dimension of the output vector to identify different phase boundaries, the proposed gradient-based acquisition can be applied to each output dimension, thereby running several acquisition simulations per iteration while sharing information from the acquired simulations. For example, to incorporate the glassy phase in the phase diagram in addition to the dilute and condensed phases, one would use this approach to simultaneously identify at least two boundaries – the dilute-condensed line and the fluid-glass line.
In the current method, to avoid excessive placement of local clusters of acquisition points near high-gradient regions, we adopted heuristic ellipses to encourage acquisitions to be evenly placed along the phase-transition boundary. Future work could improve this heuristic by designing an acquisition function that searches for the high gradient manifold of the phase diagram, instead of searching for individual points of high gradient.
Finally, the presented method could be used in a multi-fidelity simulation setting, where coarse-scale simulations would provide rapid initial localization of the phase-transition boundary, followed by more complex simulations actively selected near the initial boundary to iteratively refine its delineation. The high-fidelity simulations could take one of two approaches: refinement of the order parameter that describe the molecular behavior, or refinement of the estimation of the thermodynamic properties of the phase boundary itself. In the first approach, physics-informed techniques using principal component analysis (PCA) as described in ref. 38 can be used to construct high-fidelity order parameters near the vicinity of the phase boundary that can elucidate underlying complex mechanisms of the phase transition. In the second approach, expensive high-fidelity simulations such as coexistence simulations or free energy methods to compute the underlying energy landscapes can be performed. More recently, the expense of high-fidelity simulations was shown to be dramatically reduced by the use of generative thermodynamic maps (TMs) to predict free energy distributions in the context of phase transitions,39,40 but more work is needed in this area to improve the accuracy of the distributions.
Across one-dimensional and two-dimensional planes of the phase diagram, the proposed gradient-aware strategy consistently localized phase boundaries with substantially fewer MD simulations than dense grid search. In addition, it outperformed random acquisition and conventional GP acquisition strategies that operate only on the predicted mean and variance of Nnn and are not explicitly sensitive to sharp spatial transitions. Importantly, because the approach uses gradient information rather than assuming a fixed target level of the order parameter, it remains effective even when the precise boundary value is uncertain or varies across conditions – an advantage for exploratory studies where a universal threshold may not be known a priori.
Beyond the immediate results, the framework provides a practical route toward scalable, physics-aligned phase-diagram discovery. The gradient-based acquisition principle naturally extends to higher-dimensional parameter spaces, where grid search becomes prohibitive, and it is compatible with multi-fidelity workflows in which inexpensive, lower-accuracy simulations provide coarse localization while high-fidelity MD is reserved for refining critical boundary segments. More broadly, the methodology is applicable to other soft-matter and self-assembly systems where phase behavior is piecewise smooth with localized transition zones and where an order parameter exhibiting sharp changes can be defined. Taken together, these findings demonstrate that incorporating gradient information into active learning enables sample-efficient, boundary-focused exploration of colloidal phase diagrams and offers a computationally efficient alternative to exhaustive simulation campaigns.
| This journal is © The Royal Society of Chemistry 2026 |