Seyed Saeid
Tayebi
,
Nate
Dowdall
,
Todd
Hoare
* and
Prashant
Mhaskar
*
Department of Chemical Engineering, McMaster University, 1280 Main St. W., Hamilton, Ontario, Canada L8S 4L7. E-mail: hoaretr@mcmaster.ca; mhaskar@mcmaster.ca
First published on 11th August 2025
The particle size of a nanoparticle plays a crucial role in regulating its biodistribution, cellular uptake, and transport mechanisms and thus its therapeutic efficacy. However, experimental methods for achieving a desired nanoparticle size and size distribution often require numerous iterations that are both time-consuming and costly. In this study, we address the critical challenge of achieving nanoparticle size control by implementing the Prediction Reliability Enhancing Parameter (PREP), a recently developed data-driven modeling-based product design approach that significantly reduces the number of experimental iterations needed to meet specific design goals. We applied PREP to effectively predict and control particle sizes of two distinct nanoparticle types with different target particle size properties: (1) thermoresponsive covalently-crosslinked microgels fabricated via precipitation polymerization with targeted temperature-dependent size properties and (2) physical polyelectrolyte complexes fabricated via charge-driven self-assembly with particle sizes and colloidal stabilities suitable for effective circulation. In both cases, PREP enabled efficient and precise size control, achieving target outcomes in only two iterations in each case. These results provide motivation to further utilize PREP in streamlining experimental workflows in various biomaterials optimization challenges.
The success of each of these applications depends strongly on the size of the nanoparticle,9 which regulates both the convective transport of nanoparticles due to blood shear and variations in interstitial pressure as well as the potential for nanoparticles to interact with active and passive transport pathways that enable intracellular transport and/or transport across biological barriers such as the blood–brain barrier.6,10–16 In response, significant effort has been invested in developing strategies to synthesize nanoparticles with precise and uniform sizes across different particle size ranges suitable for different biomedical transport tasks.1,2,10,14,17,18 Such efforts can be broadly classified into two categories: (1) the assembly of pre-fabricated polymers into particles and (2) the direct synthesis of nanoparticles from monomeric building blocks. In the former case, techniques such as self-assembly, triggered precipitation, and template-assisted synthesis are commonly employed due to their ability to produce nanoparticles with well-defined characteristics.19–23 Self-assembly, for instance, relies on the spontaneous organization of polymeric building blocks through secondary intermolecular interactions like hydrophobic interactions, hydrogen bonding, electrostatic forces, and π–π stacking, with particle size control enabled by rational tuning of the composition of the building blocks and the solution conditions used.19,20 However, the inherent dispersity in size and composition among the typical polymeric building blocks for self-assembled nanoparticles can lead to broad particle size distributions, multiple particle populations, and/or the potential for aggregation. In the latter case, emulsion, precipitation, and/or suspension polymerization methods can all be applied to achieve particle size control, with the combination of such templating methods with controlled free radical polymerization strategies (e.g. atom transfer radical polymerization in emulsion polymerization) particularly beneficial to produce nanoparticles with tunable sizes.17,18 However, factors such as the variability of the local shear field, variable particle aggregation/nucleation, variability in surfactant or other surface stabilizer performance under different environmental/solvent conditions, and/or localized temperature gradients can result in poor control over nanoparticle size and polydispersity, particularly for methods that do not rely on more complex polymerization pathways and are thus more amenable to practical translation.
Solving these size and stability challenges is challenging based on the frequent interdependence of the key factors that regulate such properties; for example, adjusting one parameter such as monomer concentration, surfactant type/concentration, or reaction temperature can affect polymerization and/or assembly kinetics, the stability of the nanoparticle/solvent interface, and/or particle nucleation kinetics in sometimes unanticipated ways. This interconnectedness makes relying solely on experimental techniques for nanoparticle size optimization both time-consuming and costly, especially without a strategic framework to guide the process.24–27 In this context, incorporating model-based design techniques that can capture underlying patterns and relationships within the synthesis process offer significant promise to accelerate nanoparticle design. By leveraging model-based computational tools, researchers can plan experimental iterations more efficiently, reducing resource consumption and expediting the development of nanoparticles with desired characteristics.
Modeling approaches for optimizing nanoparticle size can be broadly classified into deterministic and data-driven models. Deterministic models leverage fundamental principles to describe system behavior, offering detailed insights into mechanisms like particle growth and nucleation. Studies have demonstrated the utility of deterministic models in solving reaction–diffusion equations and predicting size distributions under varying conditions.28–35 However, these models require extensive computational resources, detailed mechanistic knowledge (including measurement of several often hard-to-measure or estimate rate or interaction parameters), and costly validation, making them less practical for complex systems. In contrast, data-driven models bypass the need for detailed mechanistic understanding by uncovering patterns directly from experimental data. These models have been widely used to predict nanoparticle properties such as size and morphology by correlating recipe parameters with outcomes24,26,36 and have been particularly leveraged in polymerization-based processes to establish correlations between recipe parameters and final nanoparticle size, facilitating predictive particle size control while accounting for radical polymerization kinetics, diffusion rates, and interaction dynamics.1,27,29,33,36–38
Among various data-driven modeling techniques such as neural networks and advanced nonlinear regression models,24,25,27,33 latent variable models (LVM) such as Principal Component Analysis (PCA) and Partial Least Square-Projection to Latent Structure (PLS) have garnered significant attention for their ability to identify a reduced set of latent variables—underlying patterns or structures—that explain most of the system's variability.39–42 While effective, these methods also pose drawbacks in the context of nanoparticle size optimization given their typical need for large datasets and prediction uncertainty when applied to new data points. Existing literature has proposed uncertainty metrics including Hotelling's T2 and Squared Prediction Errors (SPE) to address these limitations.43–50 While these metrics assess the alignment of new data points with the calibration dataset, their interpretations can vary depending on the specific metric used. Recently, we introduced the Prediction Reliability Enhancing Parameter (PREP), a unified metric that enhances predictive reliability by combining multiple model alignment metrics, to address this prediction uncertainty challenge. The PREP method was validated on synthetic datasets and shown to outperform existing methods to identify optimum inputs to achieve target outputs, particularly in cases in which the optimal solution is outside the design space of the original dataset.51 However, to-date the method has not been validated on an experimental use case.
Herein, we apply the PREP method to optimize nanoparticle size and nanoparticle size distributions in one polymerization-based nanoparticle synthesis use case (the synthesis of dual temperature/pH responsive microgels based on poly(N-isopropylacrylamide) (PNIPAM) via precipitation polymerization) and in one self-assembly-based nanoparticle synthesis use case (the fabrication of doxorubicin-loaded polyelectrolyte complexes based on sulfated yeast beta glucan and cationic dextran). The first case builds on previous literature from our group and our previous data-driven modeling efforts to optimize the size and colloidal stability of acid-functionalized PNIPAM microgels that have broad utility for drug delivery given their potential for environmentally-responsive reversible swelling responses, their capacity to deform and thus enhance penetration through biological barriers, and their highly hydrated surface properties that can suppress immune system recognition.40,52–54 The specific target was to match the crosslinking density and the acid content (4–8 mol%) to microgels in the existing dataset while achieving smaller particle sizes that remain stable over time. Specifically, while the pre-existing data set did not include a microgel with a size less than 170 nm that met the crosslink density and acid content criteria, a size of 100 nm was targeted to better exploit the biological penetration properties of the compressible microgels for drug delivery applications. The second case targeted a key challenge around the ionic strength tolerance of polyelectrolyte complexes, which are typically fabricated in water or low ionic strength buffers but often lose colloidal stability when then transferred to the physiological ionic strength conditions typically required for practical clinical use. The specific target was to achieve nanoparticles with diameter <200 nm (target = 170 nm) and a polydispersity index (PDI) as low as possible (target = 0.15), properties most suitable for long-term circulation, that remained colloidally stable under physiological ionic strength. We demonstrate that in both cases the PREP method can achieve the target properties with minimal historical data following only two iterations, opening the potential to apply PREP more broadly to address nanoparticle design challenges.
Specifically, LVM can either (1) extract correlations within a single block of data—via Principal Component Analysis (PCA)—and project the original correlated data into a latent uncorrelated space (referred to as scores) or (2) define relationships between input variables (X) and output variables (Y) by jointly mapping them onto a latent space. In both cases, the resulting scores are represented as linear combinations of the original variables that are orthogonal to one another. The general structure of LVM is illustrated in Fig. 1; for detailed mathematical formulations, and data-blocking configurations, the reader is referred to our prior manuscript.51
1. If A < K, there is no input set X for which Ypredicted = Ydesirable. In this case, model inversion identifies an input X where its Ypredicted is as close as possible to Ydesirable.
2. If A = K, there is a single solution for which its Ypredicted = Ydesirable that can be identified by model inversion.
3. If A > K (the most common case in practice), there are an infinite number of input sets X for which Ypredicted = Ydesirable. In this context, these solutions form a continuous set known as the Null Space (NS) that represents various input combinations that leave the output prediction unchanged.
Solutions derived from LVMI can either match the targeted predetermined value (as in the second and third scenarios) or come as close as possible to the predetermined value (as in the first scenario). While the prediction accuracy for these solutions varies across different samples, the degree of accuracy cannot be confirmed until all the solutions are experimentally tested, which can be a costly and time-consuming process. To address this issue, specific modeling alignment metrics can be computed solely from the input data (X), metrics that are generally classified into three categories:
(a) Hotelling's T2 metrics measure the distance of a new data point's projection to the latent space from the center of the latent space, indicating how far the new data point deviates from the calibration set.
(b) Squared prediction error (SPE) metrics assess how well the new data point can be reconstructed or regenerated by the model.
(c) Score alignment (HPLS & HPCA) metrics evaluate the similarity of the score structure of the new data point to that of the calibration data, indicating how closely the new sample aligns with the model's learned structure.
Fig. 1 also provides a conceptual summary of the Hotelling T2 and SPE metrics in which the SPE corresponds to the distance between the Xregeneratednew and Xnew in the input space (reflecting how well the model can reconstruct the new sample) and the Hotelling T2 metric reflects the distance between the latent projection of the new sample and the center of the latent space (capturing how far the sample deviates from the distribution of the calibration set). For the Score Alignment metric (H), when a new sample is projected into a less populated region of the latent space, it reflects a lower resemblance to the calibration data point score structure, resulting in a higher H score (and vice versa).
![]() | (1) |
To implement the PREP method, an initial dataset and a desired target output set are chosen and the k-nearest neighbors (with k being a tuning parameter) to the target output in the output space are identified and used to train both a PLS and a PCA model. The PLS model generates a list of potential design space (PDS) candidates comprised of candidate recipes expected to meet the target output. Model alignment metrics are subsequently calculated for the training data alongside the prediction accuracy, using a jackknife approach in which the PLS model is developed using a subset of the samples and the predicted output is compared to the actual value(s) of the excluded sample(s). The alignment metrics and prediction accuracy of the training dataset are then used to optimize the coefficients and powers of the PREP equation (C and P in eqn (1)), enabling the ranking of PDS samples by assigning a score to each candidate based on its likelihood of accurate prediction. Candidates with the lowest PREP score (indicating high prediction confidence) and the highest PREP score (representing high uncertainty, which can aid model refinement near the target output) are selected for synthesis. If the synthesized samples do not achieve the target, they are added to the dataset, the list of k-nearest neighbors is updated, and the process is repeated iteratively until the desired outcome is obtained. Fig. 2 illustrates the general scheme of the method, with further details available in the original paper.51
The PREP method has two key advantages relative to previous methods for assessing prediction accuracy: (1) only a single parameter needs to be evaluated to compare samples, reducing uncertainty and bias in prediction assessment; and (2) the method does not require a large number of data points for practical implementation, with as few as A + 2 data points needed in which A represents the number of independent principal components of the system input. Note that while Bayesian and Gaussian process-based approaches can also be applied effectively to similar optimization challenges, they tend to rely on more sample-intensive strategies (e.g., Monte Carlo sampling) and thus often require significantly more data to achieve convergence relative to the PREP method, particularly in complex or high-dimensional settings.51 Relative to non-linear modeling approaches such as support vector regression, decision trees, and Gaussian process regression that have also performed well for predicting materials properties using relatively smaller sample sizes, PREP offers a key advantage in that it is fundamentally a linear latent variable-based framework, thus reducing the risk of overfitting, making interpretability simpler, and facilitating more robust extrapolation along well-defined latent variable directions (the latter of which is particularly beneficial for inverse design).
Sample ID | NIPAM (g) | MBA (mg) | VAA (mg) | SDS (mg) | APS (mg) | Sizea (nm) |
---|---|---|---|---|---|---|
a Sizes correspond to the intensity-averaged effective diameter measured at pH = 7.4 and 37 °C. b Represents the best available candidates based on the existing dataset to meet the design criteria of creating a set of microgels with the same crosslinking density/acid content but as different as possible particle sizes. | ||||||
1 | 1.6 | 160 | 342 | 57 | 50 | 426 |
2 | 1.6 | 160 | 114 | 57 | 50 | 283 |
3 | 1.6 | 160 | 80 | 57 | 50 | 177 |
4b | 1.6 | 160 | 46 | 57 | 50 | 176 |
5 | 1.6 | 205 | 114 | 57 | 50 | 298 |
6 | 1.6 | 114 | 114 | 57 | 50 | 269 |
7 | 1.6 | 80 | 114 | 57 | 50 | 299 |
8 | 1.6 | 46 | 114 | 57 | 50 | 319 |
9 | 1.6 | 160 | 114 | 34 | 50 | 396 |
10 | 1.6 | 160 | 114 | 23 | 50 | 444 |
11 | 1.6 | 160 | 114 | 0 | 50 | 657 |
12b | 1.6 | 160 | 342 | 0 | 50 | 954 |
13 | 1.6 | 173 | 45 | 42 | 50 | 190 |
14 | 1.6 | 244 | 176 | 24 | 50 | 332 |
15b | 1.6 | 160 | 228 | 57 | 50 | 300 |
![]() | (2) |
The number of PLS components in such cases is typically determined using data-driven approaches such as cross-validation57 the eigenvalue-less-than-one rule,58 or based on experimental knowledge of the dependencies among input variables. In this microgel dataset, the selection was guided by experimental knowledge, as all three input variables—MBA, VAA, and SDS—could be independently manipulated within feasible ranges to synthesize new microgels. Consequently, three PLS components were chosen to sufficiently capture the relationships between the inputs and the output. Using this PLS model, the optimization framework in eqn (2) was applied, resulting in the recipe outlined in Table 2 (IbO 1st itr). The particle size obtained from this recipe (170 nm) was very close to the smallest microgel already available in the dataset. This new recipe was subsequently incorporated into the dataset, and the optimization algorithm was executed again for the next iteration. However, the synthesis of the suggested solution in the second iteration (IbO 2nd itr in Table 2) resulted in aggregation. It is worth noting that the direct model inversion solution was not applicable in this case, as it provided a single answer that failed to meet the required conditions around the VAA content (reaching as low as 2.4 mol%). As such, a more conventional approach did not achieve the targeted particle size, motivating the implementation of the PREP method, which was applied next to overcome these constraints.
Sample ID | MBA (mg) | VAA (mg) | SDS (mg) | Size (nm) | Comments |
---|---|---|---|---|---|
Direct Model Inversion | 158 | 33 | 57 | — | MBA and acid content both too low |
IbO 1st itr | 160 | 62 | 65 | 170 | |
IbO 2nd itr | 160 | 108 | 74 | — | Sample showed large-scale aggregation |
PREP 1st itr (L1) | 160 | 92 | 91 | 144 | |
PREP 1st itr (H1) | 160 | 70 | 80 | 151 | |
PREP 2nd itr (L2) | 160 | 84 | 134 | 104 | |
PREP 2nd itr (H2) | 160 | 101 | 133 | 118 |
The PREP method was implemented by first identifying the list of nearest neighbors; with three latent space components and a single output variable, a minimum of A + 2 = 5 nearest neighbors was required. To ensure clarity and avoid any perception that PREP was enhanced by the IbO method and the similarity of the IbO 1st itr sample to a pre-existing datapoint (Sample 4), the IbO 1st itr sample generated in the initial attempt was excluded from the list of neighbors to ensure that PREP started with the same dataset originally provided to IbO method. Fig. 3 depicts all available datapoints and five nearest neighbors to the target in both the input (a) and output (b) spaces.
![]() | ||
Fig. 3 Visualization of all available datapoints alongside the five nearest neighbors to the target in both the input (a) and output (b) spaces derived from the pre-existing dataset (Table 1). |
Subsequently, PLS and PCA models were constructed using the selected neighbors followed by the creation of the Potential Design Space (PDS). In this case, the number of PLS components exceeded the number of output variables by two, resulting in a two-dimensional null space (i.e. for any given Ydesirable, there exists a two-dimensional surface in the input and latent spaces where all points satisfy Ypredicted = Ydesirable). However, given the imposition of the constraint fixing the MBA content at 160 mg to match the crosslink density of the target microgel with the existing microgels in the series, the number of degrees of freedom was reduced to collapse the null space to a single dimension (i.e. a line within the original two-dimensional space), as shown in Fig. 4(i). Further analysis of the points along the blue line revealed that none of the candidates met the 4–8 mol% acid content requirement, necessitating the creation of the Potential Design Space (PDS) using an optimization-based algorithm. The algorithm generated a list of 50 candidates whose predicted outputs (Ypredicted) were as close as possible to the desired target (Ydesirable) while still satisfying all specified constraints. It is important to emphasize that the list generated through this optimization process fundamentally differs from the results obtained via IbO approach; while the PREP optimization algorithm produces a list of candidates by considering only the input range requirements, IbO yields a single solution by incorporating modeling alignment metrics such as Hotelling's T2 and Squared Prediction Error (SPE). The new list generated by the implemented optimization algorithm (the PDS) is also shown in Fig. 4(i).
To identify the most relevant candidates for synthesis within the Potential Design Space (PDS), model alignment metrics were calculated for both the nearest neighbor samples and the PDS members and then used together with the prediction accuracy of the nearest neighbor samples to optimize the PREP equation parameters (C and P in eqn (1)). The resulting optimized PREP equation was then applied to rank all PDS candidates, from which two samples corresponding to the lowest (L-PREP) and highest (H-PREP) PREP scores were selected for experimental synthesis. The results of the PREP optimization and the ranking of Potential Design Space (PDS) samples for iteration 1 are presented in Fig. 4 where panel (ii) illustrates the relationship between the prediction accuracy and the PREP score for the validation data points used in optimizing the PREP equation and panel (iii) shows the PDS candidates ranked by their PREP scores; the two selected formulations for synthesis, corresponding to the highest ranked (L-PREP) and lowest ranked (H-PREP) ranked candidates, are also clearly highlighted. As expected, lower prediction accuracy is associated with higher PREP scores, confirming the metric's effectiveness in assessing prediction reliability. The measured particle sizes of the L-PREP and H-PREP recipes, as shown in Table 2, demonstrated that the samples suggested by the PREP method outperformed all existing datapoints in the dataset as well as those proposed by IbO approach. However, since the particle sizes of these samples still did not meet the ∼100 nm target size, the newly synthesized samples from this first iteration were added to the dataset, the list of nearest neighbors was updated, and the PREP method was reapplied to generate new synthesis recipes. Note that including the two recipes from the first iteration (and thus removing the two samples from the five nearest neighbors from the first iteration) results in a 40% change in the dataset for the second iteration compared to the first iteration, a key advantage of using a smaller number of samples such that each sample carries disproportionately high weight in reframing the model (i.e. adding or replacing even a few samples can substantially alter the dataset, the model parameters, and thus the second iteration predictions).
The updated latent space based on the revised dataset are shown in Fig. 5(i). Note that enforcing all design constraints—particularly the specified acid content range of 4–8 mol%—did not yield a sufficient number of solutions within the actual null space (NS); consequently, the Potential Design Space (PDS) for the second iteration was expanded using the same optimization-based approach as in the first iteration, ensuring that all constraints were satisfied while generating at least 50 candidate datapoints within the PDS. The PREP equation parameters (C and P) were then re-optimized and the resulting equation was re-applied to rank all PDS candidates, with the resulting H-PREP and L-PREP samples identified in Fig. 5(iii) subsequently synthesized. As shown in Table 2, the L-PREP sample demonstrates exceptional proximity to the target particle size, achieving a size of 104 nm. Correspondingly, as shown in Fig. 5 panel (ii), the PLS model developed for the second iteration demonstrates significantly improved accuracy near the target output of 100 nm. Even the lowest-performing validation sample achieved over 97% accuracy—an improvement from 88% in the first iteration—indicating that the PREP method effectively guided the dataset expansion toward the desired region and enhanced model precision around the target.
Table 2 provides a summary of the particle sizes of the synthesized samples suggested by both the PREP and optimization-based methods. The microgel recipes proposed by the PREP method outperformed not only those generated by the optimization-based approach but also all samples in the initial dataset in terms of closeness to the target. The L-PREP and H-PREP samples from the first iteration achieved 75% and 78% accuracy relative to the target (particle sizes = 151 nm and 144 nm, respectively), while the second iteration recipes achieved accuracies of 92% and 98% (118 nm and 104 nm) that surpassed the predefined acceptable threshold of 95% closeness to the target. The PREP method's capacity to deliver an optimized solution within just two iterations underscores the method's ability to handle dataset expansion rationally, rapidly refine predictions, and adapt to challenging design constraints in a highly non-linear system.
Sample ID | Assembly solvent [× PBS] | Total precursor conc. [mg mL−1] | Pre-assembly GS conc. [mg mL−1] | Pre-assembly Dex-GTAC conc. [mg mL−1] | Pre-assembly DOX conc. [mg mL−1] | GS![]() ![]() |
Dex-GTAC![]() ![]() |
Assembly solvent | 1× PBS | ||
---|---|---|---|---|---|---|---|---|---|---|---|
Size [nm] | PDI | Size [nm] | PDI | ||||||||
1 | 0.5 | 0.5 | 0.750 | 0.200 | 0.050 | 15.0 | 4.0 | 156 | 0.11 | 208 | 0.11 |
2 | 0.1 | 0.5 | 0.750 | 0.200 | 0.050 | 15.0 | 4.0 | 109 | 0.13 | 362 | 0.04 |
3 | 0.5 | 0.75 | 1.125 | 0.300 | 0.075 | 15.0 | 4.0 | 147 | 0.14 | 229 | 0.09 |
4 | 0.1 | 0.75 | 1.125 | 0.300 | 0.075 | 15.0 | 4.0 | 110 | 0.14 | 357 | 0.08 |
5 | 0.5 | 1 | 1.500 | 0.400 | 0.100 | 15.0 | 4.0 | 161 | 0.15 | 260 | 0.06 |
6 | 0.1 | 0.25 | 0.375 | 0.100 | 0.025 | 15.0 | 4.0 | 133 | 0.18 | 326 | 0.11 |
7 | 0.5 | 0.5 | 0.750 | 0.188 | 0.063 | 12.0 | 3.0 | 146 | 0.09 | 217 | 0.11 |
8 | 0.1 | 0.5 | 0.750 | 0.188 | 0.063 | 12.0 | 3.0 | 124 | 0.16 | 298 | 0.08 |
9 | 0.1 | 0.75 | 1.125 | 0.281 | 0.094 | 12.0 | 3.0 | 123 | 0.19 | 313 | 0.05 |
10 | 0.5 | 1 | 1.500 | 0.375 | 0.125 | 12.0 | 3.0 | 164 | 0.10 | 243 | 0.05 |
11 | 0.1 | 1 | 1.500 | 0.375 | 0.125 | 12.0 | 3.0 | 124 | 0.20 | 744 | 0.25 |
12 | 0.5 | 0.5 | 0.750 | 0.125 | 0.125 | 6.0 | 1.0 | 141 | 0.10 | 170 | 0.21 |
13 | 0.5 | 0.5 | 0.727 | 0.182 | 0.091 | 8.0 | 2.0 | 153 | 0.03 | 409 | 0.12 |
14 | 0.26 | 0.72 | 1.119 | 0.255 | 0.067 | 16.7 | 3.8 | 113 | 0.08 | 142 | 0.28 |
15 | 0.17 | 0.83 | 1.275 | 0.311 | 0.074 | 17.2 | 4.2 | 112 | 0.07 | 150 | 0.26 |
16 | 0.2 | 0.82 | 1.269 | 0.292 | 0.079 | 16.1 | 3.7 | 113 | 0.11 | 137 | 0.21 |
17 | 0.16 | 0.78 | 1.206 | 0.279 | 0.075 | 16.0 | 3.7 | 116 | 0.08 | 141 | 0.23 |
18 | 0.17 | 0.53 | 0.875 | 0.116 | 0.068 | 12.8 | 1.7 | 117 | 0.22 | 142 | 0.31 |
19 | 0.1 | 0.54 | 0.882 | 0.130 | 0.068 | 12.9 | 1.9 | 144 | 0.23 | 171 | 0.21 |
![]() | ||
Fig. 6 Visualization of all available data points along with the five nearest neighbors to the target in the input space (a) and output spaces showing all samples (b) and only the nearest neighbors (c) as derived from the pre-existing dataset summarized in Table 3. |
Although four input variables were available for manipulation, an additional constraint was imposed to require that samples have a higher GS concentration relative to Dex-GTAC concentration such that the nanoparticle surface is GS-rich (to promote nanoparticle/macrophage interactions) and the final net charge in the PEC is anionic, key to minimize interactions with proteins in physiological fluids and representing a common design criteria for PECs.71–73 As a result, the number of truly independent variables was reduced to three, and the number of PLS components was set to three, and the number of nearest neighbors to activate the PREP analysis was A (= 3) + 2 = 5. Fig. 6 illustrates all available data points and highlights the five nearest neighbors to the target in both the input space (a) and the output space (b), with panel (c) representing a zoomed-in version of the area around the target in panel (b).
Next, the PREP method was iteratively applied to the dataset following the same structured sequence of steps described in Case Study 1 for each iteration: developing PLS and PCA models, generating the PDS, optimizing the PREP equation, ranking the PDS, selecting the L-PREP and H-PREP candidates, synthesizing the L-PREP and H-PREP recipes, evaluating whether the target was met, and (if necessary) updating the list of nearest neighbors before repeating the process until satisfactory experimental results were achieved. Given the number of measurable variables and the number of PLS components, the dataset had a one-dimensional null space, i.e. there exists a line in the three-dimensional latent space along which variations do not affect the predicted Y. All points on this line, provided they satisfy the constraint GS mass > Dex-GTAC mass, constitute the PDS and were ranked based on their PREP score.
The outcomes of PREP implementation for the first two iterations are presented in Fig. 7. In each sub-figure, panel (i) illustrates the limited portion of the null space (NS) that is spanned by the Potential Design Space (PDS) within the latent space, panel (ii) displays the results of the PREP equation optimization, highlighting the alignment of the validation data points along the optimized trend line according to the calculated PREP scores, and panel (iii) shows the PDS candidates for each iteration ranked by their PREP scores; the two selected candidates for experimental synthesis denoted as L-PREP (low PREP score, high reliability) and H-PREP (high PREP score, high uncertainty) are clearly indicated in the graph and consistently labeled as Lx or Hx where x is the iteration number. The first iteration of the model exhibited relatively poor predictive performance near the target output (Fig. 7(a)), with two of the validation data points yielding prediction accuracy values as low as 60%. However, in the second iteration (Fig. 7(b)), model accuracy improved substantially, with the lowest prediction accuracy among the validation data points showing a prediction accuracy of 85%. Table 4 confirms that the optimization objectives were successfully achieved within just two iterations, yielding a particle with a size of 171 nm (target <200 nm) and a polydispersity index of 0.19 (target <0.2). Nonetheless, two additional iterations (Fig. 8(a) and (b)) were conducted to explore the possibility of further improving the dispersity, leading to the synthesis of a more narrowly dispersed PEC with a particle size of 182 nm and a PDI of 0.15 (Table 4) that precisely matched the model's targeted dispersity value. Note that by the fourth iteration (Fig. 8(b)) even the least accurate validation sample achieved a prediction accuracy above 93%, showing the relevance of the PREP method to improve model outputs in minimal iterations. It is important to note that conducting the PREP algorithm over another two iterations (Table 4) did not yield further improvements over the best sample obtained in iteration 4 (Sample L4), consistent with the high accuracy of the model already achieved at iteration 4 such that additional iterations did not offer significant further benefits in model prediction accuracy (Fig. S1(a) and S1(b)). This behavior is consistent with the probabilistic nature of the PREP algorithm, which while generally effective in guiding dataset expansion does not guarantee monotonic performance improvement across iterations. As shown in our prior work, the sample rankings based on PREP scores do not always correspond directly to prediction accuracy, and in some iterations high PREP score candidates may unexpectedly yield better results than low PREP ones (presumably by exploring less explored parts of the design space that have higher prediction errors but yield superior performance). This highlights the value of PREP's dual-candidate strategy (L-PREP and H-PREP) while also illustrating the convergence limits of the model once optimal regions of the design space have been sufficiently explored. Collectively, these results illustrate PREP's capacity to efficiently converge on an optimal solution within a constrained design space while requiring minimal experimental effort.
Sample ID | Assembly solvent [× PBS] | Total precursor conc. [mg mL−1] | Pre-assembly GS conc. [mg mL−1] | Pre-assembly Dex-GTAC conc. [mg mL−1] | Pre-assembly DOX conc. [mg mL−1] | GS![]() ![]() |
Dex-GTAC![]() ![]() |
Assembly solvent | 1× PBS | ||
---|---|---|---|---|---|---|---|---|---|---|---|
Size [nm] | PDI | Size [nm] | PDI | ||||||||
a Best performing samples. | |||||||||||
L1 | 0.18 | 0.40 | 0.625 | 0.102 | 0.073 | 8.6 | 1.4 | 121 | 0.23 | 178 | 0.34 |
H1 | 0.13 | 0.86 | 1.341 | 0.309 | 0.070 | 19.1 | 4.4 | 105 | 0.14 | 126 | 0.23 |
L2a | 0.50 | 0.88 | 1.257 | 0.274 | 0.229 | 5.5 | 1.2 | 97 | 0.21 | 171 | 0.19 |
H2 | 0.46 | 0.88 | 1.178 | 0.447 | 0.135 | 8.7 | 3.3 | 96 | 0.06 | 131 | 0.24 |
L3 | 0.30 | 0.83 | 1.273 | 0.306 | 0.081 | 15.8 | 3.8 | 94 | 0.02 | 125 | 0.27 |
H3 | 0.76 | 0.66 | 0.924 | 0.066 | 0.330 | 2.8 | 0.2 | 108 | 0.25 | 118 | 0.4 |
L4a | 0.10 | 0.80 | 1.060 | 0.353 | 0.186 | 5.7 | 1.9 | 111 | 0.02 | 182 | 0.15 |
H4 | 0.13 | 0.94 | 1.436 | 0.368 | 0.075 | 19.1 | 4.9 | 93 | 0.09 | 126 | 0.23 |
L5 | 0.10 | 0.65 | 1.000 | 0.250 | 0.050 | 20.0 | 5.0 | 106 | 0.10 | 131 | 0.25 |
H5 | 0.10 | 0.71 | 1.061 | 0.300 | 0.060 | 17.7 | 5.0 | 126 | 0.11 | 166 | 0.20 |
L6 | 0.59 | 0.65 | 1.128 | 0.120 | 0.052 | 21.6 | 2.3 | 108 | 0.25 | 105 | 0.39 |
H6 | 0.33 | 0.51 | 0.862 | 0.128 | 0.030 | 29.0 | 4.3 | 81 | 0.18 | 104 | 0.51 |
Fig. 9 illustrates the outcomes of each iteration alongside the initial nearest neighbors from the pre-existing dataset in the output space, highlighting the proximity of each iteration result to the target. Notably, while the L2 (second iteration L-PREP) sample significantly outperformed all other samples in the dataset (i.e. was positioned closer to the target within the output space), the third iteration H-PREP and L-PREP samples both significantly underperformed the initial nearest neighbor samples; however, extending the iterations for one more cycle resulted in the L4 formulation that improved on the performance of L2. This example shows that the aggressiveness of the PREP method in terms of revising the number of nearest neighbor and thus “historical” samples in each iteration can lead to some significant iteration-to-iteration variability but ultimately converges faster on a recipe with target properties. Of note, the optimized L4 recipe resulted in a DOX encapsulation efficiency and loading capacity of 31% and 2.3 wt%, respectively; while this result represents a modest encapsulation efficiency, the loading capacity is significant and the potent nature of DOX (IC50 values in the micromolar/nanomolar range74,75) is relevant for practical chemotherapeutic use. Furthermore, if additional optimization of the DOX content within these PECs is desirable, the PREP method may be applied to the same system while adding DOX loading as an additional target property.
Relative to the first case study, this case presented additional challenges associated with a greater number of output variables, a lower degree of freedom in the null space (1D compared to 2D in the first case study), and the need to optimize properties that were not intrinsic to the initially synthesized particles but instead emerged after their introduction into a higher ionic strength solution. The successful implementation of PREP in this complex scenario further underscores its potential for handling high-dimensional systems with greater complexity.
The iterative feedback structure of PREP is also highly advantageous in that it allows the PREP method to rapidly incorporate new data and revise its predictions, offering an efficient means of dataset expansion with each iteration contributing meaningful directional insight. These results suggest that PREP is particularly well-suited to systems in which the relationships among input variables are complex, the output space is multidimensional, and the design goals are not fully represented in the initial data. More specifically, the second case study presented additional challenges due to a higher number of output variables, reduced flexibility in the null space, and the need to optimize properties that emerged only after the particles were introduced into physiological conditions, all challenges that were successfully navigated by the PREP algorithm.
The success of PREP in these studies highlights its potential as a transformative tool for nanoparticle design and optimization. By leveraging data-driven modeling, PREP offers a systematic approach to refining synthesis protocols, reducing resource-intensive trial-and-error processes, and ensuring precise control over key material properties. Note that while the case studies described here in focus only on particle size optimization for two types of systems (covalently-crosslinked microgels and polyelectrolyte complexes), we expect the underlying PREP framework to be broadly applicable optimizing the size or other property of other types of nanoparticle systems in which the experimental design variables (inputs) and measured properties (outputs) can be organized into well-defined multivariate X and Y blocks respectively. Moving forward, the application of PREP to datasets with an even higher degree of input and output complexity remains an open avenue for exploration, presenting opportunities to further extend its impact across a broader range of nanoparticle engineering challenges.
PREP optimization results for the second case study (PEC) including outcomes from Iterations 5 and 6. See DOI: https://doi.org/10.1039/d5nr01664a.
This journal is © The Royal Society of Chemistry 2025 |