Christina
Glaubitz
a,
Barbara
Rothen-Rutishauser
a,
Marco
Lattuada
b,
Sandor
Balog
*a and
Alke
Petri-Fink
*ab
aAdolphe Merkle Institute, University of Fribourg, Chemin des Verdiers 4, 1700 Fribourg, Switzerland. E-mail: sandor.balog@unifr.ch; alke.fink@unifr.ch
bChemistry Department, University of Fribourg, Chemin du Musée 9, 1700 Fribourg, Switzerland
First published on 15th August 2022
Ultrasonication is a widely used and standardized method to redisperse nanopowders in liquids and to homogenize nanoparticle dispersions. One goal of sonication is to disrupt agglomerates without changing the intrinsic physicochemical properties of the primary particles. The outcome of sonication, however, is most of the time uncertain, and quantitative models have been beyond reach. The magnitude of this problem is considerable owing to fact that the efficiency of sonication is not only dependent on the parameters of the actual device, but also on the physicochemical properties such as of the particle dispersion itself. As a consequence, sonication suffers from poor reproducibility. To tackle this problem, we propose to involve machine learning. By focusing on four nanoparticle types in aqueous dispersions, we combine supervised machine learning and dynamic light scattering to analyze the aggregate size after sonication, and demonstrate the potential to improve considerably the design and reproducibility of sonication experiments.
To get around this bottleneck eventually, here we approached the problem via machine learning (ML). ML is data-driven and known for the ability to surpass human performance in various situations by being able to capture non-intuitive nonlinear and multivariate relationships. Using supervised ML, we built an algorithmic model that is able to capture characteristic features of the complex relationships between (a) the outcome of redispersion and deagglomeration, and (b) the combinations of sonication parameters—such as particle concentration, dispersion volume, sonicator type, duration of sonication, and sonication power—and selected physicochemical properties of the particles—such as size, zeta-potential, isoelectric point, surface coating and material type. We use a gradient boosted decision tree algorithm and exclusively focus on four ENP types, where we expect similar behavior to ultrasonication. We characterize the outcome via the intensity-weighted Z-average hydrodynamic size (diameter) and polydispersity index (PDI) of the particles via dynamic light scattering (DLS). According to the fundamental principles of DLS, these two are the most reproducible parameters one can determine from DLS experiments.35
Supervised ML aims at establishing a quantitative relationship between features (inputs) and labels (outputs). The outputs of sonication were the Z-average hydrodynamic radius and the PDI, and the inputs were the sonication parameters and particle properties. Our data-driven ML approach was performed on two consecutive studies (Fig. 1). In the first part of the study, we addressed well-defined—but not strictly monodisperse—model ENPs, which were synthesized, processed, and characterized in our own laboratory. For this, we synthesized aqueous dispersions of non-crystalline silicon dioxide nanoparticles (SiO2 ENPs), commonly known as colloidal silica. The SiO2 ENPs were synthesized with nominal diameters of roughly 40 nm, 70 nm, and 100 nm, respectively, using a co-condensation reaction adapted from Stöber et al.36 Depending on the desired particle size, different relative amounts of ethanol, ammonia and water (MilliQ) were mixed and heated to 40 °C (ESI, Table SI1†). After stirring the mixture for 1 h at constant temperature for equilibration, tetraethyl orthosilicate was added, and the mixture was stirred for another 8 h. After the mixture cooled down to room temperature, the SiO2 ENPs were purified by centrifugation (Thermo Scientific, 5000g, 5, 10, 15 and 20 min, five cycles in total). To mimic the extensive degree of agglomeration in powders, the ENPs were agglomerated via the change of ionic strength,37 by adding magnesium chloride hexahydrate (50 g L−1) to the aqueous dispersion. Agglomerated samples were purified again by dialysis and were resuspended in water before undergoing different sonication processes.
![]() | ||
Fig. 1 Workflow of our studies. In the first part of the study, experimental data are generated by characterizing agglomerated dispersions of model SiO2 ENPs, before and after ultrasonication, using different, systematically combined sonication parameters. These data are used for supervising a ML algorithm, where the sonication parameters and particle sizes are fed into the algorithm as features (inputs), and the DLS results (Z-average and PDI) as predicting labels (outputs). By mapping and approximating the functional relationships between labels and features, the goal of the ML model is to predict the outcome of ultrasonication in terms of DLS analysis. In the second part of the study, meta-analysis is performed on data mined from peer-reviewed publications addressing the ultrasonication of particle systems. We focused on oxides—namely ZnO, SiO2, CeO2 and TiO2—given their evident presence in consumer products.38,39 |
The overall number of parameter combinations was obtained via a generalized fractional factorial experimental design aimed at extracting an optimal amount of information from the smallest possible number of experiments, while still being able to understand the relationship between the different parameters and parameter values with the experiment outcome.40 In this design of experiments we considered that the primary particle size, particle concentration, dispersion volume, duration and power of sonication were found to have an effect on the degree of deagglomeration of particle agglomerates.11,12,41–44 We designed the sonication experiments using the pyDOE2-library,45 and used a reduction factor of seven, which reduced the number of experiments from 108 (full factorial design) to 14, with every experiment carried out in triplicate. Each triplicated design contained one of either two or three levels of combinations of particle concentration, water volume, sonicator type (probe vs. bath), duration of sonication, and effective energy and energy density of sonication (Table 1).
Sample vol. (mL) | Particle conc. (mg mL−1) | Type | Energy (J) | Duration (min) | Energy density (J mL−1) | |
---|---|---|---|---|---|---|
1 | 1 | 1 | Probe | 179 | 1 | 179 |
2 | 1 | 1 | Probe | 17![]() |
20 | 17![]() |
3 | 1 | 1 | Bath | 2025 | 45 | 2025 |
4 | 1 | 5 | Probe | 3576 | 20 | 3576 |
5 | 1 | 5 | Probe | 38![]() |
45 | 38![]() |
6 | 1 | 10 | Probe | 8046 | 45 | 8046 |
7 | 5 | 1 | Probe | 3576 | 20 | 715 |
8 | 5 | 1 | Probe | 38![]() |
45 | 7765 |
9 | 5 | 5 | Probe | 8046 | 45 | 1609 |
10 | 5 | 10 | Bath | 194 | 1 | 39 |
11 | 10 | 1 | Probe | 8046 | 45 | 805 |
12 | 10 | 5 | Bath | 194 | 1 | 19 |
13 | 10 | 5 | Bath | 45 | 1 | 5 |
14 | 10 | 10 | Bath | 100 | 20 | 388 |
The ultrasonication devices were bath and horn-probe sonicators (Elmasonic P 60 H, ELMA and Branson SFX550 Sonifier equipped with a standard 13 mm diameter disruptor horn, Branson Ultrasonics Corp.). The sonication power and the corresponding energy release were calibrated by calorimetric measurements described elsewhere,28 and the details of calibration are presented in the ESI (Tables SI3 and 4†). After sonication, the particles were characterized by DLS (Malvern Panalytical, Zetasizer Nano-ZS) at room temperature (25 °C). For DLS analysis, the given sample volumes (500, 100, and 50 μL) were diluted with 1 mL water (MilliQ) depending on the particle concentration (1, 5, and, 10 g L−1). Dilution was necessary to minimize any potential bias owing to the negative impact of multiple light scattering and collective diffusion affecting Brownian dynamics.46 Each diluted triplicate was then measured three times, and the auto-correlation functions were analyzed by the methods of cumulants35,47 to determine the so-called Z-average and PDI.35,47 Typical examples of the DLS field auto-correlation functions, and representative TEM micrographs of the agglomerates are shown in the ESI (Fig. SI1, 2 and 4†). The Z-average and PDI are both functions of the intensity-weighted particle size distribution, which is affected by the parameters of sonication. Loosely speaking, the smaller the Z-average, the higher the impact of sonication on the agglomerates. The overall goal of sonication is to redisperse (that is, disintegrate) agglomerates without altering the primary particles’ properties.
For this straightforward goal, it is essential to closely approximate the unknown but existing functional relationship between the inputs (parameters of sonication) and the corresponding outputs (Z-average and PDI). This task is called multivariate regression analysis, and it is used for predicting and forecasting. The functional relationship is nevertheless complex, and reliable and transferable quantitative regression models are not available yet, to the best of our knowledge. This is the point where we invoke supervised machine learning. The model we implemented is based on a gradient boosted decision tree (GBDT) algorithm by using the XGBoost-library.48 The benefit of using GBDT is that it offers good efficiency and flexibility while being relatively fast and relatively easy to implement as well as interpret.48–51 Therefore, GBDTs are optimal for limited datasets due to their robustness in comparison with e.g., a deep-learner.52 Additionally, they enable a straightforward ranking of the feature importance. This offers insights into decision-making of the model, and ranks the importance of the underlying physical processes during ultrasonication. The structure of decision trees is composed of nodes and branches, where branches make ‘one-way’ connections between the nodes. Besides the root node—the very first node—there are two elementary types of nodes: the first type is non-leaf nodes, which are internal crossroad junctions of the decision route in the tree, representing either an attribute (e.g., “ultrasonication via bath sonicator”, “ultrasonication via horn sonicator”) or a question (e.g., is the particle size under 50 nm?”, “is the released energy density over 15 J mL−1?”). The second type is leaf nodes at the end of the decision-making process, which offer the prediction of the label. To improve the predictive accuracy of decision trees, a regression tree algorithm can be deployed, where trees are trained in consecutive learning cycles. In these cycles, the final prediction of a tree is tested on a measured data point. If the tree fails to predict the target, another tree is built on this error until a tree is trained where the prediction and the measured value overlap. With this, the final tree model can map, generalize and compare rules from an ensemble of specific and individual observations. A set of observations forms a dataset, which is described by two main attributes: features (inputs: parameters of sonication and particle properties) and labels (outputs: Z-average and PDI).
Supervising an ML has three main phases: training, validation, and testing. During training, the ML model is given known features and corresponding labels to find a relationship between these two. In a certain number of training rounds, the predictive accuracy of the model is improved by altering the tree structure and the decision rules. To achieve this improvement, the accuracy of the model's predictive power has to be tested after every training step, which is called validation. Briefly, a part of the training data is withheld during training and used to test the success of the trained model. If the prediction for the validation data is incorrect, the model starts another learning round with an adjusted tree. In the testing phase, the ability to generalize and approximate these relationships is tested by quantifying the agreement between ML-prediction and the so-far unseen data. To train, validate, and test the model, feature values must be formatted. First, categorical feature values (for example, parameter corresponding to sonicator type: horn vs. bath) were transformed into numerical values by One-Hot encoding, using the Scikit-learn library in Python.53 Encoding creates a ‘feature vector’ for each category of the parameter and fills it with either 1 or 0 to encode the presence or absence of the feature.51,54,55 Second, feature values were power-transformed to approximate a standard normal distribution.51,56,57 To define training and test data, we used random sampling and stratified splitting. Stratification was based on the average values of the features (Z-average or PDI), and as a result of stratification, the training and test sets were balanced, in the sense that both were representative of the population of the observations we had at hand. Stratification forces the model to learn on the full range of label values, and thus, it promotes higher prediction quality. The training set and test set were non-intersecting, that is, they had no common element. This was important to prevent the phenomenon of the so-called data leakage, which is the simultaneous occurrence of data with identical features–label combo in the training as well as in the test set.51 Therefore, the experimental triplicates of identical labels were never split, were kept in a given set, and any triplicate went either into the training or the test set. We used 70% of the data in the training sets and 30% in the test sets. To train our ML model, the algorithm was presented to features and the corresponding labels of the training set. During training, the model was repetitively tested on a small set, which is referred to as the validation set. In our case, 20% of the training set was allocated into the validation set. The validation set was withheld from the actual training rounds, but it was presented to test the prediction quality after single training steps. This was necessary to tune the hyperparameters of the model.58 To optimize the values of the hyperparameters (ESI,† machine learning terms), we used the tree of the Parzen estimator algorithm (maximum 200 trials) implemented in the Optuna library.59 At the end of supervision, the quality of learning, that is, the corresponding prediction accuracy was evaluated on the test set.51 The quality of prediction was quantified by the R2 score, which is the coefficient of determination. R2 is in fact equal to the square of the Pearson (linear) correlation coefficient, and quantifies the agreement between the experimentally measured values and ML-predicted values. By definition R2 = explained variation/total variation, and it may take values between 0 (no agreement) and 1 (perfect agreement). The structure of our ML approach is summarized in Fig. 2. The comparison between experimental values and values predicted by our ML model addressing the colloidal silica particles is shown in Fig. 3.
![]() | ||
Fig. 3 Parity plots of log10Z-average and log10 PDI values of the SiO2 ENPs synthesized, processed, and characterized in our lab. Data in gray color indicate ten independent training and testing rounds, while the data in turquoise/red color highlight the most successful training (88 data points) and testing (38 data points). The two models for Z-average and PDI show a R2 of 0.76 and 0.75. These values correspond to a linear correlation coefficient better than 0.87. The dashed black lines indicate perfect predictions (R2 = 1). A distribution of the R2 score for the test set of 100 newly randomly seeded and trained models can be found in Fig. SI7.† |
To test the ability to extrapolate by our ML model, we predicted labels whose features were not from the interval of the training set. For this, we synthesized and characterized a new batch of particles (approx. 80 nm SiO2 ENP) and constructed a new experimental design (Table 2) with new parameter levels. Apart from one instance, the agreement between experimental and predicted triplicates is very good with a 6% relative error on average, but the model struggles with predicting accurately larger Z-average values. This, in part, is due to the fact that agglomerates are very heterogeneous in size, and the degree of heterogeneity scales with size. Therefore, the larger the mean aggregate size, the broader the size distribution, and thus, the expectable noise is larger.60 Second, the uncertainty of bath sonication is larger than probe sonication, and the noise in the corresponding data points is larger. This indicates some challenges in the reproducibility of bath sonication, likely due to variations in the experimental conditions, such as water and room temperature, relative humidity, bath volume and the vertical and horizontal position of the sonicated vessel.11,61
Parameters of sonication | Result | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Sample vol. (mL) | Particle conc. (mg mL−1) | Type | Amplitude (%) | Duration (min) | Energy density (J mL−1) | Predicted Z-ave (nm) | Measured Z-ave (nm) | Predicted PDI | Measured PDI | |
1 | 2 | 2 | Bath | 50 | 5 | 258 | 337 ± 67 | 324 ± 61 | 0.38 ± 0.12 | 0.30 ± 0.23 |
2 | 7.5 | 2 | Bath | 50 | 30 | 298 | 239 ± 22 | 796 ± 26 | 0.21 ± 0.10 | 0.50 ± 0.13 |
3 | 7.5 | 2 | Probe | 20 | 5 | 322 | 124 ± 17 | 141 ± 2 | 0.08 ± 0.04 | 0.10 ± 0.01 |
4 | 7.5 | 7.5 | Bath | 50 | 5 | 69 | 325 ± 40 | 331 ± 33 | 0.44 ± 0.25 | 0.31 ± 0.18 |
5 | 2 | 7.5 | Probe | 20 | 30 | 3878 | 150 ± 39 | 112 ± 5 | 0.12 ± 0.11 | 0.05 ± 0.01 |
After successfully constructing the ML model and predicting the outcome of ultrasonicating the silica particles, in the second part of this study we apply our ML approach to the meta-analysis of published and peer-reviewed work reported on the sonication and DLS characterization of oxide particles, such as ZnO, CeO2 and TiO2. Following the guidelines of Field and Gillett,62 we compiled a set of 203 data points collected from 12 peer-reviewed articles.12,30,31,43,63–70 Articles relevant to the project were searched online, by using the combinations of the keywords of “ultrasonication”, “nanoparticles”, and “oxide”. Compared to the laboratory study, the number of features (Table 1) could be increased by adding particle properties like zeta-potential in water, surface hydrophobicity/hydrophilicity, and isoelectric point. Owing to larger dataset, the ML model performed better (Fig. 4), while the structure of supervising the machine learning algorithm was very similar to the lab-based study.
![]() | ||
Fig. 4 Parity plots of log10Z-average and log10 PDI values of oxide ENPs synthesized, processed and characterized elsewhere (meta-analysis). Data in gray color indicate the progress of ten independent training and testing rounds, and data in turquoise/red color show the best models. We had a total of 383 data points for training and 289 data points for testing, and compared to the lab-based ML analysis, we achieved better performance (R2 = 0.82 and 0.84 for the Z-average and PDI, which correspond to a linear correlation coefficient higher than 0.9). The dashed black lines indicate perfect predictions (R2 = 1). A distribution of the R2 score for the test set of 100 newly randomly seeded and trained models can be found in Fig. SI7.† |
Nevertheless, similar to the lab-based study, the accuracy at large Z-average values and with bath sonicated samples is also decreased. As the final evaluation of the performance of our ML model, we tested two commercially available ENPs synthesized on a large-scale (aeroxide TiO2 P 25 and aerosil SiO2 200, both by Evonik Operations GmbH) following the experimental design detailed in Table 2. The values predicted by the ML model and the values measured by DLS are listed in Table 3.
Aeroxide (TiO2) | Aerosil (SiO2) | |||||||
---|---|---|---|---|---|---|---|---|
Predicted size (nm) | Measured size (nm) | Predicted PDI | Measured PDI | Predicted size (nm) | Measured size (nm) | Predicted PDI | Measured PDI | |
1 | 289 ± 14 | 302 ± 15 | 0.40 ± 0.09 | 0.42 ± 0.08 | 501 ± 41 | 476 ± 51 | 0.71 ± 0.34 | 0.64 ± 0.24 |
2 | 261 ± 20 | 922 ± 63 | 0.23 ± 0.17 | 0.61 ± 0.17 | 171 ± 53 | 309 ± 39 | 0.24 ± 0.08 | 0.57 ± 0.15 |
3 | 371 ± 29 | 352 ± 13 | 0.34 ± 0.10 | 0.38 ± 0.07 | 181 ± 17 | 173 ± 8 | 0.09 ± 0.08 | 0.13 ± 0.02 |
4 | 275 ± 36 | 285 ± 51 | 0.38 ± 0.09 | 0.40 ± 0.09 | 339 ± 87 | 306 ± 8 | 0.66 ± 0.22 | 0.57 ± 0.19 |
5 | 143 ± 12 | 144 ± 26 | 0.22 ± 0.09 | 0.19 ± 0.07 | 114 ± 17 | 123 ± 2 | 0.29 ± 0.10 | 0.34 ± 0.02 |
Next, we were interested in identifying the relative importance of the individual ultrasonication parameters and particle properties we used as an input on the outcome of the predictions. With this, we were also able to see what parameters are the most decisive—in the eyes of our model—in the process of particle ultrasonication, giving us insights into the underlying physical process of ultrasonication. To obtain these insights, we performed a so-called feature importance analysis (FIA). Feature importance analysis supports strongly the interpretation of ML-prediction.71 FIA, in essence, assigns a score to each feature used in the ML model, based on their relative importance in predicting the label values. The higher the score, the greater the influence of the feature. Hence, if one wants to optimize ultrasonication experiments in the future, it will be time-saving to start tweaking the most influential parameter and then to follow the importance hierarchy. Our FIA is based on the so-called Shapely values.72 Shapley values were invented in the field of cooperative games, and loosely speaking, they are intended to establish a basis of merit-based payoff, by quantifying the marginal contributions of players of a team in a given game and the associated reward.73 In a cooperative game, the success of each player depends not only on what they do, but also on how the players cooperate together. Accordingly, the most useful team-player gets the highest share of the reward. Calculating Shapley values is computational very intensive, and thus, we compute with the method introduced by Lundberg and Lee (SHapley Additive exPlanations, SHAP).74 According to the SHAP values, in our ML model trained on the silica ENPs synthesized and processed in our laboratory, the sonicator type has the highest influence on the deagglomeration success and the obtained PDI (Fig. 5a and b). This most likely reflects the fact that the available range of sonication energy is dependent on the type of sonication. The comparison of SHAP values between the models for PDI and Z-average shows only a marginal difference. In the meta-analysis, it is interesting to see that in the ML model, the isoelectric point, the zeta-potential, and surface coating only show a low influence for the ultrasonication process, their cumulative share is less than 15% (Fig. 5c and d). Apart from the energy-related quantities, the material type and particle size are however important features of the model. Their influence may be interpreted by using colloidal science: the amplitude of van der Waals forces—binding the particle agglomerates—is particle size and particle material dependent,75,76 and thus the model picks up their important role in the binding energy of the particle agglomerates and their influence on cluster deagglomeration.
Last but not least, we acknowledge that our study has limits, which point out potential subjects for future studies. First, while DLS is one of the most frequently used method in the characterization of particle dispersions, it is an ensemble technique that is very sensitive to outliers, which may lead to bias, and thus, requires carefully prepared and reproducible samples. Therefore, while DLS has its own merits, other in situ, and perhaps more robust characterization techniques—such as particle tracking analysis, Taylor dispersion analysis, small-angle scattering and diffraction methods—may serve the purpose equally well, if not better. Our choice of experimental characterization technique was due to the widespread use and accessibility of DLS—and therefore the largest number of published data points. Second, our ML model was developed on so-called horn and bath ultrasonicators, but cup horn sonication—also a frequently used device type—is not addressed in this study due to the lack of published data. Third, sonication may benefit from the use of dispersing agents, but we do not address their presence and role here. Fourth, in this study, we concentrated on a given class of ENPs with somewhat similar physicochemical properties, but other highly relevant materials, such as iron oxides, aluminum oxides, quantum dots, carbon nanotubes, or even particle mixtures were not addressed. Additionally, the model might show lower predictability for data points out of the range the model was trained on, e.g. micro sized particles or different particle morphologies like sheets or wires. Fifth, while our analyses are sound, we are able to offer only a hierarchy of importance and degree of association of the features to interpret the prediction of our ML model. Therefore, a detailed mechanistic understanding of the model and a casual inference is missing. To describe in quantitative detail the ML model in terms of cause and effect (by, for example, closed-form analytic algebraic expressions) is beyond our current capacity. Explainable and fully transparent ML is an active field of debate,77–81 which, however, concerns not only us, but any ML models where information content available (data) is limited.
As a final note for outlook, we co-published a web-based application with a graphical user interface (https://sonipredict.herokuapp.com/) where we offer quantitative guidance for designing sonication processes. While the application in its current form is based on the ML approach presented here, with creating a larger data bank, we hope to extend the model to cover increasing number of parameters and experimental scenarios of greater complexity, such as different dispersants and material types and sizes. The ML model can be greatly improved by incorporating new data, as any ML model learns best on data of high quality and of high volume, and we also call to the community to support us and send their own results on ultrasonicated nanoparticle dispersions. To collect more data, we also co-published a new database (https://tineglaubitz-sonidb-app-8z3bkw.streamlitapp.com/) for researchers to send in their data points. With a collective effort we aim at improving ML analyses, and promoting reproducibility in this impactful field.
DLS | Dynamic light scattering |
ENP | Engineered nanoparticle |
GBDT | Gradient boosted decision tree |
ML | Machine learning |
PDI | Polydispersity index |
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2nr03240f |
This journal is © The Royal Society of Chemistry 2022 |