Open Access Article
Takumi Onoa,
Tarojiro Matsumurab,
Kiwamu Suea and
Satoru Takeshita
*a
aResearch Institute for Chemical Process Technology, National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba Central 5, 1-1-1 Higashi, 3058565 Tsukuba, Japan. E-mail: s.takeshita@aist.go.jp
bMaterials DX Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba Central 2, 1-1-1 Umezono, 3058568 Tsukuba, Japan
First published on 9th March 2026
Wet-state ball milling of ceramic nanoparticles is analyzed by machine learning and machine-learning-assisted model formulation. A linear model formula is constructed from the high-impact input features revealed in the machine learning. The formula explains the relation between the ball-milling conditions and hydrodynamic size with less precision but better analytical processability compared to the original machine learning.
New conceptsThe current work provides a pioneering study on how to adapt machine learning to wet-state ball milling of ceramic nanoparticles. While machine learning has been becoming commoditized in chemistry fields, its less analytical nature has remained as a major issue, particularly in unsophisticated, industry-oriented processes including ball milling and nanoparticle dispersion–aggregation control. Several attempts have recently been reported regarding the use of machine learning in ball milling, but they have been limited in the range of simple regression or classification without analytical insights. We would like to take these attempts one step forward: utilizing machine learning to assist analytical model formulation. A linear formula is constructed based on the importance of the input ball-milling parameters as revealed by machine learning. The obtained formula explains the relation between the ball-milling conditions and hydrodynamic size with better analytical processability compared to the original machine learning. The current approach will open up a new methodology of connecting data-driven approaches and analytical modeling for the potential production of a wide range of ceramic devices. |
Ball milling is a popular process in solid device fabrication in both industry and academia. Impulsive forces of impact with balls are utilized for many purposes: physical mixing, physical pulverization and downsizing,6 disintegration of aggregates/agglomerates,7 mechanical alloying,8 and various mechanochemical reactions.9 In the production of ceramic devices, wet-state ball milling is commonly used to prepare well-dispersed colloids of ingredient nanoparticles before molding, coating, or sintering treatments.10 Several theoretical models have been established for aggregation–dispersion of colloidal nanoparticles focusing on aqueous systems with electrostatic repulsion, e.g., DLVO theory.11 On the other hand, nonaqueous systems with steric repulsion, which are typical in practical ball-milling processes, still lack general theoretical models. The wide variety of industrial processes also makes general modeling difficult, where we suggest that a data-driven approach has certain room for contribution.
In fact, several reports have applied machine learning and statistical methods to ball milling12 and aggregation–dispersion of particles,13 but they have been limited to simple regression or classification without analytical insights. In this communication, we would like to contribute a pioneering study on how to adapt machine learning to hydrodynamic size data of ball-milled ceramic nanoparticles. To take this work one step forward from a simple case study, we suggest an additional role of machine learning as an assistant in instant model formulation. Our approach consists of three simple steps: First, machine-learning regression is employed on the hydrodynamic size dataset. Second, some data-processing tools visualize the influence of input variables on the machine-learning prediction. Third, a linear formula is constructed based on the important inputs revealed in the data processing. The comparison between machine learning and formulation reveals their characteristics and applicability.
The output targets were the particle sizes of ball-milled dispersions measured by dynamic light scattering (DLS).16 Two types of hydrodynamic size were used as machine-learning targets (see Fig. 1, bottom and Fig. S1 for details): (i) Cumulant diameter, a typical size from DLS under an assumption of monomodal normal size distribution.17 Samples having multimodality tend to show large polydispersity indices. (ii) Median diameter, d50, from integrated size distribution profiles obtained by the histogram method.18 This method reflects multimodality but is more sensitive to noise. The distribution width of the histogram was evaluated using SPAN, (d90 − d10)/d50. Different bases were used for the vertical axis: scattering intensity-, volume-, and number-bases (see Fig. S2 for a typical example of an intensity-based profile).
We used random forest for machine-learning regression of the output hydrodynamic sizes.19 The obtained results targeting the cumulant diameter and polydispersity index (Fig. 2a) show the best predictions (R2 = 0.913 and 0.841, respectively, Table S3 in the SI) compared to those on d50 and SPANs (Fig. 2b). Predictions of volume- and number-based d50 are particularly not successful mainly because of large noise in the <∼1–10 nm region. We therefore focus on the cumulant diameter and intensity-based d50 for further analysis. As a supplementary calculation, we also conducted a different machine-learning method, XGBoost, to compare with the random forest prediction (details in Section S2 and Fig. S9 in the SI).
![]() | ||
| Fig. 2 Machine-learning regressions and their Gini importances using random forest targeting on (a) cumulant diameter and polydispersity index and (b) d50 and SPAN by the histogram method. | ||
Random forest has the advantage of being easy to connect with data processing modules that can interpret feature importances: the influences of input features on prediction precision (details and additional evaluations are in Section S2 in the SI). According to Gini importances in Fig. 2, milling ball size, dispersant mass, milling ball mass, solvent type (represented by its viscosity in the calculation), and total milling time have certain influences on the predicted hydrodynamic sizes and polydispersity/SPANs, whereas the nominal BaTiO3 particle mass does not have a large influence on these items.
For more interpretability, SHapley Additive exPlanations (SHAP)20 was employed on the cumulant diameter prediction (Fig. 3). This method visualizes both importances and their directions of input features. The horizontal axis in SHAP corresponds to the impact of input items on target values (negative-to-positive from left-to-right) and the plot color of red/blue corresponds to large/small input values. In the present case, high-impact items in SHAP are roughly consistent with those in Gini importance. Milling ball size is the most important with red plots on the left, meaning that large balls decrease the predicted cumulant diameter. Dispersant mass has a complex influence. Blue plots on the left show that a small dispersant amount makes the cumulant diameter small, whereas its influence is not fully monotonic as shown in the red/blue mixture on the right. Total milling time is roughly monotonic. A certain time is needed to make the diameter small, but overtreatment makes it large probably by breaking the crystals to form aggregates again (this will be discussed with XRD data later). Milling ball mass is not very straightforward with a weak negative tendency; a large mass makes the diameter small. Solvent viscosity has a monotonic influence; small-viscosity solvents make the diameter large. BaTiO3 mass and sampling-to-measurement time have complex and relatively small influences compared to the above items.
Even though the above importance-oriented data processing can increase the interpretability, the weak point of machine learning remains; regressions are nonlinear, less analytical, and hard to use to make a single approximation formula explaining dependences of input features. We suggest constructing a linear formula based on the feature importances of machine learning. The high-impact features are picked up in the order they appear in the SHAP, i.e., milling ball size, dispersant mass, … and simply multiplied in the following eqn (1) with exponent constants:
![]() | (1) |
are the milling ball size, dispersant mass, total milling time, milling ball mass, solvent viscosity, BaTiO3 mass, and sampling-to-measurement time, respectively. Superscript * means that each item is unitless through dividing by the respective unit, d0 is a constant, a–g are the exponent constants, and ϕ is the predicted cumulant diameter by this formulation. The equation is converted into a linear polynomial for simplification:
![]() | (2) |
, the second one adds
to the first one,…, and displayed in Fig. 4 with prediction R2 values to the actual cumulant diameter. The first approximation cannot fit the actual diameter at all with R2 = 0.196. The fitting gradually improves from the second to third approximations with increasing R2 from 0.529 to 0.592. Then, the fourth and fifth approximations increase the fitting up to R2 = 0.709. After that, the sixth and seventh approximations do not drastically improve the fitting. Thus, the cumulant diameter can be predicted roughly by a linear connection of the 1st–3rd important variables then more precisely by that of the 1st–5th ones. Fig. 4 also shows a tendency of polydispersity index (the plot color of prediction–actual diameter diagrams). The yellowish plots around the upper right in the diagrams indicate that some conditions that give smaller predicted diameters have large polydispersity.
![]() | ||
| Fig. 4 Linear model formulation of cumulant diameter prediction using high-impact features in SHAP analysis. | ||
For further comparison, we also tried screening 35 different combinations of 3 input variables without considering the feature importance order, e.g., 1st + 2nd + 7th terms. As shown in Fig. S10, the R2 values varied from ∼0.6 to ∼0.08 depending on the combinations. While the original combination in Fig. 4, 1st + 2nd + 3rd terms (R2 = 0.592), is not the best fitting, it still ranks among the top three with an identical R2 range of ∼0.6. This result means that the feature-importance-assisted modeling is a fast and convenient way to avoid time-consuming comprehensive trials and errors of input variable selections.
The exponent constants, a–g (the bar graphs in Fig. 4), show their influence directions in the formula; those of the 1st–5th terms are negative, positive, negative, negative, and negative. The SHAP shows the same trend, i.e., increases in milling ball size (1st), total milling time (3rd), milling ball mass (4th), and solvent viscosity (5th) roughly decrease the cumulant diameter, while that in dispersant mass (2nd) roughly increases the diameter. A decisive characteristic of the model formula is its linearity. The input features in the random forest have certain degrees of correlations (Fig. S7), e.g., solvent viscosity and dispersant mass have a correlation coefficient of −0.4. On the other hand, the formula assumes perfectly independent input variables as it is a simple linear polynomial. This nature would decrease the prediction precision of the formula (R2 = 0.711 at the maximum) compared to the original random forest (R2 = 0.913). Also, each term in the formula describes only a monotonic increase or decrease. This item would not be a drastic disadvantage in the current case, since the 1st–5th items in the SHAP show roughly monotonic behaviors.
To discuss the applicability of the linear model formulation not only for the current ball-milling system but also for other potential systems, we note its advantages and limitations. First, a linear formula would be useful for analytical investigation, e.g., grasping a rough quantitative influence of each input variable with the exponent constant. This would be beneficial for on-site modulation/tuning of input conditions in actual lab/industrial processes. In fact, several previous studies, including ball milling of alumina particles,21 have already used linear formulae to describe industrial-oriented phenomena. The current methodology would help construct such linear models without comprehensive trials and errors of input variable selections.
On the other hand, the linear formula has two drawbacks: the absence of physical meaning and less precision due to the lack of nonlinear components. Regarding the former item, the formula reflects the correlation of inputs–outputs in the dataset, but does not provide causation or explanation for physical phenomena behind the data. For example, a theoretical description of ball milling needs more physicochemical information, such as microscopic solvent–dispersant–nanoparticle interactions, specific motion trajectory of the balls, and time dependence of rheological indices. Besides, industrial phenomena often lack physical models. Researchers usually start with a linear model in such cases thanks to the simplicity of linear formulae. Although the model does not contain physical meanings, it could be an initial model to derive an empirical formula for the phenomenon.
Regarding the latter item, the current machine-learning-assisted methodology is not limited to linear formulation. After understanding input-item dependences with a linear formula, we can then expand/modify the formula to make it nonlinear. Actually, we already started a preliminary trial for nonlinear modeling (Fig. S11), where the fitting precision was slightly better than that in the linear formula. Since the possible forms of nonlinear formulae have infinite variations, we suggest that a meaningful nonlinear model should be accompanied by some physical justification based on the above-mentioned physicochemical information.
For the final section, we also point out more practical items regarding the ball-milling process. Impact with balls can disintegrate the aggregates but also introduce some defects into the BaTiO3 crystals. We measured the crystallite size of (110) as an indicator of crystallinity change through ball milling. Visualization of the crystallite size in the machine-learning prediction diagram (Fig. S14 left) unveils that the samples with both smaller and larger cumulant diameters tend to show smaller crystallites. The former means that certain damage occurs in the crystals as a result of the ball milling. The latter indicates that the damaged particles would make aggregates again through overtreatment. In the next practical step, we need an optimization of the particle size and crystallinity for subsequent processes, e.g., forming, coating, and/or sintering, to design the production process of the desired ceramic devices.
As concluding remarks, we suggest two potential future directions and their challenges toward a wide range of industrial issues: (i) machine-learning-assisted empirical formulae, typically starting with a linear formula but not limited to this, for various industrial phenomena. This direction would require microscopic physicochemical analyses to connect the formulae with actual physical phenomena. (ii) Closed-loop frameworks of experiments and computational predictions for faster discoveries of materials/processes. This direction would require automated experimental/characterization systems, which are becoming a hot topic in materials chemistry.22 For both directions, we should continue case studies connecting machine learning and analytical formulation to expand the applicable range of the data-driven approach toward a wide range of industrial issues.
| This journal is © The Royal Society of Chemistry 2026 |