Loïc
Chagot‡
*a,
César
Quilodrán-Casas‡
*bc,
Maria
Kalli‡
*a,
Nina M.
Kovalchuk
e,
Mark J. H.
Simmons
e,
Omar K.
Matar
d,
Rossella
Arcucci
bc and
Panagiota
Angeli
a
aThAMeS Multiphase, Department of Chemical Engineering, University College London, UK. E-mail: l.chagot@ucl.ac.uk; maria.kalli.14@ucl.ac.uk
bData Science Institute, Imperial College London, UK. E-mail: c.quilodran@imperial.ac.uk
cDepartment of Earth Science and Engineering, Imperial College London, UK
dDepartment of Chemical Engineering, Imperial College London, UK
eSchool of Chemical Engineering, University of Birmingham, UK
First published on 6th September 2022
The control of droplet formation and size using microfluidic devices is a critical operation for both laboratory and industrial applications, e.g. in micro-dosage. Surfactants can be added to improve the stability and control the size of the droplets by modifying their interfacial properties. In this study, a large-scale data set of droplet size was obtained from high-speed imaging experiments conducted on a flow-focusing microchannel where aqueous surfactant-laden droplets were generated in silicone oil. Three types of surfactants were used including anionic, cationic and non-ionic at concentrations below and above the critical micelle concentration (CMC). To predict the final droplet size as a function of flow rates, surfactant type and concentration of surfactant, two data-driven models were built. Using a Bayesian regularised artificial neural network and XGBoost, these models were initially based on four inputs (flow rates of the two phases, interfacial tension at equilibrium and the normalised surfactant concentration). The mean absolute percentage errors (MAPE) show that data-driven models are more accurate (MAPE = 3.9%) compared to semi-empirical models (MAPE = 11.4%). To overcome experimental difficulties in acquiring accurate interfacial tension values under some conditions, both models were also trained with reduced inputs by removing the interfacial tension. The results show again a very good prediction of the droplet diameter. Finally, over 10000 synthetic data were generated, based on the initial data set, with a Variational Autoencoder (VAE). The high-fidelity of the extended synthetic data set highlights that this method can be a quick and low-cost alternative to study microdroplet formation in future lab on a chip applications, where experimental data may not be readily available.
Nowadays, improvements in imaging techniques and in microfluidic devices enable collection of high-quality data to estimate the droplet parameters for various configurations.10–12 Roumpea et al.13 used a two-colourParticle Image Velocimetry setup to study the effect of surfactants during droplet formation in a flow-focusing microchannel. Recently, Kiratzis et al.14 studied the effect of surfactant addition in the aqueous dispersed phase during droplet generation using Ghost Particle Velocimetry (GPV). Such studies enable better understanding of surfactant transfer and adsorption. Usage of high-speed cameras with improved spatial and time resolution led to large data collections and development of semi-empiricalmodels. These models are based on physical parameters (e.g. capillary number, flow rates, channel size) and provide new data that can be used as droplet predictors (droplet size, formation time).15–17 Furthermore, improvements in algorithms and computational capacity now enable numerical simulations of drop formation inside microchannels in complex configurations. Kahouadji et al.18 presented a very accurate prediction of surfactant-free droplet formation in complex microchannel geometries using a front-tracking scheme. Using Lattice Boltzmann simulations, Riaud et al.19 studied the formation of water droplets inside octane with Span 80 and showed the non-uniformity of the interfacial distribution of surfactant.
This ease of accessing good experimental and/or numerical data coincides with the emergence of data-driven models for microfluidics. Mahdi and Daoud20 used artificial neural networks to predict the droplet size of a water-in-oil micro-emulsion. Khor et al.21 used the same method to predict the emulsion stability inside a microchannel. Deep neural networks were used to predict the flow rate or the concentration of isopropanol for different formation patterns of water–isopropanol droplets in silicone oil.22 Recently, machine learning was used to predict the performance of flow-focusing droplet generators.23
Whilst the aforementioned papers had access to a large data set of experimental data, it is crucial to perform data augmentation for small experimental data sets. The generation of high-quality synthetic data allows the augmentation of small-sample data sets,24,25 such as in healthcare to generate high-fidelity synthetic patient data.26 Though the use of synthetic data needs to be carefully developed and adapted for each case,27 it can be a powerful tool to increase the robustness and adaptability of data-driven models.28,29
This paper presents both physics-based and data-driven approaches to predict droplet size inside a flow-focusing microchannel in the presence of surfactants. A data set of 476 measurements was collected for different phase flow rates, surfactant types, including anionic, cationic, and non-ionic at several concentrations below and above the critical micelle concentration (CMC). An adaptation of the recent model of Kalli et al.17 was used as a benchmark, considering that this type of empirical models remains the principal approach for droplet size prediction in microfluidic configurations. For the first time, two predictive models for the surfactant-laden droplet size were built using a Bayesian regularised artificial neural network (BRANN) and the XGBoost regression. Additionally, a Variational Autoencoder (VAE) was used to generate synthetic data to enlarge the experimental data set, to access an unexplored part of the parameter space, and to decrease the uncertainty on the droplet size estimation. Finally, to quantify the effect of the data set on the final uncertainty of the prediction, a convergence analysis was performed. This aims to highlight the sensitivity on data of this type of prediction model and will help optimise further studies related to machine learning and microfluidics.
63 measurements at different flow rates were obtained without surfactant. To observe the impact of surfactant composition on the droplet diameter, 5 different surfactants were dissolved in the aqueous phase which can be categorised into three groups: anionic, cationic, and non-ionic surfactants (see Table 1). A total of 468 different measurements (392 with surfactant) at various flow rates (Qc ∈ [0.012, 0.4] and Qd ∈ [0.001, 0.10]) and 34 different concentrations below the CMC and 43 above the CMC were obtained. As mentioned previously, this study focuses on spherical droplets, and all droplet diameters d are below the channel depth D (d < 195 μm). The full list of experimental parameters used in this paper is available in Zenodo open data file.
Name | Type | ϕ CMC | ϕ/ϕCMC | M w | Q d | Q c | γ | Number of data |
---|---|---|---|---|---|---|---|---|
mM | — | g mol−1 | mL min−1 | mL min−1 | mN m−1 | — | ||
ϕ is the surfactant concentration; ϕCMC, the critical micelle concentration; Mw, the molar mass; Qd and Qc the dispersed and continuous flow rates; γ, the equilibrium interfacial tension. Full name of the surfactants: sodium bis(2,6-dimethyl-4-heptyl)-2-sulfoglutarate (di-BC9SG), sodium dodecylsulfate (SDS), dodecyltrimethylammonium bromide (C12TAB), hexadecyltrimethylammonium bromide (C16TAB) and Triton X-100 (TX100). | ||||||||
di-BC9SG | Anionic surfactant | 4.3 | [1.0…50.0] | 486.00 | [0.003…0.04] | [0.012…0.2] | [1.4…4.2] | 16 |
SDS | Anionic surfactant | 11.0 | [0.2…5.0] | 288.38 | [0.003…0.06] | [0.040…0.4] | [10.0…18.0] | 178 |
C12TAB | Cationic surfactant | 20.0 | [0.3…7.5] | 308.34 | [0.001…0.06] | [0.040…0.4] | [10.0…20.0] | 94 |
C16TAB | Cationic surfactant | 2.0 | [0.2…2.5] | 364.45 | [0.001…0.04] | [0.040…0.2] | [7.3…20.0] | 30 |
TX100 | Non-ionic surfactant | 3.5 | [1.0…8.6] | 646.85 | [0.010…0.02] | [0.040…0.4] | [2.8…8.7] | 87 |
No surfactant | — | — | 0 | — | [0.001…0.10] | [0.080…0.4] | 32 | 63 |
All images were taken with a 12-bit high-speed camera (Phantom v1212 with a 1280 × 800 pixel resolution (UCL) and Photron SA5 with 1024 × 1024 pixels resolution (UoB)) both equipped with a Nivatar 12× zoom lens. A backlight system using LED ensured a homogeneous illumination of the main channel (see. Fig. 1b) and did not affect the properties of the fluid by minimising its heating. Due to the oval geometry of the channel, it is possible to accurately position the focal plan at the centreline of the channel where the sharpest image of the channel walls is obtained by the optical system. The measurement of the droplet size was directly performed on the 2D images using ImageJ and MATLAB (see ESI†). A minimum of 15 droplets was used to calculate the average size for each case with a droplet size polydispersity of <3%. According to Christopher and Anna,31 this is considered extremely accurate for microfluidic experiments. The spatial error is 3 μm per pixel (2.5% of the smallest drop diameter).
Two predictive models were developed to use different numbers of features to predict the droplet size. The data can then be split into predictive features predictors and the target diameter.Two regressors were trained to predict the droplet diameter size diameter, where f is a Bayesian regularised neural network or a XGBoost regressor. The predictions are then compared to the holdout test data set from real experimental data.
F = βED + αEW, | (1) |
(2) |
Finally, another advantage of BRANN is that the model is robust and a validation process such as back propagation is unnecessary,37 which can save data for the training and test processes.
The simulation of the neural network model was performed on the MATLAB Statistics and Machine Learning Toolbox.
The objective function is the sum of loss function, which is evaluated across all predictions with a regularisation function for all j predictors. The prediction of the jth tree is defined as:
(3) |
For regression problems, like our case, XGBoost uses the mean squared error (MSE) as a performance metric. The XGBoost regressor was implemented in Python using the xgboost package.
Let and be the encoder and decoder, respectively. Moreover, let q(z|x) and p(|z) be the encoding and decoding distributions, respectively. Here, x is the vector of experimental data. As suggested by Makhzani et al.,41 a Gaussian posterior can be used assuming that q(z|x) is a Gaussian distribution, where its mean and variance are predicted by the encoder : z ∼ (μ(x), σ(x)). This is achieved by adding two dense layers of means μ and logσ to the final layer of the encoder , and return z as a vector of samples. To ensure that z ∼ q(z) = (μ, σ2), the aggregated posterior, the reparameterisation trick described by Kingma and Welling40 was used for backpropagation through the network z = μ + σ⊙ε, where ε is an auxiliary noise variable ε ∼ (0, I).
The minimisation of the Kullback–Leibler Divergence Score (KL) loss (KL) quantifies how much the probability distribution a(x) differs from the probability distribution b(x) as:
(4) |
mse = argmin‖ − x‖2 | (5) |
Xu et al.15 studied squeezing and dripping regimes in a T-junction and argued that the equilibrium between the shear forces from the continuous flow and the inertial force plays an important role in the drop formation process. The authors assume that the droplet size should be predicted by the generic equation:
(6) |
(7) |
Using the same flow-focusing microchannel, eqn (7) was applied to the present data-sets. Fig. 2 compares the experimental test data with those calculated from the model using eqn (7), showing a mean absolute percentage error (or MAPE) of 11.4%. The MAPE is defined by:
(8) |
Fig. 2 Predicted dimensionless droplet diameter using the semi-empirical equation eqn (7) compared to the experimental test data set. |
(9) |
As the role of surfactants is central to the present study, the ratio ϕ/ϕCMC is used for their comparison, where ϕ is the surfactant concentration and ϕCMC is the critical micelle concentration. This is used in the data-driven model to improve the droplet size prediction. However, as described in section 2, all experiments were performed in the same channel with the same phases. As a result, the variation of the Reynolds numbers depends only on the flow rates while that of the capillary numbers on the flow rates and interfacial tension: Rei(ρi, μi, S, D, Qi) ≡ Rei(Qi) and Cai(μi, S, γ, Qi) ≡ Cai(γ, Qi). Finally, the model can be trained with the 4 following inputs: Qd, Qc, γ and ϕ/ϕCMC.
Fig. 3 shows dimensionless droplet diameter predictions with both BRANN and XGBoost trained using the test data set with these 4 inputs. To get robust predictions, both models were run 50 times and averaged. The standard errors are low (max(errors) < 1.6%) which highlights the excellent repeatability of the models. The MAPEs for the test data set are 3.9% for both data-driven models which highlight the good selection of the 4 inputs. Moreover, this result showsthe superior prediction of the dimensionless droplet diameter d/D by both BRANN and XGBoost to that of the semi-empirical model (with associated MAPE = 11.4%, as shown in Fig. 2).
As proposed by the Garson equation, the neural network weight matrix can be used to determine the relative importance of inputs20,46,47 using the following equation:
(10) |
Fig. 4 shows a diagram of the relative importance of each input variable for both models. For the BRANN and the XGBoost, the flow rate of the continuous phase Qc has the most important effect on the dimensionless droplet size prediction at respectively 55.2% and 32.1%. This result confirms the strong impact of Qc on the droplet formation, already highlighted by the semi-empirical eqn (6), directly through the term Qc and indirectly through Cac(μc, S, γ, Qc). The flow rate of the dispersed phase Qd (BRANN: 18.8%, XGBoost: 17.7%) and the ratio ϕ/ϕCMC (BRANN: 17.3%, XGBoost: 29.2%) have a lower contribution but still a significant impact on this model. Although, the relative importance of the interfacial tension γ, is still significant for the XGBoost (21.0%), it become less crucial for the BRANN prediction (8.6%).
Fig. 5 shows the dimensionless droplet diameter predictions on the test data set for both BRANN and XGBoost trained with only 3 of the inputs: Qc, Qd and ϕ/ϕCMC. Although there is a small increase of the MAPE (6.4% for the BRANN and 5.2% for the XGBoost), these errors are smaller than the semi-empirical model eqn (7). This result highlights the accuracy of the data-driven models, especially when compared with the reference semi-empirical models, even with reduced inputs. However, for this case, XGBoost shows a significantly lower uncertainty than BRANN and demonstrates its usefulness when reduced inputs need to be used (e.g. inaccessibility of experimental data).
These reduced input models with a low uncertainty can be key to predicting accurately the droplet size for low-cost or rapid measurements, with a limited number of parameters available.
Fig. 6a, shows an example of the classic flow pattern map for the dripping regime, often used in droplet generation works with different microfluidic configurations. The colourmap corresponds to the droplet diameter. As the experimental measurement acquisition is a long process, the same flow rates were often used for the experiments with different surfactants and surfactant concentrations to enable comparison, resulting in an overlap of the experimental points and a large undefined zone in the parameter space. Fig. 6b, shows 10000 synthetic data generated in a random grid with all inputs (Qc, Qd, γ and ϕ/ϕCMC). In addition, the synthetic flow pattern map follows the exact shape of the real flow pattern map, while giving access to new information in the entire map and overcomes any experimental overlapping. Moreover, the synthetic data give access to a clear distribution of the droplet size in the flow pattern map. The excellent quality of these synthetic data can also be observed through Fig. 7. This figure shows the kernel density estimator (KDE) for the distributions of experimental against synthetic data for the four inputs and for the droplet diameter. For all cases, the synthetic data distribution is very similar to the experimental one which highlights the good mimicking capability of machine learning methods. Moreover, for 4 features, the KL divergence is 0.29, 0.62, 0.47, 1.49, and 0.04 for Qc, Qd, γ, ϕ/ϕCMC, and d/D, respectively.
Fig. 6 Flow pattern map of the dripping regime for: a) experimental data, b) 10000 random synthetic data, c) 10000 regular synthetic data. |
Fig. 7 Kernel density estimator of experimental data (shaded blue) against synthetic data (shaded orange) for the generation of 4 features plus droplet diameter size. |
To challenge the synthetic data, they were used to train the BRANN and XGBoost models and predict the droplet size d/D of the test data set. Fig. 8a shows the MAPE of the test data set using different amounts of synthetic data between 10 and 10000. To be more robust, the figure shows the average of the MAPE for 50 different runs per point, where error bars of the standard error are smaller than the markers. When the BRANN model is trained with a small synthetic data set (<100), the MAPE is bigger than the semi-empirical model of eqn (7) (MAPE = 11.4%). However, both models converge respectively to a MAPE of 7.3% (BRANN) and 6.1% (XGBoost) after being trained with 250 synthetic data. To highlight the effect of the randomness of the data set on the droplet size prediction, 10000 new synthetic data following a regular grid in Qc and Qd were built to mimic classic experimental investigations (see Fig. 6c). Fig. 8b shows the MAPE of the test data set using different number of synthetic data with this new grid. Once again, for both models, the MAPE converge after 250 synthetic data. However, the droplet size prediction is more accurate with a regular grid than with a random grid (MAPE = 6.4% for BRANN and MAPE = 5.5% for XGBoost). These results define the minimum size of the training data set needed and provide a direction for future experimental studies. Coupled with design of experiment methods49 the synthetic data could be an excellent tool for elaborating strategies to sample complex experimental data sets.
Fig. 8 Evolution of the MAPE of the test data set with the number of synthetic data for both BRANN and XGBoost: a) using a random grid, b) using a regular grid. |
Estimations using the real data even with only 3 inputs are closer to the experiments compared to the empirical model. The MAPEs of all models are summarised in Table 2. The mean absolute percentage error (MAPE) of the test data set was calculated for BRANN and XGBoost, and compared with the semi-empirical prediction (MAPE = 11.4%). Using Qd, Qc, γ and ϕ/ϕCMC as inputs, both models give an excellent prediction of the dimensionless droplet diameter d/D (MAPE = 3.9%) and show a great potential in linking machine learning with microfluidics to improve current predictive capabilities. Although the MAPEs are higher for the synthetic data than for the real data, the results are still more accurate than those obtained by using the semi-empirical model to predict the dimensionless droplet diameter. Therefore, this approach provides a quick and low-cost alternative to study droplet generation in a specific region of the flow pattern map with an acceptable uncertainty.
Name | Hyperparameters |
---|---|
XGBoost | Number of estimators: 100 |
Maximum depth: 3 | |
Learning rate: 0.3 | |
Random state: 42 |
Name | Hyperparameters |
---|---|
BRANN | Number of hidden layers: 1 |
Number of hidden nodes: 8 | |
Optimisation: Bayesian regularisation | |
Activation hidden layer: sigmoid | |
Activation output layer: linear |
The mean absolute percentage error (MAPE) of the test data set was calculated for BRANN and XGBoost, and compared with the semi-empirical prediction. Using Qd, Qc, γ and ϕ/ϕCMC as inputs, both models give an excellent prediction of the dimensionless droplet diameter d/D and show a great potential in linking machine learning with microfluidics to improve current predictive capabilities. Moreover, as the experimental estimation of the interfacial tension can be subject to discussion48 and hard to collect for dynamic and fast processes, the models were also trained with reduced number of inputs (Qd, Qc, and ϕ/ϕCMC). The results show that even if the MAPE increases slightly, the estimation of d/D is still more accurate with machine learning techniques than with semi-empirical methods. However, in this case, XGBoost gives a better prediction than BRANN. Finally, as experimental data sets can be costly and time-consuming to enlarge them, asynthetic data set of 10000 new experiments was built using VAE with all available inputs. Training the BRANN and XGBoost models with this synthetic data set, the MAPEs still outperform the semi-empirical model (Table 4).
Name network | Hyperparameters |
---|---|
3Predictors | Number of features: 4 |
Encoder | |
Number of hidden nodes (layer 1): 512 | |
Activation hidden layer 1: LeakyReLU | |
Number of hidden nodes (layer 2): 512 | |
Activation hidden layer 2: LeakyReLU | |
Number of nodes latent layer: 4 | |
Decoder | |
Number of hidden nodes (layer 1): 512 | |
Activation hidden layer 1: LeakyReLU | |
Number of hidden nodes (layer 2): 512 | |
Activation hidden layer 2: LeakyReLU | |
Number of nodes output layer: 4 | |
Activation output layer: sigmoid | |
4Predictors | Number of features: 5 |
Encoder | |
Number of hidden nodes (layer 1): 512 | |
Activation hidden layer 1: LeakyReLU | |
Number of hidden nodes (layer 2): 512 | |
Activation hidden layer 2: LeakyReLU | |
Number of nodes latent layer: 5 | |
Decoder | |
Number of hidden nodes (layer 1): 512 | |
Activation hidden layer 1: LeakyReLU | |
Number of hidden nodes (layer 2): 512 | |
Activation hidden layer 2: LeakyReLU | |
Number of nodes output layer: 5 | |
Activation output layer: sigmoid | |
Further hyperparameters | |
Optimiser: Nadam | |
Dropout of 0.5 between layers | |
Epochs: 2000 | |
Batch size: 512 | |
Random state: 42 |
The real interest on synthetic data lies on gaining access to a part of the parameter space with a low uncertainty, where data is not available due to experimental difficulties. Experimental data often follow a discrete distribution and the synthetic data can transform this into a continuous distribution. In this way, the previous results can be seen as a tool to help experimentalists design their next experiments. For example, it can be an excellent strategy to improve the filling of flow pattern maps extensively used in microfluidics but extremely time-consuming to acquire. Future work includes the exploration of other generative networks like generative adversarial networks,50 diffusion models,51 or normalising flows.52 The latter could be of interest as they do not require a compression of the input data size via a bottleneck layer, but they rather work in the same input space which is advantageous if the number of features to generate synthetic data from is small.
Finally, while the purpose of this paper is to highlight the potential of data-driven models in predicting the droplet behaviour for a wide range of surfactants and surfactant concentrations, it remains focused on a specific regime and for the same fluid phases. Apart from dripping, however, other regimes of droplet generation (e.g. squeezing, jetting, tip-streaming) have been reported and have been extensively studied both experimentally and numerically in previous works.5,13,18,48,53 In addition, Kiratzis et al.14 showed the importance of the phase viscosity ratio on the drop formation process. This work aims to unravel the unexplored capabilities of data-driven-models for droplet microfluidics. The methodologies developed here can be extend to different regimes, fluid viscosity ratios or channel geometries, which will be the focus of our future work for building generalised models for droplet size prediction in microfluidic channels.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2lc00416j |
‡ Equal contribution. |
This journal is © The Royal Society of Chemistry 2022 |