Kewei
Zhu
a,
Sibo
Cheng
*b,
Nina
Kovalchuk
c,
Mark
Simmons
c,
Yi-Ke
Guo
b,
Omar K.
Matar
d and
Rossella
Arcucci
e
aDepartment of Computer Science, University of York, UK
bData Science Institute, Department of Computing, Imperial College London, UK. E-mail: sibo.cheng@imperial.ac.uk
cSchool of Chemical Engineering, University of Birmingham, UK
dDepartment of Chemical Engineering, Imperial College London, UK
eDepartment of Earth Science & Engineering, Imperial College London, UK
First published on 27th April 2023
Predicting drop coalescence based on process parameters is crucial for experimental design in chemical engineering. However, predictive models can suffer from the lack of training data and more importantly, the label imbalance problem. In this study, we propose the use of deep learning generative models to tackle this bottleneck by training the predictive models using generated synthetic data. A novel generative model, named double space conditional variational autoencoder (DSCVAE) is developed for labelled tabular data. By introducing label constraints in both the latent and the original space, DSCVAE is capable of generating consistent and realistic samples compared to the standard conditional variational autoencoder (CVAE). Two predictive models, namely random forest and gradient boosting classifiers, are enhanced on synthetic data and their performances are evaluated based on real experimental data. Numerical results show that a considerable improvement in prediction accuracy can be achieved by using synthetic data and the proposed DSCVAE clearly outperforms the standard CVAE. This research clearly provides more insights into handling imbalanced data for classification problems, especially in chemical engineering.
Microfluidics provide a unique opportunity to study drop coalescence under well-controlled hydrodynamic conditions and if necessary to account for the fate of each formed doublet. The use of microfluidic platforms allows for the exploration of thousands of coalescence instances with minimal material usage and energy consumption. Drop coalescence is used in microfluidics for triggering/quenching chemical reactions within a drop or for cell screening. Various designs of microfluidic chambers are used depending on the aim of investigation.1–4
Considering that the Machine Learning (ML) technique has widely succeeded in chemistry and chemical engineering5 with applications spanning from quantum chemistry research6 to molecular reaction kinetics,7 we believe that an effective ML model can also predict the probability of drop coalescence in a microfluidic device, enabling a considerable reduction of time and cost spent on optimization of microfluidic designs including the cost of energy as well as expensive and hazardous materials used in photo-lithography.
For microfluidic drop reactions or cell screening relying on coalescence, it is obligatory that the rate of drop coalescence is close to 100%. In such scenarios, the composition of coalescing drops is set as a priority, and there are usually very limited possibilities for adjusting the properties of the continuous phase. Therefore, accurate prediction of drop coalescence based on geometrical and hydrodynamic conditions is pivotal for effective experimental design. In this study, the experimental dataset comprises outcomes of either “coalescence” or “non-coalescence” when two drops interact in a microfluidic coalescence chamber, contingent on process parameters, such as drop size and flow rate.
Here, ML can be used for the optimization of parameters of the coalescence chamber and geometry of microfluidic devices in general, as well as flow characteristics maximizing the drop coalescence probability. By construction, different ML methods are able to deal with various types of information in chemistry, such as images for phenomena, videos for processes, texts for descriptions, and tabular data for numerical records. For example, Lasso regression and Random Forest (RF) have been employed to predict drop coalescence based on experimental data.8 Moreover, the usage of neural networks has helped parameterize the collision merging process9 and surrogate models based on Deep Learning (DL) models are developed to predict the dynamics of drop interactions based on recorded videos of experiments.10 Using effective ML protocols providing an accurate prediction of coalescence in combination with microfluidics enables, in this case, a considerable reduction in material and energy consumption during formulation optimization.
Despite the success in a large range of applications, it is widely noticed that imbalanced training data can lead to poor prediction/classification performance in ML.11 Examples can be found in the assessment of chemical-disease relation12 and in the classification of drug-induced injury.13 To tackle the data imbalance problem, multiple advanced tree-based algorithms14–17 are developed. These approaches mainly consist of building sub-models trained by partial data to alleviate the risk of overfitting caused by data imbalance. However, in the case of extreme data imbalance, the size of the training datasets in each sub-model is limited due to the constraint of label balance.18
Generative models are developed as a method of data augmentation.19 From this extension, they are employed to address the issue of data imbalance. Generative adversarial network (GAN)20 is a common method to generate artificial data, of which the generator is used to generate data, and then the discriminator judges if the synthetic data are similar to the real. Although GANs reach obvious improvement on imbalance data issue,21 the implicit posterior distribution is difficult to optimize and requires large-scale data to converge.22 Another method to create synthetic data is variational autoencoder (VAE).23 It is proposed with an explicit posterior distribution, whereas VAE cannot generate data with class-level conditions. To enable VAEs to generate class-specific data, a conditional variational autoencoder (CVAE)24 is developed with an extra classifier in the original space. Thanks to the constraint, CVAE can learn conditional representations explicitly in the original space and implicitly in the latent space simultaneously. Benefiting from the conditional representation, CVAE can generate synthetic data separately and substantially to decrease the impact of imbalanced data issue.25–27 However, the latent space of CVAE still lacks conditional constraint. Therefore, it is unable to explicitly learn latent conditional representation, which pushes more information stored in the decoder rather than the latent space.28 The recent work of Chagot et al.29 generates synthetic data to mitigate the issue of imbalanced data but the challenge associated with latent conditional representation remains. To improve the performance of generative models, some studies have begun to focus on latent spaces.30,31 GET-3D32 in 3D modeling area does not constrain the outputs only. It is proposed to add an extra constraint in latent space to improve the consistency of outputs. However, all outputs from GET-3D are not according to specific labels due to the target task. For example, GET-3D is capable of generating vivid 3D cars, but there is no label-specific constraint involved. GET-3D just guides the generator to generate similar data without focusing on their class differences.
In this work, we aim to handle an experiment-based classification problem using ML methods. This study is conducted based on tabular data, which have a relatively small sample size and the results (i.e., “coalescence” or “non-coalescence”) correspond to a very similar distribution of conditional values. By predicting drop coalescence in the microfluidic device, we explore a new implementation to solve the dilemma of tabular data imbalance. To accomplish the prediction tasks, we use ML methods of RF and XGBoost for their interpretability and strengths in processing small sample-sized datasets. To address the data imbalance for better predictive performances, we use generative models based on variational autoencoder (VAE) to get more synthetic data. In this stage, we propose a double space conditional variational autoencoder (DSCVAE) consisting of a standard VAE and two explicit constraints on both the latent space and the original space. Its novel latent constraint guides the latent space to learn conditional representation and enables the latent space of DSCVAE to become more informative. After generating more training data using DSCVAE, the accuracy of the predictive models improved by 7.5% and 8.5%, respectively. Through sensitivity analysis based on Shapley addictive explanations (SHAP) values, the contribution of these synthetic data to predictive models is more consistent with microfluidics experiments.
The rest of the paper is organized as follows. Section 2 describes the experimental process and the initial data that originated from the experiment. Section 3 introduces the machine learning methods involved in this task, including two tree-based predictive algorithms for predicting coalescence results and our novel DSCVAE model for synthetic data. Section 4 exhibits the results of our numerical experiments. Starting with implementation details, we trace the training process of generative models, discuss the performance of predictors, and analyze the impact of the initial and generative datasets on the predictors via SHAP.
The chamber has two entrances or input channels (top and bottom channel in Fig. 1) and two exits or output channels (left and right channel in Fig. 1) of width 315 ± 5 μm and depth 140 ± 5 μm. The chamber size, L, shown in Fig. 1b was in the range of 901 ± 11 μm. The flow field in the chamber and adjacent channels is shown in Fig. 1a, with the velocity magnitude shown in mm s−1. It was measured using ghost particle velocimetry, a non-invasive technique using the speckle pattern produced by light scattered by particles smaller than the diffraction limit as a tracer.36,37 As Fig. 1a shows, there is a stagnation point in the centre of the chamber due to flow symmetry. Drops arrive into the chamber with various time delays between them caused by fluctuations in the flow rate of continuous and dispersed phases supplied by syringe pumps (Al-4000, World Precision Instruments), inevitable deviations in the channel size, and fluctuations in channel resistance due to drop presence.
In an ideal situation, the drops move along the symmetry axis of input channels and the first drop arriving in the chamber is trapped at the stagnation point until the arrival of the second drop. In reality, the flow fluctuations and effect of the second drop on the flow field in the chamber result in drop encounters at various positions around the stagnation point. Following an encounter, the drops form a doublet, i.e., start to move together. If the doublet axis coincides with that of the input channels, which is also one of the chamber symmetry axes, the doublet is trapped in the compression flow provided by the continuous phase. In this case, the coalescence probability is 100% and compositions of continuous and dispersed phases affect only the time span between the doublet formation and coalescence. Because of ever-present fluctuations as well as the inevitable small asymmetries of real devices, the doublet begins to rotate/translate to one of the output channels; a doublet rotated and moved from its initial position as shown in Fig. 1b. Doublet rotation results in its transfer from a region of compression flow to one of extensional flow when its axis coincides with the axis of output channels. The extensional flow leads to drop detachment resulting in a non-coalescence event.
In general, the outcome of the drop encounter depends on the relative time scales of drainage of the continuous phase from the film separating the drops and the transition of the doublet from the region of compression flow to that of extensional flow: if the doublet does not rotate at all, the coalescence probability is 100%; if it rotates very fast, the coalescence probability is zero, i.e., the coalescence probability increases with decreasing flow rate. However, the very slow flows are incompatible with the requirement of high throughput being a measure of the efficiency of microfluidic devices. Moreover, in a certain range of flow rates, the drainage time decreases with an increase of the continuous phase flow rate due to diminishing inertial effects38, or possibly even reversing, the dependence of the coalescence probability on the flow rate.
Therefore, process optimization is necessary to determine the conditions under which coalescence probability remains large while maintaining a sufficiently high throughput. The optimization parameters include xflow, the total flow rate; xdrop1 and xdrop2, the drop diameters; and xdt, the time delay between drops. The total flow rate is calculated as the sum of the flow rates of continuous and dispersed phases, with a syringe pump dispensing accuracy of ±1%. The drop sizes are determined from their area in the plane of observation, S1 and S2 in Fig. 1b, with a maximum error of ±1%. The drop diameters are normalized by the chamber width. The time delay, xdt between drops has an uncertainty of 2 ms.
(1) |
The statistical characteristics of binary classifications are given in detail in Fig. 2. Fig. 2a shows the imbalance percentages of “coalescence” and “non-coalescence” samples, where “coalescence” accounts for 76% and “non-coalescence” only accounts for 24%. The value distributions of four features compared between two labels are shown in Fig. 2b. Fig. 2c shows the median, quartiles and whiskers of Xfeatures. These values of the two types of labels under each feature are not significantly different. Fig. 2d shows that Xfeatures values are not subject to the standard normal distribution, and confirms again that the distributions are similar with different labels. To sum up, Xfeatures is imbalanced in volume regarding two opposite classes, though identical distributions. In other words, there are significantly more samples of “coalescence” compared to “non-coalescence” in the training dataset, and it is difficult to separate the two classes of data by simply looking at their input distributions. The latter makes the prediction task even more challenging.
ML models must be evaluated using data that have not been used for training. Moreover, a validation set is needed for tuning the hyperparameters of the predictive models. As there is no dominant experimental result in actuality, the dataset is split into a balanced test set and a balanced validation set (“balanced” here means with the same number of “coalescence” and “non-coalescence”), instead of being divided proportionally. The remaining samples are used as the training set. The imbalanced ratio (IR) here is equal to the number of major results (“coalescence”) divided by the number of minor results (“non-coalescence”). Thus, the whole dataset is split into three mutually-independent sets, namely the balanced training dataset, validation dataset and test dataset, for subsequent procedures (shown in Table 1). It is worth mentioning that all comparative models in this paper are trained on this balanced training set to avoid overfitting.
Coalescence | Non-coalescence | IR | Total | |
---|---|---|---|---|
Total dataset | 1162 | 369 | 3.15 | 1531 |
Training dataset | 1012 | 219 | 4.62 | 1231 |
Balanced training dataset | 219 | 219 | 1 | 438 |
Validation dataset | 50 | 50 | 1 | 100 |
Test dataset | 100 | 100 | 1 | 200 |
After dataset splitting, IR becomes larger from 3.15 of the total experimental set to 4.62 of the training dataset which makes the classification task more difficult.
To introduce model-performance measures for binary classification, we used the following metrics to evaluate the classification performance, including accuracy, precision (eqn (2)) from an actual perspective, recall (eqn (3)) from a predictive perspective, F1 score (eqn (4)), and confusion matrix (Table 2).
Real | Prediction | |
---|---|---|
Negative | Positive | |
Negative | TN | FP |
Positive | FN | TP |
• Confusion matrix includes four categories. Here, true positive (TP) and true negative (TN) indicate correctly classified labels, and false positive (FP) means the amount of incorrectly predicted positive results, and vice versa, false negative (FN) shows incorrectly classified negative results.
• Precision or positive predictive value:
(2) |
(3) |
(4) |
Generative models, namely variational autoencoders, are employed in this study to address data imbalance by generating data with different labels separately. These data are then used as inputs for predictive models to enhance performance.
z = E(x) and = D(z) | (5) |
(6) |
VAE stems from AE, and starts to focus on the distribution of z in the latent space (Fig. 4b),23 which equips VAE with generation capability. Unlike AE, VAE constructs the distribution of z in the latent space and resamples z′ over the distribution. Thus, the whole process of VAE can be expressed as
z = E(x) and = D(z′) | (7) |
Latent distribution q(z|x) is trained to be subjected to a posterior distribution p(z|x), usually assumed Gaussian. The approximation between q(z|x) and p(z|x) is measured by Kullback–Leibler Divergence (KLD) loss as
(8) |
KLD loss is used to regularize the distribution of hidden variables q(z|x) in the latent space to converge to a Gaussian distribution p(z|x). Thus, VAE enables the model with interpretability.
Resampled latent variables z′ can be characterized by a normal distribution with mean μ and variance σ denoted as (μ,σ2), with ε sampled from a Gaussian distribution as
(9) |
= MSE(x,) + KLD(q(z|x)‖p(z|x)) | (10) |
In addition, in our study, labels y are not put directly into the encoder with Xfeatures together to prevent label leakage during training. In other words, the label y is not given to the encoder when constructing latent variables. As shown in Fig. 4c, firstly, Xfeatures are used as inputs for the encoder only. Then, their corresponding labels y are used as the target of artificial neural network (ANN) classifier results ŷ, and this ANN classifier is jointly trained with the entire VAE network. The difference is measured by predicted labels ŷ ∈ (0,1) and probability via cross entropy (CE) loss according to
originalCE = −(ŷlog + (1 − ŷ)log(1 − )) | (11) |
= MSE(x,) + KLD(q(z|x,y)‖p(z|x,y)) + originalCE | (12) |
In DSCVAE, we first adopt two classifiers at two spaces to classify the latent representation z′ and the original space reconstruction respectively according to their common label. Double space conditions are implemented by adding another ANN classifier for resampled variables z′ (shown in Fig. 5), of which classification error is still measured by CE loss as
latentCE = − (ŷ′log + (1 − ŷ′)log(1 − )) | (13) |
These two classifiers contribute to distinguishing different labels, more than judging the similarity of outputs in conventional generative models. Furthermore, our novel latent classifier promotes DSCVAE to learn a more informative latent space than generative models with only original-space conditions. Thus, DSCVAE should better solve the conditional inputs (“coalescence” or “non-coalescence” here) with considerable noise and similar distributions for different labels.
The total loss function can be written as
= MSE(x,) + KLD(q(z|x,y)‖p(z|x,y)) + originalCE + latentCE | (14) |
We then use the predictive models to evaluate the generated synthetic data. The training datasets are set as Table 3, and the validation set and test set are also shown in Table 1.
Initial | Synthetic | Total | |
---|---|---|---|
Initial dataset | 438 | — | 438 |
CVAE mixed dataset | 438 | 438 × 15 | 7008 |
DSCVAE mixed dataset | 438 | 438 × 15 | 7008 |
The algorithms for assessments are Random Forest (mentioned in Section 3.1.1) and XGBoost (mentioned in Section 3.1.2). The key hyperparameters nestimators and dmax are chosen from a grid search. All the numerical experiments are implemented on the Google Colab platform. The DL generative models are performed on a single Tesla K80 GPU.
The classifier of the latent space in DSCVAE has a direct impact on the construction of the latent distribution and indirectly affects the classification and reconstruction of the original space. Because the latent-space classifier affects the latent distribution, unsurprisingly we find that the KLD loss of DSCVAE is more unstable than CVAE, as shown Fig. 6b. In addition, the double-space classifiers make the reconstruction training converge faster, as Fig. 6a shows that the MSE loss of DSCVAE decreases faster than the MSE loss of CVAE. This may be due to the fact that the classifier of the latent space also guides the data reconstruction.
We fix nestimators and dmax that perform the best on the validation dataset (according to Fig. 7) since the test dataset should not be used for tuning hyperparameters. These hyperparameters are used for predicting the test data. The exact values of these tuned hyperparameters are shown in Table 4.
Evaluative method | Generative method | Accuracy (%) | Precision (%) | Recall (%) | F1 score (%) | n estimators | d max |
---|---|---|---|---|---|---|---|
RF | None | 67.00 | 67.34 | 67.00 | 66.84 | 80 | 12 |
CVAE | 66.00 | 66.10 | 66.00 | 65.95 | 105 | 8 | |
DSCVAE | 71.00 | 71.01 | 71.00 | 71.00 | 125 | 9 | |
XGB | None | 64.00 | 64.09 | 64.00 | 63.94 | 60 | 5 |
CVAE | 68.00 | 68.12 | 68.00 | 67.95 | 60 | 6 | |
DSCVAE | 66.00 | 66.03 | 66.00 | 65.99 | 50 | 5 |
Evaluative method | Generative method | Accuracy (%) | Precision (%) | Recall (%) | F1 score (%) |
---|---|---|---|---|---|
RF | None | 58.50 | 58.57 | 58.50 | 58.42 |
CVAE | 63.00 | 63.13 | 63.00 | 62.91 | |
CVAE (L) | 60.50 | 60.59 | 60.50 | 60.42 | |
DSCVAE | 66.00 | 66.42 | 66.00 | 65.78 | |
XGB | None | 58.00 | 58.01 | 58.00 | 57.98 |
CVAE | 63.00 | 63.01 | 63.00 | 63.00 | |
CVAE (L) | 61.50 | 61.59 | 61.50 | 61.42 | |
DSCVAE | 66.50 | 66.58 | 66.50 | 66.46 |
As displayed in Table 5, the predictors trained by DSCVAE mixed (synthetic and initial) dataset have the best test results in terms of all metrics investigated. In fact, all generated datasets improve the accuracy of predictive models. Compared to the predictive models only trained on the initial dataset, DSCVAE substantially enhances the prediction accuracy by 7.5% and 8.5% for RF and XGBoost, respectively. The predictive models trained by DSCVAE mixed dataset also have the best performance in precision, recall and F1 score. In addition, DSCVAE helps to reduce predictors over-fitting between validation results and test results. As shown in Table 5, DSCVAE synthetic data narrows the gap between validation accuracy and testing accuracy from 8.5% to 5% (RF) and from 6% to −0.5% (XGBoost) compared to the initial dataset.
We consider “coalescence” as positive samples and “non-coalescence” as negative samples. Shown in Table 6 is the number of correct and incorrect predictions. The accurate predictions for both types increase compared to the predictive models trained on the initial dataset. The true positive rates increase by 7.4% and 12.5% respectively, and the true negative rates increase by about 17% for both RF and XGBoost. It is worth mentioning that the prediction accuracy of 66% for DSCVAE might still present some limitations for the practical use of the predictive model. However, increasing the accuracy from 58% to 66% for a binary classification problem is still significant and it clearly demonstrates the strength of the proposed generative model. With this model, we predict the outcome, “coalescence” or “non-coalescence”, of each forming doublet. Due to the dependence on the local process parameters, the probability of drop coalescence varies between 0 and 1, leaving some cases indeterminable. As a result, the predictability of the model cannot be very high. On the other hand, the reliable generation of a considerable amount of synthetic data will enable the prediction of coalescence probability for the prescribed set of data. Furthermore, a more accurate predictive model results in a more meaningful interpretability analysis as detailed in the following section.
Evaluative method | Generative method | Validation | Test | ||||||
---|---|---|---|---|---|---|---|---|---|
TP | TN | FP | FN | TP | TN | FP | FN | ||
RF | None | 30 | 37 | 20 | 13 | 54 | 63 | 46 | 37 |
CVAE | 31 | 35 | 19 | 15 | 58 | 68 | 42 | 32 | |
CVAE (L) | 30 | 38 | 20 | 12 | 56 | 65 | 44 | 35 | |
DSCVAE | 35 | 36 | 15 | 13 | 58 | 74 | 42 | 26 | |
XGB | None | 30 | 34 | 20 | 16 | 56 | 60 | 44 | 40 |
CVAE | 32 | 36 | 18 | 14 | 62 | 64 | 38 | 36 | |
CVAE (L) | 28 | 38 | 22 | 12 | 57 | 66 | 43 | 34 | |
DSCVAE | 32 | 34 | 18 | 16 | 63 | 70 | 37 | 30 |
In addition, scatter plots (Fig. 8(g)–(l)) combine feature importances and feature effects. Each point on the scatter plot is a SHAP value of the feature in a training sample. The position on the x-axis is determined by the SHAP value which represents the feature contribution. The colours of the scatter points represent the feature values (red/blue for positive/negative contribution to “coalescence”). Most of the points in Fig. 8i and l are clustered, indicating that their SHAP values are closer. This means that the individual samples in each feature contribute similarly to the model, which improves the model's robustness. Also, it can be seen from Fig. 8(i) and (l) that the contribution of “drop1” and “drop2” are reversed. That is, the higher size of “drop1” contributes negatively to the probability of “coalescence”, while the higher size of “drop2” contributes positively to the probability of “coalescence”. In fact, it can be noticed from Fig. 2 that in general, the average size of “drop1” is larger than that of “drop2”. In addition, when the sizes of “drop1” and “drop2” are close, the drops have a higher probability of coalescence. To further confirm this trend, we plot in Fig. 9 the impact of the drop size difference (i.e., |xdrop1 − xdrop2|) on the model predictions of RF and XGBoost. It can be concluded that smaller differences in drop sizes can lead to a significantly higher probability of drop coalescence. In other words, to obtain a higher drop coalescence in experiments, one should minimize the drop size difference. This explains the opposite contribution of “drop1” size and “drop2” size in Fig. 8 and also confirms the consistency of DSCVAE since both the RF and the XGBoost here are trained using the synthetic data. For the predictive models trained on the initial and DSCVAE datasets, it can be seen that the impact of xdt on model output is limited. It is worth mentioning that we consider here only the case when two drops form a doublet. For large xdt, drops can proceed as singlets, without forming a doublet and therefore are not included in the consideration.
In summary, DSCVAE mixed data outperforms the initial dataset and CVAE mixed dataset in training predictors in classification tasks. It not only reduces the overfitting gap between validation accuracy and test accuracy, but also increases overall predictive performance. By analyzing XGBoost values of each predictive model, we find that DSCVAE mixed dataset contributes to features’ homogeneity, allowing features to be treated equally.
This study offers a paradigm for addressing classification problems with limited and imbalanced tabular data. It also highlights the potential of data-driven methods for predicting microfluidic drop coalescence. The proposed methodology can be applied to other materials and devices. Future work may include further improving the interpretability and robustness of DSCVAE, for instance, by imposing physical or chemical constraints. The future work will also aim at extending the model parametric space to include viscosities of continuous and dispersed phases, interfacial tension, dynamic surfactant effects, and the presence of surfactants in different phases.
Experimental data are available upon reasonable request to Dr Nina Kovalchuk (n.kovalchuk@bham.ac.uk).
AE | Autoencoder |
ANN | Artificial neural network |
CE | Cross entropy |
CVAE | Conditional variational autoencoder |
DL | Deep learning |
DSCVAE | Double space conditional variational autoencoder |
DT | Decision tree |
FN | False negative |
FP | False positive |
GAN | Generative adversarial network |
GBDT | Gradient boosting decision tree |
IR | Imbalanced ratio |
KLD | Kullback–Leibler divergence |
ML | Machine learning |
MSE | Mean square error |
RF | Random forest |
SHAP | Shapley addictive explanations |
TN | True negative |
TP | True positive |
VAE | Variational autoencoder |
XGB | XGBoost |
This journal is © the Owner Societies 2023 |