Open Access Article
Khayrul Islam
a,
Ryan F. Forellib,
Jianzhong Hanc,
Deven Bhadaned,
Jian Huangcef,
Joshua C. Agar
g,
Nhan Tranh,
Seda Ogrencib and
Yaling Liu
*ijk
aDepartment of Mechanical Engineering, Lehigh University, Bethlehem, PA 18015, USA
bDepartment of Electrical and Computer Engineering, Northwestern University, Evanston, IL 60208, USA
cCoriell Institute for Medical Research, Camden, NJ, USA
dDepartment of Computer Science, Lehigh University, Bethlehem, PA 18015, USA
eCooper Medical School of Rowan University, Camden, NJ 08103, USA
fCenter for Metabolic Disease Research, Temple University Lewis Katz School of Medicine, Philadelphia, PA 19122, USA
gDepartment of Mechanical Engineering and Mechanics, Drexel University, Philadelphia, PA 19104, USA
hReal-time Processing Systems Division, Fermi National Accelerator Laboratory, Batavia, IL 60510, USA
iPrecision Medicine Translational Research Center, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China. E-mail: yaling.liu@gmail.com
jCenter for High Altitude Medicine, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
kDepartment of Bioengineering, Lehigh University, Bethlehem, PA 18015, USA
First published on 24th November 2025
Precise cell classification is essential in biomedical diagnostics and therapeutic monitoring, particularly for identifying diverse cell types involved in various diseases. Traditional cell classification methods, such as flow cytometry, depend on molecular labeling, which is often costly, time-intensive, and can alter cell integrity. Real-time microfluidic sorters also impose a sub-ms decision window that existing machine-learning pipelines cannot meet. To overcome these limitations, we present a label-free machine learning framework for cell classification, designed for real-time sorting applications using bright-field microscopy images. This approach leverages a teacher–student model architecture enhanced by knowledge distillation, achieving high efficiency and scalability across different cell types. Demonstrated through a use case of classifying lymphocyte subsets, our framework accurately classifies T4, T8, and B cell types with a dataset of 80
000 pre-processed images, released publicly as the LymphoMNIST package for reproducible benchmarking. Our teacher model attained 98% accuracy in differentiating T4 cells from B cells and 93% accuracy in zero-shot classification between T8 and B cells. Remarkably, our student model operates with only 5682 parameters (∼0.02% of the teacher, a 5000-fold reduction), enabling field-programmable gate array (FPGA) deployment. Implemented directly on the frame-grabber FPGA as the first demonstration of in situ deep learning in this setting, the student model achieves an ultra-low inference latency of just 14.5 µs and a complete cell detection-to-sorting trigger time of 24.7 µs, delivering 12× and 40× improvements over the previous state of the art in inference and total latency, respectively, while preserving accuracy comparable to the teacher model. This framework establishes the first sub-25 µs ML benchmark for label-free cytometry and provides an open, cost-effective blueprint for upgrading existing imaging sorters.
Recent advancements in ML have revolutionized cell classification by offering innovative solutions to circumvent the limitations of traditional methods. For instance, deep CNNs have been successfully applied to bright-field images for label-free identification of cell types, with feature fusion approaches integrating morphological patterns across multiple convolutional modules to achieve high accuracy.7 While such specialized approaches show promise, many general ML models perform suboptimally on datasets other than those they were specifically trained on, revealing inadequate generalization and transfer-learning capabilities. Furthermore, training protocols optimized for general image datasets often fail to translate effectively to biological datasets.8,9 Progress is also slowed by the scarcity of large, publicly available bright-field datasets with ground-truth phenotypes, making reproducible benchmarking difficult.
Addressing these challenges requires robust training methodologies tailored specifically for diverse biological image datasets. In this study, we focus on optimized training protocols that achieve high specificity and sensitivity in cell classification. Using lymphocyte classification as a use case, we demonstrate the adaptability and effectiveness of these training recipes, highlighting their potential to extend seamlessly to various cell types and enabling versatile applications across different biological contexts. Specialized expertise in lymphocyte classification remains limited even in well-resourced communities, leading to variability in diagnostic accuracy. This issue is exacerbated in underserved areas, where the lack of access to expert pathology services results in prolonged or erroneous diagnostic outcomes that critically impair patient management. Our ML framework leverages bright-field images to detect cellular morphological features for the cell classification process. By eliminating reliance on molecular labels, this approach reduces human subjectivity, ensures reproducibility, and offers consistent results across different settings. To facilitate community adoption, we release both our training code and the 80
000-image LymphoMNIST dataset as pip-installable packages.
Moreover, to meet the demands of real-time inference, we have implemented a field-programmable gate array (FPGA) version of our optimized student model, achieving ultra-low latency and high throughput. Previous studies have demonstrated cell classification ML inference performance on the order of milliseconds, primarily on GPU and CPU hardware.10–12 The previous state-of-the-art (SOTA) in terms of inference latency implements a deep neural network (DNN) for standing surface acoustic wave cell sorting and achieves an inference latency of approximately 183 s2 and a full cell detection-to-sorting trigger latency of <1 ms. Leveraging high-level-synthesis tools (hls4ml) and a knowledge-distilled student network with only 5682 parameters (about 0.02% of the 28 M-parameter teacher, a 5000-fold reduction), we achieve the first frame-grabber-resident deep-learning implementation that fits within this strict latency envelope.
By leveraging our ML framework in a use case involving the classification of T4, T8, and B cells, we have achieved remarkable accuracy improvements. Our teacher model demonstrates approximately 98% accuracy in classifying T4 cells from B cells and achieves about 93% accuracy in zero-shot classification of T8 vs. B cells. Employing knowledge-distillation (KD) techniques, our student 2 model attains sufficiently high accuracy relative to the teacher model with just about 0.02% of its parameters. The FPGA implementation of the student model further enhances processing speed, reducing inference latency to just 14.5 s. This improvement in processing speed facilitates the real-time analysis and accurate sorting of T and B cells, significantly advancing their rapid classification in clinical settings. With these insights and results in place, the core achievements and contributions of our study are summarized in the following research highlights:
(1) Dataset: we present a dataset of 80
000 images, which supports the training and validation of our models. The data are freely available via the pip-installable LymphoMNIST package for immediate benchmarking.
(2) Models: we publish detailed recipes for a high-capacity teacher and a KD-trained student with an in-house, lightweight architecture tuned for bright-field cell images, achieving 5000-fold parameter compression (5682 params, 0.02% of the teacher) while retaining F1 > 0.97.
(3) Transfer learning: we demonstrate the transfer-learning capability for T8 versus B cell classifications, indicating that the model can perform zero-shot inference and can be further tuned to detect other lymphocyte cell types.
(4) In situ FPGA Implementation: we deploy our student model directly on the frame-grabber FPGA, eliminating PCIe transfer overhead and reducing inference latency from the 183 s previous SOTA and 325 s on GPU to just 14.5 s, a 12× and 22× improvement, respectively. Thus, we institute a new SOTA real-time deep-learning benchmark and implementation for real-time cell sorting and rapid classification.
000 high-resolution lymphocyte images, each with a resolution of 64 × 64 pixels (Fig. 1(a)). These images are categorized into three primary classes: B cells, T4 cells, and T8 cells (Fig. 1(b and c)). To support the development and evaluation of machine learning models, the dataset is partitioned into training, validation, and testing sets in an 80-10-10 split, resulting in 64
000 images for training and 8000 images each for testing and validation (Fig. 1(e)). To enhance accessibility and usability, we have developed a pip-installable package that allows researchers to seamlessly download the dataset and incorporate it into their experimental workflows.13 The images in the dataset were captured under diverse environmental conditions, including variations in lighting and camera settings, to introduce a realistic level of complexity for algorithm development. These conditions are designed to simulate the variability encountered in real-world scenarios, challenging models to generalize effectively. Furthermore, the dataset includes images from both young (65%) and aged (35%) mice to account for age-specific cellular variability, a factor that enhances the model's ability to generalize across different biological conditions. The collection spanned 18 months across four seasons, ensuring that environmental fluctuations such as controlled humidity (±5%) and temperature (±2 °C) were captured, further contributing to dataset diversity. Performance benchmarks of various models, like Decision Tree (DT), Gradient Boosting Classifier (GBC), Linear Support Vector Classifier (LSVC), Logistic Regression (LR), Random Forest (RF), and K-Nearest Neighbors Classifier (KNC), applied to the dataset are detailed in the SI. Accuracy metrics for these models are presented in Fig. 1(d), providing insights into the dataset's applicability for machine learning tasks.
We observed that increasing the image size from the original 64 × 64 pixels in the LymphoMNIST dataset to 120 × 120 pixels improved both training and validation accuracy. This larger size allowed TN to capture more spatial information, enhancing feature extraction. The choice of image size is closely tied to the depth of the architecture, as deeper models like ResNet50 can leverage larger feature maps for improved performance, as noted in previous research.15,16 However, further increasing the size led to overfitting due to the model's increased complexity. Thus, we standardized all images to 120 × 120 pixels to achieve an optimal balance between feature learning and generalization (Table 1).
To improve generalization and reduce overfitting, we employed a range of data augmentation techniques, including random flips, rotations, scaling, translations, shearing, contrast adjustments, hue and saturation adjustments, and Gaussian blur. The choice and intensity of these augmentations must be carefully balanced depending on both the complexity of the model and the amount of available data. For complex models like ResNet50, stronger augmentations can introduce sufficient variability, preventing the model from overfitting by helping it generalize better across the dataset.17 However, when the dataset is limited, applying overly strong augmentations can introduce excessive noise, which may degrade performance, particularly in tasks with high-dimensional latent spaces (Fig. 2(a)) by causing the model to fit irrelevant or spurious patterns.18 In such cases, it can be more effective to use a less complex model that is better suited to the smaller dataset, as it reduces the risk of overfitting to noise and irrelevant patterns in the training data.19 The dataset exhibited a class imbalance between B cells and T4 cells. To address this, we employed a weighted random sampler during training to ensure that the underrepresented classes were adequately sampled. This approach allowed the model to learn distinguishing features for both classes effectively, preventing bias towards the majority class (Fig. 2(c)).
The TN model achieved a training accuracy of approximately 97%, and a validation accuracy of approximately 98% after 70 epochs (Fig. 2(a)). The close alignment between the training and validation accuracies indicates strong generalization without significant overfitting. Notably, the validation accuracy occasionally surpassed the training accuracy, likely due to the extensive augmentations applied to the training data, which were not applied to the validation set. Fig. 2(b) shows the Receiver Operating Characteristic (ROC) curve, which highlights the model's strong discriminatory capability between B cells and T4 cells, with a high Area Under the Curve (AUC) for both the training and validation datasets. The confusion matrix in Fig. 2(c) demonstrates high true positive rates and low false positive rates for both classes. Finally, the t-distributed Stochastic Neighbor Embedding (t-SNE) visualization (Fig. 2(d)) provides a visual representation of the separation between B cells and T4 cells in the latent feature space. The minimal overlap between clusters further confirms the model's ability to effectively capture distinguishing features between the two cell types, making it a reliable tool for cell classification in biomedical applications.
Our experiments reconfirmed that data-mixing augmentation techniques, such as CutMix and MixUp, substantially enhance KD performance. Conversely, other image-based augmentations, including random flipping and shearing, degraded the accuracy of the distilled student model when applied inconsistently between teacher and student,23 as demonstrated by Beyer et al.22 Maintaining identical image crops and augmentation strategies for both teacher and student networks during training was crucial to ensure consistent learning and effective knowledge transfer without misalignment in data representation.22
We observed that the student 2 model attained significantly higher accuracy when trained using KD compared to training from scratch. This outcome aligns with prior research indicating that KD enables smaller models to focus on relevant information by utilizing outputs from a larger teacher model, including softened labels, as guidance.24 Such guidance allows the student model to capture complex patterns by receiving nuanced data representations, which may be challenging to learn independently, especially in resource-constrained scenarios.25 Furthermore, studies have demonstrated that KD improves the ability of student models to capture high-level abstractions that are difficult to learn without teacher supervision.26 For instance, Hinton et al.27 showed that soft targets enhance student model performance by conveying richer information about class relationships.
The performance evaluation of student networks, shown in Fig. 3, reveals their accuracy on training and validation datasets. Confusion matrices on Fig. 3(a) and (b) indicate that student 1 slightly outperforms student 2, although student 2 demonstrates strong generalization capabilities in more challenging classes, suggesting that KD effectively maintains robustness in smaller models.28 Fig. 3(c) presents a t-SNE visualization for student 1, showing distinct clusters that signify successful feature extraction and class differentiation. ROC curves (Fig. 3(d)) for both models illustrate high discriminative performance, with AUC values of 98% for student 1 and 96% for student 2 respectively. Comparative analysis of model parameters and latency in Fig. 3(e) and (f) reveals that student 2 operates with only 0.02% of the teacher model's parameters, achieving a latency of ∼0.325 ± 0.004 ms. This is significantly lower than student 1 (∼2.11 ± 0.03 ms) and the teacher model (∼5.05 ± 0.06 ms), with the FPGA implementation further reducing latency to ∼0.0145 ± 0.001 ms.
![]() | ||
| Fig. 4 Model performance evaluation. (a)–(d) present the comparative assessment of accuracy, precision, recall, and F1 score. | ||
To further assess the generalizability of the transfer learning approach beyond the specific T8 vs. B cell classification task, we evaluated our model on an external dataset,29 which includes additional hematological cell types. Our results demonstrated a ∼1% accuracy boost for T vs. Leukemia cell classification when using our pretrained teacher model as the starting point, compared to an ImageNet-pretrained ResNet50. This indicates that leveraging prior domain-specific knowledge enhances model adaptability across different cell types and pathological conditions, reinforcing the robustness of our transfer learning strategy.
Fig. 4 illustrates the model's performance through comparative assessments of accuracy, precision, recall, and F1 score across panels (a) to (d). The adaptability of the model to the new classification task, with minimal risk of overfitting and improved generalization capabilities, highlights the practical application of transfer learning in biomedical image analysis. Future research directions include extending these methodologies to other cell types or imaging modalities and combining them with continuous learning strategies or domain adaptation to enhance model performance under diverse imaging conditions.
To achieve the latencies required for real-time control in cell sorting, an alternative platform is required. FPGAs are devices characterized by their flexibility and parallelism and provide a suitable balance between throughput and latency for real-time applications. They primarily consist of an array of reconfigurable hardware blocks, such as multipliers, logic blocks, and memories that can be used to implement an algorithm directly as a circuit, thereby forgoing the stack of software and drivers required for a GPU or CPU implementation. Additionally, the emergence of HLS technologies, enabling the synthesis of standard C++ code to register-transfer level hardware descriptions, means that deploying algorithms to custom hardware is easier than ever.
Furthermore, tools like hls4ml facilitate the process of deploying neural networks to FPGA hardware and have been shown capable of achieving nanosecond-level latencies for machine learning inference.32 hls4ml enables the translation of most neural network architectures written in a high-level deep learning framework such as PyTorch or Keras/Tensorflow to an HLS representation using dictionary configuration files and prewritten layer templates for all common HLS synthesis tools including Xilinx, Intel, and Siemens.33–35
hls4ml provides multiple avenues of optimization that empowers us to meet this project's latency constraints. First and foremost, previous work has demonstrated that neural network parameters can be quantized to a lower bit width with minimal impact on overall accuracy.36 This finding is critical for enabling the deployment of neural networks on resource-constrained devices. In this implementation, we use hls4ml to quantize the student 2 network with layer-level granularity while still achieving 86% accuracy. We also leverage hls4ml′s “reuse factor” hyperparameter to fine-tune the level of parallelization applied to each layer of the network. The value of this parameter indicates the maximum number of operations that can share a given physical instance of a resource. This feature allows us to achieve the ultra-low latencies required for this application while remaining within the resource constraints of the FPGA device. The effects of varying this hyperparameter can be illustrated as a Pareto frontier where a high reuse factor results in low resource usage but high latency, and a low reuse factor results in high resource usage but low latency.37 In general, we find that implementing dense layers with a higher reuse factor of 25, and the two convolutional layers with lower reuse factors of 1 and 2, respectively, yields an optimal balance between latency and resource usage.
Apart from latency, another challenge to enabling real-time control presents itself in the substantial input/output (IO) overhead that we would incur when utilizing a CPU or any external PCIe GPU or FPGA accelerator. Therefore, we endeavored to place our student 2 model computation as close to the edge as possible in our experiment to minimize this overhead. Our experimental setup consists of a Phantom S710 high-speed streaming camera aimed at the microfluidic channel through the microscope camera port, paired with the Euresys frame grabber PCIe card. This frame grabber card is responsible for reading out and processing the raw camera sensor data before transmitting frames back to the host computer. Frame grabbers typically implement this processing on an onboard FPGA chip. Conveniently, Euresys offers a tool, CustomLogic, that enables users to deploy custom image processing to their frame grabber FPGA.38 A separate framework, Machine Learning for Frame Grabbers (ml4fg) has also been developed specifically to bridge the gap between CustomLogic and hls4ml and enables seamless deployment of neural network models to Euresys frame grabbers.39 Thus, we leverage all three of these existing tools to deploy student 2 directly in situ in the data readout path of the frame grabber, thereby circumnavigating the need for off-chip compute and completely eliminating all associated IO overhead while achieving ultra-low latency inference. Our full workflow from Python model to bitstream deployment is illustrated in Fig. 5.
We then empirically benchmark the latency of the FPGA implementation of student 2 by monitoring the internal communication protocol used by the neural network intellectual property (IP). We then utilize the frame grabber's TTL IO to output a square wave where the high time denotes inference latency which we measure with an oscilloscope. Fig. 6(a) exhibits the results of this latency test, showing a model inference latency of just 14.5 µs. Additionally, we observe that inference begins approximately 10.0 µs after the trigger edge. Given a 2 µs exposure time, our model completes inferencing approximately 22.5 µs after image exposure is finished. The model output writeout procedure takes an additional 0.2 µs. The writeout consists of the model's two-bit output indicating the cell output class, and can be expanded or adapted for any cell classification task or communication protocol. Aggregating these constituent components yields a full cell detection-to-sorting trigger time of 24.7 µs. By reducing inference latency to under 25 µs, our pipeline shifts the limiting factor from computation to fluidics. This margin not only exceeds the ∼1 ms actuation window of current acoustofluidic sorters,3 but also opens the door to applications previously inaccessible to image-based ML—such as sorting extracellular vesicles or bacteria, where transit times are an order of magnitude shorter than for mammalian cells. As shown in Fig. 6(b), we pipeline neural network inference with the exposure and readout processes to accelerate the algorithm to a throughput of 81 kfps in our implementation. This benchmark far exceeds our GPU's best performance at a batch size of 1. Note that in Fig. 6(a) we capture at 50 kfps such that consecutive inference traces do not overlap for readability purposes.
As shown in Fig. 6(c), our implementation of student 2 consumes the majority of the FPGA resources. DSPs, the resource primarily used to implement neural network multiply accumulate operations, are most heavily utilized because we parallelized the network to the limit of the chip's resource capacity with hls4ml’s reuse factor hyperparameter. The high-speed camera's communication protocol IP imposes an additional resource tax, totaling about 95% DSP usage for the full design. A more granular breakdown of the neural network resource consumption is shown in Fig. 6(d). Most notably, the second convolutional layer consumes far more resources than any other layer due to the higher number of input channels, which results in more multiply-accumulates. Both convolutional layers consume the most lookup tables as they require more complex control logic to manage the sliding kernel window and to direct data between buffers.
By optimizing our student 2 model and leveraging existing tools like hls4ml for deployment in situ on low-cost off-the-shelf frame grabber FPGAs, we are able to bypass data transfer overhead and accelerate our deep learning algorithm to achieve a new SOTA 14.5 µs inference latency and 24.7 µs full cell detection-to-sorting trigger time for cell classification in real-time sorting applications (see Table 2).
000 images split into training, testing, and validation sets as described in Results.
Supplementary information is available. See DOI: https://doi.org/10.1039/d5dd00345h.
| This journal is © The Royal Society of Chemistry 2025 |