Open Access Article
Xingyu
Jin
a,
Jing
Yang
b,
Xiaorui
Jiang
c,
Zhenqing
Li
*a,
Jinrong
Shen
d,
Zhiheng
Yu
a,
Cunliang
Yang
a,
Fengli
Huang
e,
Dunlu
Peng
a,
Yoshinori
Yamaguchi
f and
Jijun
Feng
*a
aShanghai Key Laboratory of Modern Optical System, Engineering Research Center of Optical Instrument and System (Ministry of Education), School of Optical- Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China. E-mail: fjijun@usst.edu.cn; zhenqingli@163.com; Tel: +86 13917961036
bFaculty of Engineering, Anhui Sanlian University, Hefei, 230000, China
cNational Key Laboratory of Electromagnetic Space Security, Tianjin 300308, China
dState Key Laboratory of ASIC and System, Fudan University, Shanghai 200433, China
eThe Key Laboratory of Medical Electronics and Digital Health of Zhejiang Province, Jiaxing University, Jiaxing 314001, China
fComprehensive Research Organization, Waseda University, Tokyo 162-0041, Japan
First published on 2nd April 2025
The large-field rapid nucleic acid concentration measurement system is capable of achieving one-time gene chip imaging with high resolution. However, it encounters challenges in the precise detection of positive microchambers, which is caused by factors such as reagent residue, uneven lighting, and environmental noise. Herein we proposed an improved, lightweight algorithm based on You Only Look Once (YOLOv5) for detecting the positive microchambers. We determined appropriate detection scales based on the target size distribution and utilized the bidirectional feature pyramid network (BiFPN) for efficient multi-scale feature fusion. To reduce model size without sacrificing performance, GhostConv, C3Ghost, and a simple, parameter-free attention module (SimAM) were integrated into the network, followed by network pruning. The improved YOLOv5 model was trained on a self-built dataset, and employed a partitioned fusion prediction strategy to detect large-field ddPCR images by self-developed software. In contrast to single-stage lightweight object detection algorithms, our model features a mere 1.5MB size while achieving 99.5% precision, 99.5% recall, and a 78.1% mAP(0.5
:
0.95), significantly reducing the system's demand for computing resources without compromising efficiency and accuracy.
000 cylindrical microchambers, a white laser to improve dye excitation efficiency, and a large-field optical system for rapid imaging of the chip within 15 seconds, eliminating the need for image stitching technology. The system effectively resolved the contradiction between target template resolution and field of view range, achieving one-time gene chip imaging with high resolution. However, when processing images using ImageJ software,7 accurately counting low-quality positive microchambers remains a challenge due to factors such as residual reagents, uneven lighting, and environmental noise. Current methods depend on the thresholds of the pixels in an image, which imitates the distribution of fluorescence based on the grayscal.8 Although multiple heuristic optimization techniques have been applied to refine threshold determination,9 but minor threshold adjustment can lead to significant variations in counting the “positive” microchambers, especially when dealing with large numbers of microchambers. Therefore, improving the algorithm is crucial for an accurate and efficient large-field nucleic acid concentration measurement system.
In recent years, advancements in deep learning have spurred continuous innovation in ddPCR image processing algorithm. The Mask R-CNN had been employed to perform instance segmentation and boundary fitting of droplets, resulting in highly precise determination of microfluidic droplet size distribution.10 Improved methods based on the combination of the Hough transform and deep learning techniques, such as the Attention DeepLabV3+ model11 and convolutional neural networks,12 have been utilized to accurately segment droplets of various sizes, even in low fluorescence intensity and blurred images. Additionally, Wei et al. first applied the Segment Anything Model for zero-shot microchamber segmentation in nucleic acid detection.13 These methods achieved pixel-by-pixel differentiation between positive microchambers and background through image segmentation.14 Conversely, object detection prioritizes the detection and localization of positive microchambers.15 Recent research had utilized the YOLOv5s model to achieve binary classification and accurate identification of negative and positive droplets against complex backgrounds.16 Another study had optimized the YOLOv5m model using region proposal network, enabling real-time automatic detection and classification of fluorescent images with good generalization performance.17 These algorithms demonstrated the significant advantages of deep learning in processing ddPCR images. However, when dealing with higher-resolution images over a larger field of view, existing models become inefficient and complex. Although cloud servers or local high-performance workstations can provide powerful computing resources, they fail to meet the requirements for portability and independence in detection. Edge devices, on the other hand, are limited by computational, storage, and power constraints, making it challenging to support the efficient processing of complex models.
To address these challenges, we proposed an improved lightweight algorithm based on YOLOv5. By adjusting the feature fusion method and introducing GhostConv, C3Ghost, and SimAM models, alongside employing network pruning and partitioned fusion prediction strategy, this method significantly reduced reliance on computing resources, thereby enabling lightweight detection of high-resolution ddPCR images on edge devices. The deep learning model was trained using just two large-field images and combined with X-Anylabeling software and the Albumentations library for dataset labeling and augmentation. To assess the efficacy of each enhancement, ablation experiments were conducted, and the improved algorithm was comprehensively compared with five mainstream single-stage object detection algorithms, demonstrating its outstanding performance.
000 microchambers in the microfluidic chip (28 mm × 18 mm), which was imaged by CMOS camera (MARS-3140-3GM-P-03). Each chamber can hold 0.81 nL droplet with 270 pixels. Positive droplets typically appear as clear fluorescence signals in the images, with a radius of approximately 9 pixels, fluctuating by about 3 pixels. To aid in model training, the effective areas of images were cropped to 3900 × 2400 and segmented into 140 images (320 × 320).
To efficiently annotate images, X-Anylabeling software was employed for semi-supervised labeling.18 After manually annotating a small subset of images to fine-tune the base model, we utilized the software's auto-labeling function to complete the remaining annotations. Finally, detailed manual proofreading was performed to ensure the accuracy and consistency of the labels. YOLO format files with category and coordinates were directly generated, and a corresponding VOC format dataset was compiled using a conversion program. Moreover, data diversity was enhanced with the Albumentations library's localized augmentation, which incorporated flipping, transposing, foreign object occlusion, camera noise, random brightness and contrast, and motion blur, expanding the dataset to 280 images. Although the dataset size remains smaller than typical deep learning benchmarks, these techniques mimic realistic experimental variations—such as reagent distribution irregularities, chip impurities, lighting inconsistencies, as well as camera defocus and jitter—thereby enhancing the model's ability to generalize to unknown conditions. Although our experimental results on the current dataset are promising, continuous data collection, periodic model retraining, and rigorous cross-validation will further help ensure the model's robustness and reliability across various real-world scenarios. For model training and evaluation, we randomly split the dataset into training, validation, and test sets in a 7
:
2
:
1 ratio. Fig. 1 randomly showcases six examples from the self-built dataset.
![]() | ||
| Fig. 1 Sample images from the self-built datasets. (A)–(F) represents six examples randomly selected from the self-built dataset. | ||
We assessed the network's performance using metrics such as precision (P = TP/(TP + FP)), recall (R = TP/(TP + FN)), and mean average precision at intersection over union (IoU) thresholds from 0.5 to 0.95 (
). Precision (P) measures the accurate detection rate of positive microchambers, recall (R) captures the proportion of actual positives detected, whereas mAP(0.5
:
0.95) reflects the model's overall performance across a range of IoU thresholds, where true positives (TP) represent the count of correct detections, false positives (FP) denote the number of incorrect detections, false negatives (FN) are the undetected actual objects, APi is the average precision at each IoU threshold and C is the number of IoU thresholds. In addition, the number of parameters and model size directly affect the potential applications of the model within limited storage space. Floating-point operations (FLOPs) influence the inference speed and energy efficiency. These comprehensive evaluation metrics have provided valuable references for subsequent optimizations.
The microchamber imaging of large-field ddPCR can reach up to 270 pixels, and the distribution of annotation box size is shown in Fig. S2.† In order to better match the targets, the model's backbone was optimized by removing the Conv module at layer 7 and the C3 module at layer 8, which excluded the P5 feature map computation. Subsequently, the neck network's output channels were reduced by half, and feature maps at layers (2,14), (4,10), (8,20), and (12,17) were strategically interconnected. Object detection was conducted by the head network on resized feature maps of 160 × 160, 80 × 80, and 40 × 40, matching the P2, P3, and P4 scales. The bidirectional feature pyramid network (BiFPN)22 was also incorporated for superior feature fusion (Fig. 2).
The traditional method for feature extraction uses multiple convolutional kernels to convolve across all input feature map channels, demanding extensive parameters and computational power, particularly with large images or numerous kernels.23 To improve its efficiency, we incorporated the GhostConv and C3Ghost modules from Ghostnet,24 a lightweight deep learning model framework, followed by the network pruning.25 The computational efficiency of GhostConv surpasses traditional convolution by approximately X times, where X represents the number of linear operations involved in the ghost convolution.26 The GhostBottleneck, a lightweight bottleneck structure within the GhostConv, supersedes the cross stage partial module's bottleneck, leading to the novel C3Ghost module (Fig. 2B). This improvement is accompanied by the maintenance of the output feature map count and a notable reduction in model size. These modules, which are part of the widely used GhostNet architecture, are becoming increasingly popular in resource-constrained environments. The significant reduction in the number of parameters also reduces the potential risk of overfitting. However, the reduction in parameters often causes a decline in performance. To maintain the accuracy, a simple, parameter-free attention module (SimAM)27 was introduced (Fig. 2D), because it performs better than the most representative SE and CBAM attention modules in most cases.28 SimAM assesses the importance of neurons by optimizing an energy function and generates attention weights based on local self-similarity, thereby enhancing critical features and suppressing irrelevant ones. By incorporating the SimAM attention mechanism within the GhostConv framework (Fig. 2C), the resulting SimGhostConv module inherits the lightweight nature of GhostConv while gaining enhanced sensitivity and abstraction capabilities.
When predicting a gene chip image of 3900 × 2400 resolution, sub-images are initially sized at 640 × 640, with a sliding cropping stride of 600, resulting in 28 sub-images. Each sub-image undergoes individual prediction to obtain detection results. Subsequently, the non-maximum suppression (NMS) algorithm30 merges these results to eliminate duplicate counts in overlapping areas. The merged result represents object detection for the entire high-resolution image. Finally, the number of target pre-selection boxes represents the counting result, followed by quantitative analysis of target molecule concentration using the Poisson distribution. This strategy enables the model to flexibly utilize existing computational resources during training and inference, significantly reducing the demands on system performance while maintaining operational efficiency and prediction accuracy.
:
0.95), while reducing parameters to 1
701
526. Notably, the P5 detection scale exhibited the weakest correlation with positive microchambers, while P2 displayed the strongest correlation, consistent with the distribution of target sizes in the self-built dataset. Table S3† presents the performance comparison among FPN, PANet, and BiFPN. While the introduction of BiFPN increased model parameters by 0.95%, it enriched semantic and detailed information across scales. As a result, the mAP(0.5
:
0.95) reached 78.5%, surpassing other methods.
:
0.95). When GhostConv and C3Ghost modules were simultaneously introduced in both the backbone and neck networks, the model achieved 99.5% precision, 99.3% recall, and 78.2% mAP(0.5
:
0.95). At this point, the model size was reduced to 2.9MB, reaching the optimal value among the previous three sets of experiments. When combined with SimAM attention mechanism, this experiment achieved 78.6% mAP(0.5
:
0.95), and the model parameters, size, and FLOPs remain unchanged, achieving the best overall performance. The results indicate that the inclusion of the SimAM enables the model to extract more focused features, significantly enhancing its ability to detect small target objects. After network pruning, the model maintains a high mAP(0.5
:
0.95) of 78.1%, with precision, recall, parameters, size, and FLOPs at 99.5%, 99.5%, 262
047, 1.5MB, and 2.6G, respectively. The results of the ablation experiments, as shown in Fig. 4, highlight the consistent performance improvements across different model variants. Detailed data can be found in Table S5.† Additionally, the confusion matrices for the original and the final improved models are shown in Fig. S3.†
To comprehensively assess the improved model, we conducted comparative experiments on several representative single-stage detection algorithms: SSD,31 YOLOv3-tiny, YOLOv5s-ShuffleNetv2,32 YOLOv5s-MobileNetv3,33 and YOLOv5n. As shown in Fig. 6 and Table S6,† our model consistently demonstrates superior performance across various metrics when compared to these algorithms. Additionally, compared to YOLOv5-based methods such as YOLOv5s16 and YOLOv5m,17 our approach achieves higher detection accuracy with a smaller model size and is applicable to high-resolution ddPCR images. Notably, although our experiments were based on a large-field ddPCR image dataset, the proposed method has broad applicability in other biomedical imaging domains. For example, the method has the potential to assist in rare cell identification in low-resolution fluorescence microscopy images by leveraging multi-scale detection and lightweight design. It may also contribute to cancer region detection in high-resolution whole-slide histopathology images through a partitioned fusion strategy. Additionally, it could be applied to microorganism counting in large-field water samples, potentially achieving high precision with low computational cost. Our core innovations are versatile and can be customized for various imaging conditions.
:
0.95). Compared to the existing models, it not only achieves higher detection accuracy but also lower complexity, exhibiting superior performance and efficiency. This is significant for the efficient conduct of nucleic acid concentration measurement and related research in fields such as medicine and bioscience. Future research will focus on further improving real-time detection performance and exploring the possibility of applying it to the counting of flowing fluorescent targets.
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5dd00006h |
| This journal is © The Royal Society of Chemistry 2025 |