 Open Access Article
 Open Access Article
Tingwang Taoa, 
Haining Ji *ab and 
Bin Liu
*ab and 
Bin Liu ab
ab
aSchool of Physics and Optoelectronics, Xiangtan University, Xiangtan, 411105, China
bHunan Engineering Laboratory for Microelectronics, Optoelectronics and System on a Chip, Xiangtan University, Xiangtan, 411105, China. E-mail: sdytjhn@126.com
First published on 13th June 2025
Accurate characterization of nanoparticle size distribution is vital for performance modulation and practical applications. Nanoparticle size measurement in SEM images often requires manual operations, resulting in limited efficiency. Although existing semantic segmentation models enable automated measurement, challenges persist regarding small particle recognition, low-contrast region segmentation accuracy, and manual scalebar calibration needs. Therefore, we propose an improved U-Net model based on attention mechanisms and residual networks, combined with an automatic scalebar recognition algorithm, to enable accurate pixel-to-physical size conversion. The model employs ResNet50 as the backbone network and incorporates the convolutional block attention module (CBAM) module to enhance feature extraction for nanoparticles, especially small or low-contrast particles. The results show that the model achieved IoU and F1-score values of 87.79% and 93.50%, respectively, on the test set. The Spearman coefficient between the measured particle sizes and manual annotations was 0.91, with a mean relative error of 4.25%, confirming the accuracy and robustness of the method. This study presents a highly reliable automated method for nanoparticle size measurement, providing an effective tool for nanoparticle analysis and engineering applications.
In recent years, the researchers have developed a range of techniques for measuring nanoparticle size, including UV-visible spectrophotometry,6 X-ray diffraction (XRD) analysis,7 and laser diffraction.8 Although these techniques enable indirect measurement, they are frequently accompanied by systematic errors. In contrast, scanning electron microscopy (SEM) has become the preferred method due to its reliability and direct visual characterization. However, manual measurement of nanoparticle sizes from SEM images is a time-consuming and labor-intensive process.
To enable automated and accurate measurement of nanoparticle sizes in SEM images, precise identification and segmentation of particles must first be achieved. Traditional image segmentation methods, such as watershed transform (WST),9 clustering analysis,10 and thresholding analysis,11 are well-suited for images with high quality. However, when dealing with SEM images of poor quality (e.g., low-contrast between particles and background or very small particles), these methods often cause over-segmentation or image erosion, resulting in the loss of critical information. Additionally, these methods require manual parameter tuning when applied to different samples, which not only fails to meet the standards of measurement accuracy but also significantly increases human labor costs.
With advancements in deep learning algorithms and machine vision, deep learning-based image segmentation techniques have been widely applied across various scientific fields. To improve the accuracy of particle identification and segmentation, numerous efficient deep learning algorithms have been proposed.12–20 For instance, Wang et al.12 proposed a transformer-enhanced segmentation network (TESN) that integrates a hybrid CNN-transformer architecture, reducing the relative error of nanoparticle size measurement to within 3.52%. Kim et al.13 developed a method that uses machine vision and machine learning technologies to quantitatively extract particle size, distribution, and morphology from SEM images. It can achieve high-throughput, automated measurement even for overlapping or rod-shaped nanoparticles. Zhang et al.14 introduced HRU2-Net+ based on U2-Net+, which achieved a mean intersection over union (MIoU) of 87.31% and an accuracy above 97.31% on their dataset, significantly improving segmentation performance and accuracy. M. Frei et al.20 proposed DeepParticleNet based on Mask R-CNN and introduced a method for generating synthetic SEM images. By training the network on both synthetic and real SEM images, the model maintained adaptability while achieving high-precision particle segmentation.
Despite these advancements, two critical challenges remain: the accurate segmentation of nanoparticles, especially small or low-contrast particles; and the automatic recognition of scale bars in SEM images to ensure accurate nanoparticle size measurement.
To address these issues, we propose an improved semantic segmentation model based on the U-Net architecture, which employs ResNet50 as the backbone and integrates CBAM in the decoder to enhance feature extraction and segmentation accuracy. Furthermore, a scale recognition algorithm is introduced that enables accurate measurement of nanoparticle sizes by extracting and interpreting scale bar information.
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) :
:![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 2 ratio, resulting in 298 training images and 75 validation images. Additionally, to comprehensively evaluate the model's ability to identify small or low-contrast particles, we selected 42 additional images (containing approximately 1211 particles) from public datasets previously mentioned, most of which feature small particles or low-contrast particles, for testing the segmentation performance. The detailed dataset partitioning is shown in Table 1.
2 ratio, resulting in 298 training images and 75 validation images. Additionally, to comprehensively evaluate the model's ability to identify small or low-contrast particles, we selected 42 additional images (containing approximately 1211 particles) from public datasets previously mentioned, most of which feature small particles or low-contrast particles, for testing the segmentation performance. The detailed dataset partitioning is shown in Table 1.
| Number of images | Number of particles | |
|---|---|---|
| a “Ca.” stands for “containing approximately.” | ||
| Training set | 298 | Ca.5494 | 
| Validation set | 75 | |
| Test set | 42 | Ca.1211 | 
Through synchronized transformations of images and their corresponding masks, annotation consistency was preserved while data diversity was effectively enhanced, thereby improving the model's generalization and ensuring efficient training of the segmentation model.
The channel attention mechanism (CAM) first performs global average-pooling and global max-pooling on the input feature map separately. Then, the results are processed through shared fully connected layers (MLP). The processed feature vectors are added together and passed through the sigmoid activation function. This generates the channel weight coefficients (ranging from 0 to 1). Finally, the generated channel weights are multiplied with the original input feature map on a per-channel basis to complete the attention weighting along the channel dimension, σ represents the sigmoid function:
| Mc(F) = σ(MLP(AvgPool(F))) + MLP(MaxPool(F)) | (1) | 
The spatial attention mechanism (SAM) first computes the maximum and average values across the channel dimension for each spatial location of the input feature map. Then, the results are concatenated along the channel dimension and passed through a 7 × 7 convolution to reduce the number of channels to one. The sigmoid activation function is applied to generate spatial attention weights ranging from 0 to 1. Finally, the spatial attention weights are multiplied with the original input feature map at each spatial position to perform spatial attention weighting:
| Ms(F′) = σ(f7×7([MaxPool(F′); AvgPool(F′)])) | (2) | 
CBAM sequentially multiplies the outputs of the channel attention module and the spatial attention module to obtain the final attention-enhanced features:
|  | (3) | 
By weighting features in the channel and spatial dimensions, CBAM strengthens key feature representation while reducing redundant information, thus improving particle segmentation accuracy in complex scenarios while maintaining computational efficiency. At the same time, it expands the receptive field in feature extraction and captures multi-scale information of particles, enabling accurate recognition even in low-contrast regions, thereby enhancing the model's ability to detect small or low-contrast particles and improving its generalization capability.
Firstly, considering that the scale bar in SEM images is located within a white stripe region, contour detection is used to locate the white stripe. Secondly, as the scale bar typically resides on the left edge of the white stripe, and the numeric length value and its unit symbol are arranged adjacently. PaddleOCR is employed to recognize text starting from the leftmost part of the region, and regular expressions are used to precisely extract the numerical value and unit, thereby obtaining the pixel length of the scale bar Lr. Subsequently, to improve the accuracy of scale detection, the detection area is refined to the region located beneath the identified numerical label. And Canny edge detection30 combined with the probabilistic hough transform31 is used to automatically detect tick marks, calculating the pixel length Lp of the scale bar. Finally, the actual length per pixel L is computed using the following formula (4).
|  | (4) | 
As can be visually observed in Fig. 5, the proposed model exhibits significant advantages over models such as U-Net and SegFormer. In low-contrast images (samples 1 and 2), it accurately segments particle contours. When segmenting dense particles in sample 3, the model shows high precision and preserves detailed particle features. For sample 4, which contains small particles, the proposed model demonstrates superior accuracy in identifying these features compared to the other models. These qualitative results demonstrate that the proposed model can provide more reliable outputs for downstream tasks such as particle counting and morphological analysis, demonstrating its superior segmentation performance and practical value in complex scenarios.
To quantitatively evaluate the proposed model's segmentation performance for nanoparticles, we further adopt semantic segmentation evaluation metrics, including intersection over union (IoU), precision, recall, and F1-score, for quantitative analysis. The definitions of these metrics are provided in eqn (5)–(8).
|  | (5) | 
|  | (6) | 
|  | (7) | 
|  | (8) | 
TP (true positive) indicates the number of particle pixels that are correctly identified by the model; FP (false positive) denotes the number of background pixels that are mistakenly classified as particle pixels; FN (false negative) refers to the number of pixels that actually belong to particle regions but are misclassified as background; TN (true negative) represents the number of background pixels that are correctly identified as background. IoU quantifies the overlap between the predicted particle segmentation and the ground truth annotated regions. Recall measures the proportion of true particle pixels correctly identified by the model, while precision evaluates the proportion of correctly predicted particle pixels among all pixels predicted as particles. The F1-score combines recall and precision using the harmonic mean, reflecting the balance between the two metrics.
| Dataset | IoU/% | Recall/% | Precision/% | F1-score/% | 
|---|---|---|---|---|
| Val | 95.92 | 98.77 | 97.08 | 97.9 | 
| Test | 87.79 | 91.79 | 95.27 | 93.50 | 
Experimental results show that the performance on the validation set and the independent test set is consistently high and closely aligned, indicating that the model generalizes well without overfitting. Then we compare the proposed model with U-Net variants employing different backbone networks on the test set. The detailed results are presented in Table 3.
| Method | Backbone | IoU/% | Recall/% | Precision/% | F1-score/% | 
|---|---|---|---|---|---|
| U-Net | — | 77.85 | 84.26 | 91.10 | 87.54 | 
| U-Net | vgg16 | 84.39 | 87.68 | 95.74 | 91.53 | 
| U-Net | ResNet50 | 86.37 | 89.68 | 95.90 | 92.69 | 
| Our | ResNet50 | 87.79 | 91.79 | 95.27 | 93.50 | 
Experimental results show that the U-Net model with a ResNet50 backbone outperforms architectures such as VGG16 in terms of IoU, F1-score, and other metrics. This quantitatively confirms that the residual connection structure enables effective retention of nanoparticle edge features through cross-layer feature reuse. It significantly reduces the particle miss detection rate and validates the architectural advantage of ResNet50 over other backbones. The proposed model achieves a 3.40% improvement in IoU compared to U-Net with a VGG16 backbone, with notable increases in other metrics as well. These results quantitatively validate the proposed model's superior segmentation performance. We also compare our model with other mainstream methods, as shown in Table 4.
| Method | IoU/% | Recall/% | Precision/% | F1-score/% | 
|---|---|---|---|---|
| Pspnet | 34.30 | 34.61 | 97.43 | 54.12 | 
| DeepLabv3+ | 62.30 | 63.71 | 96.57 | 76.69 | 
| HRNetV2 | 75.60 | 78.99 | 94.63 | 86.09 | 
| Segformer | 78.02 | 81.45 | 94.87 | 87.65 | 
| Our | 87.79 | 91.79 | 95.27 | 93.50 | 
Experimental results show that the proposed model achieves an IoU of 87.79%, representing an improvement of more than 9.77% compared to mainstream models such as Segformer. The F1-score improves by 5.85% and the recall by 10.34% compared to Segformer, indicating that the model exhibits better balance in suppressing both false positives and false negatives. In addition, the precision increases by 0.40%, reaching 95.27%, which outperforms Segformer (94.87%) and HRNetV2 (94.63%). Combined with the results in Fig. 5, these findings further confirm that the CRCRA module effectively suppresses background noise and improves segmentation precision for nanoparticles.
As shown in Fig. 4, the qualitative results have demonstrated that the absence of the CRCRA module leads to suppressed key feature regions, thereby reducing segmentation accuracy for particles. The quantitative results in Table 5 further validate this conclusion. The inclusion of the ResNet50 residual network alone improves the IoU by 8.52% and the F1-score by 5.51% compared to the baseline U-Net, with other evaluation metrics also showing significant gains. These results verify the strong feature extraction capability of the ResNet50 residual network for complex patterns. The introduction of the CRCRA module results in a modest IoU improvement of 1.04% compared to the baseline model. By enhancing key region features through channel and spatial attention mechanisms, it boosts precision by 4.98%, quantitatively demonstrating its effectiveness in suppressing background noise and improving focus on critical regions. In addition, the combination of both modules improves the IoU by 9.94% and the F1-score by 5.96% compared to the baseline, while the recall increases to 91.79%, outperforming all individual enhancement schemes. This demonstrates the synergistic effect of integrating the ResNet50 residual network with the CRCRA module, substantially enhancing the model's segmentation performance on complex SEM images.
| U-Net | ResNet50 | CRCRA | IoU/% | Recall/% | Precision/% | F1-score/% | 
|---|---|---|---|---|---|---|
| √ | — | — | 77.85 | 84.26 | 91.10 | 87.54 | 
| √ | √ | — | 86.37 | 89.68 | 95.90 | 92.69 | 
| √ | — | √ | 78.89 | 81.41 | 96.08 | 88.11 | 
| √ | √ | √ | 87.79 | 91.79 | 95.27 | 93.50 | 
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) R-Acc (binary logistic regression-based accuracy), the Spearman coefficient and the mean relative error of the particle size measurement were used to evaluate the models. The formula for log
R-Acc (binary logistic regression-based accuracy), the Spearman coefficient and the mean relative error of the particle size measurement were used to evaluate the models. The formula for log![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) R-Acc is given by (9):
R-Acc is given by (9):|  | (9) | 
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) R-Acc. Additionally, three images containing only a single particle were excluded from the calculation of log
R-Acc. Additionally, three images containing only a single particle were excluded from the calculation of log![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) R-Acc due to the absence of standard deviation. Thus, the overall failure rate on the full test corpus is calculated as 1/42 and the log
R-Acc due to the absence of standard deviation. Thus, the overall failure rate on the full test corpus is calculated as 1/42 and the log![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) R-Acc is calculated as 37/38.
R-Acc is calculated as 37/38.
For log![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) R-Acc, it can be seen from the formula that its value largely depends on the critical value of the particle size measurement results. So, this evaluation metric is insufficient to accurately assess the performance of the proposed method. When evaluating the effectiveness of the method, more attention should be paid to the Spearman coefficient and the mean relative error of the particle size measurement.
R-Acc, it can be seen from the formula that its value largely depends on the critical value of the particle size measurement results. So, this evaluation metric is insufficient to accurately assess the performance of the proposed method. When evaluating the effectiveness of the method, more attention should be paid to the Spearman coefficient and the mean relative error of the particle size measurement.
The Spearman coefficient is a non-parametric statistical measure used to assess the rank correlation between predicted particle sizes and manually measured values. It evaluates the method's ability to capture the particle size distribution trend by comparing the consistency of the rankings of two data sets (rather than their absolute numerical differences). The coefficient ranges from −1 to 1, with values closer to 1 indicating that the predicted relative particle size relationships (such as particle A being larger than particle B) are more consistent with the manual measurements. The statistical analysis shows that the statistical analysis reveals the Spearman coefficient is 0.91, indicating a strong correlation between the predicted and manually measured particle sizes. And the mean relative error of the mean particle size is 4.25%, indicating the good generalizability of the proposed method for particle size measurement on SEM images containing particles of various sizes.
To more intuitively demonstrate the accuracy of the model, six images were randomly selected from the test set to compare the particle sizes measured both by the model and manually. As shown in Fig. 6, the average particle sizes measured both by the proposed method for particles of various sizes are very close to the manually obtained results, intuitively confirming the effectiveness of the model proposed in this study.
Although this study has achieved certain results, several areas still require improvement, including the accurate segmentation of densely overlapping particle boundaries in SEM images; the introduction of unsupervised learning methods to reduce the labor cost of dataset annotation; and the development of lightweight architectures to reduce computational cost and improve hardware resource utilization while maintaining segmentation effectiveness. At present, we are committed to integrating the latest research findings to iteratively improve the proposed method, focusing on overcoming the above technical bottlenecks and providing new research perspectives for related fields.
| This journal is © The Royal Society of Chemistry 2025 |