Ni
Li
a,
Ryan
Jacobs
a,
Matthew
Lynch
b,
Vidit
Agrawal
c,
Kevin
Field
b and
Dane
Morgan
*a
aDepartment of Materials Science and Engineering, University of Wisconsin-Madison, Madison, Wisconsin 53706, USA. E-mail: ddmorgan@wisc.edu
bNuclear Engineering and Radiological Sciences, University of Michigan-Ann Arbor, Ann Arbor, Michigan 48105, USA
cDepartment of Computer Science, University of Wisconsin-Madison, Madison, Wisconsin 53706, USA
First published on 4th March 2025
Quantifying prediction uncertainty when applying object detection models to new, unlabeled datasets is critical in applied machine learning. This study introduces an approach to estimate the performance of deep learning-based object detection models for quantifying defects in transmission electron microscopy (TEM) images, focusing on detecting irradiation-induced cavities in TEM images of metal alloys. We developed a random forest regression model that predicts the object detection F1 score, a statistical metric used to evaluate the ability to accurately locate and classify objects of interest. The random forest model uses features extracted from the predictions of the object detection model whose uncertainty is being quantified, enabling fast prediction on new, unlabeled images. The mean absolute error (MAE) for predicting F1 of the trained model on test data is 0.09, and the R2 score is 0.77, indicating there is a significant correlation between the random forest regression model predicted and true defect detection F1 scores. The approach is shown to be robust across three distinct TEM image datasets with varying imaging and material domains. Our approach enables users to estimate the reliability of a defect detection and segmentation model predictions and assess the applicability of the model to their specific datasets, providing valuable information about possible domain shifts and whether the model needs to be fine-tuned or trained on additional data to be maximally effective for the desired use case.
In recent years, DL has significantly advanced the fields of computer vision and image processing. Specifically, convolutional neural networks (CNNs), due to their ability to efficiently and accurately identify relevant features in images, have been transformative and widely applied to identify objects within images with high accuracy. Advanced CNNs like ResNet50, VGG16 and U-net4 have become foundational in object detection frameworks, such as the Faster Regional Convolutional Neural Network (Faster R-CNN),5 Mask R-CNN6 and YOLO (you only look once).7 These and related object detection frameworks have recently gained significant traction in materials research, and have been employed to detect features such as void defects, dislocation loops and nanoparticles.8–19 Although not directly related to this work, models based on fully convolutional networks (FCNs) have also been employed to locate individual atoms in EM images.1,20,21
Overall, object detection models have achieved human domain-expert level performance (with dramatically faster prediction times) for characterizing the numbers, shapes and sizes of various defect types in EM images for numerous types of materials.3 However, it has been pointed out that the performance of the object detection models vary with the overall quality of EM images, the size and visual quality of individual objects to be identified, and the selection of training and testing data used to train the object detection model. For example, Jacobs et al.9 found that the performance of a Mask R-CNN model for detecting defects in TEM images was affected by the similarity between images comprising the training and testing dataset, where it was found that testing images from a different data source, material type or imaging condition than was included in the training data resulted in significantly degraded model performance. Wei et al. (2022)22 demonstrated the significant impact of STEM image quality (such as resolution and contrast) and the similarity to the training data on the performance of FCN-based models. It has also been observed that the robustness of neural networks varies with EM images taken with different experimental parameters, such as magnification and electron dosage.23 Finally, Jacobs et al.12 found that a Mask R-CNN model to characterize cavities in TEM images of irradiated metal alloys had difficulty in detecting small cavities (i.e., those less than a few percent of the image dimension), and Bruno et al. found that human labelers, even domain-expert ones, will introduce biases into their ground-truth labeling when attempting to label objects that are small or visually ambiguous.24 The examples provided above leveraged significant scientific-domain expertise to identify when certain data was likely to fall inside or outside the applicability domain of the trained object detection model. Such information is not always readily available or practical to obtain, and having some uncertainty quantification of object detection model predictions would be highly beneficial for application of object detection models for EM image characterization.
The success of deep neural networks in the field of computer vision is dependent on the presumption that the data used for training and testing are drawn from the same distribution.25,26 The decline in performance when applied to data that deviates from the distribution seen during training is commonly referred to as the out-of-distribution (OOD) problem.27,28 In computer vision, OOD detection has traditionally been framed as a classification task to distinguish between OOD and in-distribution samples.28,29 Commonly-used image benchmarks, like CIFAR and ImageNet, consist predominantly of visually distinct common objects (e.g., pictures of individual animals, furniture, food, people, etc.). In EM imaging, however, object variations and distinctions are typically much less obvious, where even different domain-expert labelers will show marked differences in apparent ground truth labeling.24,30 Therefore, the approach to treat OOD detection in EM images as a binary classification problem is not feasible due to minute but distinct varying imaging domains, nuanced labeling, and complex evaluation criteria. In this work, our focus is on developing an approach that estimates the likely accuracy of a DL defect detection model for a given image so that the user can decide how they wish to use the predictions from that image.
There are two main approaches to address EM image-based DL model uncertainty, depending on its origin and the objective. In automated experimentation, data distribution may experience OOD drift due to the acquisition of new data, leading to decreased model performance.31 The goal in such scenarios is to enhance model performance, with current methods focusing on the iterative training of ML models to enable adaptive learning as the underlying data used in training is updated.32 This is an exciting approach, but involves a significant effort associated with obtaining consistently labeled data and retraining models to address issues. Another approach concerns the treatment of outlier EM images, such as those that are empty or exhibit a low signal-to-noise ratio, and therefore lack valuable information and should be discarded. Here, the objective is simply to flag and reject these outlier images, not use them for retraining, and thereby ensure the integrity of the data used for analysis.33 However, determining outliers can be challenging since model performance depends on many factors. We take an approach similar to this second outlier approach, although we provide a continuous prediction of quality (i.e., predicted F1 score) rather than just a classification of in-distribution or OOD.
In this work we develop and validate a performance estimation framework capable of predicting how well a trained Mask R-CNN model is expected to locate and classify objects when applied to new TEM images. Although we focus on one model type and just cavity defects in irradiated metal alloys, we expect the overall approach to be useful for quantifying the performance of many object detection models trained on many different types of objects and images. Crucially, our trained random forest model can be applied to images for which no labeled ground truth data is available, providing insight for the expected performance of the object detection model on new, unseen data. Fig. 1 illustrates the workflow of the performance evaluation procedure without ground truth labels. Rather than simplifying the problem to a binary classification of data to in-distribution or OOD, we have developed a methodology that predicts the defect detection F1 score as a metric for a quantitative evaluation of model performance. We have trained a random forest regression model to learn the relationship between selected features derived from the Mask R-CNN model output (the bounding boxes and associated confidence scores) and the object detection F1 score. By processing new images through a pre-trained Mask R-CNN model, one can subsequently employ our random forest regression model to estimate the defect detection F1 score. This predictive capability allows users of our Mask R-CNN model to estimate the reliability of their results and determine the suitability of the model to their specific datasets. Our framework is particularly useful in applying trained defect detection models on new images where image quality and characteristics may be different from the training dataset, e.g., due to domain shift and/or just poor image quality. This work also opens new avenues for the robust application of machine learning models in materials science, where understanding and quantifying uncertainty is crucial for advancing experimental and analytical techniques.
The data generation and utilization workflow shown in Fig. 2 begins with a comprehensive collection of the three sets of TEM images described above, from which a subset is used for the training of the Mask R-CNN model, and a distinct subset of the TEM images is deployed to test the performance of the trained Mask R-CNN. As shown in Table 1, the data splits used to train and test the Mask R-CNN model includes two types of splits: one where the training and testing datasets are sourced from the same subset (these are random splits so the test data is likely to be in the same distribution as the training data), and another where the testing data are sourced from a different subset than the training data (these are splits based on distinct subsets with known significant differences so the test data is likely outside the distribution of training data). We will refer to these carefully designed distinct subsets as “grouped” subsets to reflect the distinct nature of their grouping. The method of determining what is in each grouped subset is based on either (i) data coming from different origins, e.g., Set A vs. Set B, and thus represent different materials, irradiation conditions, and TEM instruments, or (ii) data coming from different imaging modes, where here the main difference in imaging mode is overfocus vs. underfocus conditions. For each case, a Mask R-CNN model was trained on the training dataset and then was applied to detect cavities in the test images. The resulting bounding boxes and their confidence scores on the test images were used as a basis for creating features to train the random forest model to predict the object detection F1 score, discussed more in Section 2.3.
![]() | (1) |
As shown in Fig. 2, the performance of the random forest regression model was assessed using either random five fold cross-validation (random splits) or leave-out-group cross-validation (grouped splits). The final model used for deployment was fit on all of the data together. The performance of the trained model on each test dataset was evaluated using five well-established evaluation metrics. The obtained evaluation metrics were averaged over five test folds to reflect the overall performance of the model. Apart from the three widely used metrics: the coefficient of determination (R2), the root mean square error (RMSE), and the mean absolute error (MAE), the normalized RMSE (NRMSE) and normalized MAE (NMAE) are also employed. NRMSE normalizes the RMSE by the standard deviation of the ground truth F1 scores in the test set being considered, while NMAE normalizes the MAE relative to the mean of the ground truth F1 scores in the test set being considered.
We normalized all features to the same scale using the StandardScaler tool from the scikit-learn package to prevent any single feature from dominating the model due to its value range. To identify the most important features for our model, we conducted SHAP (SHapley Additive exPlanations) analysis.37 SHAP values provide a unified measure of feature importance by quantifying the contribution of each feature to the model's predictions. Fig. 3 presents the SHAP value summary plot for all feature candidates considered in the model, illustrating the impact of each feature on the model's output. SHAP values are used to interpret the contribution of each feature to the predictions. Each dot represents a SHAP value for a particular data point in the dataset, with colors indicating the feature value from low (blue) to high (red). The color gradient reveals how different values of the features affect the predictions. For instance, high values of the number of high-confidence defects (counts_0.9) (red) tend to increase the SHAP value, positively influencing the model's output (i.e., high predicted F1 score), while low values (blue) have the opposite effect (i.e., low predicted F1 score).
The features are listed on the y-axis, where feature (1–9) are denoted by “counts_” followed by a number. For instance, counts_0.1 represents the number of defects with confidence scores between 0.1 and 0.2. Feature (13) is denoted by “average size”, and feature (14) was denoted by “std size”. Based on the ranking from the SHAP analysis, we trained the random forest model using between 5 and 19 features. The resulting RMSE, R2, and MAE are plotted as a function of the number of features as shown in Fig. 4. The model achieved the best performance, with the lowest RMSE and highest R2 score, when using the top eight features that had the most significant impact on the predictions. These eight features were therefore selected for the final model. Notably, the number of defects with confidence scores higher than 0.9 appears to have a greater impact on the model's performance compared to the number of detected defects with lower confidence scores. This observation is reasonable because the number of high-confidence defects significantly influences both false positives and false negatives, thereby correlating strongly with the F1 score. Additionally, the average and standard deviation of the confidence score are crucial since they reflect the model's ability to identify high-confidence detections reliably. Moreover, the average and standard deviation of fractional defect size are important factors; detecting small defects accurately poses a challenge for the model, influencing its overall performance. The area ratio and image confidence were also found to have a significant impact on the model's output and were therefore adopted to train the random forest regression model.
![]() | ||
Fig. 4 RMSE, R2, and MAE of the trained random forest model on test data as a function of the number of features used in the model. |
The parity plot in Fig. 6(a) visualizes the performance of our random forest model used to predict the defect find F1-score from random five-fold cross-validation. The dispersion of points along the line of parity (where the predicted score equals the true score) suggests a moderately strong correlation, supported by an MAE score of 0.094, an RMSE score of 0.127, and a R2 score of 0.774. These metrics indicate a good level of accuracy in the model predictions across all the data. However, it is also observed that lower F1 scores tend to be overestimated, while higher F1 scores tend to be underestimated, which is a common behavior of regression models as they seek to minimize overall error and balance predictions around the mean.
![]() | ||
Fig. 6 (a) Parity plot comparing the predicted defect find F1-scores from the random forest model to the true scores across five test datasets from random five-fold cross-validation. Each symbol represents a different split within the datasets, and the details of the splits can be found in Table 1. (b) Plot of mean predicted F1 scores for each split against mean true scores, with vertical and horizontal error bars denoting standard deviation in predicted and true F1 scores, respectively. Dashed lines indicate the line of perfect prediction where predicted scores match true scores exactly. |
We also observed that data points with true defect find F1 scores below 0.5 tend to deviate further from the parity line. Given that grouped splits generally have lower true F1 scores, we plot the average predicted defect find F1 score for each split against the average true F1 score in Fig. 6(b) to illustrate the overall performance across different splits. These averages show a strong alignment with the true scores, as evidenced by an MAE of 0.047, an RMSE of 0.062, and an R2 of 0.831, which surpass the collective metrics across all data. The predictions on random splits align more closely with the true F1 scores than those on grouped splits, where the average MAE for random splits is 0.082 whereas the average MAE from the grouped splits is 0.121.
The F1 scores obtained from the test dataset and the corresponding predictions from the random forest model were categorized into intervals to construct a confusion matrix which is shown in Fig. 7. This confusion matrix helps in evaluating the accuracy of our model predictions across different score ranges. The matrix shows darker shades along the diagonal from the top left to the bottom right, indicating a higher concentration of instances where the predicted F1 scores align closely with the true F1 scores. Lighter shades off the diagonal reveal fewer occurrences, suggesting that most predictions fall within the correct range.
Table 2 summarizes the model performance metrics obtained from both the grouped cross-validation and ten iterations of random five-fold cross-validation processes. The first row of the table summarizes metrics for the entire dataset, showing an RMSE of 0.127, an MAE of 0.093, and an R2 score of 0.774 based on 833 test images. The next five rows provide metrics for grouped splits, ordered by the number of data points within each split. The average metrics over the five grouped splits are shown in the next row shaded in light blue. Similarly, metrics for random splits are shown in the following five rows, with the average over random splits displayed in the last row shaded in light green. RMSE and MAE vary the least across different data splits. In contrast, the R2 score, NRMSE, and NMAE are influenced by the F1 score range within a split, often indicating higher errors for splits with narrower F1 score ranges. The average RMSE and MAE of the grouped splits are slightly higher than those for all data, while the average RMSE and MAE of the random splits are slightly lower than those for all data, suggesting higher prediction accuracy on randomly split data. An exception is observed in the split A: over_over, which shows an RMSE of 0.141 and an MAE of 0.11, likely due to the limited number of just 11 data points and the low average F1 score in this split.
In the application context of the trained random forest model, one goal is to guide users in assessing if the results of defect detection on certain EM images using a trained Mask R-CNN model are reliable or not. This scenario can be framed as a binary classification task. The F1 score predictions can be transformed into binary classifications by applying a threshold to the defect find F1 score. This precision–recall curve shown in Fig. 8 illustrates the performance of the trained random forest model in classifying data points with a threshold of 0.5 on the defect find F1 score. We note that the choice of threshold is subjective, and for our present use-case the F1 threshold of 0.5 broadly divides reasonably well- vs. poor-performing images while simultaneously providing a robust ability of our random forest model to classify such well vs. poor-performing images. The solid blue line represents the precision of the random forest model at various thresholds of recall. The curve starts with a high precision close to 1.0 and gradually declines as recall increases, indicating that the model maintains a high precision across a wide range of recall levels before it begins to fall off. The dashed line represents the no-skill baseline, which indicates the performance of a model that would randomly guess the class. The performance of the random forest model is notably above this baseline, indicating its capability to discriminate between in- vs. out-of-domain (based on defect find F1 threshold of 0.5) effectively.
Fig. 9 presents two plots comparing the performance of domain classification as a function of different defect find F1 score thresholds. The left plot illustrates the domain classification F1 score, and the right plot shows the domain classification accuracy (Acc), both as a function of various defect find F1 thresholds. In both plots, the solid colored dots represent the performance of the random forest model, while the lighter dots denote a baseline for comparison. Overall, the classification performance is significantly better than the baseline model, with a classification F1 score higher than 0.7 and classification accuracy exceeding 0.8 when the threshold on defect find F1 score is smaller than 0.8. As the threshold increases from 0.1 to 0.7, we also observe a general trend of decreasing domain classification F1 scores and accuracy.
In addition to evaluating the overall F1 score, we also trained random forest models to predict defect find precision and recall to gain a more nuanced understanding of our model's performance. While the F1 score provides a balanced measure of both precision and recall, predicting these metrics independently allows us to assess specific aspects of the model's capability. Precision indicates how many of the detected defects are true positives, highlighting the model's accuracy in defect identification. Recall, on the other hand, measures how many actual defects were detected, reflecting the model's ability to identify all relevant defects.
Our model demonstrated strong performance in predicting precision, achieving MAE of 0.094, a RMSE of 0.132, and a R2 score of 0.81. In contrast, predicting recall proved to be more challenging. The model for recall showed an MAE of 0.14, an RMSE of 0.192, and a R2 score of 0.57. The evaluation metrics on predicting defect find precision, recall and F1 scores are summarized in Table 3. Detailed analyses are provided in the ESI.† The model's performance in predicting precision surpasses that of predicting F1 score, as precision directly correlates with detected defects. However, predicting recall is more difficult because it involves estimating defects that the model failed to detect, which is inherently more challenging for machine learning models.
Target | RMSE | MAE | R 2 | NRMSE | NMAE |
---|---|---|---|---|---|
Precision | 0.132 | 0.094 | 0.81 | 0.435 | 0.163 |
Recall | 0.192 | 0.140 | 0.57 | 0.656 | 0.198 |
F 1 score | 0.127 | 0.093 | 0.774 | 0.475 | 0.167 |
We also attempted to train a random forest model predicting swelling error of Mask R-CNN. However, the model shows poor performance with an R2 score of 0.131. This outcome is expected, as predicting swelling error requires knowledge of the sizes of defects missed by the Mask R-CNN model. Without information about these undetected defects, estimating their sizes becomes significantly more challenging. Additional details can be found in the ESI.†
The Mask R-CNN model and the trained RF model using all the data we have is available on Figshare (https://doi.org/10.6084/m9.figshare.27281400.v1). The trained Mask R-CNN model is designed specifically for detecting and segmenting cavity defects in TEM images, and thus, it is not intended for use with images outside this domain. To evaluate the usefulness and reliability of the random forest model, we tested it on COCO-128 images,38 which significantly differ from EM images. We observed that Mask R-CNN often over-confidently detected cavities in these images, despite the absence of any actual cavities, resulting in an expected F1 score of 0. The random forest model, however, produced predicted defect F1 scores below 0.7, with more than 75% of them falling below 0.5. Examples of Mask R-CNN output images and the histogram of predicted F1 scores from the random forest model are provided in the ESI.† Although these predictions are not close to 0, they are still substantially lower than those for EM images in random splits. This contrast, with the Mask R-CNN's overconfidence and the moderate F1 scores of the random forest, suggests that the random forest model successfully captures features indicative of domain estimation, showing potential for identifying out-of-domain images.
By enabling users to predict model performance on new, unlabeled data, we bridge a significant gap in automated defect detection workflows. In particular, the approach taken here could be used to provide automatic guardrails for users of defect detection models, warning them when prediction quality is a concern. Moreover, the success of this methodology paves the way for future research to extend such performance estimation to other deep learning models in materials science and beyond.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00351a |
This journal is © The Royal Society of Chemistry 2025 |