A visual language model enabling intelligent nanomaterial scanning electron micrograph annotation
Abstract
Artificial intelligence (AI) has significantly advanced the research and development of materials science through data-driven approaches. However, the large number of labeled datasets required for AI to operate is difficult to obtain in current times due to the time-consuming and laborious nature of manual labeling. The morphology of nanomaterials is crucial for the study of their properties, and scanning electron microscopy is among the key techniques for morphology characterization. For nanomaterials, the structural complexity makes the annotation of Scanning Electron Microscopy (SEM) images extremely challenging, with very few labeled images available. Therefore, it is urgent to develop an automatic pattern recognition technology for SEM images of nanomaterials without relying on labeled data. In this paper, we develop the Scanning Electron Microscopy Vision-Language Model (SEM-VLM), which is a domain-specific adaptation of the Vision-Language Model (VLM) for nanomaterials science. The model is trained via contrastive learning on SEM image–text pairs extracted from the literature. SEM-VLM demonstrates superior cross-modal retrieval performance over the general-domain model Contrastive Language-Image Pretraining (CLIP) and random baselines with Recall@10 and Recall@50 metrics, and keyword searches show its robust capability to retrieve relevant images. SEM-VLM also achieves high accuracy in zero-shot classification through ensemble vision-language alignment, outperforming CLIP. In few-shot settings, SEM-VLM with 2.1% training labels exhibits superior performance compared with the fully supervised model (EMCNet: Graph-Nets for Electron Micrograph Classification). Activation mapping analysis reveals precise localization of critical nanoscale features (particles, holes, and probe tips), providing more interpretable results than conventional approaches while maintaining operational reliability. This multimodal framework reduces labeled dataset dependency by orders of magnitude and enables automated high-precision classification.

Please wait while we load your content...