Open Access Article
Madeleine A. Gaidimas†
a,
Abhijoy Mandal†b,
Pan Chenb,
Shi Xuan Leongcd,
Gyu-Hee Kima,
Akshay Talekare,
Kent O. Kirlikovali
a,
Kourosh Darvishbf,
Omar K. Farha
*ag,
Varinia Bernales*cef and
Alán Aspuru-Guzik
*bcfhijkl
aDepartment of Chemistry and International Institute for Nanotechnology, Northwestern University, Evanston, IL 60208, USA. E-mail: o-farha@northwestern.edu
bDepartment of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada. E-mail: alan@aspuru.com
cDepartment of Chemistry, University of Toronto, Toronto, ON M5S 2E4, Canada. E-mail: varinia@bernales.org
dSchool of Chemistry, Chemical Engineering and Biotechnology, Nanyang Technological University, Singapore 637371, Singapore
eMaterials Discovery Research Institute, UL Research Institutes, Skokie, IL 60077, USA
fAcceleration Consortium, Toronto, ON M5S 3H6, Canada
gDepartment of Chemical and Biological Engineering, Northwestern University, Evanston, IL 60208, USA
hVector Institute for Artificial Intelligence, Toronto, ON M5G 1M1, Canada
iDepartment of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, ON M5S 3E5, Canada
jDepartment of Materials Science and Engineering, University of Toronto, Toronto, ON M5S 3E4, Canada
kSenior Fellow, Canadian Institute for Advanced Research (CIFAR), Toronto, ON M5G 1M1, Canada
lNVIDIA, Toronto, ON M5V 1K4, Canada
First published on 23rd December 2025
Advances in high-throughput instrumentation and laboratory automation are revolutionizing materials synthesis by enabling the rapid generation of large libraries of novel materials. However, efficient characterization of these synthetic libraries remains a significant bottleneck in the discovery of new materials. Traditional characterization methods are often limited to sequential analysis, making them time-intensive and cost-prohibitive when applied to large sample sets. In the same way that chemists interpret visual indicators to identify promising samples, computer vision (CV) is an efficient approach to accelerate materials characterization across varying scales when visual cues are present. CV is particularly useful in high-throughput synthesis and characterization workflows, as these techniques can be rapid, scalable, and cost-effective. Although there is a set of growing examples in the literature, we have found a lack of resources where newcomers interested in the field could get a hold of a practical way to get started. Here, we aim to fill that identified gap and present a structured tutorial for experimentalists to integrate computer vision into high-throughput materials research, providing a detailed roadmap from data collection to model validation. Specifically, we describe the hardware and software stack required for deploying CV in materials characterization, including image acquisition, annotation strategies, model training, and performance evaluation. As a case study, we demonstrate the implementation of a CV workflow within a high-throughput materials synthesis and characterization platform to investigate the crystallization of metal–organic frameworks (MOFs). By outlining key challenges and best practices, this tutorial aims to equip chemists and materials scientists with the necessary tools to harness CV for accelerating materials discovery.
AI tools such as computer vision (CV), used to rapidly analyze digital images and videos, have significant potential to enhance automated materials discovery workflows.10 CV analysis can enable automated image classification,11–13 segmentation,14,15 and object detection.16,17 These capabilities have been implemented in a wide range of disciplines, including medical imaging,18,19 self-driving technologies,20 industrial automation,21 and agriculture.22,23 Within research laboratories, CV can be leveraged to monitor visual cues such as color changes,24 morphological changes,25 phase transitions,26 and crystal formation.27 Researchers often rely on such visual cues to make decisions about their synthetic protocol: for instance, determining if a compound is fully dissolved or waiting for a color change before proceeding to the next step. Within materials chemistry SDLs, where human researchers are not present to monitor reactions visually, CV techniques are particularly useful for assessing reaction progress and making decisions on next steps. Utilizing CV analysis and classification reduces researcher time spent on tedious, repetitive work and standardizes outputs to minimize the subjectivity associated with analyses performed by different human researchers. The improved consistency, speed, and scalability of CV analysis methods make them valuable tools within materials synthesis workflows.28–30
Despite these advantages, widespread implementation of CV analysis in the domain of materials discovery is hindered by a lack of publicly available, high-quality data and information on the requirements of integrating CV into specific experimental setups. Materials chemistry encompasses a broad range of subfields, with vastly different synthetic conditions, sample vessel requirements, and characterization procedures.8 Experimental researchers seeking to incorporate CV analysis into their workflows may lack expertise in machine learning and be unfamiliar with constructing a CV pipeline tailored to their unique experimental needs. To enable easier CV tool development and facilitate the broader use of these techniques within the materials discovery domain, standardized practices and instructions for non-experts on how to set up their own CV analysis pipelines are necessary.
As a case study, intended to help the community learn the “ins and outs” of the field, we apply CV techniques to analyze the high-throughput crystallization experiment of nanomaterials, specifically metal–organic frameworks (MOFs). MOFs are self-assembled materials comprised of metal ions or cluster “nodes” and organic “linker” molecules.31–33 The crystallinity, porosity, and tunability of MOFs have enabled their use in applications ranging from carbon dioxide capture34 and hydrogen storage35 to catalysis36 and drug delivery.37 Due to the vast range of nodes and linkers, as well as the diverse arrangements of these components, the synthetic parameter space of MOFs is extremely large. While HT methods have the potential to expedite the synthesis of novel MOFs, HT crystalline materials characterization remains a significant bottleneck and typically requires expensive, specialized instrumental setups. Efficient allocation of such characterization resources requires identifying promising candidate MOFs following a HT synthesis protocol. For instance, an unsuccessful reaction that does not produce solid MOF material should be excluded from further characterization. The development of rapid, cost-effective methods to screen promising candidate MOFs is essential for advancing novel MOF discovery in automated workflows.
A collaboration amongst our research groups has recently integrated CV into an automated, HT synthesis platform for MOF crystallization. From images of sample vials containing MOF precursors, such as metal salts and linkers in solution, we utilize CV analysis to rapidly classify material phases, including solids, liquids, and residues. The implementation of CV as a screening tool facilitates the identification of promising MOF candidates during exploratory synthesis programs that can involve screening hundreds or thousands of reactions per campaign. Considering their scalability, speed, and low implementation cost, CV techniques represent a valuable complement to more advanced characterization methods. While our case study focuses on MOF crystallization, we emphasize that the same procedure for constructing a CV pipeline can be applied to other material synthesis domains that rely on visual cues, such as color or phase changes. In this tutorial, we provide a comprehensive guide for experimental chemists and materials scientists to incorporate CV into their own synthesis workflows. We describe the design and optimization of a CV pipeline for detecting sample vials and classifying their contents based on images acquired during synthesis. In addition, we detail the challenges associated with defining phase labels in our MOF platform and provide recommendations for other researchers to adapt such decisions to their own chemistry tasks. Finally, we evaluate the performance of our classification model in terms of accuracy and speed, benchmarking it against human performance by surveying a cohort of researchers with varying technical familiarity with experimental chemistry and artificial intelligence.
![]() | ||
| Fig. 1 Schematic depicting MOF self-assembly. Initially, sample vials contain organic linker molecules and metal ions dissolved in solution. Following synthesis, solid product MOFs typically appear as a layer of powder or crystals within the vial. See Fig. 2 for computer vision (CV) images corresponding to this schematic. | ||
Some aspects of the MOF synthesis process are more easily translated into automated HT setups. For instance, liquid handlers can transfer stock solutions of reagents, automated capping tools can seal vials, and integrated shaker plates and heating blocks can agitate samples and control reaction temperature.43–45 MOF researchers have employed these techniques in large screening protocols to determine optimal synthetic parameters for specific target MOFs,46–48 as well as the discovery of new framework materials.49 Despite advancements in HT MOF synthesis, automated characterization of the solid products remains challenging. Efforts to increase the throughput of PXRD characterization have included the use of motorized multi-sample holders,50 robotic sample changers,51 and articulated robotic arms to interact with existing diffraction instrumentation.52 However, these methods remain expensive and time-consuming, further complicated by the challenge of transferring MOF powders from their synthesis vessels into suitable sample holders for characterization.9
Besides HT characterization, another difficult aspect of automating MOF synthesis is acquiring feedback on the progress and results of the crystallization reaction. In a common manual MOF synthesis, human researchers rely on visual feedback to make decisions about the synthetic procedure and necessary characterization tasks. For instance, a researcher may check samples to confirm the starting materials are fully dissolved before heating, and alternatively, continue agitating or sonicating the sample if reagent dissolution is incomplete. Following a synthesis procedure, a researcher will inspect samples to determine if MOF powder or crystals have been formed. Adapting MOF synthesis to automated high-throughput platforms makes these visual feedback steps more challenging to implement, as human researchers are not present to monitor the status of each reaction. While a well-established synthetic protocol may not require visual assessment of reaction progress, synthesis campaigns aimed at discovering novel materials will likely involve screening a range of synthetic conditions with unknown outcomes. Many trial conditions may be unsuccessful: for instance, forming no solid material at all instead of MOF powder or crystals. Visually assessing the results of a synthesis campaign provides the researcher with valuable information about potential products and aids in selecting promising samples for further characterization.
To restore real-time visual feedback—unavailable in the absence of human researchers during automated syntheses—we incorporated computer vision (CV) analysis into our automated synthesis platform to monitor reaction progress and complement traditional materials characterization techniques. We sought to use CV as a filter to identify which sample vials contained solid MOF products and required further characterization by PXRD. For MOF synthesis, CV offers a scalable tool to streamline the discovery process, particularly when synthesis campaigns involve screening hundreds or thousands of reactions with unknown outcomes. Although PXRD analysis is ultimately needed to confirm crystallinity, an initial pass with CV rapidly identifies which conditions yield solid precipitate and merit further analysis, enabling more efficient allocation of time-intensive characterization resources. While CV is not a replacement for advanced characterization techniques, it is a cost-effective and scalable tool that can rapidly expedite materials characterization in a HT setting.
Common CV tasks that are relevant to automated materials synthesis and characterization include image classification, object detection, semantic segmentation, and instance segmentation. For example, in MOF image analysis, image classification determines whether specific material phases (e.g., solids, liquid, residue) are present in an image of a sample vial. Object detection goes further to identify the spatial positions of these phases and enclose them within bounding boxes. Semantic segmentation provides a more granular differentiation by assigning each pixel to a specific phase, while instance segmentation distinguishes individual instances within the same phase, such as separate crystals within a vial. The choice of the CV task typically depends on the specific objectives of the HT image analysis.
A critical component in the CV pipeline is the training dataset, which comprises images labeled with objects of interest. This training dataset must be of sufficient quality and quantity to ensure adequate performance of the CV model. While some published datasets of chemistry-related images are available,15,53,54 many experimental use cases require custom, task-specific datasets. Collecting and annotating in-house data can be a resource-intensive process. Data augmentation techniques can be used to expand limited datasets by generating modified versions of existing images through methods such as color modulation, image rotation or scaling, mosaicking multiple images together, and adding pixel noise.55 Increasing data variability through these modifications also reduces the model's sensitivity to minor input variations, such as noise. Beyond dataset size, image quality is critical to ensure the CV model does not learn from unwanted noise, such as glare. Examples of noise reduction techniques include preprocessing approaches such as histogram equalization for glare reduction, and environmental adjustments such as controlling lighting conditions.56–59
Another key component is the model architecture. For many CV models, convolutional neural networks (CNNs) are the benchmark method,60 with seminal CNN architectures such as LeNet61 and AlexNet.62 Deeper, more complex networks, including ResNet,63 You Only Look Once (YOLO),64 and Swin Transformer,65 among others, have achieved exemplary performance in CV tasks such as image classification and object detection. Importantly, CNN-based CV models have shown promise in materials chemistry applications, such as the classification, segmentation, and subsequent quantity estimation of chemically relevant artifacts, including liquids, solids, and residues on vessel walls.66,67 However, these initial models adopted an end-to-end approach that is effective only in highly controlled environments, where external factors such as lighting and environmental noise are minimized.64 When applied to larger or more complex systems where environmental factors are harder to control, model performance can degrade.64 Expanding these models to real-world settings thus requires significantly larger and more diverse labeled training datasets, making scalability technically demanding.
On the other hand, hierarchical models address the data scarcity challenge by breaking down the CV task into multiple stages, allowing different models and processing techniques to handle specific aspects of the task. In the context of MOF image analysis, a hierarchical model would first detect the region of interest (i.e., glass sample vials) in an image before classifying material phases (i.e., solid, liquid) within the detected region. Rather than attempting classification directly from raw images, this approach decouples environmental variability from the task of identifying chemically relevant artifacts. Given the high cost and time-consuming nature of image labelling,60 this decoupling provides a practical way to leverage large external datasets for the first model, which needs only general object detection, while allowing a more task-specific downstream model to be fine-tuned on a smaller, in-house dataset. Hierarchical models have previously been deployed for HT screening,67–69 demonstrating their potential to enhance scalability and adaptability for diverse materials chemistry tasks. In addition to model architecture selection, additional performance gains can be achieved through hyperparameter tuning (e.g. learning rate, batch size etc.), which we did not emphasize in our implementation but may be explored for enhanced model performance.
In our case, the defined CV task involves detecting five material phases within MOF sample vials. These phases, commonly observed in MOF synthesis, are distributed across three layers from top to bottom: headspace, liquid, and solid (from left to right in Fig. 2). The headspace layer, located at the top of the vial, is classified as either empty or residue, where the latter indicates visible material deposits on the vial walls, as shown on the left side box Fig. 2 (additional examples are provided in Fig. S1 in the SI). The liquid phase, situated below the headspace (central box in Fig. 2), is classified as either homogeneous, meaning it consists of clear and uniform liquids, or heterogeneous, which appears cloudy or contains suspended particles (additional examples are shown in Fig. S2 and S3, respectively). Finally, the solid phase corresponds to powder or crystallites settled at the bottom of the vial (Fig. S4). To successfully implement a CV pipeline, it is essential to clearly define each class, particularly when handling edge cases that may be subjective. More detailed descriptions of each class can be found in Section S2.1 of the SI.
For our MOF crystallization example workflow, we utilized a HT synthesis platform equipped with automated liquid and solid dispensing, screw capping, and a robotic gripper tool to manipulate vials. Our MOF syntheses were performed in glass sample vials housed in a heating block. We installed a USB webcam inside the enclosure, positioning it to capture images of vials suspended above the heating block (Fig. 3). Images were captured when the gripper picked up each vial from the block and briefly held it in place. This process was automated for all vials, enabling the capture of images at specified points throughout the synthesis. To optimize image quality for CV analysis, we made minor modifications to our automated synthesis platform, including creating a consistent background for the vials by adding a black fabric backdrop opposite the camera. To minimize unwanted reflections, flat black paint was used to conceal metallic surfaces in the camera path. We also repositioned the enclosure's LED lights to illuminate the vials from above, reducing glare on the vial surfaces and ensuring uniform lighting regardless of the time of day. This approach was more effective than software-based glare reduction techniques,70 which we found removed critical artifacts of interest, such as falsely detected residue on the vial walls.
In our case, we collected images with varying amounts of liquids and solids at different time points to capture different levels of turbidity and solid levels. Further, we captured images of vials lifted from various positions in the well plate. We created a dataset of 168 images captured from 56 unique sample vials (representing 3 replicate images of each vial taken at different timepoints throughout the synthesis reaction). The images were annotated with 5 classes: empty, residue, homogeneous liquid, heterogeneous liquid, and solid, as defined in Section 4.1. We used RoboFlow to annotate our datasets and export them in YOLO format. The solid materials in our images are exclusively white in color, though similar datasets can be constructed with a wider chromatic range.
Regardless of the model design, data augmentation can improve model performance, especially in low-data regimes. Most existing CV libraries offer on-demand augmentation, generating augmented images and annotations during training without increasing data storage burden. When the model is trained, it should also be qualitatively evaluated for any undesirable or wrong behavior by visually inspecting outputs. While quantitative metrics, such as mean Average Precision (mAP), F1 score, and precision, are good measures for monitoring training progress, these can be difficult to interpret accurately; hence, qualitative analysis becomes important. Finally, the visual cue detection model must be trained using inputs that match its deployment conditions. In a hierarchical pipeline, this means training on cropped container images produced by the first-stage detector. In contrast, a single-stage system can be trained directly on raw images from the camera with some optional pre-processing.
In our example workflow, we employ a hierarchical model architecture, consisting of two sequential YOLO models. We chose YOLO models for their balance of accuracy, speed, and ease of training. These models are particularly well-suited for non-specialists, thanks to the availability of pre-trained weights, built-in support for data augmentation, evaluation metrics, and hyperparameter tuning. Additionally, YOLO models are compatible with both high-performance GPUs and less powerful machines, making them broadly accessible. The first model in our hierarchy is designed to detect vessels of interest, specifically glass sample vials. Since the vial position and distance from the camera may vary, the model needs to be robust to change in viewpoint and placement. To train the vessel detection model, we used the Vector-LabPics dataset, which includes a collection of 7900 images of laboratory equipment, such as beakers and flasks.15 We further augmented this dataset with 168 images of vials collected from our automated platform. The model was trained using default hyperparameters defined in YOLOv5 for 60 epochs, with an 80/20 train/validation split.
We assessed model performance using mAP, a standard metric in object detection that measures the overlap between predicted bounding boxes and ground-truth objects. This overlap is quantified using the Intersection over Union (IoU) metric, which ranges from 0 to 1. We assess both mAP50, which considers the detection correct if the IoU is at least 0.5 (50%), and mAP50-95, which provides a stricter measure by averaging across multiple IoU thresholds from 0.50 to 0.95. Our vial detection model achieved a mAP50-95 of 0.826 (Table S1), demonstrating reliable detection performance within our automated synthesis platform. Additionally, we measured the model's precision and recall, which represent the proportion of predicted detections that are correct and the proportion of actual objects that are successfully detected, respectively (see Section S3.1 in the SI for the equations). The vial detection model achieved perfect scores of 1.00 for both metrics (Table S1). The vial detection model outputs bounding box coordinates for each detected vial in an image, which are then used to crop the image and isolate the vial from the background. These cropped vial regions serve as inputs for the second phase detection model (Fig. 4).
The downstream phase detection model is a YOLOv8 model trained to detect five material classes (empty, residue, homogeneous liquid, heterogeneous liquid, and solid) described above. The training dataset consisted of 168 images captured within our HT synthesis platform from 56 unique MOF sample vials. The vials encompassed a representative range of chemical artifacts and instances of each material phase (Table S2). Each phase was manually annotated using RoboFlow71 to create the labelled dataset. The phase detection model was similarly trained using an 80/20 train/validation split, with the default YOLOv8 hyperparameters for 25 epochs. In this case, fewer epochs were required compared to the training of the vial detection model, as the LabPics dataset contained greater image diversity. To enhance dataset diversity, we utilized YOLOv8's built-in data augmentation techniques in the model training function.73 The phase detection model achieved an overall mAP50-95 of 0.851 across all five phases (Table 1). Table 1 also reports precision and recall, calculated according to eqn (S1) and (S2), respectively. To further assess model performance, we analyzed the normalized confusion matrix shown in Fig. 5. Among all classes, ‘empty’ was the most challenging to predict, with the lowest proportion of correct predictions (0.73). For all classes, most errors stemmed from missed detections, where the model failed to identify any object. Other notable misclassifications include ‘residue’ frequently predicted as ‘homogeneous liquid’, and ‘empty’ often mistaken as ‘heterogeneous liquid’ (with values of 0.12 and 0.27 in the confusion matrix, respectively, as shown in Fig. 5). The phase detection model outputs bounding box coordinates for each detected region in the image and classifies it as one of the defined phases. These predictions and coordinates are then utilized in downstream post-processing methods, including visualization and annotation, as well as decision-making in subsequent characterization tasks.
| Phase (N) | mAP50 | mAP50-95 | Precision | Recall |
|---|---|---|---|---|
| Empty (11) | 0.964 | 0.851 | 0.877 | 0.948 |
| Residue (11) | 0.951 | 0.927 | 0.818 | 0.909 |
| Homogeneous liquid (16) | 0.995 | 0.940 | 0.881 | 1.000 |
| Heterogeneous liquid (8) | 0.995 | 0.822 | 0.898 | 1.000 |
| Solid (12) | 0.934 | 0.684 | 0.907 | 0.833 |
| Overall average performance | 0.964 | 0.851 | 0.877 | 0.948 |
If the trained model does not meet the desired performance levels, this may indicate a need for further refinement. We recommend interpreting the evaluation metrics in the context of your specific objectives to identify appropriate next steps. For example, a low recall score suggests the model is missing true objects, and collecting more representative labelled data may help. Readers are encouraged to consult additional resources for more detailed guidance on interpreting evaluation metrics.74 However, we would like to emphasize that what is considered ‘satisfactory’ model performance highly depends on the specific use case. For example, a less precise model may be sufficient for prioritizing promising samples in early-stage exploratory screening, while higher-stakes scenarios (such as discarding expensive materials) may require stricter thresholds and more iterative model refinements. In some cases, it may also be appropriate to prioritize performance on specific classes of interest.
1. Dataset.py – This script prepares the annotated dataset that, in our case, is exported from Roboflow (Section S2.4) into YOLO format for subsequent model training. While there are no strict rules, we recommend a minimum of 100 instances per class. For more robust models, particularly when working with images featuring diverse backgrounds or subtle differences between classes, we recommend increasing the dataset size to over 500 instances per class. As described in Section 3, data augmentation techniques can be strategically employed to enhance model performance.
2. ProcessLabPics.py – This script creates a vessel detection dataset from the Vector-LabPics dataset by grouping labels corresponding to vessels such as flasks, beakers, vials etc. into one class called “vessel”.15 The resulting dataset is then randomly split into 80% training and 20% testing sets. Models trained on just this dataset can detect a variety of vessels in general lab settings. However, in more specialized setups or setups not seen in the LabPics dataset, such models can be less successful. To address this, the dataset is enhanced by adding images of the unseen/new setup, with vessels of interest labelled. These images can then be added into the dataset obtained from the LabPics dataset to improve detection accuracy.
3. Train.py – This script splits the dataset into training and validation sets, then initiates training using the YOLOv8 framework. Key training parameters include: (i) batch size, which determines the number of images processed in a single iteration; (ii) image size, which defines the input image resolution; and (iii) number of epochs, which refers to the number of complete training cycles performed over the entire dataset. In most cases, the default YOLO hyperparameters perform well; however, advanced users may choose to explore hyperparameter tuning (e.g., adjusting the learning rate) to further optimize model performance. From these training parameters, the number of epochs is particularly important: too few epochs may lead to model underperformance, while too many can cause the model to “memorize” the training data, resulting in poorer performance on unseen images. Other hyperparameters, such as batch size, can be adjusted to optimize memory usage during training. Once training is complete, the script returns the model weights, saved in a format compatible with YOLO inference (best.pt for YOLOv8), along with evaluation metrics to assess model training performance.
4. Test.py – This script performs inference on new images using the trained YOLO model. For each detected object, the model outputs bounding box coordinates, class predictions, and confidence scores. By default, YOLO returns normalized bounding box coordinates in the format: [x_center, y_center, width, height], where each value is scaled between 0 and 1 relative to the image dimensions. In our case, we convert these normalized coordinates back to absolute pixel values using the original image's width and height. These coordinates are then used to overlay bounding boxes on the vial images for visualization. The model outputs can also be utilized in downstream tasks such as region-based quantification or cropping.
Both the accuracy and speed tasks utilized a dataset comprising 378 images of 42 unique MOF sample vials, captured using our automated setup, as described in the previous sections (a detailed phase distribution among these images is presented in Table S4). This dataset is unique to the user study; none of the images were used in model training or validation. Ground truths were established through independent annotations by two domain experts, with any disagreements resolved by a third expert (for more details, see Section S4.2, Table S5 and Fig. S6).
For the accuracy task, participants were shown definitions and examples of five phase categories: empty, residue, homogeneous liquid, heterogeneous liquid, and solid (Fig. S7). They were then asked to label five randomly selected images by drawing bounding boxes around each vial and entering the corresponding phase name (Fig. S8). A Flask-based server was used to randomly select images. Accuracy was measured by comparing participant labels to ground truth labels, and F1 scores were computed for each phase to assess per-class performance (eqn (S3)). We observe that the model outperformed human participants in overall F1 scores across all phases, with notable improvements in detecting empty, residue, and solid phases (Fig. 6). In addition, the model achieved higher accuracy than human participants, regardless of the number of phases present in the image (Fig. S12). For additional discussion, see Section S4.3 in the SI.
For the speed task, participants were shown five randomly selected vial images and asked a yes-or-no question about the presence of a specific phase (e.g., “Is the solid phase present?”), as shown in Section S4.4 of the SI. The time each participant spent observing the image and answering the question was recorded. On average, the model was over 50 times faster at identifying phases than the human participants, processing each image in 0.062 seconds when run on two NVLINK Nvidia RTX A6000 GPUs (48 GB VRAM). In comparison, participants took an average of about 3.65 seconds per image (Fig. S14). Importantly, this speed advantage did not compromise phase identification performance. During the speed task, the model outperformed human participants in terms of accuracy across all phases, by 31%, 11%, 17%, and 10% for residue, homogeneous liquid, heterogeneous liquid and solid phases, respectively (Fig. S15).
To evaluate user perspectives, participants were asked to respond to a series of general statements about conducting parallel experiments (Fig. S16) and using CV analysis for phase labeling (Fig. S17). Responses were recorded on a three- and five-point Likert scale to gauge perceptions of the model's utility, considering responses ranging from “strongly disagree” to “strongly agree” for the latter. Overall, responses reflected a strong recognition of the model's practical benefits: 84% agreed that manual labeling is tedious (calculated by combining 31% “agree” and 53% “strongly agree”), and 82% agreed that the model could accelerate the identification of samples requiring further characterization (questions a and b, Fig. S17). Additionally, 80% agreed that the model would facilitate conducting parallel experiments (question g, in Fig. S17).
However, perceptions of trust and accuracy were more divided. Only 41% indicated they would trust the model for their experiments (question f, Fig. S17), and just 17% preferred it over human annotators (question e, Fig. S17). Moreover, only 34% of participants believed the model labelled phases more accurately than humans, while 28% disagreed (question j, Fig. S17). These responses suggest that while users appreciate the efficiency gains, many remain cautious about entirely deferring to the model's judgment. While 69% of participants reported understanding the model's logic, the same proportion indicated they had never known or used similar AI tools before (questions d and h, Fig. S17). This lack of previous interaction and user experience with such AI models may hinder trust and broader adoption. These results highlight a valuable opportunity: although the model is seen as helpful in reducing workload and enabling scalability, its broader acceptance will depend on increasing transparency, interpretability, and user training. Notably, more than 61% of the 33 participants who regularly conduct parallel experiments reported that doing so compromises the time they can devote to each experiment (Fig. S16), further underscoring the potential value of automated tools like the one introduced here.
Finally, we included two open-ended questions to survey participants on how this AI model could benefit their own experimental workflows, as well as to identify any potential concerns (Table S12). A word cloud generated from responses to the first question highlighted ‘time’ and ‘help’ as prominent keywords (Fig. S18). Qualitative analyses of the 49 responses to the first question revealed that 22 participants explicitly mentioned timesaving as a key advantage, along with related terms such as “efficiency” and “faster” (see Table S12 for all responses). These responses underscore that participants recognize the model's potential and value in streamlining experimental processes and improving overall efficiency.
In addition, we analyzed the results from the experimental chemists separately (Fig. S19–S21), and we observed that phase identification is a difficult task generally, rather than a task for which a specific experimental background significantly boosts performance.
Overall, the user study suggests that employing CV models for phase detection and progress tracking in chemistry tasks can make these processes more quantifiable and objective. The higher error rates observed in human subject annotations indicate that CV models could contribute to establishing more standardized interpretations of visual cues relevant to chemical processes. Moreover, while such models have the potential to reduce cognitive load on scientists and enhance their performance, responses to the user trust questionnaire highlight the importance of maintaining human oversight in automated systems. Human oversight is not only relevant for the accuracy of the computer vision system but also for related safety considerations.75,76
An observation shared by multiple user study participants related to the challenges of identifying phases from a single-viewpoint image compared to handling the actual sample vial. One respondent noted, “In a real lab setting, a human would jiggle the flask and change their viewing angle to ensure they identify the vial contents correctly; this dramatically increases the accuracy of a human chemist's identifications.” Accurate classification from a single image is more challenging, particularly in cases involving more ambiguous phase labeling. For instance, a sample containing colorless crystals in solution is difficult to distinguish from a sample containing a similar clear liquid without close examination of the real vial. Capturing multiple images from different angles could potentially increase the robustness of a CV model in such cases. For our MOF crystallization workflow, we aimed to use CV to complement additional characterization techniques rather than a standalone analysis method. HT materials characterization necessarily involves a tradeoff between throughput and data quality and accuracy. We believe that the speed and scalability of CV techniques make them a valuable counterpart to traditional characterization methods. CV analysis is particularly useful when manual inspection of every sample of a high-throughput campaign is not feasible. Unlike human researchers, CV models can consistently handle large image datasets without fatigue.
CV analysis is clearly not applicable to situations where successful and failed syntheses cannot be visually distinguished, such as homogeneous chemical reactions that require characterization with spectroscopic methods. An exception perhaps would be if imaging beyond the visible ranges (infrared, ultraviolet, etc.) is available via hyperspectral cameras. However, many tasks relevant to materials synthesis can benefit from incorporating CV tools. Beyond MOF crystallization, user survey participants offered a range of chemical tasks where CV may be useful, including solubility testing, extractions, liquid level detection, and precipitation reactions. The hierarchical model architecture employed in this tutorial enables generalization of our approach to additional chemistry tasks through the fine-tuning of each individual model. For instance, the vessel detection model can be retrained to include different sample vessel sizes, positions, or lighting conditions. Similarly, the phase detection model can be fine-tuned to distinguish other relevant artifacts, such as solution color. By creating custom, fine-tuned models, chemists and materials scientists can integrate their experimental expertise directly into the model itself, without relying on previous datasets that may not apply to their specific use case.
In our example MOF crystallization workflow, we utilized CV to distinguish promising samples following synthesis, based on the appearance of solid material from initially clear solutions. Beyond this simple case of product identification, we anticipate that this CV workflow will provide additional insights into MOF crystallization kinetics by comparing phases identified from photos captured at multiple timepoints. A CV approach enables rapid and inexpensive kinetic information, in contrast to traditional time-resolved materials characterization techniques such as in situ PXRD, which is not feasible in a HT setting. Many further opportunities exist to apply CV techniques within materials synthesis workflows, including real-time sample analysis, integration into automated platforms to guide decision-making, and handling potential errors from automated protocols without human intervention. We emphasize that the tunability of a hierarchical model approach enables the implementation of customized CV models tailored to specific chemistry goals while minimizing the amount of training data required. Considering the range of chemistry and materials science tasks where visual analysis is relevant, we hope this approach empowers experimentalists to incorporate CV tools within their own synthesis workflows. A revolution is happening in science by the advent of agentic systems.77–80 We believe that the integration of these computer vision workflows in agentic self-driving lab experiments81–83 will help in further expanding the toolset of agentic science by providing further “eyes” to automated AI science agents.
• ImageNet classification with deep convolutional neural networks.62
• No-code computer vision with RoboFlow.84
• OpenCV Bootcamp85
• IBM: introduction to computer vision and image processing.86
• The State University of New York: computer vision basics.87
Supplementary information (SI): hardware information, model training and performance details, and user study methodology and results. See DOI: https://doi.org/10.1039/d5dd00384a.
Footnote |
| † Equal contribution. |
| This journal is © The Royal Society of Chemistry 2026 |