Ryosuke
Sasaki
a,
Mikito
Fujinami
b and
Hiromi
Nakai
*ab
aDepartment of Chemistry and Biochemistry, School of Advanced Science and Engineering, Waseda University, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan. E-mail: nakai@waseda.jp
bWaseda Research Institute for Science and Engineering, Waseda University, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
First published on 16th October 2024
Developments in deep learning-based computer vision technology have significantly improved the performance of applied research. The use of image recognition methods to manually conduct chemical experiments is promising for digitizing traditional practices in terms of experimental recording, hazard management, and educational applications. This study investigated the feasibility of automatically recognizing manual chemical experiments using recent image recognition technology. Both object detection and action recognition were evaluated, that is, the identification of the locations and types of objects in images and the inference of human actions in videos. The image and video datasets for the chemical experiments were originally constructed by capturing scenes from actual organic chemistry laboratories. The assessment of inference accuracy indicates that image recognition methods can effectively detect chemical apparatuses and classify manipulations in experiments.
Chemical experiments that traditionally rely on manual procedures would benefit from the application of AI technologies for convenience. For example, recent studies have focused on creating datasets for chemical experiments aimed at object detection5,6 and segmentation, evaluating their recognition accuracy7 using deep learning-based methods. Another research has explored augmenting image data of chemical apparatuses through the artificial combination of diverse images to enhance object detection.8 However, these efforts have predominantly concentrated on identifying chemical objects. Understanding manual chemical experiments goes beyond object identification and requires recognizing the manipulative actions of the experimenter. To the best of our knowledge, there have been no reports on applying action recognition techniques to chemical experiments. This gap highlights an opportunity for further research in the application of AI for comprehensive analysis and automation in chemical experimentations.
This study aims to automatically recognize chemical experiments using image recognition technology. Combining the information obtained from object detection and action recognition is expected to be a promising approach for automatically recognizing chemical experiments. Previously, we constructed an image dataset of chemical apparatuses for object detection.9,10 This study presents the performance of object detection using an image dataset. In addition, a video dataset was constructed and applied to an action recognition method. The assessment will demonstrate the proof of concepts to explore the feasibility of utilizing action recognition in chemical experiments manipulation.
The structure of this article is as follows: Section 2 provides details on the adopted chemical datasets, with a specific focus on the video dataset. Section 3 describes the applied image recognition techniques and computational details. Section 4 presents the performance results of applying the image recognition methods to the datasets. Finally, concluding remarks are presented in Section 5.
A chemical experiment video dataset was constructed for action recognition. The videos were recorded in organic chemistry laboratories using a fixed camera. Some videos were provided as e-learning materials for a chemical experiment laboratory class in the faculty. Approximately 10 s clips were selected from the videos, and action labels were assigned manually. The video dataset format aligns with UCF-101,13 a standard dataset for action recognition.
Three actions were selected to complement object detection and to understand chemical manipulations: “adding,” “stirring,” and “transferring.” “Adding” involves manipulations such as adding reagents or solutions between apparatuses, using pipettes, or dispensing spoons. “Stirring” included stirring with a glass rod, shaking, or inverting the apparatus. “Transferring” refers to the manual movement of apparatuses and reagent bottles. Fig. 1 illustrates representative examples of these actions from four frames clipped from the videos. Fig. 1(a) shows the “adding” sample, which is the transfer of a solution from the conical beaker to the Erlenmeyer flask. Fig. 1(b) represents “stirring,” which involves holding the top of the eggplant-shaped flask and stirring. Fig. 1(c) is the “transferring” sample, which involves grabbing the top of the conical beaker and moving it from one end of the screen to the other. Three video samples for each action, including the representative examples shown in Fig. 1, are provided in the ESI.†
The video dataset was divided into training, validation, and testing subsets. Table 1 lists the number of videos and corresponding filmed actions across the subsets. A total of 478 videos were created and divided into 321 videos for training, 109 videos for validation, and 48 videos for testing. For both image and video datasets, the training, validation, and test subsets were extracted from independent, non-overlapping videos to prevent data leakage between the subsets. Although the images and videos contain the same type of apparatuses and laboratory viewing in the subsets, the dataset includes diverse backgrounds and situations surrounding the objects.
Class | Training | Validation | Test |
---|---|---|---|
Adding | 158 | 53 | 18 |
Stirring | 91 | 31 | 18 |
Transferring | 72 | 25 | 12 |
Total | 321 | 109 | 48 |
Model | mAP50 | Hand | Conical beaker | Erlenmeyer flask | Reagent bottle | Pipette | Eggplant shaped flask | Separatory funnel |
---|---|---|---|---|---|---|---|---|
YOLOv8n | 0.855 | 0.732 | 0.951 | 0.978 | 0.880 | 0.532 | 0.933 | 0.981 |
YOLOv8x | 0.890 | 0.749 | 0.972 | 0.984 | 0.943 | 0.625 | 0.962 | 0.995 |
Fig. 2 shows examples of object detection for the test data obtained using YOLOv8x. The conical beaker was correctly recognized in Fig. 2(a). As shown in Fig. 2(b), two Erlenmeyer flasks containing solutions of different colors were accurately detected. As shown in Fig. 2(c), both the hand and pipette were correctly recognized. As shown in Fig. 2(d), the separatory funnel, four reagent bottles, and an eggplant-shaped flask were identified correctly. When the entire target object was captured at a relatively large size, a tendency toward accurate detection was observed. In addition, correct detection was achieved for multiple objects in an image, even when parts of the objects overlapped.
![]() | ||
Fig. 2 Examples of object detection for test data using YOLOv8x. The detected objects are enclosed by rectangular frames with object labels. |
Fig. 3 shows examples of misrecognition and nondetection in object detection for the test data in YOLOv8x. As shown in Fig. 3(a), the hand covered with a white rubber glove was not detected. The Japanese skin-colored arm was misrecognized as a hand. As shown in Fig. 3(b), the hand is correctly recognized. The Erlenmeyer flasks held manually were not detected. In Fig. 3(c), the Erlenmeyer flask was correctly recognized. The pipette inserted into the flask was not detected. The Erlenmeyer flask on the left side was identified correctly, as shown in Fig. 3(d). The conical beaker on the right side was misrecognized as an Erlenmeyer flask. For hand detection, the misrecognized cases indicate that the object color is a critical factor in prediction. The difficulty in detection increases when the objects overlap each other. Changes in the rectangular area owing to the angle of the object also affect the recognition accuracy, particularly in the case of pipette detection. The pipette was enclosed in an elongated rectangle when captured horizontally or vertically, whereas it was enclosed in a large square when captured diagonally. The significant diversity in object color and area contributed to the lower prediction accuracy observed in the statistical evaluation of the hand and pipette. Increasing the variation in the training data would help mitigate color- and angle-based biases.
![]() | ||
Fig. 3 Examples of misrecognition in object detection using YOLOv8x. The examples include cases where an object that should be recognized is not detected or is incorrectly labeled. |
Average | Adding | Stirring | Transferring |
---|---|---|---|
0.86 | 0.94 | 0.89 | 0.74 |
Fig. 4 illustrates examples of action recognition for the test data. In Fig. 4(a), the “adding” of a solution from one conical beaker to another was correctly recognized. In Fig. 4(b), the “stirring” of the blue solution by a hand holding the top of the Erlenmeyer flask was correctly classified. In Fig. 4(c), the “transferring” of the conical beaker was correctly recognized. Fig. 5 shows examples of misrecognitions. In Fig. 5(a), the “stirring” of the blue solution in the beaker by hand was misidentified as “transferring.” The confidence scores for classes, indicating the probability of the prediction being assigned to the corresponding action, were 0.61 and 0.35 for “transferring” and “stirring,” respectively. In Fig. 5(b), the “transferring” of the Erlenmeyer flask by hand was misclassified as “stirring.” The confidence scores were 0.70 and 0.30 for “stirring” and “transferring,” respectively. In both misrecognition cases, “adding” exhibited a significantly low confidence score, indicating potential confusion between “transferring” and “stirring.” Examples of misrecognition were observed in cases where hand and apparatus orientations were similar across different actions and where an action switched to another at the end of the video. These misrecognitions suggest that action recognition involves not only hand and object movements but also the type and angle of the object, and that a mixture of actions in a video has a negative effect on recognition. These findings emphasize the importance of data variation and meticulous data curation when constructing video datasets for action recognition. Notably, the assessment demonstrated an 86% accuracy prediction for the test data based on learning from 321 chemical experiment videos.
The dataset constructed in this study is limited in both the size and the variety of labels, particularly in the case of video data. Although the manually filmed and curated datasets are highly reliable, the datasets lack diversity in terms of laboratory settings, personnel, and equipment. To evaluate the recognition accuracy on an entirely external dataset, the model trained in this study was applied to object detection on the LabPics dataset.7 The detection accuracy was lower than the dataset used in this study. Detailed results of this verification are provided in the ESI.† To develop a universally applicable model across various experimental situations, a more extensive and diverse dataset is required. Developing a generally applicable model is anticipated to be a considerable challenge. As an alternative approach, building datasets and ML models that are specifically optimized for individual laboratories could be effective. In either case, developing a platform to partially automate data collection and model training may be a promising direction for future research.
Despite recent innovative developments such as the optimization of experimental conditions through high-throughput or flow reactors,18–21 and the use of autonomous experiments facilitated by experimental chemical robots,22–24 common laboratories rely on manual procedures because of limitations in the applicable experimental protocols for specific automated equipment. The present image recognition of chemical experiment videos is anticipated to provide advantages for manual experiments, including automatic experiment recording, hazard warnings, and evaluation support for novice chemists, with minimal introduction costs and requiring only the installation of video cameras connected to the network.
ResNet-34, a 3D ResNet model with 34 layers, was used for action recognition. Stochastic gradient descent optimization with momentum was applied using the weight decay and momentum values of 0.0005 and 0.9, respectively. The learning rate was set to 0.1 for the first 50 epochs, 0.01 from 51 to 100 epochs, and 0.001 from 101 to 5000 epochs. Data augmentation techniques, including four-corner/center cropping, horizontal flipping, and scaling of video clips, were employed using the 3D ResNet implementation. A pre-trained model that trained 200 epochs of Kinetics-700 (ref. 25) and Moments in Time26 was adopted as the initial model for learning.
![]() | (1) |
Symbols A and B represent the regions where the predicted and correct objects exist, respectively. The predicted area where the IoU exceeded the threshold was utilized for classification. Precision and recall based on the classification results were applied to compute the AP, which is defined as the area under the precision–recall curve, ranging from zero to one, with higher values indicating better prediction accuracy. Two feature values are commonly used: namely, AP50 for the IoU threshold set to 0.5 and AP50–95 for the average AP obtained by varying the IoU threshold from 0.5 to 0.95, with a step size of 0.05.27 Generally, the AP50–95 provides a more stringent evaluation than the AP50. The other parameter, mAP, is the average AP across all classes.
![]() | ||
Fig. 6 Learning curve of object detection. The epoch and corresponding mAP for validation datasets during the training process of YOLOv8n and YOLOv8x are shown. |
Model | mAP50 | Hand | Conical beaker | Erlenmeyer flask | Reagent bottle | Pipette | Eggplant shaped flask | Separatory funnel |
---|---|---|---|---|---|---|---|---|
YOLOv8n | 0.825 | 0.882 | 0.866 | 0.806 | 0.881 | 0.722 | 0.785 | 0.832 |
YOLOv8x | 0.854 | 0.873 | 0.913 | 0.848 | 0.871 | 0.780 | 0.821 | 0.871 |
Fig. 7 illustrates the learning curve for action recognition using 3D ResNet. The horizontal and vertical axes represent the epochs and prediction accuracy, respectively. The orange line indicates the accuracy of the validation data. The accuracy increased up to approximately 3000 epochs, demonstrating a tendency to converge with the oscillations. This behavior suggests that ML is progressing appropriately. The model with 3218 epochs displayed the highest prediction accuracy for the validation data. Table 5 shows the prediction accuracy for action recognition on the validation data, listing the accuracy for the three actions and their average values. For the validation data, the classification accuracies for “adding,” “stirring,” and “transferring” were 0.96, 0.84, and 0.60, respectively, with an average accuracy of 0.80. The model was selected as the optimal model and applied to the numerical verification.
![]() | ||
Fig. 7 Learning curve of action recognition using 3D ResNet. The epochs and corresponding accuracy for validation datasets are shown. |
Average | Adding | Stirring | Transferring |
---|---|---|---|
0.80 | 0.96 | 0.84 | 0.60 |
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00015c |
This journal is © The Royal Society of Chemistry 2024 |