Open Access Article
Shigeaki Goto
* and
Tatsuki Hasebe
Toyota Central R&D Labs., Inc., 41-1, Yokomichi, Nagakute, Aichi, Japan. E-mail: sg-goto@mosk.tytlabs.co.jp
First published on 12th May 2026
“Creative tool use” refers to the flexible application of tools beyond their intended purpose. In scientific experiments, this behavior is described as a “lab hack,” and its automatic documentation is valuable for accumulating experimental knowledge. Recently, vision-language models (VLMs) have shown promise for generating procedural descriptions from experimental videos. However, VLMs typically rely more on object-based knowledge than on understanding the manipulations. This issue is often overlooked in existing laboratory video datasets, as tools are typically used in standard, prescribed ways. Thus, the extent to which these models can interpret and describe actions that extend beyond object-based knowledge, such as creative tool use, remains uncertain. Moreover, laboratory environments often contain numerous items unrelated to the operation (i.e., decoy objects), which can divert the model's attention and further complicate the accurate identification of creative manipulations. To address this limitation, we developed an evaluation dataset called “CREOLab” (CREative tool use in Object-rich Laboratories), consisting of 65 videos from 13 experimental scenarios featuring creative tool use, each recorded across five levels of decoy object density. Using a state-of-the-art, cloud-based VLM captioning system, we evaluated model performance. As the number of decoy objects increased, the model tended to insert redundant procedural steps or omit essential ones. As a result, it failed to document scenarios involving creative tool use accurately. These findings suggest that enhancing the reliability of automatic experimental recording with VLMs requires mechanisms for automated verification of generated outputs, as well as recording protocols that reduce the influence of decoy objects.
To enhance the usefulness of ELNs, a part of the metadata entry process should be automated to reduce the workload.1,6 Recent studies have explored the automatic labeling or captioning of experimental videos using recognition techniques. Previous research on automated experiment recording has focused mainly on predefined operational categories, such as reagent addition or stirring, rather than on diverse, undefined manipulations.11–13 These efforts can be framed as temporal action segmentation problems, which are typically addressed through classification-based neural networks.14,15
However, laboratory operations are far more diverse and often display considerable creativity in practice. For instance, weighing paper may not only serve to measure chemicals but also be used as a temporary tool or even as a notepad. Such flexible, purpose-extending practices are commonplace and reflect the ingenuity of researchers who adopt tools to achieve experimental goals efficiently in the absence of specialized equipment. In a laboratory context, these behaviors are often referred to as lab hacks,16 whereas in the automation studies, they are described as creative tool use.17,18 Both embody valuable procedural knowledge that merits systematic documentation.
Recently, video captioning technologies based on vision LMs (VLMs) have been actively investigated to convert arbitrary video content into descriptive text.19–22 Among these, cloud-based VLMs, such as GPT-based architectures, outperform standalone models on video captioning benchmarks.23,24 However, VLMs generally rely more on prior object knowledge than on the temporal dynamics of visual sequences, often resulting in overly simplistic interpretations of video content.25–27 In particular, existing laboratory video datasets primarily capture standard experimental procedures in which tools are used for their conventional purposes.11,13,28,29 Consequently, even when a model appears to recognize an action such as “pipetting,” it is unsure whether it demonstrates a true understanding of the operation or simply relies on shortcut reasoning triggered by the presence of the pipette. Therefore, these object-knowledge biases often remain undetected under conventional evaluation settings. This limitation raises a critical question: is it possible to automatically record laboratory actions that embody researchers' creativity and ingenuity, such as lab hacks or creative tool use, that transcend object-level recognition?
This study aims to identify the challenges associated with automatically recording experimental procedures from laboratory videos, including atypical operations within a video captioning framework using competitive and representative VLMs. To achieve this goal, we propose an evaluation dataset named CREOLab (CREative tool use in Object-rich Laboratory), which comprises 13 scientific experimental scenarios involving creative tool use (Fig. 1). The dataset is specifically designed to progressively introduce multiple irrelevant laboratory items (decoy objects) into each scene. This design enables rigorous evaluation of whether VLMs can genuinely interpret and document manipulative actions rather than relying excessively on correlations between visible objects and their typical uses.
The main contributions of this study are summarized as follows:
• Development of the CREOLab video captioning evaluation dataset based on 13 scientific experimental scenarios focusing on creative tool use.
• Introduction of multiple decoy objects in each scenario to assess the robustness of caption generation, systematically increasing the number of decoys to conduct an in-depth analysis of object-knowledge biases.
• Development of a fully automated evaluation protocol for VLMs incorporating a pipeline that integrates caption generation with cloud-based VLMs and checklist-based evaluation.
• Execution of 1000 caption generations and evaluations for each of multiple cloud-based VLMs using a procedural documentation system, revealing the limitations of current VLMs in procedural recording.
The remainder of this paper is organized as follows. Section 2 reviews research related to procedural recording and automation. Section 3 provides detailed information about the dataset. Section 4 describes the evaluation methodology. Section 5 presents and discusses the results, including future challenges in the automatic recording of scientific procedures. Finally, Section 6 presents the conclusion.
With advances in deep learning, particularly in LMs, recent research has shifted toward associating procedural step descriptions with corresponding video scenes based on semantic understanding. Nishimura et al.29 introduced BioVL, a dataset of 16 bioscience experiment videos, to evaluate the ability of video-language embedding models to align procedural texts with visual events. Cui et al.30 developed the ProBio dataset for molecular biology, defining a recognition task to identify which procedural step in a written protocol corresponds to an observed video action. They also introduced three difficulty levels based on the ambiguity of the procedural descriptions. Nishimoto et al.31 developed BioVL-QR, a dataset distinguished by the use of QR codes attached to experimental items. By detecting and matching these QR codes within the video, they leveraged object-level information to predict the correspondence between procedural steps in known protocols and specific video segments.
In recent years, the rapid advancement of VLMs has attracted significant attention in generating procedural texts directly from video data. Nishimoto et al.31 highlighted that their BioVL-QR dataset could serve as a valuable resource for automatic generation of experimental protocols through video understanding. Similarly, Chen et al.32 employed curated online videos of scientific experiments and used VLMs to produce natural language descriptions, including procedures, underlying principles, and safety guidelines, by referencing Wikipedia.
One limitation of existing datasets is that they primarily emphasize experimental procedures in which tools are used conventionally, thereby limiting insight into whether VLMs genuinely interpret scientific procedures or instead rely predominantly on prior knowledge of standard tool usage. To address this, our CREOLab dataset emphasizes creative tool-usage scenarios, featuring atypical operations that cannot be comprehended through shortcuts based solely on object knowledge.
Large LMs (LLMs) possess extensive prior knowledge of tools and environments. Research has shown that leveraging this knowledge enhances performance on captioning benchmarks. For example, Chou et al.37 demonstrated that pretrained models, such as GPT, can generate commonsense textual knowledge regarding the functions and purposes of tools within a scene, and incorporating this auxiliary information significantly improves the accuracy of action captioning. Furthermore, Niu et al.38 proposed a method that inputs the initial and final frames of a video into an LLM to infer intermediate procedural steps using the model's chain-of-thought reasoning. More recently, cloud-based models, such as GPT, have been extended to process visual inputs, enabling their direct application as VLMs for video captioning and producing high-quality, contextually coherent results.23,24
However, studies have shown that current VLMs still struggle to capture motion and action details in videos accurately. Wang et al.25 through ActionBench experiments involving video reversal and antonymic verb substitution, revealed that many VLMs rely excessively on prior object knowledge while underemphasizing dynamic motion. Similarly, Shvetsova et al.26 identified limitations in existing benchmarks that permit correct predictions based primarily on object or background recognition. They proposed improved benchmarking methods to eliminate such biases. Ma et al.27 also introduced “hard negative” captions that alter actions while retaining the same objects, thereby increasing the task's difficulty.
Based on aforementioned prior studies, in our CREOLab dataset, irrelevant scientific instruments are deliberately placed in the scenes as decoys. This serves as a highly challenging benchmark that closely reflects a realistic laboratory environment and is likely to induce captioning errors owing to object-knowledge bias. Furthermore, by introducing decoys incrementally within the same scenario, our dataset enables a more quantitative and in-depth analysis of object-knowledge bias than previous datasets.
We conducted interviews with seven in-house researchers actively engaged in experimental work. Their areas of expertise covered diverse disciplines, including electrochemistry, inorganic chemistry, polymer chemistry, biology, and thermal engineering. Through discussions with these participants, we collected 23 instances of creative tool use. From this set, we excluded three scenarios for safety reasons, such as those involving heating operations, four scenarios because the required tools were not readily available, and three scenarios that did not align with the dataset's concept, specifically those in which the proposed creative tool use corresponded to the standard use of the tool at the verb level. For each scenario described in the interviews, we abstracted research-specific details and reconstructed the scenario with a different performer. The reenactments were recorded on video, and the original researchers subsequently reviewed them. Whenever inconsistencies were identified, the procedures or tools were modified accordingly. Ultimately, we constructed 13 creative tool use scenarios (Table 1).
| ID | Scenario overview | Additional decoy object | ||||
|---|---|---|---|---|---|---|
| Tool | Creative use | Decoy 1 | Decoy 2 | Decoy 3 | Decoy 4 | |
| D1 | Micropipette | Sowing tiny seeds | Sucrose solution | Tweezers | Seed | Distilled water |
| D2 | Sponge | Fixing a working electrode | Electrode polishing agent | Neutral detergent | Ultrasonic cleaner | Ultrapure water |
| D3 | Weighing paper | Preparing a label for an NMR tube | NMR test compound | Laboratory balance | Microspatula | Micropipette |
| T1 | Screwdriver | Winding a platinum wire into a coil | Pliers | Cross-recessed screw | Heat sink | Tweezers |
| T2 | Ultrasonic cleaner | Dispersing catalyst ink in a vial | Distilled water | Air blow gun | Neutral detergent | Sponge |
| T3 | Filter paper | Covering the crystallizing dish | Ethanol | Powder reagent | Beaker | Laboratory balance |
| T4 | Alcohol | Promoting the drying of a heat sink | Air blow gun | Laboratory wipe | Hex key | Spray bottle |
| T5 | Weighing paper | Partitioning metal plate samples | Chemical paste | Plastic spatula | Powder reagent | Microspatula |
| T6 | Toothpick | Transferring microbial cells from an agar plate | Micropipette | Pipette tip box | Pipette tip waste container | Test tube |
| T7 | Paint brush | Resuspending contents in a microtube | Microcentrifuge | Filter paper | Microtube rack | Powder reagent |
| T8 | Pipette tip | Dispensing liquid (without using a micropipette) | Micropipette | Vial cap | Pipette tip waste container | Volumetric pipette |
| T9 | Weighing paper | Pouring pellets into a vial; using it as a conical funnel | Laboratory spatula | Powder reagent | Laboratory balance | Ultrapure water |
| T10 | Filter paper | Serving as a substrate for reagent application | Laboratory spatula | Filter funnel | Beaker | Pestle |
Each scenario was assigned an ID with the prefix “D” or “T.” The prefix “D” indicates the development split used for video captioning and prompt design. The prefix “T” denotes the test split used for evaluating previously developed captioning systems.
In every scenario, the primary tool is intentionally used in a creative manner distinct from its conventional application. For example, in scenario D1, tiny seeds about 0.2 mm in diameter, which cannot be grasped with tweezers, are soaked in water and individually sown by drawing them up one by one with a micropipette, an instrument initially designed for dispensing liquids. In scenario T9, because a funnel with a flow path wide enough for pellets cannot fit into a short vial bottle, a simple funnel is improvised by rolling a sheet of weighing paper, typically used for packaging reagents.
Thus, none of the scenarios represent misuse of tools; instead, they depict realistic and rational instances of creative problem-solving. It is worth noting that these scenarios are abstract examples and have not undergone formal training in experimental safety.
Although some scenarios were designed to simulate work in a draft chamber, all videos in this dataset were recorded under standard, air-conditioned conditions. For safety reasons, no actual chemicals were used; instead, visually similar household materials served as substitutes. Specifically, water was used in place of alcohol, and seasonings represented various chemicals. For nonchemical instruments, such as micropipettes and disk working electrodes, all items, including decoy objects, were genuine laboratory tools.
Each recording was captured using a single RGB camera mounted on a tripod, with the viewing angle adjusted for each scenario. Within the same scenario, the task content was standardized, and the same operation was performed and recorded across five decoy levels. The demonstrations were restricted to scenes involving creative tool use, which is the focus of our investigation, while preceding and subsequent tasks were excluded. Consequently, the dataset comprises 65 videos, with durations ranging from 10.5 to 60.0 s and an average length of 22.9 s. No audio was recorded.
![]() | ||
| Fig. 2 An example of GT data. For each video, the names of objects (including decoys), their relative positions in the image, and a caption common to the scenario are provided. | ||
Since such VLMs cannot process raw video files, discrete frames extracted from each video were used as sequential image inputs. A shorter frame interval minimizes the likelihood of missing scene transitions but results in a larger number of frames. Although the application programming interface supports numerous image inputs, not all may be considered during inference.39 To address this trade-off, following previous studies,40–42 the input videos were divided into short segments. Each video was segmented into 6 s units with a 1 s overlap between consecutive segments. From each segment, seven evenly sampled frames (four in the case of LLaMA 4 owing to input limitations) were used as prompts for the VLM to generate captions. The model's segment-level captions were integrated and summarized into a concise procedural description of approximately three steps.
During captioning, an object list containing the names and spatial positions of all objects was incorporated into the prompt alongside the initial frame image. This addition helped reduce inconsistencies in terminology across segments. To differentiate the model's captioning capability from its object detection performance, two operational modes were implemented:
• Manual detection mode: this mode uses the “objects” attribute from the GT data, allowing captioning to proceed without relying on automatic object detection.
• Automatic detection mode: this mode did not use any GT data. Instead, information corresponding to the “objects” attribute was automatically generated by the VLM by analyzing context frames sparsely sampled from the entire video. In this case, the captioning outcomes directly reflected the VLM's inherent ability to recognize objects.
In this study, a precise absolute evaluation was unnecessary because our primary goal was to clarify the limitations of current VLMs. However, to efficiently visualize and explore a large number of experimental results, it was desirable to assign scores that enabled relative comparisons. Therefore, we employed a checklist-based point-deduction method and constructed an automated evaluation pipeline using the same VLM as that used in the captioning module as the evaluator. The deduction criteria were as follows:
• Critical step omissions: −15 points per instance.
• Incorrect step sequence: −12 points per instance.
• Unnecessary additional steps: −8 points per instance.
• Incomplete step descriptions: −5 points per instance.
• Incorrect terminology: −10 points per instance.
• Ambiguous terminology: −5 points per instance.
We determined the deduction values according to their impact on task reproducibility. Each evaluation began with a perfect score of 100 points, from which deductions were applied for corresponding errors to compute the final score. Analyzing each deduction factor enables a statistical discussion of failure patterns.
Deduction judgments were made by comparing the generated captions with the annotated GT procedures. However, automatically generated captions do not always use identical terminology. For instance, if a scene depicts “holding distilled water,” the automatic recognition system might detect the object as a bottle and describe it as “holding a bottle”. When the correspondence between “distilled water” and “bottle” is predefined, both expressions are treated as equivalent. Otherwise, “holding distilled water” is considered missing and penalized under critical step omissions, while “holding a bottle” is treated as an unnecessary action under unnecessary additional steps. To mitigate this issue, we created a name-mapping scheme that aligned the lists, thereby accommodating terminological variations during the evaluation process.
Table 2 presents the mean scores stratified by the number of decoy objects. The Overall row shows the result obtained by combining all trials, which is equivalent to the result shown in Fig. 5. T1–T10 indicate the mean scores further stratified by scenario. For each row, we report the regression slope β of the mean score with respect to the number of decoy objects, the coefficient of determination R2 and the one-sided p-value testing whether the slope is negative. In addition, Cohen's d relative to the no-decoy condition (d = 0) is shown below the mean score for each decoy level.
In the overall analysis, β was negative in the manual detection mode (β = −3.49) and the automatic detection mode (β = −2.09), with p < 0.01, indicating a significant negative effect of decoys on the score. However, the magnitude of this effect varied considerably across scenarios. Specifically, the standard deviation (SD) of β across scenarios was 3.40 in the manual mode and 2.78 in the automatic mode. Relative to the magnitude of β, the coefficient of variation (CV = SD/|β|) was 0.97 and 1.33, respectively, indicating substantial scenario-dependent variability. Furthermore, the relatively modest R2 values suggest that the observed effects were influenced by random-seed variation and potential nonlinearity in the decoy effect.
Therefore, evaluating VLM captioning only under specific scenarios or decoy conditions may lead to misleading interpretations. A robust conclusion regarding model-wise trends can be drawn only by evaluating across the full set of 10 scenarios and the five decoy levels we defined. Indeed, in some scenarios, marked score reductions were observed only at specific numbers of decoys, and these cases are examined in detail below.
![]() | ||
| Fig. 6 Generated captions under the gray-shaded conditions in Table 2(a). Top: zero decoys, middle: three decoys; bottom: four decoys. (GT caption: 1. remove the crystal using a laboratory spatula and place it in the crystallizing dish. 2. Cover the crystallizing dish with filter paper). | ||
Next, attention is turned to the gray-shaded conditions in Table 2(b) (automatic detection mode, scenario T5, one decoy). This scenario involves separating two samples (automatically detected as a copper sheet) using weighing paper and storing them in a sample tray (automatically detected as a plastic tray). Because this mode relies on automatic object detection, the results varied depending on the random seed of the VLM. For the weighing paper, five of the ten trials classified it as “plastic film,” whereas the other five classified it as “adhesive film”. Examples corresponding to the median score within each classification group are shown in Fig. 7. Since an “adhesive film” would be unsuitable as a separator because of its stickiness, this represents a functional misrecognition. Consequently, key steps related to creative tool use, such as “placing the film,” were omitted and replaced with unrelated actions, such as “applying material from a bowl” or “pressing down”. These misclassifications ultimately degraded the quality. These findings demonstrate that when the system relies on automatic object detection, recognition errors can misguide the captioning process, leading to failures in documenting creative tool use scenarios.
![]() | ||
| Fig. 7 Differences in generated captions based on automatic recognition results of the weighing paper under gray-shaded condition in Table 2(b). Top: when recognized as plastic film; bottom: when recognized as adhesive film. (GT caption: 1. place the metal sample on the weighing paper inside the sample storage case using tweezers. 2. Cover the metal sample with another sheet of weighing paper. 3. Place an additional metal sample on top of the weighing paper). | ||
Table 3 summarizes the results obtained by applying each metric to the Overall scores presented in Table 2. The proposed method (OUR) captured the degradation in video captioning performance associated with the increasing numbers of decoy objects more effectively than the conventional metrics. In particular, it produced a relatively small one-sided p-value compared to the conventional metrics for the test of a negative slope in the linear regression of score on decoy count. It also exhibited the largest effect size, indicating that it more clearly distinguished differences among conditions. Among the conventional metrics, BERTScore performed relatively well and detected a negative trend in the manual detection mode (β < 0, p < 0.001).
| Metric | Number of decoy objects | Regression statistics | ||||||
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | β | R2 | p(β < 0) | |
| (a) Manual detection mode | ||||||||
| OUR | 81.8 | 80.2 (−0.11) | 76.1 (−0.34) | 74.1 (−0.42) | 67.4 (−0.70) | −3.49 | 0.062 | <0.001 |
| BLEU-4 × 100 | 14.4 | 15.7 (+0.13) | 15.5 (+0.11) | 13.0 (−0.14) | 13.1 (−0.13) | −0.53 | 0.006 | <0.05 |
| METEOR × 100 | 32.0 | 31.9 (−0.02) | 31.2 (−0.17) | 30.9 (−0.23) | 30.4 (−0.34) | −0.42 | 0.015 | <0.01 |
| CIDEr-D | 0.31 | 0.35 (+0.06) | 0.38 (+0.12) | 0.26 (−0.10) | 0.21 (−0.20) | −0.030 | 0.006 | <0.05 |
| BERTscore × 100 | 59.5 | 59.0 (−0.06) | 58.1 (−0.16) | 57.1 (−0.29) | 55.4 (−0.47) | −1.01 | 0.026 | <0.001 |
![]() |
||||||||
| (b) Auto detection mode | ||||||||
| OUR | 73.7 | 66.8 (−0.28) | 69.8 (−0.17) | 68.8 (−0.21) | 62.1 (−0.45) | −2.09 | 0.014 | <0.01 |
| BLEU-4 × 100 | 4.4 | 2.1 (−0.37) | 4.5 (+0.02) | 3.7 (−0.11) | 2.6 (−0.28) | −0.19 | 0.002 | 0.163 |
| METEOR × 100 | 18.9 | 18.3 (−0.11) | 20.6 (+0.35) | 19.6 (+0.15) | 18.7 (−0.05) | +0.08 | 0.001 | 0.702 |
| CIDEr-D | 0.09 | 0.08 (−0.08) | 0.10 (+0.02) | 0.11 (+0.07) | 0.04 (−0.25) | −0.01 | 0.002 | 0.178 |
| BERTScore × 100 | 40.0 | 39.1 (−0.10) | 42.6 (+0.27) | 41.4 (+0.15) | 39.1 (−0.10) | +0.05 | 0.000 | 0.571 |
However, in automatic detection mode, none of the conventional metrics, including BERTScore, reliably detected a negative trend. One possible explanation is that the object names assigned by the VLM during automatic detection do not necessarily match the expressions used in the reference captions, rendering conventional metrics that emphasize lexical overlap inherently unstable. These results indicate that the proposed checklist-based method enables visualization of the breakdown of penalty factors and outperforms conventional metrics in detecting degradation in video captioning performance.
The results are summarized in Table 4. For each model, we report the values corresponding to the “Overall” metric in Table 2. The linear regression coefficient β is negative across all models, indicating, as with the GPT-5 results, that decoys generally have an adverse effect on captioning. Notably, the slope is statistically significant for LLaMA-4 and Claude-4. By contrast, for Gemini-3, the absolute value of β and the associated effect size are relatively small, and the p-value is not sufficiently low, suggesting that the impact of decoy quantity is limited within the 10 test scenarios and may not be fully captured by this dataset.
| VLM | Number of decoy objects | Regression statistics | ||||||
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | β | R2 | p(β < 0) | |
| (a) Manual detection mode | ||||||||
| LLaMA-4 | 59.5 | 56.6 (−0.15) | 56.0 (−0.17) | 53.0 (−0.34) | 53.4 (−0.33) | −1.58 | 0.013 | <0.01 |
| Gemini-3 | 84.7 | 82.2 (−0.17) | 81.9 (−0.19) | 82.6 (−0.14) | 80.4 (−0.26) | −0.82 | 0.005 | 0.054 |
| Claude-4 | 58.2 | 57.3 (−0.05) | 54.3 (−0.21) | 52.4 (−0.30) | 48.7 (−0.48) | −2.41 | 0.027 | <0.001 |
![]() |
||||||||
| (b) Auto detection mode | ||||||||
| LLaMA-4 | 52.0 | 48.1 (−0.23) | 45.9 (−0.35) | 41.7 (−0.65) | 43.0 (−0.57) | −2.44 | 0.037 | <0.001 |
| Gemini-3 | 80.5 | 79.3 (−0.07) | 82.4 (+0.12) | 77.0 (−0.20) | 77.8 (−0.15) | −0.77 | 0.004 | 0.082 |
| Claude-4 | 46.3 | 40.8 (−0.28) | 38.8 (−0.43) | 37.6 (−0.51) | 35.6 (−0.58) | −2.46 | 0.038 | <0.001 |
Overall, the proposed experimental pipeline demonstrates consistent applicability across multiple VLMs and effectively reveals challenges posed by decoys. At the same time, although the CREOLab dataset exposes such challenges across models, the results indicate that improving evaluation robustness remains an open challenge, potentially achievable through dataset expansion with additional high-difficulty scenarios.
(1) Although this study evaluated VLMs in a relatively simple video captioning pipeline design, there remains a marked potential for refinement through architectural exploration. One promising strategy is the sequential segment captioning approach,51 in which the latent semantics of one segment serve as contextual conditioning for subsequent segments. Furthermore, integrating a reflection mechanism52,53 that can detect and correct missing procedural steps could improve accuracy. In particular, existing video-protocol alignment frameworks29–31 could serve as verification agents that prompt caption regeneration when inconsistencies with the video are detected.
(2) Improving environmental conditions for video capture can further reduce captioning errors. For instance, attaching QR codes to objects31 may help mitigate misrecognition-related inaccuracies. Developing an extended version of the CREOLab dataset that integrates such measures could also be explored in the future.
(3) Although the proposed dataset focuses only on video information, incorporating nonvisual modalities such as tactile54 and auditory cues55,56 could help mitigate object-knowledge bias in automated captioning. More broadly, such an extension would also make the ELN database valuable beyond serving merely as a repository of human operations. For instance, automatically generated captions could function as annotations for training robot foundation models.57,58 In addition, human operation records enriched with nonvisual modality information could support the training of vision-tactile-language-action models.59 This type of multimodal experimental record could provide a foundation for future robot-driven automated experimentation and SDLs.
(4). The object-knowledge bias revealed by the proposed dataset may be intrinsic to machine-learning-based VLM approaches. By contrast, when experiments are performed using captioning systems grounded in nonmachine-learning methods, such as Bayesian approaches exemplified by Bayesian networks,60 the presence or extent of such bias may substantially differ.
Nevertheless, the dataset does not yet encompass the full diversity of experimental scenarios. For example, it does not include complex, specialized procedures, such as preprocessing in materials analysis or operations involving multiple experimental instruments. Consequently, even if a model performs well across the 13 scenarios in CREOLab, this should not be interpreted as evidence of comprehensive video captioning capability for all scientific experiments. In future work, we plan to expand CREOLab to include a broader array of scenarios. Furthermore, user-specific scientific experiment scenarios could be evaluated if datasets are generated following the format established in this study. While careful attention is essential to confidentiality, we hope that such scenarios can be shared within the scientific community whenever feasible. By sharing these challenging scenarios and continuously expanding the dataset, more robust evaluations will become possible, facilitating improvements in captioning technologies. Moreover, these advancements can lay the groundwork for future data-driven laboratories by integrating reliable captioning technologies into ELNs and providing foundational laboratory records for autonomous robotic experimentation and SDLs.
| This journal is © The Royal Society of Chemistry 2026 |