CREOLab: A Procedure Captioning Dataset for Understanding Creative Tool Use in Object-Rich Laboratory Videos
Abstract
“Creative tool use” refers to the flexible application of tools beyond their intended purpose. In scientific experiments, this behavior is described as a “lab hack,” and its automatic documentation is valuable for accumulating experimental knowledge. Recently, vision-language models (VLMs) have shown promise for generating procedural descriptions from experimental videos. However, VLMs typically rely more on object-based knowledge than on understanding the manipulations. This issue is often overlooked in existing laboratory video datasets as tools are typically used in standard, prescribed ways. Thus, the extent to which these models can interpret and describe actions that extend beyond object-based knowledge, such as creative tool use, remains uncertain. Moreover, laboratory environments often contain numerous items unrelated to the operation (i.e., decoy objects), which can divert the model’s attention and further complicate the accurate identification of creative manipulations. To address this limitation, we developed an evaluation dataset called “CREOLab” (CREative tool use in Object-rich Laboratories), consisting of 65 videos from 13 experimental scenarios featuring creative tool use, each recorded across five levels of decoy object density. Using a state-of-the-art, cloud-based VLM captioning system, we evaluated model performance. As the number of decoy objects increased, the model tended to insert redundant procedural steps or omit essential ones. As a result, it failed to document scenarios involving creative tool use accurately. These findings suggest that enhancing the reliability of automatic experimental recording with VLMs requires mechanisms for automated verification of generated outputs, as well as recording protocols that reduce the influence of decoy objects.
Please wait while we load your content...