Open Access Article
Di Zhang
*a,
Xue Jia
a,
Hung Ba Tran
a,
Seong Hoon Jang
a,
Linda Zhang
ab,
Ryuhei Satoc,
Yusuke Hashimotob,
Toyoto Sato
e,
Kiyoe Konno
d,
Shin-ichi Orimo
*ae and
Hao Li
*a
aAdvanced Institute for Materials Research (WPI-AIMR), Tohoku University, Sendai 980-8577, Japan. E-mail: di.zhang.a8@tohoku.ac.jp; shin-ichi.orimo.a6@tohoku.ac.jp; li.hao.b8@tohoku.ac.jp
bFrontier Research Institute for Interdisciplinary Sciences (FRIS), Tohoku University, Sendai 980-8577, Japan
cDepartment of Materials Engineering, The University of Tokyo, Tokyo 113-8656, Japan
dInstitute of Fluid Science, Tohoku University, Sendai, 980-8577, Japan
eInstitute for Materials Research, Tohoku University, Sendai, 980-8577, Japan
First published on 3rd February 2026
Despite the surge of AI in energy materials research, fully autonomous workflows that connect high-precision experimental knowledge to the discovery of credible new energy-related materials remain at an early stage. Here, we develop the Descriptive Interpretation of Visual Expression (DIVE) multi-agent workflow, which systematically reads and organizes experimental data from graphical elements in scientific literature. Applied to solid-state hydrogen storage materials—a class of materials central to future clean-energy technologies—DIVE markedly improves the accuracy and coverage of data extraction compared to the direct extraction method, with gains of 10–15% over commercial models and over 30% relative to open-source models. Building on a curated database of over 30
000 entries from >4000 publications, we establish a rapid inverse-design AI workflow capable of proposing new materials within minutes. This transferable, end-to-end paradigm illustrates how multimodal AI agents can convert literature-embedded scientific knowledge into actionable innovation, offering a scalable pathway for accelerated discovery across chemistry and materials science.
The recent surge in LLM applications has greatly enhanced the prospects for automated data mining and reasoning in materials science. Leveraging advanced LLMs, several studies have explored automated extraction of materials data from scientific literature using prompt engineering and conversational interfaces.11–13 Despite these advances, existing strategies still suffer from limitations in completeness, depth, and precision—especially when extracting key quantitative information from graphical elements, which often encode critical materials properties. Current state-of-the-art multimodal models, while powerful, often require multiple rounds of prompt-based querying and validation, resulting in significant computational cost and inefficient use of token resources. There remains a lack of systematic workflows for one-shot, high-throughput extraction and for rigorous, quantitative benchmarking against human-curated data. Moreover, there is no widely adopted workflow for rapidly constructing collaborative, multi-agent materials design systems based on newly mined datasets.
Recent work has increasingly adopted multi-step, multimodal pipelines to mine scientific literature beyond plain text by explicitly incorporating figures, tables, and cross-modality constraints. For example, OpenChemIE provides an information-extraction toolkit for chemistry literature that integrates multimodal extraction components.14 In electrosynthesis, MERMES demonstrates an end-to-end multimodal workflow that leverages multimodal LLMs to parse reaction diagrams and resolve cross-modality dependencies from publications.15 In materials/reticular chemistry, Zheng et al. show that GPT-4V can be used to categorize and mine diverse graphical sources (e.g., isotherms and diffraction patterns) at scale.16 More broadly, MERMaid proposes sequential modules for figure/table segmentation and multimodal analysis to convert PDF-embedded chemical information into machine-actionable representations.17 Building on these advances, we develop the Descriptive Interpretation of Visual Expression (DIVE) workflow, a multi-agent workflow designed for high-throughput extraction of quantitative, figure-centric materials data (e.g., PCT/TPD/discharge curves), coupled with an embedding-based evaluation protocol and a downstream inverse-design agent for hydrogen storage materials discovery. Although conceptually simple, the DIVE pipeline achieves significant gains over current open-source and commercial models, as confirmed by rigorous manual validation and scoring. Afterward, we apply DIVE to the domain of solid-state hydrogen storage materials (HSMs)—a field critical for the future of sustainable, carbon-neutral energy.18 Hydrogen's high gravimetric energy density and environmentally benign combustion make it an ideal candidate for large-scale energy storage;19 yet practical deployment hinges on the development of compact, safe, and cost-effective storage technologies. Solid-state HSMs, including interstitial hydrides, complex borohydrides, ionic compounds, porous frameworks, and emergent high-entropy and superhydride phases, offer a promising path forward. Despite decades of research, however, no comprehensive, structured experimental database for hydrogen storage materials currently exists.
In this work, we systematically mine over 4000 primary publications on solid-state HSMs, spanning the period from 1972 to 2025, using the DIVE workflow and optimized prompt engineering. Compared to leading multimodal and open-source models, DIVE achieves improvements of 10% to 15% and 30%, respectively, in accuracy and data completeness. The resulting database comprises more than 30
000 entries, which we leverage to construct a materials design agent (DigHyd) using GPTs. This agent supports natural language interaction with the HSM database and, more importantly, incorporates a machine-learning-based verifier trained on the extracted data. By integrating LLM-driven reasoning and iterative validation, we realize a streamlined materials design workflow capable of proposing novel hydrogen storage candidates that meet user-defined criteria within minutes (SI Video 1–3). Overall, this work delivers an efficient, scalable framework for AI-driven materials research and offers a transferable methodology for rapid database construction and inverse design in diverse materials domains.
Equally important to the multi-agent workflow is the development of effective evaluation methods. To the best of our knowledge, there is currently no well-established method for evaluating the accuracy and completeness of data extraction from articles using LLMs. To save tokens, it is common to extract multiple entries in one call, making the JSON dictionary list format particularly suitable for outputs. However, how to efficiently and reasonably compare human-extracted and AI-extracted JSONs and assign meaningful scores remains underexplored. This is particularly challenging in materials property extraction, where extraction quality cannot be judged simply as true or false, because the magnitude of numerical differences should also be considered. To address this, as shown in Fig. 1c, we propose using an embedding model to match entries between the human and AI-extracted JSONs. After matching, the units of numerical values are standardized, and the relative errors are calculated using mathematical functions to provide nuanced scoring. We divide the final score into accuracy and completeness (each normalized to 50 points, for a total of 100). In this way, hallucinated entries are not explicitly filtered at the completeness stage; instead, hallucinations are penalized during accuracy evaluation. Specifically, each AI-extracted entry is forcibly matched to the most similar ground-truth entry using an embedding-based alignment. As a result, hallucinated or severely inaccurate entries receive low per-item scores and are systematically penalized in the final accuracy metric. We use a 10% relative-error tolerance as a pragmatic choice for figure-derived values. This level is sufficiently strict to penalize clear misreads, while remaining compatible with the current limitations of multimodal models in precise visual digitization (e.g., axis tick resolution, curve overlap, and image quality). This method allows for a more scientific and rapid evaluation of LLM data extraction performance and can also serve as a reward function for reinforcement learning to further fine-tune or train LLMs. The detailed evaluation functions, as well as the code for the DIVE workflow, are available in the GitHub repository in the Data and code availability section provided with this article. A representative example comparing ground-truth annotations and AI-extracted structured data is shown in SI, Table S2, further illustrating completeness and accuracy scores. To ensure the high reliability and scientific value of HSM data, the DigHyd Data Checking System (curvechecking.dighyd.org; refer to the SI for details) has been developed as an efficient online platform for manual review and correction of AI-extracted data. The diversity of the test set can be found in Fig. S16.
Based on our developed DIVE workflow and the associated scoring algorithm for materials literature data extraction, we systematically evaluated several state-of-the-art commercial and open-source large language models. The score distributions for data extracted by different combinations of multimodal models and LLMs in the DIVE workflow are benchmarked against a dataset consisting of results manually curated from 100 published articles on experimental HSM reports. Fig. 2a presents the data extraction scores for the conventional direct extraction approach and under the DIVE workflow (Fig. 2b and c). Gemini-2.5-Flash,20 currently Google's best model in terms of price-performance, achieved a total score of 77.89 when used for direct extraction. However, when combined in a multi-stage, multi-agent DIVE workflow (Gemini-2.5-Flash20 + DeepSeek R1 (ref. 21)), the total score increased to 87.21 (Fig. 2b), representing an improvement of nearly 12%. To further demonstrate the effectiveness of the DIVE workflow on models with even better token efficiency, we also tested DeepSeek-Qwen3-8B.21 Despite having only 8B parameters, the model also showed about a 10% improvement compared to Gemini-2.5-Flash in the direct extraction scenario. In addition, we systematically assessed the data extraction accuracy across different combinations of mainstream commercial and open-source multimodal and text extraction models (all detailed results can be found in the SI). As shown in Fig. 2d, for the direct extraction workflow, most commercial models achieved a total score of around 75, whereas open-source models scored noticeably lower. When the multi-stage, multi-agent DIVE workflow is applied—particularly with DeepSeek R1 as the post-descriptive embedding LLM—commercial models saw typical improvements of 10–15%, and open-source models improved by 15–30%. The highest score was achieved with the combination of Gemini 2.5 Flash and DeepSeek R1. However, DeepSeek R1 is a large inference model with 685B parameters, making it relatively costly and slow. The same memory budget that supports one DeepSeek-R1-class deployment can typically support dozens of concurrent DeepSeek-Qwen3-8B instances, enabling substantially higher throughput for large-scale processing. Therefore, we further tested DeepSeek V3 and DeepSeek-Qwen3-8B as post-embedding LLMs. Surprisingly, despite its much smaller size (8B parameters), DeepSeek-Qwen3-8B achieved a total score of 84.6, second only to the Gemini 2.5 Flash + DeepSeek R1 combination, but with much faster inference speed and significantly lower computational cost.
![]() | ||
| Fig. 2 Performance improvement of the DIVE data extraction workflow. (a) Conventional extraction workflow using Gemini 2.5 Flash.20 (b) DIVE workflow integrating Gemini 2.5 Flash with DeepSeek R1. (c) DIVE workflow integrating Gemini 2.5 Flash with DeepSeek Qwen3 8B. Dotted vertical lines indicate the mean score of each corresponding score distribution. (d) Benchmark comparison across seven multimodal models, including four proprietary models (Gemini 2.5 Flash,20 Claude 4 Sonnet, OpenAI o4 mini, and Gemini 2.0 Flash) and three open-source models (LLaMA-4-Scout, LLaMA-4-Maverick, and Qwen2.5-VL-72B-Instruct22). Ideally, the proposed DIVE workflow achieves a 10–15% improvement in extraction performance compared to state-of-the-art commercial models, and an over 30% improvement over leading open-source models. | ||
Based on the above benchmarking, we ultimately selected the combination of Gemini 2.5 Flash and DeepSeek-Qwen3-8B for data extraction across 4053 publications. The screening strategy for selecting article DOIs is described in the SI. The processed data have been made publicly available in our Digital Hydrogen Platform (DigHyd: https://www.dighyd.org/). Fig. 3 provides an overview of data mining results from over 4000 hydrogen storage materials publications. As shown in Fig. 3a, aside from the years before 2010, the number of experimental publications on hydrogen storage materials has steadily increased, with 150–200 papers published annually since 2011 (except for 2021 and 2022, likely due to the global COVID-19 pandemic).
Fig. 3b shows the distribution of gravimetric hydrogen densities for different types of hydrogen storage materials. Porous carbon materials generally exhibit very low hydrogen storage capacities at room temperature. At low temperatures (e.g., 77 K) and moderate pressures (e.g., below 100 bar), their hydrogen uptake is typically in the range of 0–1 wt%. One of the main advantages of these materials lies in their extremely fast adsorption and desorption kinetics. Therefore, in the hydrogen storage range of 0–1 wt%, porous materials are the primary candidates.23 The region with the highest concentration is between 1 and 2 wt%, which mainly corresponds to interstitial hydrides—the most widely studied class of hydrogen storage materials. In contrast, ionic, complex, and multi-component hydrides primarily fall in the 4–8 wt% range. By analyzing the extracted formula fields in the DIVE-generated data dictionaries, we can examine the elemental distribution in hydrogen storage materials across different gravimetric density ranges. The most frequent elements in the 0–4 wt%, 4–8 wt%, and 8–12 wt% intervals are Ni, Mg, and Li, respectively, reflecting a general shift in hydrogen storage materials from interstitial hydrides (represented by LaNi5,24,25 Ti–Mn alloys,26 or high-entropy alloys27) to ionic hydrides (MgH2) and complex hydrides (LiBH4 ( ref. 28) or Mg(BH4)2 (ref. 29)). Fig. 3c and d show the proportion of different types of materials in the DigHyd platform. Interstitial hydrides account for the largest share, but we also include a small number of superhydrides. Although superhydrides are mainly reported for superconducting applications,30 they are emerging as a new research hotspot for hydrogen storage under ultra-high pressure conditions. Fig. 3d further illustrates the subtypes of interstitial hydrides.
After constructing the DigHyd database, direct data mining enables the extraction of valuable insights for materials design. Fig. 4 illustrates the top five most frequently added elements to typical hydrogen storage materials—LaNi5, MgH2, and LiBH4—and the distribution of key performance metrics for materials modified with these elements. For LaNi5, magnesium is the most commonly used dopant. After Mg is added, the gravimetric hydrogen density of LaNi5-based materials can reach 4–6 wt% (Fig. 4b). However, the introduction of Mg also affects the hydrogen absorption and desorption pressures. In the case of MgH2, nickel is the most frequent additive.31 While doping MgH2 with Ni tends to improve its hydrogen storage density (Fig. 4e), the dehydrogenation temperature of Mg–Ni systems can reach around 600 K. For LiBH4-based systems, the gravimetric hydrogen density spans the widest range (0–14 wt%). Notably, introducing carbon or nitrogen can boost the hydrogen density of LiBH4 materials to ∼14 wt%, likely due to the catalytic effects of graphene or N-doped graphene on LiBH4 (ref. 32 and 33) dehydrogenation. However, despite this high hydrogen density potential, the dehydrogenation temperature of LiBH4 systems also tends to be relatively high, often requiring 700–800 K for complete hydrogen release. All the visualizations shown in Fig. 3 and 4 can be directly accessed and interacted with via our AI agent using natural language (SI Video 4).
Despite decades of research, most HSMs still fall short of the U.S. Department of Energy (DOE) 2030 technical targets for onboard hydrogen storage systems: >5.5 wt% system-level hydrogen capacity, >40 g H2 per L volumetric density, operational capability between −40 to 85 °C, and cycling durability exceeding 1500 charge–discharge cycles.34 Current benchmark materials exemplify these limitations. MgH2, for instance, boasts a high theoretical gravimetric capacity (7.6 wt%) but requires temperatures above 300 °C for hydrogen release due to slow desorption kinetics.35 Complex hydrides such as LiBH4 and NaAlH4 can achieve moderate hydrogen densities but often necessitate high temperatures, catalytic activation, or suffer from poor reversibility.36 Porous frameworks (MOFs/COFs), while tunable and lightweight, rely primarily on weak physisorption and struggle to meet practical storage densities.37 High-entropy alloys38 and superhydrides, though scientifically intriguing, demand extreme synthesis or operating conditions (high pressures or cryogenic temperatures),38,39 hindering their deployment in commercial systems.
The chemical diversity and complexity of hydrogen storage materials—ranging from AB2, AB3, and AB5 interstitial hydride40 to Mg-, Ti-, and V-based alloys, complex hydrides, and rare-earth-enriched compounds—make the search for optimal candidates challenging. Existing efforts to accelerate hydrogen storage material discovery are fragmented. Conventional computational databases primarily focus on crystalline structures and predicted thermodynamic properties, lacking integration with experimentally validated performance data. The absence of a comprehensive, machine-readable platform41 that integrates both experimental and theoretical information has hindered the rational design and rapid screening of HSMs.
In this work, by integrating the database, machine learning models trained on this database, and LLMs, it becomes straightforward to construct materials-focused AI agents using simple instruction and schema interface functions (for more related details, refer to SI, Fig. S9–11). To initially assess the reliability of the AI agent's predictions, we did not require DigHyd to design entirely new materials. Instead, we focused on cases where comparable materials already exist in the database, allowing for direct validation (Fig. S12 and SI Video 1). Under these conditions, the DigHyd agent proposed compositions such as Mg2Ni0.8Co0.2, Mg2Fe0.8Co0.2, and La0.8Mg0.2Ni5. Among these, Mg2Fe0.8Co0.2 was predicted to exhibit a hydrogen storage capacity of 4.06 wt%. Importantly, analogous alloys already reported in the database, such as Mg2FeH6 and Mg2Fe1−xCoxH6, display capacities in the range of 4.5–5.5 wt%,42,43 thereby supporting the consistency of the predictions.
Next, to verify that DigHyd can indeed design entirely new materials (Fig. 5 and SI Video 2), we applied the same prompting strategy but with explicit instructions to generate compositions never previously reported. Under these conditions, DigHyd demonstrated an iterative design–prediction–optimization capability, as illustrated in Fig. 5. In this workflow, researchers can guide the AI agent to propose novel materials by specifying the material class, potential elements, and target properties such as gravimetric hydrogen density, pressure, and temperature (Fig. 5a).
In the first round, leveraging the local knowledge base as well as the analytical, reasoning, and predictive capabilities of large language models, the DigHyd agent proposed CaMgFe2 (Fig. 5b). This candidate was then evaluated using our machine learning model (see Methods: machine learning methods for model details, hyperparameters, and code), which predicts hydrogen density directly from the material formula. With an R2 value of 0.87, the model provides a reliable first-pass screening for LLM-proposed candidates (Fig. 5c). CaMgFe2 was predicted to store 2.64 wt% hydrogen (Fig. 5d). The agent subsequently suggested increasing the Mg content, resulting in Mg2Fe with a predicted capacity of 4.13 wt%. However, literature reports indicated that this compound exhibits hydrogenation/dehydrogenation only at elevated temperatures (300–400 °C), failing to meet the design targets. In response, DigHyd refined the composition to Mg2Fe0.75Co0.25, and further to Mg2Fe0.6Co0.2Mn0.2. The latter was predicted to achieve 4.19 wt% hydrogen storage capacity, with Mn (or alternatively Al) contributing to hydride stabilization and plateau pressure optimization. Importantly, this final composition has never been reported in the current database. Taken together, these results in Fig. 5d highlight the ability of the DigHyd agent to rapidly design, predict, and iteratively refine candidate materials in line with researcher-defined goals within minutes. If such AI-driven agents are directly integrated with high-throughput experimental platforms, the efficiency of materials discovery and development could be advanced to an unprecedented level.
To further increase the design difficulty, in the third case study (Fig. S13 and SI Video 3), we constrained the element space for material design (A = Mg or Ca, B
Ni). Leveraging the local knowledge base together with the analytical, reasoning, and predictive capabilities of LLM, the DigHyd agent proposed 8 candidate materials. Among these, one candidate exceeded the initial target of 4 wt% hydrogen capacity, while three achieved predicted performances above 3 wt%. The remaining candidates showed comparatively lower hydrogen densities. Based on these initial predictions, the DigHyd agent further optimized the proposed compositions by suggesting minor La and Y doping to enhance hydride phase stability and to reduce the hydrogenation/dehydrogenation temperature and pressure. The final designs, Mg2Ni2.9La0.1 and Mg2NiY0.1, are derived from the Mg2Ni system, a well-established intermetallic compound for hydrogen storage.44 The introduction of a small amount of La or Y by partially substituting Ni is a common strategy to optimize hydrogen storage properties. The substitution ratio (3.3% for La45 or Y46) is appropriate because it is sufficient to significantly influence the microstructure and hydrogen storage behavior without destroying the main phase structure. The addition of La or Y can promote grain refinement and introduce defects, which facilitate hydrogen diffusion, improve absorption/desorption kinetics, and may lower the hydrogenation/dehydrogenation temperature. Moreover, the larger atomic radii of La and Y compared to Ni lead to lattice expansion, thus reducing the activation energy for hydrogen diffusion.46 Therefore, the proposed compositions are also rational for hydrogen storage materials, as supported by both theoretical understanding and experimental data from the literature. In fact, our database did not include this very recent paper [ref. 46] at the time of writing, which investigates the Mg–Y–Ni system. The findings presented in this work further demonstrate the reliability of the predictions made by our developed agent.
435 entries. Across seven multimodal models, DIVE consistently outperforms direct extraction, with typical gains of 10–15% over commercial models and 15–30% over open-source models under the same benchmark. Building on this resource, we implemented the DigHyd agent, integrating natural-language querying with a machine-learning verifier to rapidly propose and refine candidate materials under user-defined constraints. Current limitations still include hallucinated fields, visual reading noise, and multi-plateau interpretation errors. Future work will focus on improving robustness to these failure modes and extending coverage to more complex figure types and long-range context, enabling more reliable literature-to-design pipelines for accelerated materials discovery.
435 unique entries, each corresponding to a distinct material or experimental condition. Users can interactively filter data, visualize results, and explore specific material properties or test conditions. We have also deployed the AI agent developed based on DIVE on the website. In addition, the DigHyd database is updated daily with newly published literature related to HSMs. The platform also provides direct access to the DigHyd agent and integrated machine learning regression models for data analysis and materials prediction.
:
20%. Model training was performed using an XGBoost regressor. Hyperparameter optimization was conducted via GridSearchCV (with 3-fold cross-validation, scoring by negative mean squared error and parallel computation) to select the best model configuration. Model performance was evaluated using standard regression metrics. All code and scripts are available in our GitHub repository (https://github.com/gtex-hydrogen-storage/DIVE).
| This journal is © The Royal Society of Chemistry 2026 |