Open Access Article
Di
Zhang
,
Xue
Jia
,
Yuhang
Wang
,
Heng
Liu
,
Qian
Wang
,
Seong-Hoon
Jang
,
Daksh
Shah
,
Songbo
Ye
,
Hung Ba
Tran
and
Hao
Li
*
Advanced Institute for Materials Research (WPI-AIMR), Tohoku University, Sendai 980-8577, Japan. E-mail: li.hao.b8@tohoku.ac.jp
First published on 23rd February 2026
The concept of a digital materials ecosystem represents a new paradigm in materials research, where data, theory, and automation are integrated into a unified and iterative framework. By combining reliable databases, physical frameworks, and intelligent data analysis, materials discovery is evolving from empirical exploration toward a systematic and predictive science. The rapid growth of data and artificial intelligence (AI) has enabled the identification of complex structure–property relationships, while advances in automated synthesis and high-throughput characterization are closing the loop between prediction and validation. Looking forward, the field must focus on building trustworthy and benchmarked datasets, developing interpretable and high-precision models, and designing AI tools that embody human scientific reasoning. Equally important is ensuring standardization and consistency between digital inputs and experimental responses. Together, these efforts will transform materials discovery from data accumulation into genuine knowledge generation, paving the way for an autonomous and self-improving research ecosystem that accelerates both fundamental understanding and technological innovation.
At the heart of this ecosystem are materials databases, which serve as the backbone for aggregating experimental and theoretical data.2 These large-scale databases allow for the efficient retrieval, analysis, and reuse of information across diverse materials systems, providing the foundation for subsequent data-driven research. As materials databases grow in size and complexity, they enable deeper insights into structure–property relationships, fostering a more systematic and predictive approach to material design.
In parallel, machine learning (ML) and AI-driven modeling are advancing the capabilities of materials science.3,4 These technologies enhance predictive accuracy by learning from historical data, identifying complex patterns that may be difficult for traditional methods to uncover. AI models can now predict material properties, suggest new material candidates, and even guide experimental design, all of which are pivotal in reducing the time and cost involved in material development. Moreover, the integration of automated synthesis and high-throughput characterization techniques has led to the development of a closed feedback loop in materials research.5,6 In this loop, predictions made by AI models are validated through experiments, and new data generated from experiments are fed back into the system, continuously refining the models. This self-evolving cycle fosters a more efficient and dynamic approach to materials discovery, where the pace of innovation is accelerated, and previously unattainable breakthroughs are within reach.
The digital materials ecosystem is not limited to specific material classes. It spans a wide range of domains, from solid-state batteries and catalysis to hydrogen storage and beyond. Each of these material systems contributes uniquely to the broader ecosystem, where AI, automation, and data-driven methodologies are applied to solve complex problems. By drawing on examples from various material classes, this perspective highlights the versatility and wide applicability of the digital materials ecosystem, emphasizing its potential to revolutionize materials research across diverse application areas.
000 MOFs, such as specific surface area, pore volume, pore size and limiting pore diameter, pore size distribution, connectivity dimensions, and density. The goal is to provide reproducible geometric characterization methods, improve consistency and comparability across algorithms, and enable structure–property relationship analysis and high-throughput screening using visualization and principal component analysis tools, thereby accelerating the discovery of new materials.
The strength of small-domain databases lies in their high specificity and focus, enabling the extraction of unique patterns within a given field and offering precise support for targeted research. However, their limitations include data fragmentation and a lack of compatibility and generalizability, making them less suitable for cross-disciplinary studies and limiting their broader applicability.
000 entries of inorganic crystal structures.13 The database includes chemical formulae, unit cell parameters, space groups, atomic coordinates, site occupancies, thermal parameters, and bibliographic information. It also provides a Windows graphical interface with search and visualization tools, offering reliable structural data for materials research, phase identification, and structure–property relationship analysis. In 2020, Huang and Cole et al. created a battery materials database by extracting experimental data from 229
000 research papers using ChemDataExtractor.25 The database contains 17
354 chemical compounds and 292
313 entries, including capacity, voltage, conductivity, coulombic efficiency, energy, and their respective units and conditions. It provides a graphical user interface along with tools for data cleaning, standardization, and augmentation, enabling large-scale machine-readable data to support battery material design and prediction. In 2022, Ward et al. proposed the “Battery Data Genome (BDG)” database and data hub system, based on experimental multi-source data.12 This framework spans data from materials and electrodes to single cells, modules/packs, and real-world systems, accompanied by complete metadata and standardized protocols. It aims to enable cross-stage data sharing and machine learning (ML) applications through unified standards and interoperable open-source software, accelerating battery material discovery, manufacturing optimization, and lifetime prediction, thereby facilitating efficient translation from research to deployment.
The strength of experimental data-driven databases lies in their high reliability, as they are derived from real experiments, providing critical parameters such as catalytic performance and battery properties. However, their limitations include slow data updates, limited coverage, and high experimental costs, which hinder the rapid generation of large-scale datasets and restrict their applicability in high-throughput screening and large-scale materials design.
000 calculations), along with tools for error correction, deduplication, and complex salt coordination generation. It facilitates large-scale screening and data-driven electrolyte molecular design.
The same year, Kirklin et al. developed the Open Quantum Materials Database (OQMD), conducting ∼300
000 DFT calculations on ICSD structures and common prototypes.20 The database includes crystal structures, total energies, formation energies, and chemical potential corrections, validated through large-scale comparisons with experiments (MAE ≈ 0.096 eV per atom). It enables thermodynamic stability assessments and predicts ∼3200 potential new compounds, advancing materials discovery and design. In 2019, Winther et al. launched the Catalysis-Hub database, which aggregates over 100
000 adsorption/reaction energies and activation energies from DFT calculations.23 It includes atomic structures, computational parameters, and APIs/web-based search tools. The platform supports reproducible, machine-readable data sharing and efficient screening, aiding the discovery and modeling of catalyst materials for sustainable energy applications. In 2022, Hu et al. introduced MaterialsAtlas.org, a materials informatics platform, and database integrating tools for composition/structure validation (e.g., electroneutrality, Pauling rules, and dynamic stability), property predictions (e.g., bandgap, elasticity, hardness, and thermal conductivity), and hypothetical material generation (e.g., generated compositions and cubic structures).22 The platform enables high-throughput exploration, screening, and visualization of inorganic crystals, significantly improving the efficiency of materials discovery and design. The Novel Materials Discovery (NOMAD) Laboratory provides one of the largest open repositories of computed materials data worldwide,26 built around strict FAIR principles (findable, accessible, interoperable, and reusable). By aggregating and homogenizing raw first-principles calculations from multiple codes and users, NOMAD converts heterogeneous simulation outputs into a consistent, queryable data infrastructure that can be directly used for large-scale screening and data-driven model development. On top of this repository, the NOMAD Artificial Intelligence Toolkit offers workflows for feature extraction, dimensionality reduction, clustering, and supervised ML, enabling researchers to discover hidden patterns in high-dimensional materials spaces and to derive interpretable structure–property relationships. Other high-throughput frameworks such as AFLOW27 further exemplify this transition from individual calculations to curated digital infrastructures. AFLOW provides an automated pipeline for generating, standardizing, and storing large numbers of first-principles calculations, together with symmetry analysis and a rich catalogue of derived materials properties accessible through the AFLOWLIB repository and programmatic APIs. In 2025, Huang et al. developed a comprehensive public single-atom catalyst (SAC) database and combined DFT-derived descriptors with ML models to rapidly screen 4d single-atom catalysts, identifying Rh1B4 and Rh1C2S2 as highly active candidates for NO and Hg0 oxidation.28
The advantages of computational databases lie in their ability to rapidly and efficiently generate theoretical predictions, making them suitable for large-scale material screening. They can also predict material properties that are difficult to measure experimentally, providing valuable insights. However, their limitations include potential inaccuracies in model predictions, which may not always align with rigorous experimental data. Consequently, computational data must be strictly experimentally validated before practical applications, limiting the direct applicability of such databases.
Encouragingly, there has been striking progress in the field in recent years. In 2024, Li and co-workers released the Digital Catalysis Platform (DigCat: https://www.digcat.org),29 a catalysis database primarily based on experimental data combined with computational structures (Fig. 2a). DigCat encompasses over 400
000 experimental performance records and over 400
000 structural entries, enabling data visualization, literature tracking, AI-powered Q&A, cloud-based microkinetic simulations, and ML force field training, thereby accelerating catalysis research. In the same period, they launched the dynamic database of solid-state electrolytes (DDSEs) and the Digital Battery Platform (DigBat: https://www.digbat.org) for solid-state electrolytes (SSEs) in solid-state batteries (Fig. 2b).30 As of February 2026, DDSE contains data on over 3000 inorganic SSE materials, including ionic conductivities and activation energies measured across a wide temperature range (132.4–1261.6 K), covering diverse cationic and anionic systems. This database supports structure–property exploration and ML-based predictions. In 2026, the same team introduced the Digital Hydrogen Platform (DigHyd: https://www.dighyd.org, Fig. 2c),31 a hydrogen storage materials database that integrates data from over 4000 publications (1972–2025) and more than 30
000 experimental data entries, including pressure–composition–temperature (PCT), temperature-programmed desorption (TPD), and discharge curves. This innovation drives data-driven discovery and significantly advances research in hydrogen storage materials. Furthermore, to accelerate the advancement of the digital materials ecosystem, the team has established a cutting-edge AI-powered digital platform for advanced materials discovery and development, termed the Digital Materials Platform (DigMat, https://www.digmat.org; Fig. 2e). This ecosystem encompasses a series of specialized sub-platforms, including the Digital Spin Materials Platform (DigSpin, Fig. 2d) for quantum spin and correlated materials, the Digital Thermoelectric Platform (DigTEM) for thermoelectric systems, the Digital Superconductivity Platform (DigSuperC) for superconductors, and the Digital Corrosion Platform (DigCorrosion) for materials corrosion analysis and prevention. Other related initiatives include the Digital Ionic Liquids Platform (DigILS), the Digital Polymer Platform (DigPol), the Digital Sensor Platform (DigSen), the Digital CO2 Capture Platform (DigCC) and the Digital MOF Platform (DigMOF). Together, these dynamically updated platforms integrate millions of experimentally measured data and terabytes of literature-derived information, providing a robust foundation for the future expansion of the digital materials paradigm. Recently, the OCx24 (ref. 32) (Fig. 2f) study also provided high-throughput, AI-oriented experimental–computational datasets for electrocatalysis, connecting adsorption-energy descriptors with industrially relevant hydrogen evolution reaction (HER) and CO2 reduction reaction (CO2RR) performance. By revealing a data-driven Sabatier volcano and demonstrating transferable predictive capability across diverse material classes, OCx24 highlights how standardized, ML-ready experimental workflows can substantially narrow the gap between computation and practical catalyst discovery.
![]() | ||
| Fig. 2 Ecosystem of the Digital Materials Platform. (a) Digital Catalysis Platform (DigCat: https://www.digcat.org).29 (b) Digital Battery Platform (DigBat: https://www.digbat.org).30 (c) Digital Hydrogen Platform (DigHyd: https://www.dighyd.org).31 (d) Digital Spin Materials Platform (DigSpin: https://www.digspin.org). (e) Digital Materials Platform (DigMat: https://www.digmat.org). (f) Open Catalyst Project.32 Adapted with permission from: (f) ref. 32 © Copyright under a CC-BY 4.0 License. | ||
To address these challenges, ongoing efforts are being made to develop more robust and comprehensive databases that integrate both experimental and computational data, ensuring consistency and standardization across materials systems. Advances in AI-driven agents are playing a pivotal role in overcoming data fragmentation and inconsistency by automating data extraction, validation, and curation from diverse sources. These agents can standardize experimental protocols, fill gaps in incomplete metadata, and align data from different studies, thus improving the reliability and usability of databases. Additionally, the closed-loop feedback systems ensure continuous refinement of both data and models. These integrated efforts collectively enhance the reproducibility and reliability of AI-driven predictions, advancing the efficiency of materials discovery in energy and catalysis research.
The first example is the surface Pourbaix diagram, initially proposed by Hansen et al.35 in 2008, as a DFT-based phase diagram framework to describe the stability of surface states (i.e., the surface coverage under electrochemical operating conditions) as a function of applied potential and pH. This pioneering idea has since been extended by Liu et al.36 to systematically survey transition metal oxides, carbides, nitrides, and hydroxides (Fig. 3a), demonstrating that the electrochemical operando surface states are often drastically different from the pristine stoichiometric structure, thereby highlighting the necessity of preliminary electrochemical surface state verification. More recently, Liu et al.37 advanced this model into a reversible hydrogen electrode (RHE)-dependent formulation that incorporates electric field corrections and potentials of zero-charge, enabling the accurate prediction of pH-dependent surface coverage (Fig. 3b) and providing a closer bridge between theoretical predictions and experimental observations.
![]() | ||
| Fig. 3 Representative physical models for catalytic materials. (a) 1D Surface Pourbaix diagram.36 (b) Classical surface Pourbaix diagram at the standard hydrogen electrode (SHE) scale (left) and the advanced pH-dependent surface Pourbaix diagram at the reversible hydrogen electrode (RHE) scale (right).37 (c) An example of the application of the energy diagram in the oxygen evolution reaction (OER) to describe reaction thermodynamics.38 (d) Transition energy barrier calculated from CI-NEB.46 (e) Scaling relationships for catalytic activity modelling45 and (f) pH-dependent microkinetic modeling to derive a pH-dependent volcano model for ORR,45 which was extended to (g) NO3RR47 and (h) CO2RR.48 Adapted with permission from: (a) ref. 36 © 2023 AIP Publishing, (b) ref. 37 © 2024 The Authors, (c) ref. 38 © 2018 Springer Nature (d) ref. 46 © 2025 The Authors, (e and f) ref. 45 © 2024 The Authors, (g) ref. 47 © 2025 The Authors and (h) ref. 48 © 2025 The Authors. | ||
Another widely used example is the electrochemical free energy diagram (Fig. 3c),38 developed following Nørskov's seminal computational hydrogen electrode (CHE) model.39 Its simplicity and intuitive mapping of free-energy changes along elementary steps have made it the most applied framework in electrocatalysis. In practice, it is often coupled with kinetic tools such as the nudged elastic band (NEB) method (Fig. 3d), developed by Henkelman et al.,40,41 to capture the activation barriers from the complex potential energy surfaces of atomistic systems. Additionally, scaling relationships (Fig. 3e) are frequently used in the construction of catalytic activity volcano plots to describe the relationships between the adsorption energies of different intermediate species. However, this framework remains fundamentally limited: it is thermodynamics-centered and computationally expensive to fully capture kinetics-dominated electrocatalytic behavior, and lacks explicit treatment of pH effects under the realistic RHE conditions.
To address these challenges, the third example is pH-dependent microkinetic modeling, which explicitly integrates thermodynamics, kinetics, and the electrochemical environment. This approach was pioneered by Kelly and co-workers,42 who incorporated electric field effects and potential of zero-charge simulations into the CHE framework to rationalize the pH dependence of the ORR on Pt and Au electrodes (Fig. 3f). Li, Nørskov, and colleagues subsequently demonstrated how this methodology explains the intrinsic limitations of transition metal oxides for oxygen reduction reaction (ORR) in hydrogen fuel cell applications43 and the pH dependence of SACs for electrocatalysis,44–46 and further extended it to the highly complex nitrate reduction reaction (NO3RR,47Fig. 3g) and CO2RR48 (Fig. 3h), showcasing its general applicability. By coupling potential, pH, and coverage effects into kinetic simulations, this modeling method provides a more realistic description of catalytic activity and selectivity under operando conditions.
Many modeling approaches in catalysis have the potential to make significant contributions to the digital materials ecosystem. Taken together, the three examples discussed above illustrate how representative physical models have evolved, from static surface thermodynamics to dynamic, pH-dependent kinetic frameworks, laying the foundation for next-generation modeling strategies that aim to predict catalytic behavior with both chemical accuracy and broad applicability.
Campos dos Santos et al.50 presented a study that integrates a genetic algorithm (GA)-based global optimization method with ab initio metadynamics (MetaD) simulations to explore the structure–performance relationships of divalent CTCHs (Fig. 4a). This integrated computational strategy allows for the prediction of stable crystal structures and cation diffusion activation energies (Ea) without relying on experimental data (Fig. 4b). By combining GA and MetaD, the study successfully predicted structural information and activation energies that were in excellent agreement with experimental observations (Fig. 4c). This approach not only unveiled the impact of neutral molecules, such as water, on the ionic conductivity of CTCHs but also identified key factors that promote cation diffusion, ultimately providing insights into the design of more efficient SSEs for battery applications.
![]() | ||
| Fig. 4 Representative physical models for SSEs. (a) Migration pathway (direction: A → B) of the [Mg(H2O)x]2+ hydrocomplex to the next vacant site.50 (b) Potential energy surfaces for the A → B migration in MgB12H12·12H2O.50 (c) Experimental conductivity as a function of temperature.50 (d) Typical cations, anions, and neutral molecules in hydride SSEs.51 (e) The potential energy surface of Mg(BH4)2·2NH3 as captured by MetaD simulations.51 (f) Comparison of experimental activation energy (Ea) with simulated Ea from MetaD simulations for structures with (filled icons) and without (half-filled icons) neutral molecules.51 Adapted with permission from: (a–c) ref. 50 © 2023 American Chemical Society and (d–f) ref. 50 © 2025 John Wiley and Sons. | ||
A notable study by Wang et al.51 introduced an innovative data-driven AI framework that integrates LLMs, ab initio MetaD simulations, and multiple linear regression to explore and predict the migration mechanisms of hydride SSEs (Fig. 4d). The research highlights a novel “two-step” migration model that involves an initial “coordination-unlock” stage, followed by a “paddle-wheel” mechanism. This process was observed in the migration of Mg2+ ions in Mg(BH4)2·2NH3 and Li+ ions in LiBH4·NH3 (Fig. 4d). These findings are significant as they demonstrate how neutral molecules, such as NH3, facilitate ionic migration by disrupting the strong electrostatic interactions that typically hinder divalent ion movement. The presence of these neutral molecules within the SSE lattice results in a substantial decrease in activation energy (Ea), as demonstrated by the close agreement between the experimental Ea and the MetaD-simulated Ea values (Fig. 4f).
The application of advanced physical models integrated with GA and MetaD simulations has proven to be an invaluable approach in the development and optimization of SSEs. These models offer deep insights into the structure–performance relationships of complex materials, particularly those involving divalent ions, by accurately predicting cation diffusion mechanisms and activation energies. The continued evolution of the digital materials ecosystem, powered by key simulation methods, will be critical in the development of advanced solid-state batteries.
S, Se, Te) compounds were identified and validated by theoretical calculations. Moreover, classification models can also be utilized to categorize TE materials into binary classes, such as high versus low S or σ, by appropriately adjusting and confirming threshold values,70 resulting in a list of possible TE material candidates.
![]() | ||
| Fig. 5 Examples of machine learning (ML) models for materials property predictions, including supervised models, unsupervised models, and semi-supervised models. (a) Schematic illustration of the composition-based 10-fold cross-validation strategy, and the comparison of ML model performance across the training set, testing set, and an independent dataset published in 2023.67 (b) Iterative integration of unsupervised ML with labeled and reported half-Heusler TE materials, and the ScNiSb-based TE material was experimentally investigated;71 (c) PU learning framework employing bootstrap aggregating (bagging) techniques, and the identified potential TE materials were validated through theoretical calculations.73 Adapted with permission from: (a) ref. 67 © 2024 Springer Nature, (b) ref. 71 © 2022 The Authors and (c) ref. 73 © 2023 AIP publishing. | ||
Unsupervised learning does not require well-labeled training data, whereas semi-supervised learning relies on partially labeled datasets. Both approaches have garnered attention in situations where supervised learning is challenging due to the scarcity of sufficient labeled data. For example, unsupervised clustering methods,71 including K-means and Gaussian Mixture Models, were employed to group half-Heusler compounds into distinct clusters based on generated features (Fig. 5b). ScNiSb was identified as a promising candidate, and subsequent experiments achieved peak zT values of ∼0.5 at 925 K in p-type Sc0.7Y0.3NiSb0.97Sn0.03 and ∼0.3 at 778 K in n-type Sc0.65Y0.3Ti0.05NiSb. Unsupervised word embeddings72 trained on materials literature can capture latent knowledge and be applied to explore potential TE materials. Predictions derived from historical literature have been validated by recently reported TE materials, demonstrating that such insights can effectively guide the discovery of new candidates. Positive and unlabeled (PU) learning,73 a semi-supervised learning method, was proposed to train a classifier to distinguish reported TE materials (P) from unreported materials (U) (Fig. 5c). Using this approach, the probabilities of unlabeled materials belonging to the TE class were predicted. Finally, forty candidate TE materials were identified. Eight p-type and twelve n-type materials exhibited excellent theoretical zT values greater than 1. In addition, a semi-supervised generative learning framework was developed for the inverse design of TE materials,74 combining limited labeled data with augmented unlabeled data to generate and validate high-performance candidates. The designed compound Mg3.1Sb0.5Bi1.497Te0.003 exhibited a zT of 0.75 at 300 K, surpassing most known inorganic materials at room temperature. Therefore, ML has become an indispensable tool for TE materials research, enabling efficient property prediction, data-driven screening, and inverse design.75–77
![]() | ||
| Fig. 6 Examples of machine learning (ML) for catalyst performance prediction, including models trained on experimental and theoretical data, and those employing ML potentials (MLPs). (a) Diagonal scatter plot comparing experimental and predicted values by XGBoost on the training and test sets at 0.8 and 0.63 VRHE, along with a contour map illustrating the model-predicted current densities across different multicomponent metal oxides.81 (b) Development of the 2D–3D ensemble model for C–C coupling big data set prediction.84 (c) Left: schematic illustration of MLP generated by active learning on-the-fly during hybrid molecular dynamics and time-stamped force-biased Monte Carlo (MD/tfMC) simulations.89 Right: theoretical analyses combining a MLP with replica exchange molecular dynamics and Monte Carlo based atom swaps (REMD/MC) for understanding catalytic behavior.90 Adapted with permission from: (a) ref. 81 © 2024 The Authors, (b) ref. 84 © 2024 American Chemical Society, and (c) ref. 89 © 2025 The Authors and ref. 90 © 2025 American Chemical Society. | ||
From a theoretical perspective, the interaction between reaction intermediates and catalyst surfaces plays a crucial role in determining catalytic performance. The adsorption energies of key intermediates are typically calculated to quantify their binding strengths with the surface. These energetic parameters serve as fundamental descriptors for constructing microkinetic volcano models that predict theoretical catalytic activities. Therefore, obtaining accurate adsorption energies is essential for assessing electrocatalytic performance. ML methods have been employed to predict adsorption energies for key intermediates across a vast number of catalytic sites. For instance, ML models have been used to evaluate the adsorption energies of OH* on millions of reactive sites of different crystal facets in high-entropy alloys (HEAs) for ORR,82 to predict CO adsorption energies on layered alloy surfaces relevant to CO2-to-methanol conversion,83 and to estimate the adsorption energies of six C1 precursors (CO, COH, CHO, CH, CH2, and CH3) and twenty-one C2 combinations (six symmetric and fifteen asymmetric couplings) involved in the C–C coupling processes (Fig. 6b).84,85
In addition, MLPs are typically trained on datasets generated from DFT calculations, where total energies and atomic forces serve as the training targets.86,87 By learning the relationships between local atomic environments and these physical quantities, MLPs can accurately construct potential energy surfaces, thereby accelerating adsorption energy calculations and providing valuable insights into catalytic mechanisms. For example, the AdsorbML framework88 and an active-learning Gaussian Approximation Potential (GAP) model (Fig. 6c, left)89 have been proposed to accelerate adsorption energy evaluations and efficiently identify global minima with formation energies. Furthermore, an MLP coupled with replica-exchange molecular dynamics was employed to describe the effect of Ru composition variation on phase formation and stability in Rux(Ir, Fe, Co, Ni)1−x multicomponent alloys under acidic OER conditions (Fig. 6c, right).90 Collectively, these efforts have greatly advanced the data-driven design and discovery of promising electrocatalysts.91
000 experimentally curated pressure–composition isotherms, Jang et al. demonstrated how physically interpretable ML can be harnessed to predict key thermodynamic and gravimetric metrics, gravimetric hydrogen density (w) and the equilibrium pressure at room temperature (Peq,RT), with accuracy comparable to that of state-of-the-art black-box models. Fig. 7 presents an integrated view of their simulation-based framework that connects large-scale data curation, symbolic-regression modeling, and descriptor-guided materials mapping. Fig. 7a summarizes the DigHyd database, which aggregates thousands of experimentally measured pressure–composition isotherms for metal hydrides. The scatter of equilibrium pressure Peq,RTversus gravimetric capacity w demonstrates an inherent performance trade-off: light-element hydrides achieve high w but exhibit excessively low pressures, whereas transition-metal hydrides release hydrogen readily but with low capacity. None of the known compositions reach the US-DOE Target window, emphasizing the need for predictive models that can transcend existing empirical limits. Fig. 7b outlines the symbolic-regression approach used to derive physically interpretable equations for wand Peq,RT. Starting from four chemically intuitive descriptors (atomic mass M, electronegativity χ, molar density ρmol, ionic filling factor ηf, etc),94 they performed a high-throughput search over more than a million candidate analytical forms generated by combining scalar and link transformations. This exhaustive exploration yielded compact closed-form models that match the accuracy of black-box ML regressors while preserving transparent physical meaning. The outcome demonstrates that even complex hydrogen-storage behavior can be captured by simple, human-readable equations when descriptor selection and model search are systematically orchestrated. Fig. 7c–f schematically illustrates how each descriptor affects storage performance. For high capacity, low atomic mass and strong bond polarity (large χ) are beneficial because they reduce lattice weight and strengthen metal atom–hydrogen interactions. In contrast, high equilibrium pressure, desirable for room-temperature hydrogen release, is favored by densely packed lattices with low ionic filling factors and weaker bond polarity (small χ). The fact that χ exerts opposite influences on w and Peq,RT explains the persistent trade-off observed in panel (a) and frames the chemical origin of the capacity–pressure dilemma. Finally, Fig. 7g–i shows descriptor-based design maps derived from the regression equations. The Mg-anchored pathway typifies saline-type hydrides, offering high w but low Peq,RT; the Ni-anchored pathway represents interstitial hydrides with the reverse trend. Remarkably, the Be-anchored map reveals a distinct trajectory that approaches the US-DOE target region,95 identifying Be and its alloys (particularly Be–Na) as unique compositions capable of balancing both metrics. This “bird's-eye” view underscores the predictive and explanatory power of the symbolic-regression framework, which transforms large experimental datasets into quantitative, physically interpretable guidance for the rational design of next-generation hydrogen-storage materials.
![]() | ||
| Fig. 7 Integrated simulation perspective for physically interpretable ML modeling of hydrogen storage properties. (a) Overview of the DigHyd database showing the broad distribution of equilibrium pressure Peq,RT and gravimetric capacity w across reported metal hydrides, illustrating the unavoidable trade-off between the two properties and the gap from US-DOE targets. (b) Framework of symbolic-regression modeling, where combinations of chemically meaningful descriptors and nonlinear transformations were systematically searched to construct millions of candidate equations. (c–f) Schematic interpretation of key descriptors governing w and Peq,RT. (g–i) Descriptor-based design maps generated from the regression models for compositions anchored on Mg, Ni, and Be, respectively. The maps visualize compositional pathways linking saline- and interstitial-type hydrides and highlight that Be-containing systems, especially Be–Na alloys, uniquely approach the US-DOE target zone (red = ultimate, green = internal combustion engine, and blue = fuel cell). Reproduced from ref. 94, under the terms of the Creative Commons CC BY-NC license. | ||
![]() | ||
| Fig. 8 Multimodal data extraction pipeline: the descriptive interpretation of visual expression (DIVE).31 (a) Conventional extraction pipeline based on a single multimodal LLM. (b) DIVE extraction pipeline, where descriptive prompts embed key data points and generate image replacements for structured data extraction. (c) Annual publication trends categorized by different types of hydrogen storage materials. Reproduced from ref. 31, under the terms of the Creative Commons CC BY-NC license. | ||
The DDSE (now renamed as DigBat: https://www.digbat.org), developed by Li and co-workers, compiles a comprehensive and large-scale dataset of SSEs.100 As of February 2026, it contains approximately 3000 experimental materials, 25
996 ionic conductivity measurements and 863 computational entries. Moreover, the DDSE functions as a dynamic and self-improving research infrastructure that continuously incorporates new data from both experiments and simulations. Automated data extraction and model standardization enable rapid exploration of ionic conductivity trends across wide chemical and structural spaces. By coupling large-scale statistical analysis with LLMs, DDSE identifies transport-related parameters such as activation energy, temperature dependence, and carrier type, thereby linking physical descriptors with measurable macroscopic properties. Importantly, DDSE supports iterative model refinement through feedback between simulation and experiment, transforming it from a static database into a dynamic predictive engine (Fig. 9a). This workflow enhances the interpretability of data-driven models and enables physics-informed correlation mapping, bridging the gap between first-principles accuracy and experimental observability.
![]() | ||
| Fig. 9 Data-driven, AI-accelerated discovery of solid-state electrolytes (SSEs). (a) Ion conductivity vs. inverse temperature for ∼3000 experimental materials and activation energies for ∼700 computational materials.51 (b) Conductivity vs. inverse temperature for monovalent and divalent SSEs, with/without neutral molecules.51 (c) Comparison of computational and experimental activation energies for different methods.51 (d) Ionic conductivity at 298 K vs. structural descriptors for sulfide-based SSEs.104 (e) Predicted ionic conductivity of polymer electrolytes based on structural components.105 Adapted with permission from: (a–c) ref. 51 © 2025 John Wiley and Sons, (d) ref. 104 © 2024 John Wiley and Sons and (e) ref. 105 © 2023 The Authors. | ||
Metal hydride-based SSEs provide a representative demonstration of DDSE's predictive capabilities.101 Metal hydrides possess light-element frameworks, flexible lattice structures, and tunable cation–anion interactions, making them ideal systems for model-driven discovery. Using the DDSE data, a large-scale analysis of divalent hydrides containing neutral molecules was conducted by combining big-data analytics with LLM-assisted feature extraction to reveal how lattice coordination and molecular incorporation affect cation migration.51,101 The results revealed two universal insights. First, the inclusion of neutral molecules such as NH3 promotes divalent ion migration by reducing electrostatic confinement and increasing the dynamic reorientation of BH4− clusters (Fig. 9b). Second, a consistent gap was observed between experimental and simulated activation energies, indicating that traditional static simulations often neglect configurational entropy effects (Fig. 9c).
Beyond the DDSE framework, numerous researchers have independently applied physics-informed and data-driven models to other classes of SSEs, revealing diverse yet convergent mechanisms of ionic transport. For antiperovskites (X3BA, X = Li, Na), descriptor-based learning identified the ratio between the tolerance factor and atomic packing factor as a negative predictor of ionic conductivity. Using this compact descriptor, nitro-halide double antiperovskites such as Li6NClBr2 and Li6NBrI2 were predicted to reach room-temperature conductivities above 1 × 10−4 S cm−1 in AIMD simulations.102 In the famous garnet-type oxides (Li7La3Zr2O12, LLZO), data-mining combined with molecular dynamics revealed that Ga3+ occupation of octahedral sites enhances Li+ migration, while Sc3+ co-doping promotes redistribution between octahedral and tetrahedral sites.103 However, the non-monotonic conductivity trend in Ga/Sc co-doped LLZO illustrates the complex balance between carrier concentration and mobility, underscoring the importance of mechanistic frameworks beyond empirical fitting. For sulfide-type argyrodites, both experimental measurements and ML modeling demonstrated that ionic conductivity scales linearly with the product of halogen substitution ratios at octahedral and corridor cage centers (RX,Oh × RX,Corridor)104 (Fig. 9d). Controlled halogen substitution enhances Li+ mobility by weakening sulfur localization, while excessive doping produces insulating by-products such as LiCl that reduce overall performance. In polymer electrolytes, the ChemArr model integrated the Arrhenius relationship into a predictive neural network, achieving near-experimental accuracy across more than 200 studies and screening over 20
000 polymers.105 The model identified siloxane- and phosphazene-derived polymers as promising high-conductivity materials with low glass-transition temperatures (Fig. 9f).
Beyond bulk transport behavior, interfacial processes remain a key bottleneck in the development of solid-state batteries. At Li-metal anodes, instability often results from dendrite formation and side reactions. ML models based on support vector machine (SVM) and kernel ridge regression (KRR) identified Sc3+ and Ca2+ as effective dopants, capable of forming stable SSE interphases that mitigate interfacial degradation.106 On the cathode side, interfacial resistance is largely governed by the complex microstructure of composite electrodes. Hwang et al. introduced advanced microstructural characterization based on semantic segmentation of electron micrographs, enabling automated quantification of porosity, particle distribution, and phase connectivity.107
The use of LLMs to analyze data and gain insights into materials is a key approach in advancing digital material ecosystems. By combining data-driven analysis with physics-informed ML models, this method enhances the design and prediction of materials with desired properties. In SSEs, LLMs help identify important relationships between atomic structure and material performance, revealing new trends that guide the development of high-conductivity materials. This process not only refines ML models but also fosters a dynamic, evolving research infrastructure, where continuous feedback between simulations, experiments, and model refinements accelerates material discovery and innovation. This approach ultimately helps build a more connected and efficient materials research ecosystem.
![]() | ||
| Fig. 10 Workflow of AI agent-driven discovery of new hydrogen storage materials. (a) The user specifies key requirements, including material type, constituent elements, and performance targets. (b) The DigHyd agent proposes initial candidate compositions based on data mined from over 4000 historical publications. (c) The candidate compositions are evaluated using a pretrained ML model to predict their gravimetric hydrogen density. (d) DigHyd agent rapidly designs, predicts, and iteratively refines candidate materials in line with researcher-defined goals within minutes. Finally, the DigHyd agent outputs the final material design, together with the relevant reaction conditions and an assessment of synthetic feasibility. Reproduced from ref. 31, under the terms of the Creative Commons CC BY-NC license. | ||
In the first round, drawing on both its local knowledge base and the analytical, reasoning, and predictive abilities of LLMs, the DigHyd Agent proposed CaMgFe2 (Fig. 10b). This candidate was then evaluated using the ML regression model, which predicts hydrogen density directly from the material's composition. With an R2 value of 0.87, this model provides a reliable first-pass screening for LLM-generated candidates (Fig. 10c). CaMgFe2 was predicted to store 2.64 wt% hydrogen (Fig. 10d).
The AI agent next suggested increasing the Mg content, yielding Mg2Fe, with a predicted capacity of 4.13 wt%. However, literature reports indicate that Mg2Fe undergoes hydrogenation and dehydrogenation only at elevated temperatures (300–400 °C), thus failing to meet the design criteria. In response, DigHyd refined the composition to Mg2Fe0.75Co0.25, and later to Mg2Fe0.6Co0.2Mn0.2. The latter was predicted to achieve 4.19 wt% hydrogen capacity, with Mn (or alternatively Al) contributing to hydride stabilization and plateau-pressure optimization. Importantly, this final composition has not been reported in any existing database. Together, these results (Fig. 10d) highlight the ability of the DigHyd Agent to rapidly design, predict, and iteratively refine material candidates according to user-defined goals—within minutes. If such AI-driven agents are integrated with high-throughput experimental platforms, the efficiency of materials discovery and development could reach an unprecedented level.
In summary, the integration of AI agents into materials research is reshaping the conventional paradigm of discovery.108 By bridging data extraction, knowledge reasoning, and material design, such agents not only accelerate the pace of research but also enable a deeper understanding of structure–property relationships that were previously difficult to capture. The examples of DIVE and DigHyd demonstrate how multimodal and generative AI can work “hand in hand”—transforming unstructured literature into structured knowledge and transforming that knowledge into actionable design hypotheses. Looking ahead, the close coupling of AI agents with autonomous experimental platforms will pave the way for a truly self-driving laboratory,109,110 where materials discovery evolves from a manual and time-consuming process into an intelligent, iterative, and self-improving cycle.
![]() | ||
| Fig. 11 Inclusion of a thorough AI-driven automated design workflow towards accelerating scientific discoveries in a closed-loop framework (reference place holder for cloud synthesis).117 The integrated design framework connects high-throughput experiments, AI-driven automated workflows, and scientific insights to accelerate catalyst development. By coupling robotic experimentation and advanced characterization with ML-guided screening, descriptor analysis, and mechanistic understanding, the DigMat platform enables continuous feedback between data, models, and experiments. Adapted with permission from ref. 117 © 2024 The Authors. | ||
To illustrate the closed-loop framework originating from DigCat, the design of a novel stable and low-cost bifunctional metal oxide (MO) electrocatalyst for water splitting was investigated.117 From 1430 thermodynamically stable MOs identified by stability analysis in DigCat, RbSbWO6 was chosen as a case study. RbSbWO6 outperformed several widely studied, heavily engineered MOs for HER and OER under acidic media. The addition of the RbSbWO6 experimental dataset to DigCat's database further improves catalyst discovery iteratively, addressing the limitation of high-quality experimental datasets. This study demonstrates the closed-loop framework and the potential of digital materials platforms' AI-driven automated workflow in integrating human curiosity, theory-based knowledge and machine precision effectively to accelerate real world impact.
In a parallel development, high-entropy alloys (HEAs) have emerged as one of the most challenging material systems due to their vast compositional complexity and multi-principal element design space. The conventional trial-and-error synthesis method is inefficient for such a large design space. High-throughput and data-driven ML strategies have been systematically reviewed as key enablers for accelerated HEA discovery, covering preparation, characterization, computation, and structure–property mapping necessary for efficient exploration of HEA systems.118 Emerging autonomous experimental platforms are advancing the integration of AI with robotics for materials synthesis. Notably, the concept of “self-driving laboratories” that integrate AI, automation, and high-throughput characterization is gaining traction in alloy development and broader materials science, showing that automated closed-loop workflows can unify predictive models with robotic experimentation to efficiently explore new compositions.119 Even in HEA development, machine learning models are being combined with high-throughput synthesis methods to generate large libraries of alloy samples rapidly, and automated characterization data are fed back into ML models for iterative design optimization.120 These advances highlight that autonomous and AI-guided experimental systems are increasingly important for HEA design, helping to overcome data scarcity and accelerate the translation of predictions into verified materials.
In summary, the future of modern digital materials will depend on a dual evolution: the scientific rigor of data and models, and the cognitive sophistication of AI agents. By merging verified data, interpretable models, human-inspired reasoning, and standardized automation, the community can move from knowledge accumulation to autonomous scientific discovery—a transition that may redefine not only materials research but the very process of scientific innovation itself.
| This journal is © The Royal Society of Chemistry 2026 |