Open Access Article
Boshko Koloski
a,
Senja Pollak
a,
Sašo Džeroski
*a and
Aleksandar Kondinski
*b
aDepartment of Knowledge Technologies, Jožef Stefan Institute, Jamova cesta 39, SI-1000 Ljubljana, Slovenia. E-mail: saso.dzeroski@ijs.si
bInstitute of Physical and Theoretical Chemistry, Graz University of Technology, Stremayrgasse 9/I, 8010 Graz, Austria. E-mail: kondinski@tugraz.at
First published on 17th March 2026
Over the past few years, large language models have become technologically ubiquitous and now offer a powerful route to accelerate discoveries in chemistry. In this article, we highlight current impactful applications of large language models in inorganic chemistry, from smart text mining of the inorganic literature through the proposal and discovery of new materials to real-time experimentation. We also discuss ongoing developments and their potential future impact on the field.
Building on this foundation, LLMs support text categorisation, keyword extraction, and the automatic conversion of free prose into structured data relevant to chemistry.6,7 Internally, LLMs treat text as a sequence of tokens, usually whole words or smaller sub-word pieces, which pass through stacked layers of self-attention (where each token attends to all other tokens in the sequence).3 As their tokenisers are trained on generic web corpora, chemical information, such as oxidation states (e.g. FeIII), unusual ligand labels, and even unicode arrows can be fragmented into several subtokens, a mismatch that can erode numerical fidelity and alter the chemical meaning. On average, terms were segmented into tokens only 4–6 characters long, producing fragmented inputs and eroding structured chemical semantics at the embedding layer.8,9 Likewise, Tarasova notes that chemical named-entity recognition is strongly affected by tokenization.10 Training typically lies in two stages: first, pretraining, where the network reads an internet-scale corpus and learns to predict unseen tokens from context, and second, fine-tuning or instruction tuning, where the same parameters are aligned with task-specific prompts used by chemists. Decoder-style models learn autoregressively to predict the next token, whereas encoder-style models recover a subset of intentionally masked tokens, as demonstrated by ChemBERTa and MolBERT.11,12 From an architectural standpoint, the community distinguishes encoder-only systems that yield compact semantic embeddings useful for similarity search and property prediction, decoder-only systems that excel at continuous text generation and dialogue, and newer encoder–decoder hybrids that aim to combine or balance between both strengths.13 These fixed-length vector embeddings cluster chemically related sentences, reactions, and coordination environments in latent space, enabling k-nearest-neighbour screening for catalysts, ligands, or structure–property relationships without any natural-language output. While these architectures unlock powerful representations, their open-ended training also introduces new sources of error. As a standalone transformer lacks an authenticated knowledge base, it can generate confident yet unfounded statements, a phenomenon known as “hallucination”. Reinforcement learning from human feedback (RLHF) reduces the most obvious errors by steering the model toward expert judgment, but the same procedure can miscalibrate token probabilities, so a persuasive answer may still be wrong.14 Retrieval augmented generation (RAG) adds a further safeguard by letting the model fetch primary literature or database records at run time and ground each claim in a cited passage, which lowers hallucination frequency, though it cannot remove it entirely.5 A third approach appears in ReAct agents (Reason and Act agent framework), which explicitly alternates a natural-language “Reason” step with an “Act” step that calls an external tool such as a crystallographic search, a quantum-chemical calculation, or a laboratory robot. By publishing every intermediate thought and the evidence it retrieves, systems like ChemCrow allow chemists to inspect the whole chain of logic before accepting a recommendation.15 In general, LLMs are already reshaping research.16 In particular, this article, therefore, aims to survey and highlight the latest LLM-driven advances in inorganic chemistry, outlining directions their forthcoming implications on the field.
538 inorganic synthesis paragraphs into 19
488 balanced solid-state reactions through a combination of topic modelling, random forests and sequence tagging. Their approach has reached 93% chemistry accuracy in terms of predicting correct precursors, targets, reactions; attributing the errors to parsing of paragraphs and the aforementioned tokenisation problem, resulting in failure to parse chemical composition. However, a significant challenge in training on extracted data is synthesis reproducibility; multiple attempts of the same literature procedure can yield different outcomes due to subtle, unreported variations, making it difficult for models to identify critical procedural details.19 Random forests approach was also undertaken by Zaki et al. who have applied the idea on extracting the annealing temperatures and indentation loads from a small set of glass-mechanics papers. The authors combined the textual data with compositional records to create a 102-point set, which improved Vickers-hardness prediction to a test R2 = 0.89 after processing the data.20 Recent advances have expanded this to multimodal curation. For example, Chan and coworkers developed EXSCLAIM!, an automated pipeline that extracts and labels microscopy and spectroscopy data from the primary literature to bridge the gap between text and visual evidence.21
Gupta et al. present MatSciBERT, a BERT model continued pretraining on about 150
000 materials papers about 285 million words. It sets a new state of the art on the Matscholar NER benchmark with test Macro F1 of 0.8638; improves relation classification for synthesis procedures; and classifies glass versus non glass abstracts with 96% accuracy. The glass abstracts cover inorganic glass topics including bioactive and rare earth doped systems, while metallic glasses appear in the pretraining corpus. Checkpoints are public and run on standard GPUs.22 Trewartha et al.23 introduce MatBERT trained on about two million materials papers. Across solid state, doping, and gold nanoparticle morphology tasks it exceeds general baselines by about one to four F1 points, and a compact BiLSTM with Mat2Vec still beats untuned BERT in these settings. The nanoparticle corpus concerns gold nanoparticles such as rods and spheres. Code and weights are openly available.
Transformer-based text miners now dominate inorganic chemistry curation, and two workflows have emerged. The first keeps the language model frozen and steers it with prompts over a focused corpus. Zheng et al.24 guided the GPT (Generative Pretrained Transformer) framework through 228 metal organic framework (MOF) papers. GPT-3.5 and GPT-4 first isolated the synthesis sections and then extracted 26
257 reaction variables for roughly 800 frameworks, giving F1 scores between 0.90 and 0.99, where F1 is the harmonic mean of precision and recall. The second workflow fine-tunes a lighter model on a few hundred labelled examples and then applies it more broadly. Dagdelen et al. fine-tune GPT-3 and open access Llama-2 on a few hundred annotated passages to jointly extract entities and relations.25 The models emit JSON records that link inorganic host compounds (for example, doped oxides and chalcogenides in solid state semiconductors), their dopants, crystal structure or phase labels, and guest species in metal–organic frameworks. On a solid state doping task they reach F1 of about 0.82 for host to dopant links; on general and MOF extractions, exact match relation F1 typically lies in the 0.3 to 0.6 range (with higher manual scores reflecting normalisation and error correction). A human in the loop setup reduced annotation time by approximately 57%.25
Beyond information extraction and structuring, LLMs can act as powerful approximators and predictor of chemical properties. Inorganic synthesis spans a vast composition space, thus the expectation of current AI models is only to unveil promising starting conditions. Before the transformer models, Kim et al. mined synthesis text with recurrent nets and trained a conditional variational autoencoder, a generative model that samples outputs given the target to propose action sequences and precursors.27 From about 51
000 action sequences and 116
000 precursors, the model suggested plausible precursor sets for InWO3 and PbMoO3 with training only up to 2005, and it screened 83 predicted ABO3 perovskites to 19 with at least one route using commercially available precursors. More recently, Okabe et al. fine-tuned a distilled GPT on about nineteen thousand balanced reactions from the previously published corpus.18,28 One variant maps reactants to products, another maps products to reactants, and a third writes a full equation from only a target formula. Using a Tanimoto similarity on element counts, these models keep roughly ca. 90% chemical fidelity even when prompts include extra verbs such as heat, mix or quench. Demonstrations cover BaTiO3, SrTiO3, the high-Tc cuprate YBa2Cu3O7, BiFeO3, LiMn2O4, Ni0.6Zn0.4Fe2O4 and Co3Sn2S2.28 However, these models face several limitations (Fig. 1). The Tanimoto similarity metric based on element counts is relatively coarse and cannot distinguish between different oxidation states or structural motifs. The models are constrained to compositions well-represented in their training data and cannot reliably extrapolate to novel chemistries. Critically, they predict thermodynamically plausible reactions but provide no information about kinetic barriers, reaction conditions, or practical feasibility. A suggested precursor set may be chemically valid yet experimentally impractical due to cost, availability, or incompatible processing requirements.
Kim and coworkers extended this idea with SynthGPT, a GPT 3.5 model fine-tuned on positive unlabelled compositions from the Materials Project.29 Given only a formula, the model classifies each composition as likely to be synthesised or as unlabelled which is unknown or unlikely. They recalibrated the decision threshold using the estimated proportion of true positives in the unlabelled pool, which balanced recall and precision and matched or exceeded a stoichiometric graph fingerprint.29 When prompted with a full reaction such as LiFePO4 ← Li2CO3 + FeC2O4 + (NH4)2HPO4, the same model selected precursors with top 1 accuracy on par with the Elementwise template formulation method and top 5 accuracy that slightly exceeded that method's notional limit because it can choose valid reagents outside that template, for example phosphate or ammonium salts.29 The Elementwise template assumes one precursor per metal element in the target which constrains its outputs, whereas SynthGPT is not bound by that rule.
In a recent study, Kim et al. developed StructGPT, a GPT-4o-mini model trained to predict the synthesisability of inorganic materials from textual descriptions of their crystal structures.30 The performance of the model was shown to be comparable to established crystal-graph convolutional neural networks (CGCNNs), while a related approach using the model's text embeddings in a positive-unlabeled classifier set a new benchmark for the F1 score at a lower computational cost. The model demonstrates high structural sensitivity, correctly lowering its synthesisability score in response to small, symmetry-breaking coordinate perturbations. Crucially, StructGPT successfully identified twelve hypothetical, near-hull compounds (within 0.01 eV per atom) as non-synthesizable, a prediction that aligns with failed experimental attempts and suggests the model captures kinetic barriers missed by simple energy screens. A key feature is the ability of the model to explain its reasoning, generating rules based on factors like size mismatch and coordination strain, with the authors noting distinct explanatory themes for major inorganic material classes like perovskites, Heuslers, and spinels. However, the model's text-based representation may miss geometric details and cannot reliably handle defects or disorder, and its explanations may represent post-hoc rationalisations rather than true mechanistic understanding. The chemical origin of these predictions remains unclear. SynthGPT is trained on Materials Project compositions, so its predictions likely reflect both computed thermodynamic stability and patterns of experimental accessibility. StructGPT shows sensitivity to symmetry-breaking coordinate perturbations and identifies some near-hull compounds as non-synthesizable, which suggests that it may also capture aspects of kinetic inaccessibility beyond simple energy-based screening.30 At the same time, both signals are affected by data bias. Compositions that dominate the training set may appear more synthesizable regardless of their true thermodynamic or kinetic status. Addressing this limitation will require datasets that include failed as well as successful synthesis attempts.
Gruver et al. finetune a large language model to write inorganic crystal structures as strings and to propose stable candidates.31 Training uses the Materials Project MP 20 subset of stable inorganic crystals with at most twenty atoms per unit cell and an extended set of about one hundred twenty seven thousand Materials Project entries for text conditioning and infilling. The strongest model reaches 99.6% structural validity and 95.4% compositional validity, and 49.8% of samples are predicted metastable with Êhull < 0.1 eV per atom versus 28.8% for a diffusion baseline. The domain covers diverse inorganic families such as oxides, nitrides, carbides, borides, halides, and intermetallics, including perovskites and spinels.
Despite these advances, LLM-based prediction models face several common limitations. Synthesis prediction models rely on coarse metrics that cannot distinguish oxidation states or structural motifs, and while they suggest thermodynamically plausible reactions, they provide no information about kinetic barriers, reaction conditions, or practical feasibility. Synthesizability classifiers may misclassify thermodynamically stable but kinetically inaccessible compositions or miss realizable metastable phases. Text-based structure representations may lose geometric nuances and struggle with defects, disorder, or non-stoichiometry, while model explanations may represent post-hoc rationalizations rather than mechanistic understanding. Crystal structure generators, despite high structural validity, cannot guarantee experimental synthesizability beyond energy criteria, are typically restricted to small unit cells, and provide no synthesis guidance. Fundamentally, all models are constrained by training data and struggle to extrapolate to novel chemistries or underexplored composition spaces.
A comparison between LLM-based and established materials-informatics approaches is given in Table 1. Recent benchmarking efforts comparing LLM-based property predictors against crystal graph convolutional neural networks (CGCNN) show that neither approach universally dominates.30,32 For property prediction tasks, CGCNN outperforms LLM-based models on five out of ten benchmark datasets, particularly those with complex structural features requiring detailed geometric understanding.32 Conversely, LLMs excel on datasets with shorter textual descriptions or composition-focused tasks, where linguistic context provides advantages over graph-based representations. For synthesizability prediction, fine-tuned LLMs such as StructGPT achieve F1 scores comparable to PU-CGCNN methods, with the best performance obtained by combining LLM-derived embeddings with traditional positive-unlabelled learning classifiers.30 These benchmarks highlight that LLMs currently complement rather than replace physics-based and graph-based methods. The primary advantages of LLMs lie in their natural language interfaces, their ability to integrate unstructured literature knowledge, and their explainability through text generation, rather than in superior predictive accuracy on well-defined numerical tasks. Future progress will likely require hybrid architectures that combine the geometric precision of graph neural networks with the contextual reasoning of language models.
| Method | Input repr. | Task | Key metric | Advantages | Limitations |
|---|---|---|---|---|---|
| LLM-based approaches | |||||
| MatSciBERT22 | Text (150 k papers) | NER, classification | Macro F1 = 0.86 on Matscholar NER | Open weights and standard GPU use | Encoder-only model with no generation |
| SynthGPT29 | Composition string | Synthesisability prediction; precursor selection | Top-1 accuracy similar to the Elementwise template | Not restricted by the one-precursor-per-metal rule | Coarse metric and no kinetic or reaction-condition information |
| StructGPT30 | Textual crystal structure | Synthesisability prediction | F1 comparable to PU-CGCNN at lower computational cost29 | Sensitive to structural perturbations and provides text-based explanations | May miss defects, disorder, and non-stoichiometry, and explanations may be post hoc29 |
| Gruver et al.31 | CIF-as-string | Crystal structure generation | 99.6% structural validity, with 49.8% of samples predicted to be metastable at Ehull < 0.1 eV per atom | High validity and better performance than the diffusion baseline | Limited to small unit cells and provides no synthesis route |
| Established approaches | |||||
| CGCNN32 | Crystal graph (3D structure) | Property prediction | Better than LLMs on 5 of 10 benchmark datasets31 | Geometrically precise with physics-informed message passing | Requires known 3D structures and has no language interface |
| Descriptor-based ML (e.g. Oliynyk property list26) | Compositional descriptors (98 elemental properties) | Structure classification; property prediction | Effective on small datasets (50–1000 points) and experimentally validated26 | Interpretable, works with limited data, and does not require a 3D structure | Requires manual feature engineering and contains no explicit geometric information |
The reliability of LLM predictions increases substantially when they are coupled with first-principles validation. Several frameworks now implement autonomous “propose-and-test” cycles where LLMs generate hypotheses that are validated through DFT calculations. For instance, the AtomAgents framework employs physics-aware multimodal agents to orchestrate the entire pipeline: generating alloy structures, executing simulations, and integrating deep learning potentials with physics-based validation.33 Similarly, the MatPC framework demonstrates this synergy in practice, using LLMs to semantically screen photovoltaic candidates before rigorous DFT confirmation.34 In future, going beyond structure generation, LLMs can potentially act as technical assistants for computational workflows, suggesting optimal DFT functionals (e.g., identifying when Hubbard U corrections are necessary for specific transition metals) or selecting appropriate basis sets for heavy f-block elements.
Beyond retrieval, LLM-based agents are also actively used to guide live experiments. While systems like molSimplify have long automated the generation of transition metal complexes (Kulik et al.),37 LLM agents now offer higher-level orchestration. Zheng et al. built a “ChatGPT Research Group” of seven role specific assistants that communicate with a single chemist, process the literature, write Python code, operate a robotic platform and a programmable microwave system, and guide a Bayesian optimiser.38 The loop searched a space of about six million possible microwave reaction conditions and converged in about 120 experiments to yield highly crystalline metal organic frameworks such as MOF 321 and MOF 322, with surface areas and pore volumes close to the theoretical values and with high water uptake.38 Recent developments, such as the interconnection of LLMs with an RDF framework, report the simultaneous involvement of up to six GPT-4 agents that process papers, queue high-throughput batches, read spectra, and design scale-ups–all inside one chat window.39 A GPT-4 model in combination with eighteen chemistry tools has likewise been shown to work smoothly in organic chemistry, planning reaction routes and designing new chromophoric materials.15 This combination of tools and chain-of-thought reasoning within the ReAct framework makes these implementations superior to a plain GPT model and likely to be applied in inorganic domains soon.15 Similarly, Jihan Kim and coworkers introduced ChatMOF, an autonomous multi-agent system that leverages specialised tools to predict and generate metal–organic frameworks with high fidelity.40
| This journal is © the Partner Organisations 2026 |