Open Access Article
Aritra Roy
*ab,
Enrico Grisan
c,
John Buckeridge
*ab and
Chiara Gattinoni
*d
aEnergy, Materials and Environment Research Centre, London South Bank University, London SE1 0AA, UK. E-mail: pgr.aritra.roy@lsbu.ac.uk; j.buckeridge@lsbu.ac.uk
bSchool of Engineering and Design, London South Bank University, London SE1 0AA, UK
cBioscience and Bioengineering Research Centre, London South Bank University, London SE1 0AA, UK
dDepartment of Physics, King's College London, London WC2R 2LS, UK. E-mail: chiara.gattinoni@kcl.ac.uk
First published on 25th March 2026
Modern materials discovery using data-driven techniques relies heavily on large and structured databases of material compositions and properties; however, the majority of information regarding experimentally synthesised materials lies buried within millions of scientific articles. Large language models and agents have now made it possible to extract structured knowledge from scientific text, but, despite several approaches designed for this aim, no highly accurate approach focused on composition and property extraction—the bare minimum for data-driven methods—to create machine learning-ready databases without the need for human assistance has been developed. We therefore developed ComProScanner, an autonomous multi-agent platform that facilitates the extraction, validation, classification and visualisation of machine-readable chemical compositions and properties for comprehensive database creation. ComProScanner is a publisher-to-database framework which incorporates publisher APIs bypassing the need to manually upload papers into the framework and it is capable of scanning thousands of papers without human intervention. We evaluated our framework using 100 journal articles against 10 different LLMs, including both open-source and proprietary models, to extract highly complex compositions associated with ceramic piezoelectric materials and corresponding piezoelectric strain coefficients (d33), motivated by the lack of a large dataset for such materials. DeepSeek-V3-0324 outperformed all models with a significant overall accuracy of 0.82. Even with this small journal sample, the vast majority of the piezoelectric materials we extracted are not included in commonly available databases and we identified one system with a significantly high piezoelectric coefficient. This framework provides a simple, user-friendly, readily usable package for extracting highly complex experimental data buried in the literature to build machine learning or deep learning datasets.
Natural language processing (NLP) algorithms have demonstrated remarkable advances in materials science applications, from building toolkits and techniques for automated extraction of chemical information from the scientific literature, such as ChemDataExtractor,8,9 ChemicalTagger,10 BatteryBERT,11 and others. These tools and techniques have been implemented to systematically structure the vast corpus of textual knowledge in the field11–20 leveraging various techniques, including regular expressions,21 BiLSTM recurrent neural networks,22 and smaller transformer-based language models like BERT.23 These approaches have successfully facilitated the extraction of entity information from diverse sources, including battery materials literature11,14 and chemical synthesis parameters documented in methodology sections of scientific papers.13 Entity extraction, and in particular named entity recognition (NER), has dominated these research efforts. Researchers have applied domain-specific labels such as “material” or “property” to specific textual elements, but require an additional post-processing step to construct the relations between these entities, relations that prove essential for training effective machine learning or deep learning models. To exemplify, discrete entities such as “Cu2O” or “2.1 eV” were targeted rather than establishing the relational connections between them (for example, “2.1 eV” represents the measurement of the band gap for “Cu2O”), i.e., they do not implement relation extraction (RE) techniques.
In the early 2020s, several end-to-end methods were developed that use a single machine learning model integrating both named entity recognition and relation extraction (NERRE).24–26 These methodologies demonstrate efficacy in relation extraction tasks; however, they remain fundamentally limited to n-ary relation extraction frameworks that are complex in architectural structure and struggle to extract all information if the interconnection between various entities are too high. Following the widespread adoption of various large language models (LLMs), researchers have employed them successfully to extract information from journal articles, replacing traditional sequence-to-sequence approaches with more sophisticated NERRE methods. Approaches ranging from pre-training27 and fine-tuning LLMs28–33 to prompt-engineering,29,32–37 zero-shot29,30,32,33,38 and few-shot prompting,29,33,37,38 as well as Retrieval-Augmented Generation (RAG) methods39,40 have enhanced NERRE-level text extraction from materials science literature. Concurrently, LLM-powered agents have been utilised for various chemistry and material science tasks, including extracting relevant information from journal articles,41–45 predicting new molecules or materials or their properties,33,41 automating data handling,33,43,45–47 enhancing reasoning and computational capabilities of LLMs,41,43,44,46–48 proposing novel hypotheses,33 and even semi-automating experiments47 by integrating expert tools. Several notable implementations have emerged in this domain, such as Eunomia by Ansari et al.,42 an AI agent chemist for developing materials datasets by accessing computational databases and research papers, and, very recently the multi-agent system nanoMINER,49 which combines LLMs and multimodal analysis to extract information, though it is specifically limited to nanomaterials. However, both Eunomia and nanoMINER lack the capability to integrate Text and Data Mining (TDM) API keys† through the package, requiring users to provide the articles in PDF format by manually downloading them, which represents a labour-intensive and time-consuming process when dealing with large-scale datasets. Additionally, enumerating all explicit chemical formulas from variable compositions (e.g., Pb1−xKxNb2O6 where x = 0.1, 0.2 etc.) into distinct compounds remains beyond the scope of these agentic systems. Recently, Wilhelmi et al. published a comprehensive tutorial on using LLMs to extract chemical data as structured output via various methods, including prompting, RAG and agentic systems.50 Nevertheless, an easily configurable automated workflow that enables end users to build, evaluate and visualise datasets through information extraction from journal articles has been lacking.
In this work, we present an autonomous multi-agent agile framework, ComProScanner, for end users to extract, evaluate, categorise and visualise machine-readable structured chemical compositions and properties, combined with synthesis information from journal articles to create extensive databases. When a research article contains chemical composition along with the enquired property value either in full article text or tables, the framework extracts structured JSON data51 containing both agent-extracted relevant information and journal article metadata obtained via APIs. The agent-extracted relevant information comprises the chemical composition of the material and the property value as key–value pairs, property unit, material family, synthesis method, precursors used, brief synthesis steps highlighting the key synthesis conditions and steps used and characterisation techniques employed. Our system combines LLM agents with powerful tools, including RAG and a custom deep learning model for extracting chemical compositions and properties only when property values are available in articles. The workflow supports Elsevier, Springer Nature, IOP Publishing and Wiley articles via publishers' TDM APIs or PDFs from local folders. ComProScanner enhances text-mining accuracy by providing flexible contextual parameters to agents while maintaining cost-effectiveness through preliminary article filtering via keyword matching. The system supports multiple configurable LLMs for both extraction agents and RAG implementations. ComProScanner can be implemented with fewer than 20 lines of Python code to extract pre-defined structured data, provided that users have access to the TDM APIs of the publishers. We evaluated the extraction performance of ten LLMs using 100 articles containing piezoelectric coefficient d33 values, achieving overall accuracy exceeding 80% across various models. Detailed evaluation methods and metrics are presented in the Results and discussion sections.
ComProScanner's workflow architecture comprises four distinct operational phases: (a) metadata retrieval, (b) article collection, (c) information extraction and (d) evaluation, post-processing and dataset creation (see Fig. 1). We describe each phase in turn below.
![]() | (1) |
From precision and recall, another metric, F1, can be calculated using eqn (2),
![]() | (2) |
Beyond standard classification metrics calculated using the aggregate number of items across all evaluation articles, we have developed normalised classification metrics that consider each article as a single evaluative unit, wherein each extracted item within an article contributes a fractional importance to that article's overall evaluation score. These normalised evaluation metrics were specifically designed to ensure an equitable comparison between articles with significant disparities in the quantity of extractable information. The normalised metrics for all papers are calculated using the modified Precision, Recall and F1-score,
![]() | (3) |
![]() | (4) |
![]() | (5) |
Weight-based accuracy metrics, classification metrics and normalised classification metrics all provide the flexibility to use both semantic and agentic approaches for evaluation. The semantic similarity method is used to match ground truth and ComProScanner-extracted information for the semantic approach, whereas LLM agents are instructed to match the ground truth and ComProScanner-extracted information for the agentic approach. Although the evaluation accuracy is expected to be higher for the agentic approach, given that LLM agents will have better comparison ability than semantic comparison between two sentences, the agentic evaluation can take more time and require significantly large numbers of tokens if reasoning models are used for better performance.
ComProScanner provides extensive visualisation capabilities of the evaluation through a diverse array of graphical representations, including bar charts, radar plots, heat maps, histograms and violin charts, all readily accessible within the framework. Additionally, the system offers pie charts and histogram plotting functionalities to facilitate the analysis of data distribution across composition families, precursors and characterisation techniques.
Metadata of articles related to piezoelectric materials were collected based on piezoelectric, piezoelectricity, pyroelectric, pyroelectricity, ferroelectric and ferroelectricity as the main base keywords. After collecting metadata with only base queries, combinations of base queries and 18 additional keywords such as, advancements, applications, ceramics, characterization, composites, crystals, etc., were used to collect a larger set of metadata that could contain potential piezoelectric materials along with their corresponding d33 coefficient values. The complete list of the additional keywords can be found in the SI60 (see the test_example.py script). Although metadata were collected for all articles published between 1st of January 2019 and 17th of March 2025, only Elsevier papers were considered for the evaluation process, where only 3916 papers mentioned d33, accounting for potential differences in formatting.
Subsequently, 100 test DOIs were selected, based on the presence of the composition-property data using the RAG agent, whilst randomising the metadata order. For RAG and other NLP tasks, text embedding plays a crucial part in ensuring the efficiency of the models. The PhysBERT61 model has demonstrated superior accuracy compared to various sentence transformer and BERT models in identifying various physics and materials science specific vocabulary. However, to ensure that PhysBERT would perform better than the leading sentence transformer model, all-mpnet-base-v2,62 in our specific domain, the thellert/physbert_cased model from Hugging Face was evaluated against sentence-transformer's all-mpnet-base-v2 model using 12 domain-specific synonyms based on abbreviated forms or chemical formulae and their corresponding full names or trivial names, which are summarised in Table S1 in the SI. The PhysBERT model outperformed all-mpnet-base-v2 in all cases, with remarkable performance differences ranging from highly significant improvements for terms such as DOS (density of states) with a cosine similarity§ difference of 0.8338, to modest improvements for common terms such as PVC (polyvinyl chloride) with a difference of 0.0566. This satisfactory performance of PhysBERT encouraged us to adopt this model as the default embedding model for storing article text data in the ChromaDB vector database for use in the RAG tool illustrated in Fig. 2. For fair evaluation across various models, the RAG environment was maintained consistently, as described in detail in Section S3 of the SI.
Finally, extraction agents were used to extract information for the piezoelectric materials only for the test DOIs. We selected LLMs to enable a comparison between open-source models (Google's Gemma-3-27B-Instruct,63 DeepSeek's DeepSeek-V3-0324,64 Meta's Llama-3.3-70B-Instruct, Llama-4-Maverick-17B-Instruct,65,66 and Alibaba's Qwen3-235-A22B,67 Qwen-2.5-72B-Instruct68) and proprietary models (Google's Gemini-2.0-Flash,69 Gemini-2.5-Flash-Preview,70 and OpenAI's GPT-4o-mini,71 GPT-4.1-nano72) at similar price points, after analysing the cost-versus-accuracy ratio from the Chatbot Arena LLM Leaderboard,73 where models had Arena scores exceeding 1250 and output costs below $1/1 M tokens (for more details, see Section S4 and Fig. S2 in the SI). Additional instructions were passed to the agents for better extraction performance specific to piezoelectric materials and d33 coefficients (see the test_example.py script from SI60). Temperature and maximum output tokens were set to 0.1 and 2048 respectively, which are the default values for ComProScanner's data extraction function.
The normalised classification metrics (precision, recall and F1-score) of different models, as described in the evaluation, post-processing and dataset creation sub-section above, for both semantic and agentic approaches, are represented as grouped bar charts for the model Llama-3.3-70B-Instruct (the best-performing model when considering only normalised metrics) in Fig. 3. Semantic and agentic comparisons based on normalised metrics for all other models can be found in section S5 along with associated Fig. S3 of the SI. We used PhysBERT model for semantic evaluation, while the Gemini-2.5-Pro reasoning model70 was employed for agentic evaluation. Although normalised classification metrics for both semantic and agentic evaluation show similar trends, the agentic evaluation demonstrates superior performance accuracy compared to semantic evaluation, which is understandable given that reasoning models such as Gemini-2.5-Pro possess greater capability to compare sentence structures with equivalent meanings. Given the superior accuracy demonstrated by agentic evaluation, we focus on these results to identify the best-performing models for practical implementation. As mentioned earlier, Llama-3.3-70B-Instruct outperforms all other models in normalised classification metrics with a Precision value of 0.80, Recall value of 0.81 and F1-score of 0.80 (Fig. 3).
The confusion matrix (Fig. 4) reveals distinct performance patterns across the evaluated models for piezoelectric materials extraction taking into account all performance metrics. DeepSeek-V3-0324 emerged as the top-performing model for data extraction, demonstrating consistently high scores across all metrics, with particularly strong performance in composition accuracy (0.90), precision (0.84), recall (0.83) and F1-score (0.84). This model showed balanced performance with an overall accuracy of 0.82 and robust synthesis accuracy of 0.75. The Qwen model family demonstrated competitive performance, with both Qwen3-235B-A22B and Qwen-2.5-72B-Instruct achieving comparable results. Notably, both models excelled in composition accuracy (0.89–0.90) and maintained consistent performance across precision, recall and F1-score metrics (0.79–0.85). Llama-3.3-70B-Instruct showed strong overall performance with an accuracy of 0.76 and exceptional composition accuracy (0.87). Google's Gemini models presented mixed results. While Gemini-2.0-Flash achieved moderate performance with balanced metrics, Gemini-2.5-Flash-Preview unexpectedly underperformed compared to its predecessor, showing lower scores across most metrics (0.61–0.71). Llama-4-Maverick-17B-Instruct demonstrated notable strengths in specific areas despite its overall lower performance, achieving commendable composition accuracy (0.83) and precision (0.78). However, the model struggled significantly with synthesis accuracy (0.55) and normalised precision (0.63). The most concerning performance was observed with GPT-4.1-nano, which consistently scored lowest across all metrics, particularly struggling with normalised Precision (0.46). Similarly, Gemma-3-27B-Instruct showed suboptimal performance, with notable weaknesses in synthesis accuracy and normalised precision. We have also taken into consideration the incorrect and hallucinated information extracted by the best-performing model, i.e., DeepSeek-V3-0324. Although incorrect extractions occasionally occur, the total number of hallucinated extractions is significantly low (see Table S2 in the SI). Interestingly, although only compositions specific to the relevant work were instructed to be extracted, compositions mentioned in that specific article as part of a literature study were occasionally also extracted. However, these nuances can be fine-tuned through robust prompt engineering for each specific task.
Furthermore, to compare ComProScanner's variable parsing ability with the original material-parsers tool developed by Foppiano et al.,20 we tested several examples from the test dataset by processing them directly through material-parsers, with results summarised in Table 1. Whilst ComProScanner outperformed material-parsers in most cases (first three examples), both tools successfully resolved the chemical formulae for relatively straightforward compositions (fourth example) and both occasionally failed, as demonstrated in the fifth example. Throughout the entire test set, ComProScanner demonstrated superior performance in most instances and equivalent performance in others when compared to material-parsers, thereby validating ComProScanner's capabilities. Furthermore, we compared our framework with similar existing frameworks, Eunomia42 and the extraction agent by CMEG-IITR.48 ComProScanner outperformed both frameworks in all metrics by significant values. More details about the comparison can be found in section S6 of the ESI.
| DOI | Item | Details |
|---|---|---|
| 10.1016/j.jallcom.2024.176609 | Text | The 0.12Pb(Ni1/3Ta2/3)O3−xPbZrO3−(0.88−x)PbTiO3 piezoelectric ceramics with 2 mol% MnO2 (abbreviated as PNT-xPZ-PT-Mn, x = 0.41, 0.42, 0.43, 0.44) were fabricated by the conventional solid-state reaction method |
| Material-parsers | 1. 0.12Pb | |
| 2. Ni1/3Ta2/3)O2.59PbZrO3-(0.87.59)PbTiO3 | ||
| 3. Ni1/3Ta2/3)O2.58PbZrO3-(0.87.58)PbTiO3 | ||
| 4. Ni1/3Ta2/3)O2.57PbZrO3-(0.87.57)PbTiO3 | ||
| 5. Ni1/3Ta2/3)O2.56PbZrO3-(0.87.56)PbTiO3 | ||
| ComProScanner | 1. 0.12Pb(Ni1/3Ta2/3)O3-0.41PbZrO3-0.47PbTiO3 + 2% MnO2 | |
| 2. 0.12Pb(Ni1/3Ta2/3)O3-0.42PbZrO3-0.46PbTiO3 + 2% MnO2 | ||
| 3. 0.12Pb(Ni1/3Ta2/3)O3-0.43PbZrO3-0.45PbTiO3 + 2% MnO2 | ||
| 4. 0.12Pb(Ni1/3Ta2/3)O3-0.44PbZrO3-0.44PbTiO3 + 2% MnO2 | ||
| 10.1016/j.jeurceramsoc.2025.117193 | Text | In this study, dense Pb(1−x)K2x[Nb0.96Ta0.04]2O6 (PKxNT, x = 0.05, 0.10, 0.15, 0.20) ceramics were prepared via the solid-state reaction method |
| Material-parsers | 1. In | |
| 2. Pb(0.95)K20.05[Nb0.96Ta0.04]2O6 | ||
| 3. Pb(0.9)K20.10[Nb0.96Ta0.04]2O6 | ||
| 4. Pb(0.85)K20.15[Nb0.96Ta0.04]2O6 | ||
| 5. Pb(0.8)K20.20[Nb0.96Ta0.04]2O6 | ||
| ComProScanner | 1. Pb0.95K0.1[Nb0.96Ta0.04]2O6 | |
| 2. Pb0.9K0.2[Nb0.96Ta0.04]2O6 | ||
| 3. Pb0.85K0.3[Nb0.96Ta0.04]2O6 | ||
| 4. Pb0.8K0.4[Nb0.96Ta0.04]2O6 | ||
| 10.1016/j.ceramint.2024.09.282 | Text | BaCO3 (99.8%, Aladdin), TiO2 (99.0%, McLean, Shanghai, China), SnO2 (99.9%, Aladdin), CaCO3 (99.0%, Sinopharm), Bi2O3 (99.9%, McLean), Fe2O3 (99.0%, Sinopharm) are used as raw materials, which were accurately weighed according to a composition of (1 − x) (Ba0.95Ca0.05) (Ti0.89Sn0.11)O3−xBiFeO3 (BCTSO-xBFO, x = 0, 0.1, 0.5, 0.9 mol%) and milled with ethanol for 16 h |
| Material-parsers | 1. BaCO3 | |
| 2. TiO2 | ||
| 3. SnO2 (99.9%, Aladdin | ||
| 4. CaCO3 (99.0%, Sinopharm | ||
| 5. Bi2O3 (99.9%, McLean), Fe2O3 | ||
| 6. (1.0) (Ba0.95Ca0.05) (Ti0.89Sn0.11)O3.0BiFeO3 | ||
| 7. (0.9) (Ba0.95Ca0.05) (Ti0.89Sn0.11)O2.9BiFeO3 | ||
| 8. (0.5) (Ba0.95Ca0.05) (Ti0.89Sn0.11)O2.5BiFeO3 | ||
| 9. (1.0) (Ba0.95Ca0.05) (Ti0.89Sn0.11)O3.0BiFeO3 | ||
| 10. (0.9) (Ba0.95Ca0.05) (Ti0.89Sn0.11)O2.9BiFeO3 | ||
| 11. (0.5) (Ba0.95Ca0.05) (Ti0.89Sn0.11)O2.5BiFeO3 | ||
| 12. (0.1) (Ba0.95Ca0.05) (Ti0.89Sn0.11)O2.1BiFeO3 | ||
| 13. (−15.0) (Ba0.95Ca0.05) (Ti0.89Sn0.11)O-13.0BiFeO3 | ||
| ComProScanner | 1. (Ba0.95Ca0.05)(Ti0.89Sn0.11)O3 | |
| 2. (Ba0.95Ca0.05)(Ti0.89Sn0.11)O3-(0.1)BiFeO3 | ||
| 3. (Ba0.95Ca0.05)(Ti0.89Sn0.11)O3-(0.5)BiFeO3 | ||
| 4. (Ba0.95Ca0.05)(Ti0.89Sn0.11)O3-(0.9)BiFeO3 | ||
| 10.1016/j.ceramint.2024.10.314 | Text | Lead-free piezoelectric ceramics with the formula Ba1−xSrxTi0.92Zr0.08O3 [x = 0, 0.04, 0.08, 0.12, 0.16, 0.20 (mol)] were prepared using the solid-state reaction technique |
| Material-parsers & ComProScanner | 1. Ba1.0Sr0Ti0.92Zr0.08O3 | |
| 2. Ba0.96Sr0.04Ti0.92Zr0.08O3 | ||
| 3. Ba0.92Sr0.08Ti0.92Zr0.08O3 | ||
| 4. Ba0.88Sr0.12Ti0.92Zr0.08O3 | ||
| 5. Ba0.84Sr0.16Ti0.92Zr0.08O3 | ||
| 6. Ba0.80Sr0.20Ti0.92Zr0.08O3 | ||
| 10.1016/j.jeurceramsoc.2024.117065 | Text | Pure CaBi2Nb2O9 and rare-earth thulium-substituted CaBi2Nb2O9 powders with nominal compositions of Ca1−xTmxBi2Nb2O9 (CBN-100xTm) were prepared through a solid-phase reaction method. To characterize the phase transition in detail, a composition range of x = 0.01–0.05 was selected |
| Material-parsers | 1. CaBi2Nb2O9 | |
| 2. CaBi2Nb2O9 | ||
| 3. Ca1−xTmxBi2Nb2O9 | ||
| ComProScanner | 1. CaBi2Nb2O9-1Tm | |
| 2. CaBi2Nb2O9-2Tm | ||
| 3. CaBi2Nb2O9-3Tm | ||
| 4. CaBi2Nb2O9-4Tm | ||
| 5. CaBi2Nb2O9-5Tm | ||
| Actual resolved compositions | 1. Ca0.99Tm0.01Bi2Nb2O9 | |
| 2. Ca0.98Tm0.02Bi2Nb2O9 | ||
| 3. Ca0.97Tm0.03Bi2Nb2O9 | ||
| 4. Ca0.96Tm0.04Bi2Nb2O9 | ||
| 5. Ca0.95Tm0.05Bi2Nb2O9 |
ComProScanner also offers built-in data distribution visualisation functions to represent various material families, synthesis precursors and characterisation techniques as either histograms or pie-charts through a semantic clustering mechanism. Fig. 5 shows these data distributions, where similarity thresholds of 0.8 (default in ComProScanner) were applied for material families and precursors, while 0.78 was found to be best for characterisation techniques during semantic clustering. The resulting distributions reveal the prevalence of different components in piezoelectric materials research across the evaluated 100 articles. In terms of material families, BaTiO3 dominates at 39.0%, followed by KNN (16.0%) and PZT (14.0%), with various other compositions including CaBi2Nb2O9 (9.0%) and BNT-based materials (3.0%) comprising the remainder. For synthesis precursors, Bi2O3 is most frequently used (18.9%), followed by Na2CO3 (13.3%) and TiO2 (10.2%), with a diverse range of other precursors including various carbonates, oxides and acids distributed across smaller percentages. The characterisation techniques show XRD as the predominant method (33.1%), which is expected for crystalline phase analysis, followed by impedance analysers (21.9%) for electrical property measurements and ferroelectric test systems (8.3%) for specific piezoelectric characterisation, with various other analytical techniques contributing to comprehensive materials evaluation.
To visualise the relationships between the distribution of all data types mentioned in the Methods section for the evaluated dataset, a custom schema-based relationship graph has been constructed and visualised using the neo4j74 library within the ComProScanner package (Fig. 6). The schema defines hierarchical relationships between extracted compositional data, material properties, synthesis parameters and metadata. The produced neo4j relationship graph from 100 test articles contains a total of 1825 graph nodes, which are summarised in Table S2 of the SI. Cypher75 queries can be utilised to retrieve relational information for specific nodes, for example, the inset of Fig. 6 represents 10 random items among 79 compositions associated with the BaTiO3 family across 100 test articles. Detailed information about all nodes associated with the test data can be found in Table S3 in the SI.
For the piezoelectric materials considered, balanced performance in each metric with an overall accuracy of 0.82 indicates that DeepSeek-V3-0324 possesses the most reliable extraction capabilities for complex piezoelectric material data. The consistency in various metrics for both Qwen models suggests these models are also well-suited for systematic materials data extraction tasks. Llama-3.3-70B-Instruct's results makes it particularly valuable for applications requiring high Precision in materials identification. However, its synthesis accuracy (0.65) is relatively lower, indicating potential challenges in extracting complex synthesis information. The counter-intuitive results from two Gemini models suggests that model updates do not always guarantee improved performance for domain-specific tasks. The results for Llama-4-Maverick-17B-Instruct suggest it may be more suitable for composition-focused extraction tasks rather than comprehensive materials informatics applications. Poor performances from GPT-4.1-Nano and Gemma-3-27B-Instruct highlight the importance of model selection for materials informatics applications, where domain-specific performance can vary significantly from general language tasks. The comparison between the original material-parsers tool and ComProScanner demonstrates that our package performs significantly more efficiently than material-parsers when resolving complex chemical compositions containing variables.
Though LLM agents attempt to ensure consistent results across runs, the underlying LLMs are nondeterministic by nature, which forms the core limitation of any type of LLM-based approach; consequently, results may vary slightly between runs. For the Materials Data identifier agent, the RAG question, chunk size, chunk overlap, top k value and RAG chat model may require adjustment and testing according to the specific use case. Although manual evaluation would serve as the optimal evaluation technique compared to semantic and agentic approaches, it is not practical for large dataset evaluation. With this consideration, semantic and agentic approaches are incorporated into the framework and depending on the chosen reasoning model, evaluation results can vary slightly.
ComProScanner establishes the essential foundation for the next generation of AI in material science, creating a pathway to develop extensive text-mined datasets from journal articles. The framework we have developed enables a seamless, user-friendly automated data extraction pipeline. However, OCR technology or VLMs could be integrated with the framework in the future to extract information from graphs or other image formats. Additionally, flexibility to modify the structure of the extracted JSON data could be incorporated into the framework to extract multiple material properties.
Supplementary information (SI) is available. See DOI: https://doi.org/10.1039/d5dd00521c.
Footnotes |
| † TDM agreements differ from standard academic subscriptions granted to institutional libraries, as they specifically govern the scraping and downloading of large volumes of content, which could potentially impact the operational performance of publishers' servers. |
| ‡ This new article metadata is collected for each specific article containing agent-extracted information, differing from the previously collected metadata that contained limited information for all related articles associated with the property keyword used for metadata collection. This new comprehensive metadata includes a wide range of information: DOI, article title, journal name, year of publication, open access information, author list with their institutional details, and article keywords. These data are obtained either via Elsevier's ScienceDirect Article Metadata API56 (optional) or the Open Access Button's free metadata API57 developed by OA.Works.58 |
| § Cosine similarity measures how similar two text embeddings (vectors) are by calculating the cosine of the angle between them. This provides a score between −1 (completely dissimilar) and 1 (identical), which is useful for tasks such as text or document clustering. |
| This journal is © The Royal Society of Chemistry 2026 |