Max Dreger*a,
Kourosh Malekab and
Michael Eikerling
abc
aTheory and Computation of Energy Materials (IET-3), Institute of Energy Technologies, Forschungszentrum Jülich GmbH, 52425 Jülich, Germany. E-mail: m.dreger@fz-juelich.de
bCentre for Advanced Simulation and Analytics (CASA), Simulation and Data Lab for Energy Materials, Forschungszentrum Jülich, 52425 Jülich, Germany
cChair of Theory and Computation of Energy Materials, Faculty of Georesources and Materials Engineering, RWTH Aachen University, 52062 Aachen, Germany
First published on 7th April 2025
Research in materials science increasingly harnesses machine learning (ML) models. These models are trained with experimental or theoretical data, the quality of their output hinges on the data's quantity and quality. Improving data quality and accessibility necessitates advanced data management solutions. Today, data are often stored in non-standardized table formats that lack interoperability, accessibility and reusability. To address this issue, we present a semi-automated data ingestion pipeline that transforms R&D tables into knowledge graphs. Utilizing large language models and rule-based feedback loops, our pipeline transforms tabular data into graph structures. The proposed process consists of entity recognition and relationship extraction. It facilitates better data interoperability and accessibility, by streamlining data integration from various sources. The pipeline is integrated into a platform harboring a graph database as well as semantic search capabilities.
The quality and availability of data are crucial in this realm. Data generation in materials science is frequently tied to short-term projects and is considered time-consuming and costly, leading to the creation of many small and scattered datasets.16,17 Data management in research labs typically relies on relational databases or file systems, predominantly filled with tables. Those tables are rich data assets; however, information on how data points across different columns are interconnected is often only implicitly provided. The lack of data management standards, therefore, leaves valuable data silos scattered with very limited accessibility and interoperability among labs or institutions.18–20
Several recent papers have emphasized the need for openly accessible databases; however, data heterogeneity due to varying length scales and structural complexity of materials presents intricate technical challenges.21,22 Consequently, databases are often limited to single length scales, highly domain-specific, and focused on chemical elements and compound properties that do not depend on microstructure.23–30 Thus, the materials science community is experiencing a trend toward information silos separated by domain, design, or exploration space, hindering the full potential of AI methods.31 This separation complicates finding answers to generic research questions. Filtering a corpus of materials for desired properties or identifying processing conditions associated with desired materials properties often requires consulting domain experts or scientific literature.32
Knowledge graphs are promising data structures, responding to these challenges. They consist of nodes and relationships forming a network of connected entities.33 These relationships make information and contextualized data machine-readable, facilitating the integration of tools to analyze, organize, and share information. Furthermore, graphs excel in representing highly heterogeneous data due to their focus on connectivity, which provides a high degree of flexibility.34,35 Materials science increasingly uses knowledge graphs to integrate and organize data from literature, databases, and ontologies.35–37 To elevate them to viable data management solutions on lab-scale and beyond, these tools need to be broadly appealing to materials science.
Attractive data management solutions provide intuitive mechanisms for data storage and retrieval, streamlining the process for data owners by removing unnecessary complexity. Moreover, they ought to reduce their usage barrier, ensuring smooth integration into the user's routine data practices, with minimal interruption.17 The capacity to mine existing data, derive insights, and integrate them into one's data assets is a crucial feature of data management systems. Thus, broad adoption of graph-based data management solutions requires tools that facilitate the migration from existing tabular data assets to knowledge graphs. Knowledge extraction is an emerging field aiming to extract information from structured and unstructured sources38 and is increasingly applied in the materials science domain.39,40 Recent advances in natural language processing through the availability of Large Language Models (LLMs) provide novel disruptive tools in this field.41,42 LLMs excel in inferring context and meaning of unseen data without the need for expensive training. This eases the implementation of LLM-enabled knowledge extraction tools, making them attractive for data management solutions.
In a recent publication, we proposed a data model for graph databases43 that follows the logic of the Elementary Multiperspective Material Ontology (EMMO).44 An ontology is a structured framework that defines and categorizes the concepts, entities and their relationships within a specific domain. The proposed data model is able to represent experimental workflows in materials science with any desired degree of granularity. Entities within the database are labeled via a semantically connected system of nodes that span a wide range of processes, matter and quantities (see Fig. 1). These labels are based on the EMMO and BattINFO, a domain-specifc EMMO extension focused on batteries and their characterization.45 The database aims to help research groups manage their data assets in an intuitive way while making it interoperable with other data vendors.
In this study, we introduce a knowledge graph extraction pipeline to improve the efficiency of populating graph databases with existing table data. The pipeline semi-automatically transforms tables into connected knowledge graphs that follow the data model we proposed in Fig. 1. The extraction process utilizes LLMs to infer meaning from headers and extract information from tables. We divided the process into four stages, which can be verified by the user through a graphical user interface, ensuring the high quality of the knowledge graph. To enhance the cost efficiency and scalability, we integrated various caching strategies to streamline the extraction process from known tables.
Comparatively, alternative solutions—such as Microsoft's GraphRag46 and manual data transformation approaches—often require expertise in database querying languages or extensive iterative prompt engineering. While these methods are viable, they tend to be labor-intensive and face scalability challenges due to their technical complexity. In contrast, enterprise cloud data management platforms like Databricks,47 Google Cloud,48 and Splunk49 are designed to integrate seamlessly with standard workflows, providing streamlined and robust data operations. However, their architectures are generally optimized for more homogeneous data, which makes it difficult for them to natively handle the high complexity and heterogeneity inherent in scientific data. Consequently, effective data management in the scientific domain should seamlessly integrate into existing data handling routines. Thus, data management should not impose significant technical overhead or require specific expertise from researchers, while allowing them to accommodate the full complexity and interconnected nature of their data.
In the following, we thoroughly discuss the methodologies and metrics of the extraction procedure and its results. This article is relevant to those interested in using our data management system or engaging in knowledge extraction in different scientific domains.
The entities extracted through this process are then used to infer relationships, constituting the build-up of a knowledge graph. The following sections provide an overview of the data types encountered in this study, introducing the input and output and delineating the specifics of the extraction pipeline.
Drymilltime (h) | Drying T (°C) | Catalyst | Ionomer | Equiv. weight | I/C |
---|---|---|---|---|---|
6 | 55 | F50E-HT | Aquivion | 790 | 0.7 |
24 | 55 | F50E-HT | Aquivion | 790 | 0.7 |
48 | 55 | F50E-HT | Aquivion | 790 | 0.7 |
6 | 55 | F50E-HT | Aquivion | 790 | 0.9 |
24 | 55 | F50E-HT | Aquivion | 790 | 0.9 |
48 | 55 | F50E-HT | Aquivion | 790 | 0.9 |
6 | 55 | F50E-HT | Aquivion | 790 | 1.1 |
24 | 55 | F50E-HT | Aquivion | 790 | 1.1 |
48 | 55 | F50E-HT | Aquivion | 790 | 1.1 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
In materials science, these tables typically describe entities such as materials, components, devices, chemicals, properties, measurements, manufacturing steps, and processing conditions. The goal is to extract these entities and contextualize them by inferring relationships among them.
For example, the excerpt in Table 1 contains a column titled “Ionomer” immediately followed by a column labeled “Equivalent Weight” that holds numerical values. To retrieve meaningful information, the pipeline must extract a matter node representing the ionomer cell “Aquivion” and connect it to a property node labeled “Equivalent Weight”, which carries the numerical value of 790 and the unit g mol−1. Given the significant variations in table structures and header terminologies, our pipeline is designed to be agnostic to both structure and terminology, enabling it to process a wide range of tables without prior knowledge of their layout. The pipeline leverages LLMs to extract data within the domain of materials science, with a particular focus on energy materials.
We use the OpenAI embedding generator as it has shown high accuracy in benchmarking, especially in classification tasks. To improve the accuracy of this step, the table header of each validated classification is transformed into an embedding and added to the pool of node examples. The candidate selection uses cosine similarities to find the most similar vector in the pool of examples.
The process is encapsulated into a classifier Python class that iterates over all table headers, creates embeddings from the table header and the sample cell, identifies the best match from the pool of examples, and returns a dictionary with each header representing a key with its node type as the value. As the node type assignment of each header is an isolated task, it can be easily parallelized.
Fig. 2 presents an example for the node type extraction. The assignments of the correct node types to the columns that contain matter nodes are straightforward as they contain headers that are semantically very close to matter and can therefore be easily assigned to an example for the node type matter. A challenge can arise with the heading “Method” as it has a high semantic overlap with measurement and manufacturing and could therefore be attributed to both node types. Since manufacturing steps and measurements have different methods (e.g., “mixing”, “imaging”, etc.) both node types show a high semantic overlap with the word “Method” alone. For that reason, the header and sample cell are transformed into an embedding. The word “mixing” is less ambiguous and, therefore, the embedding from the header and the sample cell can successfully be assigned to the node type manufacturing.
In Fig. 2, we illustrate the attribute extraction process. In the example, columns containing Identifier attributes contain “ID” in their headers, while those listing material names reference well-established materials in the cells and headers. Columns with property and parameter values display numerical entries that can be directly associated with the corresponding attribute, value. Because the node type of each column is determined in the first step, the range of possible attributes is constrained. For example, matter, manufacturing, and measurement nodes include an identifier attribute, whereas parameter and property nodes do not. Thus, knowledge of the node types, along with the unambiguous headers and cell contents, makes the assignment of each column to the correct attribute straightforward.
The agent is given the following information:
(1) Introduction and general task (system message).
(2) Explanation of the expected output (schema).
(3) Context provided by the user (user input).
(4) Examples (few-shot).
Each agent is initiated with a fixed system message explaining the task and providing general input and guidelines. Additionally, we employ few-shot learning by generating an artificial conversation history in which the agent was given a table and produced the correct nodes. These examples serve as valuable guidelines on how to aggregate nodes correctly and which format to follow. The actual prompt is automatically generated and contains the table headers, a sample row, and a string of additional context provided by the user. Fig. 3 represents the general structure of an LLM-enabled pipeline containing extractor, validator, and corrector classes. Node aggregation and relationship extraction are implemented following this general structure and vary only in inputs and static components.
To clarify the node aggregation procedure, we will follow the aggregation of all matter nodes in the example of Fig. 2. The agent of the MatterNodeAggregator receives a prompt containing the table, user context, and instructions to transform columns 1, 2, 3, and 6 into nodes. The agent should then correctly assume that the table provides information about the fabrication of a catalyst ink, and therefore propose the creation of three matter nodes per row: one for the ionomer, one for the catalyst, and one for the catalyst ink. The catalyst name and identifier must be extracted from columns 1 and 2. The matter node representing the ionomer has a name that is taken from column 3. The node representing the catalyst ink has an identifier in column 6. Since no column provides a name for the catalyst ink and it is a crucial attribute, the agent needs to infer it (for example, “catalyst ink”) from the header of column 6. Each row of the table can thus be transformed into three matter nodes, with attributes varying by row, except for the name of the catalyst ink node, which is inferred from the table headers. If the MatterNodeAggregator does not infer the name “catalyst ink” for the matter node representing the catalyst ink, the MatterNodeValidator would detect this during its sanity check. In that case, the MatterNodeCorrector would be invoked to correct the error by inferring the missing name attribute.
The relationship extraction in Fig. 2 can be illustrated using the example of the HasPropertyRelationshipExtractor. This class extracts the HAS_PROPERTY relationships that connect matter and property nodes. In this example, the extractor would receive a list of all matter nodes (e.g., the ionomer, the catalyst, and the catalyst ink) as well as a single property node (e.g., the I/C ratio). The agent must then decide to which matter node the property node belongs. This can be inferred either by recognizing that “I/C” stands for “ionomer to catalyst ratio”—indicating it is a property of the catalyst ink, or by analyzing the table structure, since the property node's column is directly adjacent to the catalyst ink node's column.
(1) No match found: if none of the candidates is a suitable label node, or a child or parent class of the unlabeled node, the agent is given all possible labels of the given node type. If the unlabeled node is a matter node, this means all label nodes that are children of the label node named “matter” are forwarded to the agent. Among them, the agent chooses a label node that represents a parent label of the unlabeled node. Then, the agent is asked to suggest a new label and additional child labels—if necessary—to seamlessly extend that branch. The output of the agent is used to create new label nodes within the graph database. They are used to label the unlabeled node and extend the existing taxonomy.
(2) Adequate match found: if one of the candidates represents an adequate label for the node, the task ends and the node gets assigned the label chosen by the agent.
(3) Subclass/parentclass found: if one of the candidates is a parent or child label of the unlabeled node, the agent is given all parent or child labels of that candidate. The agent then has to identify the semantically closest label node and generate a new label node that adequately represents the extracted node. To improve the quality of the extension, the agent can suggest additional labels to smooth the branching. The output of the agent is used to create new label nodes, connecting them to the existing taxonomy. The unlabeled node is then stored within the graph database and linked to the newly generated label node.
The complete procedure is depicted in Fig. 4. As the correct extension of the labeling system is crucial to make the data retrievable and interoperable, all newly added label nodes are flagged for curation by the database admins.
Two ways of caching have been implemented:
Single-column cache: single-column caches contain the headers of individual columns and the assigned node types and attributes. Columns that contain already known table headers can be cached, and the first two steps of the transformation pipeline can be skipped. Additionally, table headers with already correctly extracted labels and attributes are transformed into embeddings and added to the pool of examples for the labels and attributes they represent. These additional embeddings facilitate type and attribute extraction of headers with similar wording and therefore help boost the accuracy of the first two steps of the pipeline.
Table cache: as researchers or self-driving labs often generate the same table structures, caching full graph extraction is crucial to enhance the scalability of the pipeline. We implemented look-up tables that store the table headers and the resulting graph of each validated extraction. If already cached tables need to be transformed into graphs for ingestion into the database, the cache is activated and the correct graph can be directly requested from the look-up table.
All look-up tables are within an SQL database and accessible via the django-admin user interface. Each transformation procedure generates one single-column cache for each table column and one table cache entry for the full table. These new cache entries are directly validated by the user through the GUI. Through the django-admin interface the cached tables can be checked and validated by the admin. After validation by the database administrator, the columns and tables are fully cached.
To set up a robust pipeline, it is necessary to optimize each step. The first two steps are classification tasks, to assign a node type and node attribute to each column. Optimizing node type and attribute extraction (see Sections 2.3.1 and 2.3.2) requires optimization of:
• Examples: each column is assigned to a node type and node attribute by a similarity comparison to a pool of examples (e.g., examples for the node type Parameter might be: “Operating Condition”, “Process Parameter”, “Heating Speed”).
• Input: depending on the format and structure of the examples, the input for the classification can be optimized (e.g., Heading:Sample_Cell, Heading, “Key”: Heading, Value: “Sample Row” are different ways to generate input for a similarity comparison and will lead to different results).
• Matching: the logic of how the correct node type/attribute is chosen can be varied (e.g., a naive approach is to select the example with the highest similarity).
The subsequent steps, which extract nodes and relationships, mainly depend on prompt engineering. The accuracy here was improved by prompt engineering and iterative tweaking of the input to the LLM agent. Prompt engineering is very expensive; therefore, we optimized the prompts on tables with high complexity that contain nodes and relationships of all types and most labels. Increasing accuracy requires optimization of the following:
• System message: the system message contains the general information and task the LLM Agent is given (e.g., “You are a world-class node extracting algorithm…”).
• Prompt: the actual prompt the agent is given to extract nodes/relationships from a given input.
• Examples: examples that are given to the agent that show how to correctly extract data from a given input.
• Schema: the desired output format that contains small descriptions of the parts of the output.
• Input data: the input data is part of the prompt. It is important to include it in a way that is easy to process and contains exactly the information and context that is needed to solve the given task.
The final parameters for every step are made available on our GitHub repository.52
![]() | (1) |
We evaluated the classification tasks on the headings of all tables listed in the data repository. Additionally, we created an artificially generated dataset of 100 heading/sample_cell pairs for each node/attribute type. The accuracy of the classification task is given in Table 2.
Node type | Table dataset | Artificial dataset | ||||
---|---|---|---|---|---|---|
Attribute type | Precision | Recall | F1 | Precision | Recall | F1 |
Matter | 0.97 | 1.0 | 0.99 | 0.95 | 0.98 | 0.96 |
Property | 0.98 | 0.90 | 0.94 | 0.94 | 0.99 | 0.96 |
Parameter | 0.91 | 0.96 | 0.94 | 0.98 | 0.98 | 0.98 |
Measurement | 0.88 | 1.0 | 0.93 | 0.96 | 0.93 | 0.95 |
Metadata | 0.94 | 0.94 | 0.94 | 1.0 | 0.92 | 0.96 |
Manufacturing | 1.0 | 0.92 | 0.96 | 0.99 | 0.99 | 0.99 |
Identifier | 0.95 | 1.0 | 0.97 | 0.91 | 0.97 | 0.94 |
Value | 1.0 | 0.94 | 0.97 | 0.96 | 0.97 | 0.97 |
Name | 1.0 | 0.97 | 0.98 | 0.97 | 0.95 | 0.96 |
Unit | 1.0 | 1.0 | 1.0 | 1.0 | 0.97 | 0.98 |
Error | 0.89 | 1.0 | 0.94 | 1.0 | 0.96 | 0.98 |
The classification part of the pipeline yields F1 scores from 0.90 to 1.0. Especially, the classification of the node types was challenging, as the parameter and property types have a high semantic overlap that complicates distinguishing them.
(1) Pairwise similarity calculation
We compare the attributes of each pair of nodes ni and nj from the lists L1 and L2. Let Ai and Aj be the attributes of nodes ni and nj, respectively.
If Aik and Ajk are strings, their similarity Sijk is computed using cosine similarity:
If Aik and Ajk are numerical, their similarity Sijk is:
The overall node similarity Sij is the weighted average of attribute similarities; unpaired attributes are given a similarity of 0:
(2) Optimal matching
The similarity comparison of each possible combination of nodes from the model output and the ground truth generates an n × m matrix, with dimensions equal to the number of nodes in the model output and the ground truth, where each value represents the similarity of a pair. To yield the overall similarity between the output and its ground truth, we need to map the elements of both lists one-to-one while optimizing the overall similarity of the pairs. This assignment problem can be solved with the Hungarian method.
The Hungarian method, also known as the Kuhn–Munkres algorithm, is an optimization technique used to find the optimal one-to-one matching in a weighted bipartite graph, minimizing the total cost.55 It iteratively improves the matching through augmenting paths until the best possible assignment is achieved.
(3) Similarity score calculation
The total similarity score S is the sum of the similarities of the matched pairs Stotal, normalized by the length of the longer list max(|L1|, |L2|):
This evaluation metric ensures a comprehensive comparison of node lists, optimally matching nodes while accounting for missing attributes and different data types.
The results of the evaluation are given in Fig. 5, and the pipeline was tested on a total of 500 columns from various materials science tables. Additionally, the pipeline was tested on tables from different scientific publications across different domains. To test the flexibility of the pipeline, tables from chemistry were used as well.
As can be seen, accuracies range from 0.95 to 1.0. Inaccuracies occurred when the table was missing the units of physical quantities or when a table contained duplicate table headings. In case of a missing unit, the pipeline tries to infer the unit from the content of the table and makes an educated guess. Duplicate table headings introduce ambiguity to the table and therefore uncertainty to its transformation. In both cases, the LLM agent has to make a guess, which is intrinsically error-prone.
The results are depicted in Fig. 6.
The relationship extraction achieved F1 scores ranging from 0.92 to 1.0. The validation tables contain up to 98 columns and may list the same fabrication technique multiple times, presenting a significant challenge for relationship extraction. Generally, most tables yield high F1 scores, while more complex tables tend to produce F1 score outliers.
Analyzing the results of these classification tasks, we realized that recurring problems could be traced back to the widespread use of abbreviations in tables. These abbreviations are highly challenging for embedding-based classification tasks, as they are often ambiguous and require context to be understood. Examples include the name of a commercial catalyst, “F50E-HT”, the abbreviation for an ionomer, “AQ”, or its equivalent weight, “EW”. Such abbreviations demand domain knowledge as well as an understanding of the general content of the table and can lead to incorrect classifications. Another challenge arises from inherently ambiguous column headings, such as “Column1”, which cannot be assigned to the correct node types or node attributes. A third issue involves column headings containing too much information, such as “RH sensitivity at 85C”, which implies both a sensitivity measurement and a specific operating condition—two separate nodes within the graph. Node extraction uses the table, along with the previously assigned node and attribute types, to transform the table into a list of nodes. Similar to the classification tasks, the LLM can struggle to interpret ambiguous table headings or cell contents, especially abbreviations. Because the task is handled by Chat-GPT-4-o and entails providing the full table headings and sample rows to the LLM, its robustness toward these abbreviations is somewhat improved. Nonetheless, abbreviations not well established in the domain, such as “NOC” for normal operating condition in a fuel cell, may still lead to errors. Another example is the ambiguous abbreviation “I/C”, which in the context of fuel cell fabrication often denotes the ionomer-to-catalyst ratio, but can also mean ionic conductivity. Beyond semantics, the structure of tables presents additional hurdles. Large tables, for instance, inflate the number of columns and broaden the range of possible nodes. Additionally, large tables necessitate a larger context window for the LLM, which can invite inaccuracies due to its susceptibility to information overload.
The final step, converting a list of disconnected nodes into a graph, faces the same challenges as the earlier stages. Furthermore, it requires deep domain knowledge, since relationships are usually only implicitly present within tables and must be inferred. In conclusion, the pipeline is a robust tool to extract information from tables by transforming them into graphs. It is limited by the content and structure of the given table and, similarly to a human, it can make errors. These errors are often caused by the ambiguity of the table data, as tables are frequently created and used internally, tying their interpretation to knowledge about the underlying scientific procedures they represent.
To conclude, the biggest challenge in table transformation is dealing with ambiguities in table structures and terminology. These ambiguities often arise because researchers typically imply important context when creating tables. As a result, achieving a correct transformation may require feedback from the original data generators. Recognizing that extracting graphs from tables inherently involves uncertainty, we implemented the pipeline in a semi-automated way that incorporates feedback at every step.
Fig. 7 shows the costs of the table transformation. Note that this figure neglects the first two steps of the pipeline and focuses solely on the last two steps, which are the main influences on overall costs and processing time, while the first two steps are negligible in cost and time. Fig. 7(a) illustrates the time required for both node extraction and relationship extraction, as well as the combined duration of these steps. Since the extraction procedures for the different node types and relationship types are independent, they are executed in parallel. As a result, the overall time cost is determined by the extraction process that takes the longest—effectively becoming the bottleneck of the operation. The costs are calculated as the sum of all node and all relationship extractions. The error bars show the standard deviation of the results, as each table was transformed three times to account for the nondeterministic nature of LLMs. Large standard deviations arise if the initial extraction is not correct and the output needs to be corrected. In that case, the token consumption is increased by a factor of two, approximately. The number of rows does not affect the transformation as the LLM agents are solely given the table headings and a sample row. The figure shows that the costs for the node and relationship extraction increase with an increasing table size. A contributing factor is the increasing uncertainty, caused by the complexity introduced by larger table sizes. A clear trend is difficult to determine, though, as the table size is only one factor for the duration and costs of the transformation. The table structure and table lingo also contribute to the complexity of the task and therefore influence cost and duration as well.
This pipeline, coupled with semantic search capabilities and integrated within a user-friendly graphical interface, significantly enhances data management for small research groups or within research projects. It simplifies complex data management tasks, making data ingestion and transformation intuitive. By extracting relationships and adding valuable context, it increases the overall value of the data.
LLMs have proven to be valuable tools in data extraction and graph construction, as they do not require intensive training. The rapid advances in the field of LLMs imply that our pipeline will continue to improve in accuracy, speed, and cost-efficiency by incorporating the latest models. Currently utilizing GPT-4, our evaluation shows that it extracts graphs with high accuracy. The nondeterministic nature of the output can be minimized through validation functions.
In future works, the proposed pipeline will be integrated into a comprehensive data management system. Specific tasks will focus on testing it as a data management solution for research groups, which will involve adding additional interfaces and enhancing user management capabilities.
This journal is © The Royal Society of Chemistry 2025 |