The Materials Experiment Knowledge Graph

Materials knowledge is inherently hierarchical. While high-level descriptors such as composition and structure are valuable for con-textualizing materials data, the data must ultimately be considered in the context of its low-level acquisition details. Graph databases offer an opportunity to represent hierarchical relationships among data, organizing semantic relationships into a knowledge graph. Herein

Materials knowledge is inherently hierarchical. While high-level descriptors such as composition and structure are valuable for contextualizing materials data, the data must ultimately be considered in the context of its low-level acquisition details. Graph databases offer an opportunity to represent hierarchical relationships among data, organizing semantic relationships into a knowledge graph. Herein, we establish a knowledge graph of materials experiments whose construction encodes the complete provenance of each material sample and its associated experimental data and metadata. Additional relationships among materials and experiments further encode knowledge and facilitate data exploration. We illustrate the Materials Experiment Knowledge Graph (MEKG) using several use cases, demonstrating the value of modern graph databases for the enterprise of data-driven materials science.
The materials community has envisioned a new paradigm in materials discovery wherein experiment automation and the integration of human and machine intelligence accelerate materials research to enable new technologies that address a range of societal needs. [1][2][3][4][5] This vision is being realized in specific areas of materials research via advancements in high throughput computation, experiment automation, and artificial intelligence. [6][7][8][9] Continued evolution of accelerated discovery efforts will require methods to aggregate data and knowledge from a diverse set of sources. Recent advancements for specific sources and domains of materials data include integration of computational databases via the JARVIS project 10 and aggregation of perovskite solar cell data. 11 12,13 , which enables materials researchers to submit and annotate datasets.
Scientific knowledge and the discoveries that it provide are the result of cyclic learning. Scientific discovery can thus be accelerated by improving the quality and/or the frequency of learning cycles. Bolstered by the availability of machine learning methods to learn from an ever-expanding dataset, the autonomous or closed-loop approach to experiment automation focuses on increasing the frequency of learning cycles. Initial examples of autonomous operation of such learning cycles have been naturally limited to optimization of performance in a low-dimensional parameters space. Bolstered by these successes, the community is poised to broaden the purview of autonomous learning cycles, which places new constraints on both the breadth of knowledge that must be encoded and the speed of data exploration provided by the in-loop data store. The inherent challenges of managing a diverse set of data streams and establishing a performant data store for autonomous research are compounded by the historical dearth of research in establishing materials data infrastructure. 2,14,15 Herein, we describe the use of graph databases to improve the management of data from materials experiments, provide scalability with respect to data diversity and quantity, and enable data exploration at a speed commensurate with autonomous execution of learning cycles.
Computational materials databases can track the origin of data entries via annotations of the code repository used to generate the data along with specific metadata describing the computational methods. The analogue of this metadata for experimental materials science is far more complex due to the broad range of instruments and their settings, reagents and their purities, etc.. Perhaps most foundationally, the data resulting from materials experiments is often sensitive to the order of the experimental steps. Consequently, data management schema must encode the experiment provenance to uniquely represent a piece of experimental data. Recording experiment provenance is inherent to automated experiment workflows that track samples and record timestamps of experiments. [16][17][18][19] Other strategies for provenance management have been introduced for spectroscopy experiments 20 Fig. 1 Snapshots from an interactive data exploration spanning visualization of a) element nodes for elements Al and Pd with 10,278 sample nodes containing these elements, b) an expanded view of select samples containing both Al and Pd, and c) graphs for 4 select samples where the relationships to element nodes are no longer shown and each sample node has been expanded to show its processes as well as additional information for a select process. The Element, Sample, and Process node types are labelled. Additional annotation includes the experiment provenance of 1 sample where the 7 process nodes are linked by "Next" relationships. The user-selected chronoamperometry (CA) process of interest, of which there is an analogue in each of the 5 sample provenance graphs, is expanded to show its data file and the "CA current" metric. The metric nodes are colored according to the color bar in the upper-right.
The MEKG contains a total of 52,263,968 nodes and 111,430,058 edges, and herein we its utility for high throughput electrochemistry experimentation and data exploration. We present 4 use cases, commencing with the most general applications, i) graphical exploration of data and ii) data retrieval via queries. We then describe specific implementation of database queries to iii) automate design of experiments and iv) evaluate a hypothesis from crowd-sourced data.
Human researchers possess domain expertise combined with intuition from their aggregated prior knowledge, both of which are unrivaled by machine learning to-date. Machine learning thrives in its scalability to large datasets that exceed the memory capabilities of a typical human. The MEKG can assist the human in exploration of such large datasets through intuitive visualizations. Figure 1 shows images of the MEKG at select moments during a graphical data exploration exercise, for which the full video is available in the mekg-migrations repository (see Code Availability). This interactive visualisation demo commences with viewing all samples that contain Pd or Al (Figure 1a), focusing on samples that contain both (Figure 1b), and then viewing their experiment provenances (Figure 1c). In this last step, the sub-graph for each sample is expanded to show the analyzed electrochemical current density, for which a color legend is assigned to demonstrate simultaneous visualization of performance and experiment provenance.
Another mode of exploration, applicable to equally to human and machine users, is data exploration via queries. We developed the following set of queries to include a synthesis-based search, a synthesis and measurement-based search, a provenance-based search, and a provenance-based search conditioned on analysis results: 1) Find samples annealed at 350 • C; 2) Find all electrochemistry measurements performed on a sample that contains both Bi and V; 3) Find all provenances wherein a sample was synthesized by inkjet printing and whose first 2 electrochemistry measurements were chronopotentiometry measurements at 0.03 augmented with facile metadata management. 21 Our approach to this challenge is to recognize the experimental events as the data source, resulting in the Event Sourced Architecture for Materials Provenance Management (ESAMP). 22 To facilitate ingestion of a variety of data sources and automate some aspects of data validation, we implemented ESAMP with a Structured Query Language (SQL) database. The sequence of experimental steps is most naturally modelled as a directed graph, and in the present work we demonstrate a graph database that encodes experiment provenance along with a variety of other relationships. The graph approach to modelling experiment sequences has been primarily applied in the field o f c hemical synthesis. [23][24][25][26] The MEKG (pronounced "Mek G") extends this concept to span synthesis, processing, and characterization experiments, while additionally encoding other relationships that facilitate knowledge representation in general, and data exploration in particular.
We recently published the Materials Provenance Store (MPS), 27 a database built with the ESAMP SQL schema. In the present work, we ingested MPS into a neo4j database (see Code Availability), in which there is a node for each material "Sample", for each experiment "Process", and for "Sample-Process", which is the application of a Process to a Sample. The experiment provenance for a given sample is encoded through directed edges of type "Next" that connect Sample-Process nodes. Additional nodes for collections of samples, details of each process, data files produced by processes, and analysis results are linked with edges derived from foreign keys in the SQL-based MPS database. We then add additional relationships, such as edges between Element nodes and Sample nodes as well as between pH nodes and electrochemical Process nodes. The encoded knowledge can be further expanded via additional relationships to facilitate data exploration, and relationships can extend to organizational knowledge such as project funding, intended research goal, and relevance to a publication. and 0.1 mA, respectively, each with a duration between 7 and 15 s; and 4) Find all provenances that contain a sequence of 5 electrochemistry experiments in NaOH-based electrolyte wherein the first 4 experiments were each chronoamperometry measurements that produced a measured current above 10 −7 , 10 −8 , 10 −9 , and 10 −10 A, respectively, and the final electrochemistry experiment was a cyclic voltamogram that produced a maximum measured current above 10 −6 A. The query execution times are summarized in Table 1, demonstrating the excellent performance of the graph-based query across a breadth of query types. For query 1, where the requisite data is indexed in a single SQL table, the SQL-based query is naturally the fastest. For provenance-based queries, the graph-based queries are several times faster than the SQL-based queries. More drastically, the complexity of query 4 revealed a marked difference in query preparation time. While the graph-based query was written in a matter of minutes, initial attempts at writing the SQL query resulted in query timeout after 10 4 s. Multiple days of human effort were required to obtain a query time within a factor of 5 of the graph-based query, which is reflected in the relative complexity of the queries (see Supporting Information). Our conclusion from this exercise is not that graph databases universally outperform the other data management methods with respect to query execution, but rather that the graph-based queries are sufficiently fast for real-time data exploration and can be achieved with intuitive query expressions that avoid complex query engineering. Table 1 Comparison of execution times for representative queries of materials experiment data (MPS) when it is stored in a graph database (MEKG), SQL database (ESAMP), and file system (MEAD). The graph and SQL queries were performed on a t2.xlarge Ubuntu Amazon Web Services (AWS) machine (see Supporting Information). The number of results is shown for each query. The File System database is not applicable (N/A) for query 4 because it does not contain the required information. † Query times were in excess of 10 4 s prior to extension query optimization. As a moderately complex provenance-based query, query 3 was chosen to characterize how query time scales with data size. To achieve representative databases of smaller size, 3 sub-databases were created using the earliest 1/8, 1/4, and 1/2 of the Sample-Processes in the MPS, followed by removal of all orphaned samples, processes, analyses, etc. (see Supporting Information). Running query 3 on these databases informs us of how long the query would have taken if it had been performed at these various points in the lab's sequence of experiments. The results for graph and SQL-based version of query 3 are shown in Figure 2, which illustrates the excellent relative performance of the graph-based query across all data sizes as well as a favorable power-law scaling relationship for the graph-based query. Extrapolating to a database with a billion Sample-Processes, the scaling law provides a projected query execution time of 65 s, illustrating the promise of graph database for aggregating large swaths of materials chemistry data while maintaining operability for both humans and machines. Our second use case involves the automated design of experiments, where we choose a learning cycle of intermediate scope.
Sequential learning in closed-loop experimentation typically involves the design of a single acquisition from a collection of available experiments, a small-scope experiment design intended to iterate many times per day. Traditional human-executed learning cycles have a broad scope, typically occurring over the course of many days. Here, we consider the automated planning of experiments for a single batch of high throughput experiments that can be executed in a few hours. Electrocatalytic activity of the oxygen evolution reaction (OER) varies substantial with not only the catalyst composition and structure, but also the electrolyte, especially the electrolyte pH. While high throughput experimentation has amassed catalyst screening data, these cover a small fraction of all possible combinations of catalysts and electrolytes. We thus consider a automated design of experiments for choosing which catalysts available in the lab should be tested in a given electrolyte. While machine learning models could be invoked for this prediction, we simplify the design process to keep focus on the role of the MEKG. We previously demonstrated a correlation of OER activity in pH 3 and pH 7 electrolytes among metal oxide catalysts, 28 which helps define a simple design-of-experiments strategy. We conduct 2 queries, one to establish the catalysts screened in pH 7 but not pH 3 electrolyte, and a second to establish which catalysts have already been synthesized but not yet electrochemically tested. Evaluating the query results provides a set of composition libraries that are candidate for pH 3 OER screening, ranked by the expected activity based on prior pH 7 experiments. Running on the lab's notebook server (see Supporting Information), the initial query used criteria spanning experiment provenance, process details, and analysis details, identifying the 69K activity measurements of interest from the set of 2.5M electrochemistry covery is imminent, we believe the elevation of experimental data management to graph databases will pave the way for a new era of artificial intelligence for materials science.

Code Availability
The code for the query time use cases and MEKG migration from MPS is available at https://github.com/modelyst/mekgmigrations.
The code for the design of experiments and hypothesis evaluation use cases is available at https://data.caltech.edu/records/m4mpa-4mt17 (doi: 10.22002/m4mpa-4mt17) measurements (Sample-Processes) with a query execution time of 70 s. In total, the design of experiment notebook runs in under 3 min, enabling human-guided, data-driven design of high throughput experiments.
Our final use case involves the evaluation of a new hypothesis based on existing data. Trotochaud and coworkers demonstrated that the activity of electrocatalysts for the oxygen evolution reaction (OER) may be enhanced due to incorporation of trace Fe impurities in standard electrolytes. 29 Meanwhile, high throughput experiments revealed the broad range of compositions that are active OER catalysts in alkaline electrolytes. 28 From these reports we can hypothesize that catalyst conditioning, perhaps through Fe incorporation, improves the activity of OER catalysts regardless of initial catalyst composition. This would imply that even poor catalysts will become competent catalysts upon aging, which has not been evaluated in the literature. Querying the MEKG for experiments of the type reported in Ref. 28 produces a dataset of catalyst activity, where we group measurements by the primary element of the catalyst (concentration at least 70%) and consider the total duration of prior electrochemistry. Figure  3 summarizes the results, revealing that all catalysts experience conditioning over 10's of seconds of electrochemical operation, and while transition-metal-rich catalysts exhibit the highest activity, the conditioning results in high activity for rare-earth-rich catalysts that otherwise may not exhibit such activity. A similar analysis in Figure S1 shows that the same conditioning trend is observed in an alternate measurement of catalytic activity (catalyst overpotential at 3 mA/cm 2 ) in pH 13 electrolyte, while an opposite trend is observed in pH 7 electrolyte, indicating that catalyst instabilities outweigh any catalyst conditioning at near-neutral pH and demonstrating that evalaution of the aforementioned hypothesis pH-dependent. While the underlying high throughput experiments were not designed based on a catalyst conditioning hypothesis, the management of catalyst activity data in the context of experiment provenance enables rapid evaluation of such hypotheses using the MEKG.
The MEKG extends the rich use of graph and network models in materials science. Networks have been used to model all known inorganic materials 30 and their interrelationships established with structural and electronic features. 31 Materials knowledge graphs have been established for materials properties and their symbolic or data-driven relationships, 32 for representing interrelationships among various sources of materials data, 33 , for integrating multiple data streams, 34 and for encoding relationships among factual knowledge, analytical models, and domain experts. 35 Knowledge graphs for specific d omains o f materials science have been established for common industrial metals, 36 , nanocomposites, 37 metal organic frameworks, 38 and battery materials. 39,40 The value proposition for expanding the purview of such knowledge graphs has been made, 41 and the present work builds towards a global materials knowledge graph by establishing best practices for representing experiments and their associated (meta)data in a scalable manner. With the proliferation of graph neural networks, causal modeling, and attention based networks such as transformer models in machine learning writ large, and the expectation that increased deployment for materials dis-

Conflicts of interest
Modelyst LLC implements custom data management systems in a professional context.  The extract, transform, load (ETL) process was carried out using a python library called DBgen (https://github.com/modelyst/dbgen), 12 which was specifically designed to instantiate complicated, scientific data pipelines. PostgreSQL (https://www.postgresql.org/) 13 was used to create the SQL database, and the Neo4j community edition (https://neo4j.com/) was used to create the graph 14 database. The process of migrating the data from the SQL database to the graph database was done using a python library 15 called PG4J (https://github.com/modelyst/pg4j), which is capable of migrating any PostgreSQL database to Neo4j. For query 16 timing, the SQL database and the graph database were run in docker containers on AWS EC2. Specifically, the EC2 instance 17 was a t2.xlarge, and the docker images were postgres:14 and neo4j:5.5 for the SQL and graph databases, respectively. Cypher 18 queries and data processing for the design of experiments use cases were executed in Jupyter notebooks running on a local 19 JupyterHub server (Intel i9-11900K, 64 GB RAM). The computational methods are summarized in table S1.  To investigate the scaling of query times in both the SQL and graph databases, we created three smaller versions of the original 23 database. The first database fragment was created by removing the last half of the rows in the sample-process table, ordered by 24 their process timestamps. We then deleted all rows in other tables that were no longer linked to a sample-process. This process 25 was repeated two more times to create two additional database fragments, with 3/4 and 7/8 of the rows in the sample-process 26 table deleted. Each fragment was migrated to Neo4j using the tools described above, resulting in a series of MPS-style and 27 MEKG-style databases that share the same information and contain 1/8, 1/4, and 1/2 of the number of Sample-Processes in the 28 full MPS and MEKG databases.