Developments and applications of the OPTIMADE API for materials discovery, design, and data exchange

The Open Databases Integration for Materials Design (OPTIMADE) application programming interface (API) empowers users with holistic access to a growing federation of databases, enhancing the accessibility and discoverability of materials and chemical data. Since the first release of the OPTIMADE specification (v1.0), the API has undergone significant development, leading to the v1.2 release, and has underpinned multiple scientific studies. In this work, we highlight the latest features of the API format, accompanying software tools, and provide an update on the implementation of OPTIMADE in contributing materials databases. We end by providing several use cases that demonstrate the utility of the OPTIMADE API in materials research that continue to drive its ongoing development.

The Open Databases Integration for Materials Design (OPTIMADE) application programming interface (API) empowers users with holistic access to a growing federation of databases, enhancing the accessibility and discoverability of materials and chemical data.Since the first release of the OPTIMADE specification (v1.0), the API has undergone significant development, leading to the v1.2 release, and has underpinned multiple scientific studies.In this work, we highlight the latest features of the API format, accompanying software tools, and provide an update on the implementation of OPTIMADE in contributing materials databases.We end by providing several use cases that demonstrate the utility of the OPTIMADE API in materials research that continue to drive its ongoing development.

Introduction
Industrial chemicals and materials underpin the global economy: for example, chemicals alone contribute $6.4trillion 24 annually to the global economy.Industrial chemicals and materials companies are under significant pressure to improve the environmental, social, and governance impact of their business 10 , with a particular focus on reducing the carbon footprint.Unfortunately, the discovery of chemicals and drugs is traditionally a time-consuming and expensive process driven by experiment-led trial-and-improvement, delaying the response to the climate crisis.However, in the last few years, high-throughput calculations have led to an explosion in the volume of available materials data 41,57,77 .Machine learning has emerged as a pivotal tool to exploit this data 125 , accelerating the discovery of chemicals and drugs that meet the challenges faced today.
A core requirement for the data and machine-learning revolution is data availability and interoperability.Therefore, the Open Databases Integration for Materials Design (OPTIMADE) universal application programming interface (API) was created to empower users with programmatic access to many leading materials databases.By organising under an open federation, and emphasising the interoperability of search as well as access, OPTIMADE improves the discoverability of materials data, especially from smaller, less known databases.As we move into the era of autonomous laboratories (both computational and experimental), the technical approach taken renders all OPTI-MADE APIs machine actionable, allowing for automated serendipitous discovery of newly added data entries in a given materials space without needing to specify which databases to access.Such an extended data availability requires explicit clarification of data permissions and ownership, which can be different for each database.
Since the first release of the OPTIMADE specification (v1.0) 2 with accompanying article 3 , the OPTIMADE API format has enjoyed significant adoption, with 22 registered providers 130,131 of 25 interoperable databases serving over 22 million crystal structures with associated properties.In the v1.2 release 5 , the specification has undergone significant extensions and enhancements that enable novel use cases whilst making the format accessible to both users and developers.This gives users access to data from both large and well-known sources, and many specialist datasets focused on a family of materials of particular interest.The combination of a general overview of all possible materials and detailed knowledge of particular materials enables novel discovery and deep insights with for example machine learning.
In this paper, we highlight recent developments to and uses of the OPTIMADE API.First, in Section 2, we provide an overview of the OPTIMADE API format, and of the latest features.Next, in Section 3, we highlight the efforts of leading materials databases to provide access through the OPTIMADE API format.In Section 4, we show how the OPTIMADE API has been used in computational screening and for the creation of machine learning datasets.Sec-tion 5 discusses future plans for OPTIMADE, both in terms of new technical frontiers, and of the sustainability of the ecosystem and community.Finally, in Section 6, we summarise and look ahead to the future of materials databases.

Overview of the OPTIMADE API
The OPTIMADE API is well-positioned to set the standard to enable search, retrieval and annotation in a common way for all databases.Crystal structure data has benefited from decades of standardisation work in the form of the Crystallographic Information File (CIF) 13,51 and related initiatives, which heavily inspired the crystal structure representation employed by the OPTIMADE API format.By building a standardised, open format, both proprietary and open data can be aggregated and used on the same footing.
Building on these seminal standardisation efforts, OP-TIMADE goes considerably further than standardising the representation of crystal structure data, by including: the means for filtering entries (via the OPTIMADE filter grammar), a standard for laying out resources on the web (by providing rules and expectations of URL formats), a means for introspectively defining additional properties and entry types per-database, and the creation of a decentralised federation of compatible databases.These additional aspects are what maximises the impact of the OPTIMADE API and enable new scientific applications.The OPTIMADE API is registered with FAIRsharing.orgas a data standard 38 , and releases are archived on Zenodo 2 , with ongoing development occurring openly under the Materials-Consortia banner on GitHub 129 .The initial motivation for OPTIMADE, and a discussion of the previously existing materials API formats and filter mechanisms can be found in Ref. 3 which described the first release.
The process for a user to access OPTIMADE compliant data is as follows.The user starts with the base URL defined by the database provider (available from the federated OPTIMADE provider list 130,131 ), and then appends a common string describing the entry type to query, plus any filter or implementation parameters, which is submitted as an HTTP GET request.For example, to probe the Materials Project 66 for materials containing SiO 2 , we make a GET request to the following URL: base URL https://optimade.materialsproject.orgendpoint /v1/structures ?filter= chemical_formula_reduced="O2Si"

OPTIMADE filter
This delivers the response in Box 1 that contains entries where oxygen and silicon occur in a 2:1 ratio.The power of the OPTIMADE API is that the same generic request can be appended to the base URL of any other database, and its matching entries will be returned.This is a non-trivial step for database providers, who must convert the OPTIMADE filter grammar into the corresponding query for their own database engine.The benefit is that client code can then be written to unify the results from multiple databases, allowing users to receive the most comprehensive results for the query.Such an extended data availability requires explicit clarification of data permissions and ownership, which can be different for each database or even each entry.

OPTIMADE core design principles
A materials database provider that implements the OPTI-MADE API will have a database backend and one or more interfaces available to clients.These interfaces include the OPTIMADE API, but can also provide access by other means, e.g., a database-specific API or web-based graphical user interface.Figure 1 serves as a schematic illustration of this point.
The providers that presently support the OPTIMADE API represent a wide range of underlying backends.The primary backend component is, usually, a database engine, but the backend covers all parts of the system that manages the stored data.Relevant backends for OPTIMADE implementations range from simple flat files in a filesystem, to There exists a multitude of APIs for data access and storage (both for materials databases and more generally) that have been designed for a specific database backend.These APIs are typically designed around specific features the backend provides in terms of, e.g., browse, search, and retrieval, and, crucially, how these features are implemented by the backend.The API then typically becomes a thin wrapper for these features, which exposes the functionality of the backend implementation.There is thus a crucial need for a generic database interface, which should be based on the central principles: • A core feature set that any reasonable materials database backend can implement via cheap, singlepass, on-the-fly translations in the API layer.Furthermore, the translations should be possible without modifying the underlying backend -i.e., participating databases should not be required to reformat or amend the stored data, software features, etc. to provide these core OPTIMADE features.
• Extended features beyond the core features that are shared among multiple (but not all) backends should be standardised as optional features.It may appear that this design works against interoperability, as it may lead to query Q working for databases A and B, but not for database C.However, if the database backend for C cannot support the type of query that Q represents (for example, a database containing molecular dynamics calculations of proteins cannot sensibly be searched on the chemical formula in the simulation cell), it means that there exists no way to make that query interoperable across A, B, and C (without altering the backend of C).Hence, the highest level of interoperability is achieved by standardising Q as an optional feature.
• For standardised optional features, multiple overlapping representations of the same features and/or data should be avoided.The reason is that, for two different ways of providing feature Q as Q 1 and Q 2 , we may end up with database A supporting only Q 1 and database B supporting only Q 2 .Hence, the client either has to tailor the query differently depending on the destination or provide multiple versions of the query in the request, which is an undesirable burden to put on clients and works against an interoperable design.What is strictly standardised is how a database that does not implement a certain feature should respond should that feature be requested.
• Those features and data that only occur in a particular database should be provided in isolated databasespecific namespaces that other databases can recognise and handle appropriately.The OPTIMADE specification describes how references to such fields should be announced by the particular database and also how clients making multi-provider queries should handle in the response in various situations for optimal interoperability.The latest version of OPTIMADE even outlines a mechanism for creating sub-specifications that allow multiple database providers to collaborate on custom communal definitions, as shall be discussed later.

Recent OPTIMADE improvements
Since the initial v1.0 release 2 of the OPTIMADE API in July 2020, many features have been added, driven by user feedback and use cases.The major enhancements to the specification introduced in versions 1.1 4 and 1.2 are discussed below, with the full changelog and specification text available online on GitHub at Materials-Consortia/OPTIMADE 129 .

Property definitions
In previous versions of OPTIMADE, the information served through introspection for a provided property was limited to a single specification of an OPTIMADE type, which is far from sufficient for a client to use the data.For standard fields, clients were referred to the human readable descriptions in the OPTIMADE specification, and for databasespecific fields to side channel information, e.g., at the website of the database.OPTIMADE v1.2 includes a full schema format, based on -and compatible with -JSONSchema 99 , capable of fully describing in a machine-readable way what a property is (including definition of sophisticated data structures with multiple layers of lists and dictionary subfields).All properties are given clear versioned stable identifiers (URIs) that can be used to identify that multiple databases refer to the exact same property.A simple example is provided in Box 2, which displays the definition of the nsites field.Furthermore, a browsable interface to these definitions for the OPTIMADE standard properties is available at https://schemas.optimade.org/.
There is related work outside of the specification on providing an enlarged set of shared definitions that can be incrementally adopted by the subsections of the community.This will allow databases to point to and share the same property definition without requiring the slow consensusbuilding step for the relevant fields to be promoted to the main specification.This is especially useful in cases where this shared information is the defining feature of a given database; for example, the first target namespace is that of stability predictions arising from density-functional theory calculations, the key property required to truly enable materials discovery applications with OPTIMADE (see 4.1.2later for more information).

Streaming of partial data
So far, OPTIMADE has been based on the JSON format, which has the advantage that it is relatively human readable and well supported by most programming languages.Most tools, however, can only process an entire JSON file and thus do not support streaming processing, which makes it difficult to handle large JSON files.In addition, JSON does not support binary data: a response with large amounts of numerical data needs to be encoded (e.g., in base64 or similar encodings), significantly increasing the size of the data to be transferred over the network.Adding support for a format like BSON, which does support binary data, or JSON variants which do support streamed processing (such as JSON Lines) could therefore improve the data "examples": [  42  ], "x-optimade-unit": "dimensionless" } Box 2. An example OPTIMADE property definition for the nsites property.transfer rates for OPTIMADE requests, and allow for immediate visualisation of partial results.
In previous versions of OPTIMADE, the size of property data was limited so that a single database entry could fit in a single HTTP response.In the most recent version of OPTIMADE, the implementation can decide to defer large properties to be communicated over a separate simplified streamable protocol, which in practice can be implemented by serving large static files over HTTP.This addition was driven by the future application of serving trajectory data with OPTIMADE (see 5 for more information) where individual "properties", such as atomic positions in each frame of a trajectory, can individually be prohibitively large.

Other technical and scientific enhancements
There have been several other extensions of the format since version 1.0: Symmetry information for structures The latest OPTI-MADE release includes a set of standardised and comprehensive descriptions of structural symmetry.This has been achieved with 5 fields that can be variously used for filtering by symmetry and for reconstructing atomic site positions, building on existing standards laid out by the IUCr and others: • space_group_symmetry_operations_xyz , • space_group_symbol_hall , • space_group_symbol_hermann_mauguin , • space_group_symbol_hermann_mauguin_extended , • space_group_it_number .
/files endpoint The OPTIMADE specification has been extended to enable the description of files and their precise relationships to other OPTIMADE entries.To implement this, a new /files entry endpoint has been defined.This addition allows, for example, linking /structures entries with their representations in widely used structural data formats such as CIF, POSCAR, and SDF, or linking directly to the raw input or output files associated with a calculation involving a structure.

Per property and per entry metadata
We plan to create a mechanism to provide metadata per property for each individual entry.This metadata could, for example, be confidence intervals, or information about how a property was calculated.

Request delay
In order not to overload a particular OPTI-MADE server, the metadata property request_delay appears in the latest release to allow an implementation to suggest a specific back-off time delay between subsequent requests.It is up to the given server implementation to decide what to do with clients violating the requested delay: refuse to serve, intentionally delay the response, or ignore.

Licensing
The possibility of unsupervised database harvesting raises a need for machine-readable definition of data licenses.Whilst OPTIMADE is an open format, it can be used to serve proprietary or otherwise restricted data, the usage terms of which must be described.The latest OPTIMADE release addresses this issue by introducing the metadata fields license , available_licenses , and available_licenses_for_entries .
The property license is intended for databases to link to humanreadable licensing terms, while available_licenses and available_licenses_for_entries allow specifying machine-readable license identifiers following the Software Package Data Exchange (SPDX) standard, making interpreting requirements easier for automated clients and web crawlers.This standardised way to announce that either the whole database, or the individual entries, are explicitly available under certain licenses enables machine-actionable licensing decisions, e.g. for commercial re-use and republishing of data.

Substring comparisons on list elements
The OPTI-MADE filter grammar defines substring comparison operators to match either the start, end or any part of a property value.However, until version 1.2, such comparisons could not be carried out on list elements.The latest version of OPTIMADE now explicitly supports such substring queries on elements of list properties.

Boolean values
Although there are no Boolean fields in the main OPTIMADE specification, the latest version of the OPTIMADE specification includes support in the filter grammar for defining and filtering on such custom fields defined by data providers.

optimade-python-tools
optimade-python-tools is an open-source software (MIT license) package that provides tooling for serving, validating and consuming OPTIMADE APIs in Python 37 , available on GitHub at Materials-Consortia/optimade-python-tools.Now in version 1.0, it provides a highly extensible reference server implementation of an OPTIMADE API, with support for different database backends.This server is provided as a Docker container for easy deployment and can be configured to use an existing database, or generate one from scratch in the OPTIMADE format.Existing databases wanting to make use of the library need to provide mappings to and from their existing data format and query mechanisms.The package contains isolated modules for various OPTIMADE-related functionalities, for example, a grammar and parser for the OPTIMADE filter language, mappers for querying different database backends, and a fuzzy validator that can dynamically generate requests to an OPTIMADE server to assess its compliance with the standard (which is also used to generate the OPTIMADE provider dashboard: optimade.org/providersdashboard) 130.In addition to server-focused functionalities, the package includes reusable code that can help OPTIMADE consumers and clients, including adapters for converting OPTIMADE entries into common formats used in the community, such as ASE Atoms 72 , pymatgen Structures 90 and AiiDA nodes 60 .
The package also contains an advanced asynchronous HTTP client that can be used within Python code or at the command line to concurrently filter multiple OPTIMADE databases with multiple queries, paginating and validating the results, as well as searching across databases for supported properties.Users can provide callbacks to store the results in local secondary databases for re-use in other projects.
optimade-python-tools is fully documented online at optimade.org/optimade-python-toolswith guides for setting up and validating an API, deploying a server and using the client.In this way, optimade-python-tools significantly lowers the barrier to retrieving data from OPTI-MADE APIs, and to the development of new server implementations.Future developments will focus on extending the use cases of optimade-python-tools to operating on static data, so that it can be embedded within archival infrastructure, such as Materials Cloud 127 , to serve user data without any additional input.

optimade-gateway
optimade-gateway is an open-source software (MIT license) package implementing a RESTful API in Python for querying OPTIMADE databases, available on GitHub at Materials-Consortia/optimade-gateway.This provides a powerful yet straightforward tool to allow users of the Python programming language (that is widespread in machine learning and other branches of data science) to access material database results.optimade-gateway supports both synchronous and asynchronous searches via HTTP GET and POST requests, respectively.It can return search results in the standard OPTIMADE data format, as well as a custom OPTIMADE-inspired data format.The main purpose and goal of the deployed service is to be a client backend.A gateway to version 0.4 is running at mmp-optimade-gateway.materialscloud.ioas well as at optimade-gateway.fly.dev.Both services utilise a MongoDB database for time-limited caching of the query results, increasing response speeds for common queries.

Optimade.Science
Optimade.Science is a minimalist in-browser OPTIMADE aggregator, written in the TypeScript language on top of the Svelte frontend framework 126 .It fetches the official OPTIMADE providers list, looks for the structure endpoints, and allows simultaneous querying against all of them, collecting the results together on a single webpage.Technically, this is just the single file index.htmland is thus highly-portable, can be opened from anywhere, on any environment (e.g., on a smartphone or locally from a USB stick).To increase ease-of-use, a simple pattern-matching library was developed (mpds-io/optimade-mpdsnlp), transforming the free-text user input into a standard OPTIMADE query (e.g., a keyword ternary is transformed into nelements=3).A separate Svelte user interface kit (basf/svelte-spectre) was developed offering a range of the modular GUI components, willingly accepted by the frontend community and already re-used in many other webprojects, including commercial ones.A standalone OPTI-MADE client written in TypeScript was employed, being fully isomorphic (that is, the same code can be used inside the web browser and on the web server).

Contributing databases
The burgeoning community of materials databases are the core that underpins the OPTIMADE consortium.Therefore, below we discuss the key features of the major materials databases that make data available through the OP-TIMADE API.We first briefly introduce each database and its offering, and we then discuss its particular OPTIMADE implementation.Finally, we provide an updated table from our previous work 3 that compares the amount of compliant data available in different databases.

AFLOW
Database: One of the largest open-access databases for inorganic materials, with 4 million compounds and 800 million associated properties 35,92 , which also includes the 2000+ entries of the AFLOW Encyclopedia of Crystallographic Prototypes 55,56,83 .The data has been employed for the discovery of new permanent magnets 107 , superalloys 89,104 , high-entropy high-hardness plasmonic carbides 1 , super-hard disordered carbides 108 , borides and carbo-nitrides 30 , and phase-change memory compositions 71 , and has also been used to study bulk metallic glasses 39,98 , superconductors 62,124 , and thermoelectrics 135 .The data can be retrieved conveniently through the AFLUX search API 105 with a minimal, flexible, and human-readable query language.

OPTIMADE implementation:
The AFLOW OPTIMADE API builds on AFLUX to offer a common query syntax across multiple materials databases, mapping AFLOW property labels to that of OPTIMADE while still offering access to AFLOW-specific properties with the _aflow_ prefix.A full list of keywords available to use with OPTIMADE to query AFLOW are available at the info endpoint /info/structures .

Alexandria
Database: Comprises both hypothetical and existing compounds, which have been relaxed using density functional theory (DFT).Currently, the database contains 5,062,521 entries, spanning nearly the entire periodic table with 89 elements.The database was primarily generated by scanning binary, ternary, and quaternary prototypes to identify stable compounds.This process employed crystal graph attention networks 114,116,117 to predict the stability of all potential compositions for each prototype.Compounds that were found to be close to stability were subsequently confirmed using DFT.Additionally, the database includes compounds obtained from traditional high-throughput searches conducted previously 112,113,137 .
Most entries were calculated using the PBE functional with parameters mostly consistent with those of the Materials Project 66 , and includes 3D (4 489 295 entries), 2D (137 833) and 1D (13 295 entries) compounds.Additionally, a total of 422 098 materials were computed using the PBEsol and SCAN functionals to yield more precise geometries, formation energies, and bandgaps 115 .The PBE version of the Alexandria database, which comprises 115 535 potentially stable materials, represents the most extensive publicly available DFT convex hull of thermodynamic stability in our knowledge.Furthermore, 771 696 materials lie within a distance of less than 50 meV/atom from the convex hull.The database encompasses various properties, including structure (lattice and atomic positions), energy distance to the convex hull, formation energy and direct as well as indirect bandgaps.Continuous expansion of the database is underway through further ongoing high-throughput searches.

OPTIMADE implementation:
Uses the optimade-python-tools 37 reference implementation and provides a list of extra properties with the _alexandria_ prefix.Prior to this development, the Alexandria database was made available solely as a static archive that users had to download in its entirety to explore, but now the OPTIMADE format supports filtering on both composition and the predicted stability of database entries.

BioExcel COVID-19
Database: A platform designed to provide online access to atomistic molecular dynamics trajectories for biological macromolecules related to the COVID-19 disease 12 .The project is part of the open access initiatives promoted by the world-wide scientific community to share information about COVID-19 research and integrate technology developed in previous biology related projects 6,58,59,148 .
OPTIMADE implementation: A web-server interface https://bioexcel-cv19.bsc.espresents the MD trajectories, with a set of quality control analyses and system information.Using an extension of version 1.1.0 of the OPTIMADE specification, a basic OPTIMADE server based on the optimade-python-tools has been set up, which provides the trajectory data at the /trajectories endpoint.This server also provides protein specific properties and metadata under database specific fields with the _bioxl_ prefix.Querying has been partially implemented, but is not yet available for all fields.No atomistic structures are shared, so the /structures endpoint is not available.

Computational Materials Repository (CMR)
Database: CMR is a repository of databases containing calculated atomic structures and basic properties of a broad set of materials.The CMR databases can be browsed online using a simple querying system or downloaded in various formats (https://cmr.fysik.dtu.dk).Currently, the CMR holds more than 30 different databases.
The flagship database in CMR is the Computational Two-Dimensional Materials Database (C2DB) 44,50,88 that contains structural, thermodynamic, elastic, electronic, magnetic, and optical properties of more than 15,000 two-dimensional (2D) monolayer materials computed using the GPAW 34,86 package.The core set of materials in C2DB have been obtained by extracting monolayers from experimentally known layered van der Waals crystals.Subsequently, new monolayers have been generated by systematic atom-substitution applied to the core materials, or using deep generative AI models 78 .Recently, the C2DB has been complemented by the BiDB database containing homobilayers formed by stacking 1,000 of the most stable monolayers from the C2DB in all possible commensurate configurations 94 .
OPTIMADE implementation: CMR implements the OP-TIMADE API through the OASE package, utilising the optimade-python-tools library.At present only a subset of C2DB is available via an OPTIMADE API, but in the future other CMR databases will also be available via OPTIMADE.

Crystallography Open Database (COD)
Database : The largest open access collection of experimental crystal structures 49 .It is widely used by the scientific community to explore different material categories such as superconductors 21 , metal-organic frameworks 11 , high entropy alloys 118 , organic molecules 17 as well as for conformer sampling 84 or custom force field generation 45 .Having a set of experimental structures readily available under the same format as is required for computational materials research is highly beneficial since these structures serve as initial points for material property calculations 127 or for the search of new materials 101 .They also serve as experimental points that theoretical computations can be checked against 138 .

OPTIMADE implementation:
The COD database currently implements version v1.1.0 of the OPTIMADE standard.The new OPTIMADE version will allow this implementation to be enriched with new features that are required for the faithful representation of experimental data, thus making computations from these data and comparisons of theory and experiments more accurate.Being an experimental structure database, the COD requires a slightly different data presentation than computational material databases.Structures from the COD are evaluated using a set of experimental data quality criteria, established by the IUCr and the chemical crystallography community 63 .An OPTIMADE response within the core features does not contain all of the necessary fields to convey these additional data elements; however, the OPTIMADE standard allows introducing database-specific fields in a regular way.As a result, all established crystallographic quality criteria are included into a COD response as COD-specific fields with the _cod_ prefix.This allows OPTIMADE to include experimental position and composition disorder information following the Crystallographic Information Framework 51 .

Joint Automated Repository for Various Integrated Simulations (JARVIS)
Database: A repository designed to automate materials design using classical force-field, density functional theory (DFT), machine learning calculations and experiments.
The JARVIS-DFT originated about 5 years ago and contains millions of properties materials with carefully converged atomic structures as well as tight convergence parameters and various exchange-correlation functionals.The JARVIS-DFT contains metallic, semiconducting, insulator, superconductor, high-strength, topological, solar, thermoelectric, piezoelectric, dielectric, two-dimensional, magnetic, porous, defect and various other classes of materials 22,142 .
OPTIMADE implementation: Based on the Django Rest Framework and the JARVIS-Tools packages to follow OPTIMADE protocols of filtering and curating data.JARVIS-DFT specific fields are included in the results with the _jarvis_ prefix.

Materials Cloud
Database: A platform created to enable sharing and dissemination of resources in computational materials science 127 .A major service offered is the archiving and publishing of research data for the community via the open Materials Cloud Archive service (https://archive.materialscloud.org).Moreover, several databases that are generated within the AiiDA 60,100,133 framework are published in the Materials Cloud Explore section, which enables users to interactively browse the data and its provenance.Curated visualisations of these databases are also provided in the Materials Cloud Discover section.These databases are accessible via the OPTI-MADE RESTful API.The databases are divided into flagship and contributed databases.The current flagship databases are MC3D and MC2D 15,87 hosting over 34,000 3D crystals and 3,000 2D crystals, respectively, providing properties of experimentally-known inorganic compounds obtained via DFT simulations.The contributed databases include 2D topological insulators, pyrene-based metal organic frameworks, high-throughput Wannierisation, SrTiO 3 -CeO 2 interfaces, tail-corrections in the molecular simulations of porous materials, hidden spontaneous polarisation in the chalcohalide photovoltaic Sn 2 SbS 2 I 3 , and the CURATED covalent organic frameworks database.The data has been used to investigate transport properties such as mobility 122,123 , to search for Z 2 topological order 80 , to screen and discover quantum Spin-Hall insulators 47,79 and Weyl semimetals 48 , to obtain tight-binding-like Wannier Hamiltonians in a fully automated fashion 103 , and to develop machine-learning methods for fast identification of lowdimensional materials 134 .
Current developments are focusing on a second "Materials Cloud Archive" provider, allowing submissions to Materials Cloud Archive to specify whether and how data contributed by users should be served via an OP-TIMADE API, to enable advanced federated search over archived data.The first proof-of-concept of this integration is a recently published dataset of novel electride materials 139,140 .The full list of OPTIMADE databases served by the Materials Cloud can be explored at the /main/v1/links endpoint, with a similar list for the Materials Cloud Archive available at /archive/v1/links .A landing page for both OPTIMADE providers is available at https://www.materialscloud.org/optimade.

OPTIMADE implementation:
The data on the backend of Materials Cloud is managed via AiiDA.Along with a custom REST API, AiiDA can serve data in the OPTIMADE format thanks to the AiiDA-OPTIMADE (https://github.com/aiidateam/aiida-optimade)plugin, that is thus also used to serve the main Materials Cloud data.The OPTIMADE implementation of the Materials Cloud Archive provider is instead based directly on the optimade-python-tools 37 package.In addition to the server implementations, Materials Cloud also offers users several web applications that include clients of the OP-TIMADE API, as we describe in more detail in Section 4.2.1.

Materials Platform for Data Science
Database: Materials Platform for Data Science (MPDS) serves the Pauling File dataset 95 .Started in 1993, Pauling File is the oldest privately funded initiative for the curation and standardisation of the published inorganic chemistry data * .Data is drawn from nearly 400 thousand publications and backs up such commercial products as Springer Materials, ICDD PDF 4+, ASM's Alloy Phase Diagram Database and Pearson's Crystal Data, MedeA, and AtomWork Advanced.
OPTIMADE implementation: MPDS presents curated experimental data of three types: crystalline structures ( /structures endpoint), physical properties ( /extensions/properties endpoint), and phase diagrams ( /extensions/phase_diagrams endpoint).These three data types are inter-linked into about 200 thousand distinct phases ( /extensions/phases endpoint).Any distinct phase is uniquely determined by the chemical formula, space group, and Pearson symbol.Furthermore, each distinct phase has the permanent integer identifier phase_id, e.g., see brookite.The MPDS OPTIMADE implementation is specifically designed for the low response time and high retrieval speed, therefore some expensive operators ( ANY , OR ) are currently not supported (cf.Table 1).

Materials Project
Database: This multi-institution, multi-national effort 66 aims at computing the properties of all inorganic materials and providing the data and associated analysis algorithms for every materials researcher free of charge.Currently, over 172k molecules and over 154k inorganic compounds are included in the database.The project was established in 2011 with an emphasis on battery research, but includes property calculations for many areas of clean energy systems such as photovoltaics, thermoelectric materials, and catalysts.

OPTIMADE implementation:
The Materials Project (MP) 66 makes use of the reference server implementation provided by the optimade-python-tools 37 .Since May 2022, Materials Project has been serving formation energy data via its OPTIMADE API.In June 2023, MP started exposing additional thermodynamic stability in the form of energy distance to the convex hull via OPTIMADE for all 154k materials in its core database.The convex hull distance to the Materials Project is one of the most important properties of theoretical structures for experimentalists and simulators alike, as it indicates whether a postulated material is potentially synthesisable.
The MP OPTIMADE integration is complemented by a convenient open-source pymatgen 90 interface in the form of the OptimadeRester class, designed to streamline access to these resources for existing users of pymatgen and the Materials Project.Additionally, efforts are underway to further expose the full set of MP summary data via the OPTIMADE endpoint under the _mp_ namespace, mirroring the complete set of data recorded in the emmet SummaryDoc.

Material-Property-Descriptor Database
Database: Material-Property-Descriptor Database (MPDD) is an extensive database (4M+) of ab initio relaxations of 3D crystal structures, combined with an infrastructure of tools allowing efficient descriptor calculation (featurization), as well as the deployment of ML models. 69he most critical feature of the MPDD is the retention of intermediate modelling data, including structure-informed descriptors, which typically cost orders of magnitude more computational time than any of the other steps performed during ML model deployment 68 .Thus, many ML models can be run at a small fraction of the original cost if the same descriptor (or, more commonly, a subset chosen through feature selection) is used.This benefit applies regardless of whether a model is just another iteration, e.g., fine-tuned to a specific class of materials like perovskites, or an entirely new model for a different property.Furthermore, MPDD's access to stored atomic structures and associated metadata has been shown to be useful, for instance, in the fully data-driven prediction of atomic structures (validated with DFT and experiments), allowing quick identification of unknown structures in Nd-Bi 61 and Al-Fe 119 systems.
OPTIMADE implementation: MPDD has a stable OPTI-MADE API that serves the entire core MPDD dataset, fully implementing v1.1.0 of the OPTIMADE standard through a server based on optimade-python-tools 37 .Making the MPDD available via OPTIMADE was initially challenging, as MPDD stores and exchanges data in a way that prioritises high throughput and low storage requirements, including binary data, making it difficult or slow to make MPDD queryable as an OPTIMADE API on-the-fly.However, issues have been resolved by establishing a self-updating mirror of the dataset where structures are made OPTIMADE-compliant during transfer and with most associated MPDD-specific data available under the _mpdd_ namespace, including dictionaries of metadata, properties, and descriptors.

Materials Properties Open Database
Database: The Material Properties Open Database (MPOD) 40,97 is a web-based, open access repository of experimentally determined quantitative information about the physical properties of crystalline materials.MPOD is oriented at design engineers, scientists, science teachers and students.Properties are generally treated as tensor magnitudes.In MPOD the compact matrix notation is applied.To bring an intuitive view of tensor properties, so-called longitudinal properties surfaces are displayed.3D printing of properties surfaces is implemented via creation of STL files.A dictionary of properties definitions is included.Eventually, comments are added.Syntax and notation in MPOD files are oriented towards matching IUCr standards and so tries to comply with CIF format.

OPTIMADE implementation:
The integration of OPTI-MADE with MPOD encompassed two distinct phases.Initially, the process entailed migrating all data from the MySQL database to MongoDB.This was followed by the mapping of MPOD objects to OPTIMADE, utilizing optimade-python-tools 37 .In this context, the prefix _mpod_ was employed to delineate specific database fields.

Materials Resource Registry
Database: A federated, decentralised registry of resources in the domain of materials science.It exposes these resources to users and machines via XML and OAI-PMH APIs 102 .OPTIMADE implementation: OPTIMADE has been added to the Materials Resource Registry as an API format that other services and datasets can link to, to indicate their own compliance.Materials Resource Registry's rich semantic description of databases, with regards to their scientific content, techniques, and material focus 82 , as curated by the provider, enables users to make expressive queries over OP-TIMADE providers, to narrow down which databases may be of interest to them.This makes it much easier to dis-cover data and direct clients to the resources that are scientifically the most relevant to them.

Matterverse
Database: A database of yet-to-be-synthesised materials predicted using state-of-the-art ML models, currently comprised of 31,664,858 hypothetical materials.The current structures were generated by combinatorial isovalent ionic substitutions on 5,283 binary, ternary, and quaternary structural prototypes from the 2019 version of the ICSD database.A critical enabler for this database is the Materials 3-body Graph Network (M3GNet) universal interatomic potential encompassing 89 elements of the periodic table 18 .Along with the information of lattice parameters, atom coordinates and E hull , matterverse.aialso provides the predicted formation energies, bandgaps (of multiple fidelities, including PBE, HSE and experimental), and bulk and shear moduli.As an ongoing effort, matterverse.ai is growing in two directions, (i) increasing the number of hypothetical materials via various structure generation strategies, and (ii) increasing the number of ML-predicted properties.
OPTIMADE implementation: Support of the OPTIMADE API is under active development, with so far successful mapping of data to the OPTIMADE format using the optimade-python-tools 37 .

NOMAD
Database: An open-source software and free service for managing and publishing FAIR 110 materials science data.NOMAD 31,32 was made publicly available in 2014; it provides over 12 million data entries from over 500 researchers 111 .Originally, NOMAD focused on ab initio codes based on density-functional theory (DFT), automatically extracting data and metadata from input and output files.Meanwhile, NOMAD was significantly expanded in scope by the consortium FAIRmat (fairmat-nfdi.eu).It now supports file types from over 60 simulation codes, it encompasses advanced many-body calculations, including GW, the Bethe-Salpeter equation (BSE), and dynamical meanfield theory (DMFT), and classical molecular dynamics simulations.It can cope with different types of experimental data.For instance, it provides support for electronic lab notebooks and the NeXus format.NOMAD can track data provenance in complex simulation and experiment workflows.
NOMAD enables individual researchers to make their data available to a wide range of possible clients and applications.All data is formally described through rich metadata schema 41 and can be analyzed with build in containerised tools and notebooks 109 .Data in NOMAD is provided through the OPTIMADE API, NOMAD specific APIs, and a rich graphical user interface with faceted search, (meta)data explorer, and visualisations for material properties.API access is particularly important for re-use of the data, e.g. with artificial-intelligence (AI) methods.A collection of AI tools is available in the NOMAD AI Toolkit 109 .Besides supporting the community with the central data infrastructure, NOMAD offers the same software 111 for local installation through NOMAD Oasis, which allows research groups to manage and provide their own research data individually and customise the software accordingly.
OPTIMADE implementation: NOMAD supports a full OP-TIMADE API implementation based on the 37 using the Elasticsearch database engine, and a web-based search interface that allows users to formulate queries based on the standardised OPTIMADE query strings.Furthermore, NOMAD users can search for related resources from all other OPTIMADE database providers in the OPTIMADE provider list.

Open Database of Xtals (odbx)
Database: A small database serving selected phase diagrams studied with ab initio crystal structure prediction techniques 36,52 .Recently, odbx has been used to ingest new materials discovery datasets into the OPTIMADE ecosystem as part of the optimade-misc.odbx.sciencesub-database 46,73,147 , as well as the GNome dataset 85 at optimade-gnome.odbx.science,as will be discussed in 4.1.2.
OPTIMADE implementation: odbx was created using the matador 36 and optimade-python-tools 37 packages.As well as serving the standard OPTIMADE properties, odbx also serves stability data (hull distances, formation energies) and the DFT parameters used to relax the structures under the _odbx_ namespace.odbx serves multiple distinct datasets; the links endpoint of the index base URL below can be used to retrieve them.

Open Materials Database (omdb)
Database: omdb provides materials properties and is maintained by the developers of the High-Throughput Toolkit (httk) 7 .It contains 205,264 structures for access via programmatic interaction using this toolkit.The structures are also accessible via a web interface.Recently, it is being integrated in the broader database effort Anyterial (https://www.anyterial.se/),which also includes the ADAQ database of point defects 27 .
OPTIMADE implementation: omdb uses the built-in implementation of the OPTIMADE API provided in httk.This implementation is written in Python using no dependencies beyond the Python standard library.Work is currently ongoing to extend the implementation to fully support version v1.2.0.

Open Quantum Materials Database
Database: The Open Quantum Materials Database (OQMD) holds over 1 million materials, consisting of both experimental and hypothetical compounds 106,120 .
The overarching interest of OQMD is to understand the competing stability between known and unknown compounds by generating large-scale convex hulls -a stable yet-to-be-synthesised material should fall along or close to this convex hull.In addition, OQMD grows and develops organically with interests in Wolverton group, including calculations targeting thermoelectrics 53 , battery materials 8 , and high-strength alloys 67 .
OPTIMADE implementation: OQMD currently utilises v1.0.0 of the OPTIMADE standard and will adopt newer versions of OPTIMADE to replace OQMD's current qmpy API.OQMD offers database specific properties through the _oqmd_ prefix, including formation energies, bandgap, and stabilities of compounds.

Comparison of data available
An important benefit of a universal API format such as OP-TIMADE is the ability to simultaneously request and unify results from different databases.While the key features of the databases are highlighted under the subsections dedicated to the respective providers in Sec. 3, a summarizing list of the implementations tested and confirmed to support the OPTIMADE API is shown in Table 1.These databases are all openly accessible and provide users with a broad range of materials classes, applications and modalities.
The table shows the return from three requests that explore materials that contain at least one element from Group 14 (N 1 ), and then constrains the search to cover only binary materials (N 2 ), and only ternary materials without toxic lead (N 3 ): /v1/structures?filter=elements HAS ANY "C", "Si","Ge","Sn","Pb" N 2 /v1/structures?filter=elementsHAS ANY "C", "Si","Ge","Sn","Pb" AND nelements=2 /v1/structures?filter=elements HAS ANY "C", "Si","Ge","Sn" AND NOT elements HAS "Pb" AND elements LENGTH 3 These queries directly duplicate the three detailed in the 2021 OPTIMADE paper 3 .The ability to repeat the query attests to how the OPTIMADE API helps with reproducibility in research.In addition, we provide the total number of structures served by each OPTIMADE API in the N tot column.
Comparison of this table to that in Ref. 3 bears witness to the growth and impact of OPTIMADE.We see several additional providers (Alexandria, BioExcel, CMR, JARVIS, MPOD, MPDD and 2DMatpedia) that now support OPTI-MADE, and several new databases hosted by pre-existing providers.Furthermore, among the databases that did support OPTIMADE in 2021, there has been an impressive growth in the volume of returned data, reflecting their continued efforts to assimilate further data.

Application of OPTIMADE to real-life problems
A key goal for the OPTIMADE API is for it to act as an enabling technology for materials discovery, design and other new research avenues.Feedback from users is crucial to motivate the future development of the API.Therefore, in the following sections, we spotlight several use cases of the application of the OPTIMADE API to real-life systems, firstly in Section 4.1 by supplying data for machine learning, and secondly in Section 4.2 by providing data for screening and other studies.We highlight examples which benefit from access to the wealth of data available in large databases (e.g., the hard-coating alloys database discussed immediately below), and examples that benefit from access to specialist data available only in the small and focused databases (e.g., Section 4.1.1).Furthermore, there has been additional use of OPTI-MADE in the literature: firstly how OPTIMADE has been a central tool to access materials data for materials discovery, secondly as a template for materials data curation and access, and thirdly through online web-based interfaces: Discovery The hard-coating alloys database (HADB) 74 exploited the OPTIMADE API to rapidly and easily provide the browser-based graphical web interface as well as a RESTful API.A second application of the OPTIMADE API was to query and retrieve an unprecedented volume of data to train an attentionbased crystal graph convolutional neural network to accurately predict the formation energy, total energy, bandgap, and Fermi energy of a broad range of crystals 136 .Finally, OPTIMADE has found application in materials discovery, where it was used in Ref. 54 to assess the novelty of a predicted structures in a highthroughput study on quaternary mixed metal chalcohalide perovskites using the optimade-python-tools client 37 .As these structures were ingested into an OPTIMADE-compliant database, in this case NO-MAD 111 , any future OPTIMADE queries for novelty in this chemical space will yield the results of this study.
Template Development of the OPTIMADE API has motivated and guided data access in other ongoing projects.For example, firstly OPTIMADE API collaborates with, and is being used in, the development of the FAIRmat metadata, dictionaries, and materials ontology 110 .The inclusion in other community efforts reflects the maturity and uptake of the OPTI-MADE API.A second example is the BIG-MAP Project, where the consortium plans to use the OPTIMADE API to guide the access of the data gathered in the Battery Interface Genome 16 .

Interfaces and integrations
The MarketPlace Project 128 has integrated the OPTIMADE Gateway (Section 2.3.2) into the platform, which will make it possible to perform OPTIMADE queries through its global search functionality.OntoTrans 91 has developed the Open Translation Environment to perform ontology-driven data pipelines to retrieve, parse, map, and transform data.As part of the project, an OPTIMADE plugin has been developed for the system, making it possible to request and digest OPTIMADE resources.The next steps include semantic mappings for the OPTIMADE data models for true semantic data interoperability.

Machine learning
Machine learning is a promising tool that is already having a significant impact in the materials sciences.Machine learning starts from already computed data about a system and trains a model to capture trends.The machine learning model can then make predictions and design materials quicker and more cost effectively than performing additional experiments.Machine learning relies on having a pool of historical data available.This is where the OPTIMADE API offers a significant boost, by opening access to a wide range of materials databases that hold complementary data.We high-light the help offered by the OPTIMADE API to machine learning with two case studies.

High-entropy alloys
High-entropy alloys are comprised of roughly equal parts of five or more elements.This endows the alloy with a high entropy of mixing, which in turn delivers excellent high-temperature properties, such as strength-to-weight ratio, corrosion resistance, and fracture resistance.These favourable properties have driven an acceleration in research into high-entropy alloys over the last decade, but this means that there is still relatively little historical data available for methods such as machine learning.
The complete dataset obtained using OPTIMADE is split into training and testing sets (4:1 ratio) while ensuring that ratio of entries from each provider is the same in both training and testing sets.The training set is used to train the "combined" model M-C.Data points from the providers P1, P2, and P3 that appear in the training set are used to train the models M-P1, M-P2, M-P3, respectively.The predictive power of the models is assessed by calculating the R 2 on the same test set.
We choose Random Forest Regressor (with default parameters, from the scikit-learn 1.2 package in Python 96 ) to construct the machine learning models for our example.Standard structural entries of the OPTIMADE specification, species_at_sites and lattice_vectors , are used to construct vectors codifying the composition of each material and also to calculate the density of each material.The 'composition vectors', described above, are used as input to machine learning models that are trained to predict the densities (output).Models that are trained on data from only one provider (M-P1, M-P2, and M-P3) perform poorly when tested on data from all the providers (R 2 = 0.316, 0.104, −2.79 for M-P1, M-P2, and M-P3 respectively).Meanwhile, the "combined" model that is trained on data from all providers (M-C) performs very well (R 2 = 0.995).A comparison of the R 2 values is shown in the top-left of Fig. 2.
We can get a better insight into the benefits of leveraging data from multiple providers by looking at comparison of actual density values and those predicted by the machine learning models for small random sampling of materials (scatter plot in Fig. 2).For models trained only on data from a single provider (M-P1, M-P2, and M-P3), the prediction is quite accurate when tested on data from the same provider (indicted by 'cross' markers).However, most of the predictive power is lost when tested on data from another provider ('dot' markers).This explains their poor R 2 values.Meanwhile, the model which is trained on data from all providers (M-C) retains its predictive power when tested on data from all the providers.The number of materials returned by each provider is shown in the bar-graph on the bottom right of Fig. 2. A Venn diagram of unique elements that appear in the materials from each provider are shown in the centre right in Fig. 2. Therefore, the OPTIMADE API offers the significant benefit to merge the information from the datasets together.

Materials discovery and accelerated design
Recent advances in AI-driven materials discovery have created an abundance of hypothetical crystal structures that are expected to be stable 18,46,73,85,137,147 .New datasets targeted towards materials discovery have been ingested and made available as OPTIMADE APIs within the odbx provider 36,52 .Typically, these datasets would only be explored by other materials discovery specialists, at least until a more established DFT database ran them through their pipelines.With OPTIMADE, this dissemination process can be automated and greatly accelerated.Anyone can register as a provider and have their novel crystal structures appear in searches by other data-driven applications, such as X-ray diffraction phase identification to be discussed below.As OPTIMADE is not limited to purely theoretical crystal structures, any future experimental confirmation of a structure could also be served through OPTIMADE and used to further improve generative models for materials discovery.
Another possible application is to repeat previous highthroughput materials design campaigns on the wider set of structures now available through OPTIMADE.As each structure is time-stamped, active or ongoing workflows can be implemented to constantly monitor and screen new crystal structures against the search criteria of the campaign, to avoid having to redo such searches from scratch.Structure-based property prediction models, e.g., MOD-Net (for small structure-property datasets) 28,29,76 or graphbased models (where larger structure-property datasets are available) 19,20,143 , can be leveraged to sift through huge swathes of available crystal structures, with the results being used to prioritise future calculations, attempts at synthesis, characterisation experiments or model retraining, by selecting for structures with combinations of particular target properties.

Data provision
The OPTIMADE API has also been used to provide data for a rich variety of other scientific analysis approaches, below we highlight three projects that take advantage of the comprehensive range of data offered by OPTIMADE.

OPTIMADE client: a web-based GUI to find and import structures
The primary input to a first-principles materials calculation is the structure of the system.Experimental or computational crystal databases are commonly used as sources for the input structures of first-principles simulation software.To find the target structure in these databases, a query with filter conditions needs to be prepared, and then the structure needs to be downloaded, inspected, possibly converted into a different format, and finally used in simulations.
To facilitate this goal, Materials Cloud 127 provides the OPTIMADE client, a web application to perform the struc-ture search task via a unified and user-friendly GUI, empowering users to not only generate and execute complex OPTIMADE queries, but also to provide immediate graphical access and visualisation of the resulting structures.The OPTIMADE client can be embedded in other applications or used as a standalone tool, which is hosted on Materials Cloud at optimadeclient.materialscloud.io.The filtering section of the GUI is shown in Fig. 3.A dropdown (at the top) is provided to select any of the known and automatically discovered OPTIMADE database providers.A periodic table widget allows users to select which elements need to be included (green) or excluded (red) in the compounds; additional filtering tools are also provided, such as for the number of elements and of sites, and for the dimensionality.An OPTIMADE query string is then produced (which can be optionally manually modified).After the search button is clicked, the OPTIMADE query is sent to the selected database provider.The results are curated and shown in the results widget.Here, the structures are visualised and can be downloaded.
Materials Cloud also provides various tools that leverage the power of the OPTIMADE client.One example is the Quantum ESPRESSO input generator, available as a tool at materialscloud.org/work/tools/qeinputgenerator.A structure can be sent directly to this tool using a button in the OPTIMADE client (shown at the bottom of Fig. 3).The Quantum ESPRESSO input generator enables any user to obtain a working input file for the Quantum ESPRESSO DFT code 42,43 , including an automated selection of all numerical parameters, by just specifying a crystal structure (either by uploading it, or by selecting it from OP-TIMADE).A second example is the Quantum ESPRESSO app (https://aiidalab-qe.readthedocs.io)developed within the AiiDAlab platform 144 .It allows users to run complex computational workflows from the web browser, using straightforward graphical user interfaces for structure selection (including via OPTIMADE), parameter selection and inspection of the results.

Automatic phase identification from X-ray diffraction
The Xerus (X-ray Estimation and Refinement Using Similarity) 9,64,93,132 software package implements procedures to refine and screen measured X-ray diffraction patterns of inorganic crystals against databases of crystal structures reported in the literature and beyond.By querying for all possible structures in a given chemical space, Xerus excels at multiphase fits and performs competitively against more specialised and compute-intensive models constructed with machine learning.Xerus uses a straightforward OPTIMADE interface to connect to the multiple databases hosted by members of the OPTIMADE consortia.The dynamic OPTIMADE providers list allows new databases to be automatically included in Xerus search results.Additional filtering parameters can be used to refine the searches towards materials stable (or predicted to be stable) at the experimental conditions (e.g., low temperature or high pressure).

Workflows for automated and simultaneous queries of different databases
BIOVIA Pipeline Pilot 25 is a scientific workflow system that allows users to automate calculations and visualise and report research results by graphically composing a protocol from hundreds of different configurable components.As a technology demonstrator, we investigate the intercalation voltage of a series of cathode materials with the workflow in Figure 4(a).It uses OPTIMADE to adopt the available structures and energetics from different providers and performs complementary calculations using the CASTEP DFT code 23 .
Results are shown in Fig. 4(b), where we plot the intercalation voltage of Li-Ni-O materials from the extracted VASP 70 data in Materials Project and NOMAD and compare predictions to those from CASTEP.The comparability of results illustrates the functionality of the workflow.Since databases contain different structures, the OPTIMADE API facilitates the process of materials investigations by aggregating the query of all of them.
MatCloud 81 is a cloud-based integrated highthroughput computational materials infrastructure, which is directly connected to computing clusters and material property databases 145,146 .Users worldwide can visually design structures, create and run simulation jobs through workflows, and retrieve crystal structures from multiple databases using OPTIMADE; all the user needs is a web browser.MatCloud provides a Graphical User Interface (GUI)-based environment for users to intuitively create, enact and monitor a workflow.The MatCloud workflow system includes a front-end workflow designer and a back-end workflow engine, and supports the creation of workflows by a drag-and-drop approach.
Fig. 5a shows a workflow that retrieves a Si 8 crystal structure from the MPDS database through OPTIMADE, and replaces two Si atoms with Ge to produce structures in the Si-Ge chemical space (Fig. 5b).The band structure and density of states (DOS) are then simulated respectively over the chemical space in a high-throughput manner.Figs.5c and 5d show the visualisation results for the band structure and total DOS of the structures.

Future of OPTIMADE
The rapid expansion of materials databases, the use cases presented, and the adoption of machine learning motivate the continued work on the OPTIMADE API.We describe below firstly the ongoing tutorials that introduce new users to OPTIMADE, secondly the workshops that enable the development and promulgation of the OPTIMADE API, and finally the features currently under development coming out of those recent workshops.

Workshops & tutorials
The OPTIMADE consortium originated from the workshop "Open Databases Integration for Materials Design", held at the Lorentz Center in Leiden, Netherlands in October 2016.There were follow-up workshops held at CECAM in Lausanne, Switzerland annually from 2018 to 2023, with events since 2020 also supporting remote attendees.The workshops in 2022 and 2023 were accompanied by on-site tutorials during the first two days.There was also a partner workshop, "Ontologies for machine learning driven materials design" held at Linköping University, Sweden in 2021.Going forward, the OPTIMADE API will undergo continued development through further annual workshops and publicly advertised monthly video calls.There are continual efforts to reach out to new databases to help accelerate their adoption of the format.
Several tutorial exercises have been developed and delivered at a variety of conferences and topical workshops, outside of the OPTIMADE annual meetings.Eight different exercises developed by the community are now hosted on the GitHub repository Materials-Consortia/optimade-tutorialexercises, ranging from the basics of the OPTIMADE filter syntax and URL structure, to machine-learning pipelines operating on database-specific properties, all the way up to hosting an OPTIMADE API for a new database.

Upcoming features for future OPTIMADE releases
The ongoing real-life use cases of the OPTIMADE API have identified a series of opportunities to extend the API and make it more applicable to a wide range of materials systems.The features currently under development include: /trajectories endpoint So far, OPTIMADE can only be used to describe static structures.We are, however, working to expand the OPTIMADE specification, so that OPTIMADE can be used for sharing trajectory data as well.Such data can originate from structural optimisations, or from Monte Carlo and molecular dynamics simulations.These trajectories could be used  /collections endpoint It has been suggested several times whether it would be possible to have a method to create groups of entries.For example, all the structures that were used to generate a certain machinelearning potential, or all the structures that pertain to a certain research project.We therefore plan to introduce a /collections endpoint, which would contain metadata for the collection as a whole, as well as references to all the individual entries that belong to this collection.
Ontologies and semantics Building on the expanded property definitions outlined above, semantic mappings can be developed to existing and indevelopment materials and crystallography domain ontologies.For example, properties of an OPTIMADE structure (such as the specification of periodicity or types of disordered occupation) can be mapped into concepts in the crystallography domain ontology 26 under development within the Elementary Multiperspective Material Ontology (EMMO) ecosystem. 33his ontology is being created as a collaborative effort that includes members of the OPTIMADE consortium.These mappings of OPTIMADE properties into ontologies facilitate the alignment with other semantic data interoperability frameworks.Examples of such use include the ability to reference properties standardized by OPTIMADE in, e.g., future EMMO-aligned domain ontologies and giving access to property data via ontology-based GraphQL server generation. 75ILES property So far, OPTIMADE has mostly been designed around describing inorganic crystal structures.There are, however, plenty of materials that are (at least partially) comprised of organic constituents.To make it easier to find and select these, we plan to implement a SMILES field 141 and a SMARTS filter 121 for the organic parts of a structure.
Biomolecular fields One major use case of the upcoming /trajectories endpoint is to handle biomolecular structures, which can only be described statistically as a trajectory with multiple configurations.The OPTI-MADE specification will therefore be extended to standardise the various dynamical fields and order parameters that are used to describe biomolecular structures.This covers fields that are typically stored in PDB files such as insertion codes, Chain IDs and sequence infor-mation of proteins, DNA and RNA.

Proliferation of domain-specific namespaces
As mentioned in 2.2, we have already added a mechanism for multiple providers to collaborate on subsets of shared property definitions.It is hoped that many such namespaces will arise to improve coverage of OPTIMADE data across various domains of atomistic science, potentially including the cheminformatics and biomolecular-focused fields above.Other immediate targets include stability information from density-functional theory calculations and magnetic properties.
Large language models (LLMs) LLMs have emerged as an exciting frontier for data science and machine learning 14,65 .We are now considering two uses of LLMs within OPTIMADE.First, a large language model can help a non-expert formulate a query for OPTI-MADE, for example the query in section 2 could be found by requesting "tell me the structure of an oxide of silicon"; this could be readily performed by providing the LLM with the specification text or the machine-readable schemas, then constructing the relevant query with in-context learning 14 .A second use is to pass the large language model either textual data or a scan of a page of historical data, which can then be readily parsed to extract out relevant numbers for an OPTIMADE database.The value provided by OP-TIMADE here is to give a machine-actionable scaffold that an LLM can be validated and evaluated against, in such a way that the data produced is automatically compatible with other initiatives.

Conclusion
The OPTIMADE API provides users with easy access to many of the world leading materials databases.Since the initial release, the OPTIMADE API has not only been adopted by scientists as a tool to drive innovation, but furthermore served as a template for data curation.In this paper, we have provided use cases for how the breadth of data made available through OPTIMADE enables discovery in both academia and industry.The development of the OPTIMADE API has continued apace.Major new support for enhanced property definitions and partial data formats have recently been added and will underpin future work on trajectories and biomolecular data.The concept that subconsortia of databases are responsible for the definition of new sets of shared properties will accelerate the extension of the OPTIMADE API to other disciplines and fields.
Through monthly meetings, and with the continuing support of CECAM, the developers are continuing to extend the range of properties accessible via OPTIMADE APIs.Plans to both expand the format to cover challenges arising from dealing with molecular dynamics data, and continued outreach to support the adoption of the API by additional databases, will further expand the range of scientific use cases that OPTIMADE enables.Increasingly, use cases will take advantage of the unique advantages OPTIMADE has to offer; namely the robust and straightforward aggregation and data unification from the multitude of growing and federated data sources.

Fig. 1 .
Fig.1.Schematic for how databases with different backends may provide their own web-based user interfaces and database-specific APIs alongside the OPTIMADE API.A single client can then interact with the databases via the OPTIMADE API without having to be aware of these differences.

Fig. 2 .
Fig.2.Scatter plot showing the comparison between actual density values and those predicted by the models trained on data obtained using OPTIMADE (M-P1, M-P2, M-P3 for three providers, and M-C trained on all data).For a particular provider, the × symbols are those points validated against blind data from the same provider, the • symbols those validated against blind data from a different provider.The top inset shows the R 2 values for each of the models.The bottom inset shows the number of entries returned by each provider.The inset on the centre-right shows the Venn diagram of unique elements in the entries returned by each of the three providers.

Fig. 3 .
Fig. 3.The search interface of the standalone OPTIMADE client.The user can select any OPTIMADE-compliant database and make a query based on filters specified through GUI elements.

Fig. 4 .
Fig. 4. (a) Workflow architecture for automated simulations using BIOVIA Pipeline Pilot.The workflow queries the structure and energies from the main and archive databases, analyses the available data, and performs CASTEP calculations for the missing data.(b) Screenshot of the app that reports the corresponding electrochemical properties, such as Intercalation Voltage and Crystal structures.

Fig. 5 .
Fig. 5. (a) MatCloud workflow of crystal structure retrieval through OPTIMADE, and job setup.(b) Visualisation of one of the Si-Ge structures.(c) Computed band structure for the Si-Ge structure.(d) Total DOS of the Si-Ge structure.
D.B, A.H. and M.O.acknowledge funding by the European Union's Horizon 2020 programme under grant agreements No. 675728 (BioExcel), No. 823830 (BioExcel-2), No. 720270 (Human Brain Project SGA1), No. 785907 (Human Brain Project SGA2), No. 945539 (Human Brain Project SGA3) and by the Horizon Europe Programme under No. 101093290 (BioExcel-3) and No. 101094651 (Molecular Dynamics Data Bank).K.C. acknowledges funding support from the CHIPS Metrology Program, part of CHIPS for America, National Institute of Standards and Technology, U.S. Department of Commerce.X.Y. acknowledges the funding support from National Natural Science Foundation of China (NSFC) under the grant No. 62376258.

Table 1 .
3he table from Ref.3recreated in March 2024, with new providers.The final column indicates the total number of structures served by each OPTIMADE API.Providers that serve multiple databases are indicated with ⋆. Results for Materials Cloud and Materials Cloud Archive have been aggregated under the same title.The corresponding values from the 2021 paper 3 are provided in brackets, where appropriate.