A database of molecular properties integrated in the Materials Project †

Advanced chemical research is increasingly reliant on large computed datasets of molecules and reactions to discover new functional molecules, understand chemical trends, train machine learning models, and more. To be of greatest use to the scienti ﬁ c community, such datasets should follow FAIR principles ( i


Introduction
With advances in scientic workow automation and highperformance computing, it has become increasingly facile to generate large datasets of molecules, 1 materials, 2 and reactions, 3,4 as well as their computed and predicted properties.][15][16][17][18][19][20][21][22] While the wealth of data available to researchers is a boon, not all data is equally useful.It is increasingly recognized that for data to maximally benet the scientic community, they should follow FAIR principles: 23 they should be ndable (the data can be easily searched using rich metadata and unique identiers or IDs); accessible (the data are as open to the public as possible and can be reached using standard communication protocols); interoperable (the data can be readily combined with other data or used with a wide range of tools); and reusable (the data contain many useful attributes relevant to the domain of interest, have provenance allowing for verication of their accuracy, and are licensed in such a way as to allow others to employ them in their own work).
Since the advent of the Materials Genome Initiative in the United States, 24,25 a number of databases of materials and their computed properties have been developed.Many of these databases, including the Open Quantum Materials Database (OQMD), 26,27 Novel Materials Discovery (NOMAD) repository, 28 and the Materials Project, 29 aspire to follow FAIR principles.Though they vary in scale, the types of materials contained, and the properties reported, these repositories are all alike in that they have web interfaces where users can easily search and visualize data as well as application programming interfaces (APIs) that allow programmatic access to a wide range of data and metadataenabling individual users with knowledge of computer programming to more easily navigate large collections of materials properties and allowing these databases to be integrated into other applications.
In contrast, few FAIR databases of calculated molecular properties exist.It remains common for computational chemistry data to be presented as a single unit (for instance, a zipped le that cannot be easily searched), or worse, not be publicly shared at all.The Molecular Sciences Soware Institute's QCArchive 30 and the Public Computational Chemistry Database Project (PCCDB) 31 are noteworthy and laudable examples of quantum chemical databases approaching FAIR standards.
QCArchive hosts large collections of internally generated and user-submitted data, including the popular QM9 32 and ANI-1 datasets. 33The data on QCArchive can be downloaded in HDF5 format from their web site or can be accessed through a representational state transfer (REST) API with a high-level Python client, making it accessible and interoperable.QCArchive data is also reasonably ndable and reusable.Molecules in QCArchive are given unique IDs.However, at the time of writing, it is not possible to search for specic molecules in the datasets listed on the web interface.Moreover, data visualization tools are either limited or nonexistent, making it difficult for users to discover or digest the data without downloading and siing through large collections.In terms of reusability, QCArchive boasts an enormous collection of molecules and datapoints with provenance based on over 10 million calculations, but the available data are oen limited in scope and applicability.Many of the datasets included in QCArchive contain relatively few properties (for instance, only structures and electronic energies), meaning that the data can only easily be applied to very specic tasks, e.g.training ML force-elds for molecular dynamics.
PCCDB hosts data from PubChemQC, a collection of electronic structure properties for more than 2 million molecules taken from the PubChem database. 34PCCDB has a web app that allows users to search for molecules with particular properties and then visualize those molecules, their absorption spectra, and their molecular orbitals.Calculation inputs are available through the web interface, providing users with some means to access (meta)data about e.g.calculation parameters.An API is also available, and the standard is specied in the web site's documentation.However, no client for this API has been released, which nontrivially increases the burden for end users to interact with the data and especially to download large collections of data for e.g.high-throughput screening or ML applications.Like QCArchive, data in PCCDB is limited in scope, with a strong emphasis on excited state and optical absorption properties.In our assessment, data in PCCDB is ndable and interoperable but is somewhat lacking in accessibility and reusability.
In order to continue the advancement of data-driven chemical research, new platforms are needed that emphasize ease of access and diversity of data and data attributes.Here, in an effort to ll this need and support computational chemistry and chemical ML research, we report an extension of the Materials Project for calculated molecular data which we call the "Materials Project for Molecules", or "MPcules" for short.We have developed a database schema and modular data processing pipeline that allows molecular DFT calculations to be converted into rich molecule and molecular property documents with unique, robust, and chemically meaningful IDs.This data pipeline can be used to add data to MPcules or to develop bespoke datasets.As a means to access the data in MPcules, we have expanded the Materials Project API and associated Python API client.Further, we have developed and released a new application (app) on the Materials Project web site allowing users to visualize the data in MPcules without any programming knowledge.MPCules currently contains more than 170 000 molecules assembled from more than half a million DFT calculations.It is envisioned as a dynamic database that will continue to grow both in terms of the number of molecules as well as the number and types of properties included.In this paper, we describe the methods used to construct MPcules and report on the current status of the database.

Quantum chemical methods
All data currently included in MPcules are directly calculated or derived from DFT calculations.Specically, all calculations were performed with the Q-Chem electronic structure code, using either version 5 or 6. 35 Calculation automation and initial processing of DFT inputs and outputs relied on the reworks, 36 custodian, 37,38 and atomate 39 Python libraries.
At present, the calculations that make up MPcules use a small set of DFT methods.Specically, calculations have been performed using three exchange-correlation functionals-the range-separated hybrid generalized gradient approximation (GGA) functionals uB97X-D 40 and uB97X-V 41 and the rangeseparated hybrid meta-generalized gradient approximation (meta-GGA) functional uB97M-V 42 -as well as three basis sets from the def2 family with polarization and diffuse functions added: def2-SVPD, def2-TZVPPD, and def2-QZVPPD. 43Many calculations were performed in vacuum, but calculations using the polarizable continuum model (PCM) 44 and the solvent model with density (SMD) 45 implicit solvent methods are also included.We note that while these calculation methods reect the data currently in MPcules, the database can easily accept calculations applying any functional and basis set included in Q-Chem.

Database construction
The MPcules database is constructed using the emmet Python packages.emmet-core denes "data models" or "documents" (using the pydantic data validation framework) that represent everything from the output of a DFT calculation to a molecule or a specic property; emmet-builders describes how raw calculation outputs can be converted into molecule and molecule property documents (dened in emmet-core) and how these documents should be inserted as entries in a database (MPcules, like most of the Materials Project, uses a MongoDB NoSQL architecture).Lastly, emmet-api denes how MPcules can be queried to obtain the data that has been built.Here, we elaborate on the structure of MPcules and how the database is constructed.

Assigning priority to calculations
As discussed above ("Quantum chemical methods"), MPcules can accommodate calculations that use many different of levels of theory, where we dene "level of theory" as the combination of a density functional, basis set, and solvent method.Therefore, when a particular property has been calculated using multiple levels of theory, we must rank them in order to retain and report only the "best" property available.
Each component of the level of theory-the functional, basis set, and solvent method-is assigned a score.Because the accuracy and appropriateness of computational methods depends sensitively on the property of interest and the types of molecules being considered, these scores are inherently arbitrary and heuristic in nature and are based on e.g.previous benchmark studies and simple rules.Further details regarding the scoring of the components of level of theory are provided in the ESI.† While one solvent method may be considered more accurate or reliable than another, the same cannot be said of solvents themselves.That is, a calculation using PCM parameterized with 3 = 80 (roughly approximating an aqueous medium) is no more or less accurate than one parameterized with 3 = 7 (approximating the dielectric of e.g.tetrahydrofuran).Rather, calculations performed with different solvent media are better or worse suited for particular applications.When tabulating molecular properties, we therefore select the best level of theory available for each solvent medium.We note that calculations conducted in vacuum are ranked below those performed using implicit solvent methods, but vacuum properties are still reported when available, as we treat vacuum as a distinct solvent medium.

Tasks
A single DFT workow may correspond to one calculation (e.g. a single-point energy calculation or geometry optimization) or may be a collection of related calculations (e.g. a geometry optimization followed by a vibrational frequency analysis to conrm that the optimized structure is a local minimum of the potential energy surface or PES).In either case, the metadata, input parameters, and results of the calculation(s) are parsed by atomate and stored in a MongoDB database in a single "task document" (represented in emmet-core as a TaskDocument object).Tasks/TaskDocuments are the most fundamental collections of data used to construct MPcules, corresponding almost directly to the parameters and raw outputs of DFT calculations.

Molecules
"Molecules" are central to MPcules.Most data in MPcules are conceptually organized and grouped by molecule.How we dene the term "molecule" therefore affects how users will access and interact with data.Although chemists and physicists have intuitive understandings of what a molecule is, we must be careful in dening the term and consider how best to represent a molecule in a database.
What is a molecule?Conventionally, a molecule is dened as a group of two or more atoms that are bound together.We expand the term to include single atoms (e.g.H 0 ) and monatomic ions (e.g.F −1 ), as such species can be important for the calculation of molecular and reaction properties.For instance, single metal ions like Li +1 are necessary to compute the binding energies of those ions to coordinating molecules.
A molecule can be minimally described by its chemical composition, charge, and spin multiplicity.This is in line with common written nomenclature for molecules and ions.As a small example, diatomic oxygen in the triplet ground state ( 3 O 2 ) is differentiated by composition from the oxygen atom (O 1 ), by charge from a peroxide anion (O2 −2 ), and by spin from the singlet excited state ( 1 O 2 ).Notably, additional information may be needed to distinguish between ground and excited states.To specify beyond this starting point, there are two natural denitions: one based on PES, and another based on the idea of chemical bonding (Fig. 1).
In the rst denition (Fig. 1a), a molecule is dened as a local minimum on a PES.The PES, in turn, is dened by the chemical composition, total number of electrons, spin multiplicity, and the DFT methods (level of theory and other calculation parameters) employed.In this denition, every unique atomic structure (in terms of interatomic distances and angles) corresponding to a local PES minimum obtained via a geometry optimization calculation is a different molecule.It is worth noting that this PES-based denition is used within the Materials Project's data for crystalline solids to dene a unique "material".
In the second denition, it is the connectivity of a molecule-the way that atoms are linked to each other through chemical bonds and other interatomic interactions-that distinguishes molecules.Different local minima on a PES may correspond to structures with different bonds, but they may also simply be different conformational isomers (conformers).This denition is somewhat more complex than the picture based on PES, as it requires additional denitions and decisions.For instance, this denition relies on the idea of a "bond" and associated criteria determining when two or more atoms are or are not bonded.We note that it is extremely challenging to rigorously dene chemical bonding, and ultimately, most denitions are arbitrary.
In MPcules, we use both the PES-based and the bondingbased denitions to construct molecules, as described below ("Building Molecules").However, as most chemical observables of interest-including various spectra, electrochemical properties, and reaction properties like thermodynamics or kineticsare averaged over different interconverting conformers, 46 we rely in most cases on the denition based on bonding.
Building Molecules.Molecules (MoleculeDocs in emmetcore) are constructed in two stages: association and collection.In the rst (association) stage (Fig. 1a), tasks are grouped according to a PES-based denition of a molecule (i.e., each structure corresponding to a unique local minimum of a PES is a unique molecule).When tasks are initially grouped together, charge and spin multiplicity are not considered, because calculations could be performed which use structures optimized at one charge/spin state but compute the electronic structure at a different charge/spin state; for instance, this is necessary to compute the vertical electron affinity or ionization energy of a molecule. 47The structures associated with each task (represented by pymatgen Molecule objects) are then compared, and tasks with structures that are identical within a tight tolerance (by default, the root-mean-squared deviation or RMSD # 10 −6 Å) are grouped together.If no task in a group corresponds to a geometry optimization and the structure in question is not a single atom (for which geometry optimization is not meaningful), then we cannot conrm if the structure is a local minimum of a PES, and so the group is discarded.For groups that remain, a single representative structure is chosen by ranking the geometry optimization calculations by level of theory (see "Assigning priority" above) and electronic energy, and the charge and spin of the associated MoleculeDoc are determined based on this "best" structure.
In the second (collection) stage (Fig. 1b), the structures of associated MoleculeDocs are compared on the basis of connectivity.Though we can dene bonding using several methods informed by quantum chemistry (see "Molecular Properties" below), for the purpose of collecting MoleculeDocs we need to choose a denition of bonding such that the connectivity of every molecule in MPcules can be determined regardless of what calculations have been performed.We use the bond detection algorithm included in the OpenBabel cheminformatics toolkit 48 and then identify missing coordinate bonds to metals with the metal_edge_extender dened in pymatgen. 37This method is entirely based on heuristics and does not depend on any electronic structure calculations.Upon detecting bonds, we construct a molecular graph representation using the pymatgen MoleculeGraph functionality.When dening connectivity for a graph representation, we consider bonds to hydrogen atoms, which are always included explicitly in our 3D molecular structures.If there are multiple associated MoleculeDocs with the same formula, charge, spin, and connectivity, then we rank the different documents (again in terms of the level of theory used to optimize the best structure and the associated electronic energy) and choose the best to represent the group.All other associated MoleculeDocs with the same formula, charge, spin, and connectivity are linked to this representative as "similar molecules".
Molecules are assigned unique identiers ("MPculeIDs") based on their chemical formulae, charges, spin multiplicities, and connectivity; further details regarding the MPculeID format are provided below (see "Unique Identiers").Likewise, tasks are given unique IDs dened by an (optional) prex and some integer.MoleculeDocs store a list of the IDs of all tasks performed on the same geometry.Collected MoleculeDocs additionally store the IDs of the tasks that produced the "best" structure for each implicit solvent medium (including vacuum) for that molecule and the MPculeIDs of the "similar" associated MoleculeDocs (documents with the same connectivity, but with geometries representing different PES minima).This allows users to collect the properties of various conformers of a given molecule.

Molecular properties
MoleculeDocs and their underlying TaskDocuments contain all of the information about a molecule that is stored in MPcules.To aid in accessibility and reusability, we further process taskand molecule-level data to generate property documents.Typically, property documents are uniquely dened by the combination of MPculeID and solvent.In some cases, a property can be calculated or determined using multiple different methods; for instance, atomic partial spins can be dened using Mulliken population analysis 49 or the natural atomic populations determined by the natural bonding orbital (NBO) program. 50,51For such properties, a property document is uniquely dened by MPculeID, solvent, and method.
At present, we generate property documents for the following properties: natural atomic and molecular orbitals (based on NBO); atomic partial charges; atomic partial spins; bonding; thermodynamics; vibrational properties; redox and electrochemical properties; and coordination or binding of metals.Basic details for these different properties are provided below, and a schematic of how collections of tasks, molecules, and properties are connected is shown in Fig. 2.
For molecules with multiple optimized structures for a given solvent medium (i.e., for cases where multiple associated Mol-eculeDocs were collapsed into a single MoleculeDoc during the collection stage), we only calculate properties based on the "best" structure.This ensures that comparable properties for the same molecule always refer to the same structure.We further note that, for atomic properties (e.g.atomic partial charges, atomic partial spins) or properties with atomic components (e.g.normal modes of vibration), we consistently use the same atomic indices as the pymatgen Molecule object for the "best" structure.
Natural atomic and molecular orbitals.NBO processes the optimized multi-electron wavefunction produced during a DFT self-consistent eld (SCF) calculation.Aer rst converting the atom-centered orbital basis into sets of natural atomic orbitals, natural hybrid orbitals, natural bond orbitals, and natural localized molecular orbitals, the NBO program can perform detailed atomic population analysis, analysis of lone pairs and bonds (including hyperbonds and 3-center bonds), and perform second-order perturbation theory analysis of donor-acceptor type orbital interactions. 50or each atom, we store the number of core, valence, and Rydberg electrons, as well as the total number of electrons assigned to that atom.Lone pair information includes the orbital character of the lone pair (fraction of the lone pair made up of s, p, d, and f natural atomic orbitals), as well as its total occupancy.Similarly, for bonds we include the orbital character of each atom's contribution and the total occupancy, as well as information regarding the bond polarization.We also store information regarding orbital interactions, including the perturbation energy, the energy difference between the donor and acceptor orbitals, and the Fock matrix element for the interaction.For each type of hybrid orbital (e.g.long pair or bond) or orbital interaction, we retain information regarding the atoms involved in the hybrid orbital(s) as well as the orbital type(s), using the codes from NBO outputs.For instance, lone pair orbitals are labeled "LP", while antibonding orbitals are labeled as "BD*".
For molecules with unpaired electrons, NBO separates its analysis for a and b electrons.Accordingly, the orbital data on MPcules is structured differently for open-shell and closed-shell molecules.Closed-shell molecules have singular collections of populations, lone pairs, bonds, and interactions, while openshell molecules have one collection of each type of property for a electrons and one for b electrons.NBO version 7 signicantly improved the bond-detection algorithm over version 5. 51 As a result, we currently only allow NBO 7 calculations to be included in MPcules.Users wishing to adopt our methodology should note that Q-Chem is packaged with NBO version 5 and uses this version by default, meaning that conguration of an external NBO application is necessary to benet from the improvements and produce data that can be incorporated into MPcules.
Atomic partial charges.Atomic partial charges are determined from DFT calculations following SCF convergence.They can be obtained by assessing the population of orbitals in an electronic wavefunction, by partitioning the total electron density, by calculating the electrostatic potential, or by other means. 52In MPcules, we currently calculate atomic partial charges using four methods: Mulliken population analysis, 49 the restrained electrostatic potential (RESP), 53 Bader charges 54 (obtained using the critic2 program), 55 and natural atomic populations via NBO.When other methods are available, we recommend against Mulliken population analysis, as the Mulliken method is known to depend strongly on basis set and produce in some cases unphysical partial charges. 56We include Mulliken partial charges because they remain widely used in computational chemistry and because Mulliken population analysis is performed by default in Q-Chem DFT calculations.We provide a comparison of Mulliken and NBO atomic partial charges in the ESI.† Atomic partial spins.For molecules with unpaired electrons, atomic partial spins can be calculated in a manner analogous to atomic partial charges (for closed-shell molecules without unpaired electrons, the net spin on all atoms is always 0). Atomic partial spins are currently calculated using two methods: Mulliken population analysis and NBO natural atomic populations.We note (ESI Fig. S1-S5 †) that Mulliken atomic partial spins are more well-behaved than Mulliken atomic partial charges and are qualitatively similar to NBObased partial spins.
Bonding.Bonds are a key molecular property, as we have already discussed ("Building Molecules").Bonding documents (MoleculeBondingDocs in emmet-core) include a list of bonds (using indices to represent what atoms are bonded), bond lengths organized by bond type (e.g."C-O" for bonds between carbon and oxygen), and a graph representation of the molecule, with bonds included as edges (using MoleculeGraph in pymatgen).
In addition to the heuristic method of dening bonds using OpenBabel and pymatgen, we can determine bonding in two ways: (1) with the method of Spotte-Smith, Blau, et al., 57 in which OpenBabel/pymatgen heuristic bonding is augmented with bonds identied by analyzing the critical points of the total electron density (using critic2) and (2) via natural bonding orbitals identied with NBO.
NBO reports bonds that form hybrid orbitals based on the sharing of electrons between atoms-in other words, covalent bonds.Ionic bonds, coordinate bonds, and other electrostatic interatomic interactions are not captured directly as bonds in the NBO output.However, such non-covalent bonds and interactions can be inferred from other NBO-reported quantities.Specically, to identify metal coordinate bonds, we examine NBO's second-order perturbation theory analysis of the Fock matrix.If there is an interaction between a lone pair ("LP") orbital on a nonmetal (donor) and a lone vacant ("LV") or anti-Rydberg orbital ("RY*") on a metal (acceptor) where the metal is within 3.0 Å of the nonmetal and the perturbation energy is greater than or equal to 3.0 kcal mol −1 , then we determine the metal and nonmetal to be electrostatically bonded.These cutoff values were determined by manual inspection of set of metalcontaining complexes and are, like most denitions of bonding, arbitrary.
Molecular thermodynamics.Typical DFT calculations produce as output an electronic energy, which can be used to determine the relative stability of different structures or calculate reaction energies.If a vibrational frequency analysis has been performed, the zero-point energy, as well as the total enthalpy, total entropy, and their components (vibrational, rotational, and translational), can be obtained; from this, one can calculate the molecular Gibbs free energy, which is oen a more natural thermodynamic potential, particularly for comparison to experiments at constant temperature and pressure.
In order to obtain optimized geometries and calculate free energies at reduced cost, it is common for computational chemists to optimize structures at a relatively inexpensive level of theory (e.g. using a small basis set, or ignoring solvent effects) and then re-calculate the electronic energy with a single-point calculation using a more accurate and expensive level of theory (e.g. using a larger basis set or including an implicit solvent model).There are therefore two natural ways to calculate molecular thermodynamics: one in which all thermodynamic quantities of interest (electronic energy, enthalpy, entropy, etc.) are calculated from a single vibrational frequency analysis calculation, and another in which most properties are obtained from a vibrational frequency analysis but the electronic energy is instead obtained from a single-point energy calculation performed on the same structure at a higher level of theory.
We construct thermodynamic property documents (Mole-culeThermoDocs in emmet-core) using both methods.If, for a given solvent, one can produce a MoleculeThermoDoc both with and without a single-point energy correction, then the scores for the "best" uncorrected document (based on the level of theory used and the electronic energy) and "best" corrected document (based on the average of the scores of the levels of theory used for the vibrational frequency analysis and the single-point energy calculation, and the electronic energy) are compared, and the document with the better (lower) score is selected.
Vibrational properties.Vibrational frequency analyses produce a set of frequencies (related to the eigenvalues of the Hessian matrix) and associated normal modes (related to the Hessian eigenvectors).At present, we report these frequencies and normal modes, as well as the calculated infrared (IR) activities and intensities.From these quantities, it is possible to obtain a calculated IR spectrum of a molecule.
Redox and electrochemical properties.We calculate properties related to molecular reduction and oxidation using both the vertical and adiabatic approximations (Fig. 3). 47In the vertical approximation, one does not allow the molecular atomic structure to relax upon accepting or donating an electron, under the assumption that electron attachment or detachment occurs much more rapidly than atomic rearrangement.Calculating a vertical electron affinity (EA) or ionization energy (IE) therefore requires only a single-point energy calculation on an optimized geometry with the charge shied by −1 (for EA) or +1 (for IE).Since vertical EA and IE calculations involve only a single molecular structure, they can be calculated using a single MoleculeDoc and its associated tasks.
In the adiabatic approximation, one allows a reduced or oxidized molecule to fully relax.Calculating an adiabatic reduction or oxidation (free) energy therefore requires two optimized structures (and therefore two MoleculeDocs) at different charge states.Because it can be challenging to predict a priori how a molecule may decompose upon charge transfer, we neglect the possibility of dissociative redox reactions.Instead, when calculating adiabatic redox properties for a given molecule, we search for MoleculeDocs that have the same connectivity as that molecule (not including bonds involving metals), but with charge shied by −1 (for reduction) or +1 (for oxidation).Reaction free energies are calculated using previously-constructed MoleculeThermoDocs (see "Molecular Thermodynamics").If the oxidized and/or reduced MoleculeDocs can be identied, we also calculate reduction or oxidation potentials, referenced to the standard hydrogen electrode (SHE) using the relative potentials reported by Trasatti. 58etal coordination and binding.Metal coordination is important in a range of applications, including chemical separation, 59 organometallic chemistry, 60 and the design of electrolytes for energy storage and other applications. 61We therefore collect information regarding the binding properties of metals in molecules, including the number, type, and length of coordinating bonds, as well as the thermodynamics of metal binding for the reaction A − M / A + M, where M is a metal and A is a coordinating molecule.
To determine metal binding properties (Fig. 4), we must rst ascertain the charge and spin state of each metal in a given molecule.To do this, we round the atomic partial charge and the atomic partial spin of the metal atom in the molecule to the nearest integer.These atomic partial charges and spins are obtained from previously-constructed collections in MPcules (see "Atomic Partial Charges" and "Atomic Partial Spins" above).If the rounded charge and spin are incompatible-for instance, if a Mg atom were assigned a charge of +1 and a spin multiplicity of 1 (spin 0)-then the charge is shied by +1 or −1 (whichever produces a charge which is closer to the calculated atomic partial charge).We shi the charge, rather than the spin multiplicity, because we have found that the spin state of metals is more oen well described by DFT than metal charge states (see "Comparison of Atomic Partial Charges and Spin" in the ESI †).
Aer the oxidation and spin state of each metal have been determined, the bonding environment around the metal atoms are characterized using previously calculated bonding information (see "Bonding").For each metal, we then construct a MoleculeGraph of the molecule with that metal removed.Using this graph, we search for molecules with the same connectivity and the appropriate charge and spin multiplicity.If appropriate MoleculeDocs can be located for both the metal (M in the previous chemical equation) and the molecule without the metal (A), then we calculate the reaction thermodynamics using previously-constructed MoleculeThermoDocs (see "Molecular Thermodynamics").
Atomic partial charges, atomic partial spins, and bonding can all be determined in multiple ways, which means that there are numerous possible combinations that could determine the metal-binding properties of a molecule.However, with the aim of ensuring that the various methods used are as consistent as possible, we currently only allow two combinations of methods to be used.In the rst, atomic partial charges, atomic partial spins, and binding are all determined using NBO.In the second, the atomic partial charges and atomic partial spins are both calculated using the Mulliken method, and the bonding is determined using the default method combining OpenBabel with pymatgen.
We would like to point out that DFT can struggle to accurately predict the thermochemical properties of single atoms and ions, whether in vacuum or in implicit solvent.This may affect the accuracy of computed binding (free) energies.

Summary documents
Aer all property documents have been constructed, we compile a subset of calculated properties into a single document called a MoleculeSummaryDoc.Whereas property documents are uniquely dened by MPculeID, solvent medium, and sometimes method, the summary document is uniquely dened only by the MPculeID.Properties in the summary document that are not method-dependent are represented as key-value pairs, where the keys are the names of solvents used to calculate the property and the values are the properties calculated in those implicit solvent media.For properties that are method-dependent, the values are instead key-value pairs, with the keys being various methods and the values being the properties calculated using specic combinations of solvent and method.

Unique identiers
The principles of ndability and accessibility require that data be given IDs which can be used to search for and reference specic information.In addition to being unique and persistent, it is desirable (though less essential) for these IDs to carry chemical information and to be interpretable by human users.
Tasks.When tasks are inserted into the MPcules databasefor instance, aer a DFT calculation has completed-they are assigned a sequential numerical ID.We prepend these numerical IDs with a string (e.g."mpcule") to form a unique task ID.
Molecules.In the Materials Project database, 29 materials are given MPIDs which are derived from task IDs as described above.For instance, "mp-1518" represents CeRh 3 .While MPIDs are unique and persistent for a given task, they are not necessarily persistent for materials, as older calculations used to generate an MPID could be deprecated over time.Moreover, MPIDs do not carry any chemical information, human-interpretable or otherwise.
The most widely used representations for molecules which could be used as IDs are the simplied molecular-input lineentry system (SMILES) 62 and the International Chemical Iden-tier (InChI). 63SMILES has numerous pitfalls which make it inappropriate for use as a database ID.Most importantly, SMILES strings are not unique, and there can be several valid SMILES for the same structure.Though it is possible to generate unique "canonical" SMILES, 64 this fundamental lack of uniqueness makes searching for molecules by SMILES problematic.SMILES is also designed primarily for organic molecules and struggles to robustly represent metals and coordination complexes.As many of the molecules in MPcules contain coordinated metal atoms or ions, this is a severe limitation.The self-referencing embedded strings (SELFIES) devices by Krenn, Aspuru-Guzik, and colleagues, 65,66 signicantly improve on SMILES-most notably, by ensuring that all possible SELFIES strings represent chemically valid molecules-and can in principle support arbitrary metal bonding.However, at present, SELFIES can only be generated via SMILES, which ultimately means that many of the same pitfalls persist.InChIs are guaranteed to be unique-for a given molecular structure, there can be only one InChI-but the InChI generation algorithm explicitly ignores metal bonding, again meaning that metal-coordinated molecules with different coordination environments cannot be distinguished by InChI.
To overcome the limitations of existing IDs and molecular representations, we have devised a new ID format: the MPcu-leID (Fig. 5).The basic ID has four parts that are separated by hyphens; these four parts represent the connectivity, composition, charge, and spin multiplicity of the molecule.For connectivity, we generate a graph representation of the molecule (see "Building molecules") and hash it using the Weisfeiler-Lehman graph hashing algorithm 67 originally implemented in networkx. 68This hash can be augmented with features of the nodes (atoms) or edges (bonds).In the association stage of molecules building, where MoleculeDocs are differentiated by their exact structure, we augment the graph with the Cartesian (XYZ) coordinates of the atoms.In the collection stage, where MoleculeDocs are distinguished by connectivity only (without concern for exact bond lengths, angles, etc.), we instead augment only with the string representation of the element (e.g."Li" for lithium).To ensure consistency, when representing the composition, we always use the alphabetized chemical formula (e.g."C1Li2O3" for lithium carbonate or Li 2 CO 3 ).For molecules with negative charge, we prex the charge with "m" instead of a minus sign "-" to distinguish from the hyphen separators.
The MPculeID comes closer to simultaneously meeting the goals of uniqueness, persistence, and interpretability.Though hash collisions-in which multiple distinct inputs are converted to the same hashed output-are essentially unavoidable with the Weisfeiler-Lehman hash or any other hashing method, it is exceptionally unlikely that any two molecules with different connectivities will nonetheless have the same hash, formula, charge, and spin.In practice, the MPculeID should therefore always be unique.Because the hashing algorithm is deterministic, the same graph input will always receive the same hash, meaning that MPculeIDs will not change over time.The Weisfeiler-Lehman algorithm further guarantees that graphs that are isomorphic produce the same hash, which means that these hashes can be used to compare molecular structures (acknowledging the possibility of hash collisions).Finally, though graph hashes are not human-interpretable, they do carry chemical information, and as the formula, charge, and spin information in the MPculeID are easily understood, users reading an MPculeID should be able to obtain a basic understanding of the underlying data.
Although the MPculeID format meets the basic requirements for a database ID format and overcomes certain key limitations of previous chemical identiers, MPculeIDs have limitations of their own.For example, similar graphs do not in general produce similar Weisfeiler-Lehman hashes; these hashes therefore cannot be used to search for similar molecules, including molecules containing a particular substructure or functional group.There are also limits to the current implementation of MPculeIDs in the MPcules database that are not limitations of the basic format.As we have explained, when generating graph hashes for use in MPculeIDs, the graphs can be augmented with atom or bond features.Depending on how the graphs are augmented, different hashes will be produced, which can change if and how species are distinguished.As an example, consider chiral molecules.Different enantiomers have the same connectivity but are thought of as distinct because of their optical, structural, and (in some cases) reactive properties.
Because they are by denition non-superimposable, enantiomers can be distinguished by their MPculeIDs in the association stage (where the graphs are augmented with Cartesian coordinates).However, MPculeIDs in the collection stage cannot distinguish between enantiomers because we do not augment the graphs with any information about chirality.
Although existing identiers like InChI are not sufficient for use as a unique identier in the MPcules database, they are widely used and supported.As such, to improve interoperability with other databases, we associate InChIs and InChI-key hashes with each molecule and molecule summary document in MPcules.We intend for users to be able to search for molecules based on their InChI strings in the future.
Molecular properties.Though one could search for a property document using dening characteristics such as molecule ID, for convenience, we also dene IDs for property documents.These IDs are generated by constructing a string with the identifying information for the document (including MPculeID, solvent, andwhere relevant-method, as well as potentially other information used to generate the document); this string is then hashed using the BLAKE2 algorithm, 69 as implemented in the Python standard library.The uniqueness of a hash can in general not be guaranteed, but because there are other ways to access a desired property document using data that are essentially guaranteed to be unique, the relatively remote possibility of hash collisions is acceptable in the case of property documents.

Provenance
As we have noted, data provenance is vital to allow users to verify the accuracy of a calculation and to trace processed data back to the raw data that they are based on.Throughout the construction of MPcules, we include provenance information, allowing users to trace back to individual DFT calculations/ tasks.
Already, we have mentioned how provenance information is stored during the construction of associated and collected MoleculeDocs (see "Molecules").In addition, each property document stores the IDs of the other documents used to calculate the relevant properties.For data obtained from a single task-for instance, atomic partial charges-the task ID for that property is stored.For data obtained from other property documents, the property ID is stored.When data for a particular document comes from other documents with different MPculeIDs, then the MPculeIDs of those other documents are also stored.Finally, MoleculeSummaryDocs store the property IDs of all of the documents used to construct them, linked to the relevant solvent and (where relevant) method through key-value pairs.

Accessing molecular data
The data in MPcules can be accessed in three ways: (1) directly via the Materials Project API; (2) using the high-level Python interface to the Materials Project API, mp-api; or (3) via a web app on the Materials Project web site.Here, we briey describe these means of accessing MPcules data.Further documentation can be found online (see https://api.materialsproject.org/docs and https://docs.materialsproject.org).

The Materials Project API
Upon making an account (https://prole.materialsproject.org/),users of the Materials Project gain access to an API key (https:// https://next-gen.materialsproject.org/api).This allows users to interact with the Materials Project API.
A Materials Project API request begins with a uniform prex (https://api.materialsproject.org/).All data in MPcules can be accessed via an API endpoint under the/molecules/root endpoint; for instance, molecules summary documents can be obtained from the endpoint/molecules/summary/.Following these endpoints, query parameters can be provided to limit the results of the search.
Because the Materials Project API provides an OpenAPIcompliant specication, it is facile to incorporate this API into soware using a variety of web frameworks and programming languages.However, to avoid having to interface with this specication directly, users can also apply the MPRester class implemented in the mp-api Python package.MPRester includes straightforward Python interfaces to each of the MPcules API endpoints.For example, to search for a molecule summary document with charge +1 and formula C 2 H 4 , one can write the following Python code: In the MPcules database at the time of this writing, there is exactly one molecule with charge +1 and formula C 2 H 4 , so query_output will contain a list with one entry.Due to the quantity of data included in the MPcules summary collection, we will not show the entire output, but it is worth illustrating how one can interact with the results of a query: This yields the following output: We note that, in addition to obtaining complete task, molecule, property, and summary documents, we have also provided API endpoints that extract more targeted information.For instance, using the /molecules/tasks/trajectory/endpoint, it is possible to extract information from a task related to a geometry optimization trajectory, including the structures, energies, and forces along that trajectory.This off-equilibrium data could be used, among other purposes, to train ML interatomic potentials. 22,70e Molecules Explorer The new Molecules Explorer web app is built using the Crystal Toolkit Python framework for data visualization, 71 as well as suites of custom React JavaScript components (mp-reactcomponents) and Plotly Dash components (dash-mp-components).The root of the Molecules Explorer presents a search interface for discovering new molecules.
The Molecule Details Page visualizes data from the summary document of each molecule.It allows users to explore key computed properties under different solvent media and bonding schemes.At the top (Fig. 6), solvent-invariant properties (e.g.number of atoms, charge, and spin multiplicity) are shown alongside a 3D molecular visualization rendered with Crystal Toolkit.The solvent medium and bonding scheme can be selected from two drop-down boxes that determine the computed properties displayed on the rest of the page.Below this, a set of property sections are shown that closely map onto the MPcules database schema.Namely, we have created sections for bonding, thermodynamic stability (containing molecular thermodynamics data), partial charges and spins (containing data on atomic partial charges and atomic partial spins), vibrations (containing information on vibrational properties), and redox stability (containing redox and electrochemical properties).We plan to add sections describing orbital information from NBO and metal binding properties.
Each property section consists of a data tab including the processed data from the summary document, a methods tab describing how the data was obtained from DFT and post-processing, and an API tab describing how users can programmatically access the data.For example, the data tab for the Partial Charges and Spins section (Fig. 7) consists of two dropdown menus to select the method for calculating charges and spins, a table of atomic charge and spin values, and a 2D molecular visualization of the molecule.Selecting rows in the table highlights the corresponding atoms in the molecular visualization and shows the total charge and spin of the selected atoms.This provides a map between atoms in the table and their position in the molecular topology.The other property sections each contain unique user interface elements.The Bonding section contains an interactive 2D visualization of bond distances, angles, and dihedrals; the Thermodynamic Stability and Redox Stability sections present tables of properties; and the Vibrations sections contains an interactive simulated IR spectra.At the bottom of the page, aer the property sections, the parameters for the selected solvent method are shown.

The current state of MPcules
The main focus of this work is to describe a general computational infrastructure for processing, storing, and disseminating calculated molecular properties.We expect the data stored on MPcules to change and grow over time, and specic additions to the database will be discussed in future works.Nonetheless, we here briey discuss the scale and scope of the MPcules database as it exists at the time of this writing.
MPcules currently contains data on 172 874 (collected) molecules (248 302 associated molecules) based on 568 004 tasks (Fig. 8).Most properties are present for all relevant molecules.Because atomic partial charges and electronic energy are calculated by default for all DFT calculations in Q-Chem, there is at least one partial charge document and one molecular thermodynamics document per molecule.Likewise, there is at least one bonding document per molecule (because bonding can be assessed from an optimized geometry without any further electronic structure calculations) and at least one atomic partial spins document for each open-shell molecule.While we do not strictly require that optimized structures be validated by performing a vibrational frequency analysis, all molecules currently in MPcules that are not single atoms have been subjected to such analyses.As such, almost all molecules in MPcules have vibrational properties calculated.Other properties-namely natural atomic and molecular orbital properties, redox properties, and metal binding properties-are available only for a subset of molecules, either because these properties require specialized calculations (e.g.NBO analysis must be performed for orbital properties, and single-point calculations at different charges must be performed to calculate vertical redox properties) or because the calculation of certain properties for a given molecule require other molecules with specic structures and charges to be present (e.g.calculating a metal binding energy requires three molecules: the metal ion, the molecule-metal complex, and the same molecule not bound to the metal).
Molecules in the MPcules database do not come from a single source and are not selected based on any single set of criteria.We note that some of the data in MPcules has been previously released in different collections.Specically, we previously reported the Lithium-Ion Battery Electrolyte (LIBE) dataset, 57 a collection of the properties of 17 190 molecules relevant to electrolyte decomposition and interphase formation in Li-ion batteries with carbonate electrolytes.More recently, we released the MAgnesium Dataset of Electrolyte and Interphase ReAgents (MADEIRA), 7 containing properties of 11 502 species relevant to electrolyte degradation and gas evolution in Mg-ion electrolytes consisting of magnesium bistriimide dissolved in diglyme.Properties in LIBE and MADEIRA were calculated at the uB97X-V/def2-TZVPPD/SMD level of theory.In addition to the molecules in LIBE and MADEIRA, MPcules contains molecules relevant to Mg-ion battery electrolytes with tetrahydrofuran electrolytes, as well as large numbers of small organic molecules, the properties of which have been calculated in vacuum and in many cases in an implicit solvent medium approximating water.As mentioned above, we intend to describe these data in further detail in future works.
Fig. 9 details the current composition of the MPcules database.In contrast to many existing molecular datasets, MPcules contains molecules with diverse charges and spin multiplicities (Fig. 9a and b).Currently, there are more ions in MPcules (98 480) than neutral molecules (74 394) and more radicals (89 715) than closed-shell molecules (83 159).Most ions have charge ±1, with nontrivial numbers of molecules with charge ±2.A very small number of ions with charge −3 (7) and +3 (6) are also included.These are all single atoms, the properties of which were studied in order to calculate redox properties.Currently, MPcules contains a relatively small number of triplets (2942); this presents a natural area for expansion.
In terms of elements (Fig. 9c), MPcules is skewed towards organic molecules containing C, H, N, and O.In large part because of the previous LIBE and MADEIRA datasets, there are many (>10 000) molecules containing F, Li, Mg, and S. While we do believe that MPcules is relatively diverse in terms of elements and chemical formulae, there are obviously many opportunities to expand its coverage through the addition of molecules containing B, P, halogens (e.g.Cl and Br), Si, or transition metals.

Future work
Just as the Materials Project has evolved from its initial release in 2011 to today, increasing in scale, scope, and structure, MPcules will continue to develop over time.We have already mentioned types of molecules that we intend to add to MPcules (e.g.transition-metal complexes and triplets).Here, we outline further plans to expand MPcules.We note that while we aim to internally develop the MPcules code(s) and dataset, we welcome user-submitted contributions of data as well as features (in the form of code contributions to emmet-core, emmet-builders, emmet-api, and the other Materials Project packages).

Calculation methods and sources
Currently, MPcules accepts only DFT calculations from Q-Chem.This means that we are presently excluding calculations using high-quality wavefunction methods based on e.g.coupled-cluster theory and multireference methods, which might be useful for benchmarking electronic structure methods or for D-machine learning of electronic energy and other molecular properties.At the same time, we exclude calculations based on semiempirical quantum chemistry methods such as self-consistent extended tight-binding (e.g.GFN2-xTB), 72 which have become increasingly popular for generating large datasets of molecules at low computational cost.Even within the narrower realm of DFT, the restriction to using a single electronic structure code could limit the ease with which users can contribute data to MPcules.
In the future, we intend to create a more exible interface which can parse and construct molecule and molecule property documents from calculations using a variety of DFT and non-DFT methods with multiple quantum chemistry codes (e.g.xTB, 72 ORCA, 73,74 or NWChem). 75

Molecular properties
MPcules already contains diverse atomic, molecular, and reaction properties.In the future, we aim to expand the properties available, both to aid in chemical analysis and to enable the development of ML methods.In particular, we are interested in expanding properties in two directions: spectroscopy and electronic densities.At present, the only spectra included in MPcules are IR spectra obtained via vibrational analysis.With modest modications to our existing workows, we should be able to additionally obtain molecular Raman spectra, including Raman activities and intensities.We also intend to incorporate nuclear magnetic resonance (NMR) spectra, including chemical shis and J-couplings.In terms of charge densities, NBO provides detailed information regarding atomic and hybrid orbitals but does not describe the spatial extent of such orbitals or the total charge density around atoms in a molecule.Inspired by the recently developed charge density dataset included in the Materials Project, 76 we intend to present total molecular charge densities to MPcules users via the Materials Project API and web site, as well as information regarding the electron densities of individual molecular orbitals.

Conclusions
As chemical research grows increasingly reliant on big data and ML approaches, high-quality and open datasets of molecular properties will become vital cornerstone resources, accelerating the understanding of existing chemical systems and the design of novel functional molecules with optimized properties.MPcules, expanding on the existing Materials Project database, is a database of molecular calculations adhering to FAIR principles.The MPcules database currently contains over 170 000 molecules which can be accessed through an API and featureful web app.MPcules is unique both because it grants users facile access to data and because that data is particularly diverse, containing many charged, open-shell, and metal-coordinated species as well as properties related to molecular bonding, electronic structure, thermodynamics, electrochemistry, vibrations, and reactions.Since MPcules relies on a suite of open source soware, it is possible for users to add calculations to MPcules or to develop standalone datasets based on the same underlying schema.We believe that MPcules could serve as a community center for chemical datasets, with collaborative contributions of both code and data from users.

Fig. 1
Fig. 1 Conceptual guide to molecule definition and construction in MPcules.(a) A molecule can be defined as a unique minimum of a potential energy surface, defined by some composition and structure (e.g.interatomic distances and angles).This definition is used in the first (association) stage of molecule building.(b) Alternatively, a molecule can be defined by its composition, charge, spin multiplicity, and connectivity.Different conformers (species with distinct structures but the same connectivity) can all be thought of as the same molecule.This definition is used for the second (collection) stage of molecule building, where we use OpenBabel and pymatgen to determine bonding.

Fig. 2
Fig. 2 Diagram showing how different collections of tasks, (associated or collected) molecules, and properties are linked together in MPcules.An arrow from a source collection (box) to another destination collection indicates that documents in the source are used to construct documents in the destination.

Fig. 3
Fig. 3 Calculation of redox and electrochemical properties in MPcules.(1) For a given molecule, the molecule document and related thermodynamics properties are needed.(2a) Vertical ionization energy (IE) and electron affinity (EA) can be calculated by identifying tasks with the same structure as the molecule of interest, but with charge shifted by +1 (for ionization energy) or −1 (for electron affinity).(2b) Adiabatic reduction and oxidation properties can be calculated by identifying molecules with the same connectivity (ignoring metal coordination) but with charge shifted by −1 (for reduction) or +1 (for oxidation).For clarity, the charge of each molecule and task is shown next to its structure.

Fig. 4
Fig. 4 Calculation of metal binding properties in MPcules.(1) For a given molecule, the molecule document, along with atomic partial charge, atomic partial spin (if the molecule is open-shell), bonding, and thermodynamics information, must be available.(2) For each metal in the molecule, the atomic partial charges and spins (if applicable) are used to determine the metal's oxidation state.This specifies what non-bound species will be searched for.(3) Documents for the unbound metal and the molecule without that metal attached (no metal), along with their thermodynamic information, are sought.(4) Metal binding properties can be calculated using the thermodynamics of the base molecule, metal, and no metal.

Fig. 5
Fig.5Explanation of the sections of an MPculeID, using doublet C 2 H 4 − as an example.

Fig. 6
Fig. 6 The summary section of the Molecule Details Page.

Fig. 7
Fig. 7 The Partial Charges and Spins section of a Molecule Detail Page on the Materials Project web site.An interactive molecule visualization allows users to select atoms and see their atomic partial charges and spins; these are also represented in tabular form.

Fig. 8
Fig. 8 Scale of the current MPcules database in terms of number of documents, broken down by type.

Fig. 9
Fig. 9 Composition of the MPcules database.(a) Number of molecules with different charges; (b) number of molecules with different spin multiplicities; (c) number of molecules including different elements.