From the journal Digital Discovery Peer review history

You do not have JavaScript enabled. Please enable JavaScript to access the full features of the site or access our non-JavaScript page.

Round 1

Manuscript submitted on 15 Aug 2023

Editor’s decision letter

09-Sep-2023

Dear Dr Persson:

Manuscript ID: DD-ART-08-2023-000153
TITLE: A database of molecular properties integrated in the Materials Project

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Joshua Schrier
Associate Editor, Digital Discovery

************

Reviewer comments

Reviewer 1

In this study, the authors present a database within the "Materials Project." After exploring the database via the provided web link, I have several comments:

1. Regarding the vibrational properties, could the authors incorporate mode-dependent frequencies? In practical applications, the correlated vibrational mode is of greater significance.

2. In Figure 4, the metal binding, represented as reaction A−M→A+M, typically does not function in solvents. Additionally, other properties such as spin and charge, as depicted in the database, are solely based on gas-phase conditions without considering interactions with solvent molecules. This limits the database's applicability in real-world reactions.

3. Figure 2 appears somewhat cluttered. While numerous lines depict similar functions, some connections are notably absent. For instance, there's no arrow linking the vibrational properties to metal bonding, which seems inaccurate.

Reviewer 2

Overall Comment:
The paper offers a overview of the current status of the Material Project for Molecules, which is still in its nascent stages.
While I recognize the significance of such an endeavor, I have a few comments and suggestions for improvement.

Comments:
1. On what criteria or methodology did the authors base their selection of molecules for this project?
(Almost the same number of molecules as LIBE dataset)

2. The choice to employ "molecule ID" as opposed to more conventional identifiers like InChI or SMILES is intriguing. Could the authors shed light on the rationale behind this decision? Additionally, the paper seems to lack a clear definition or explanation of what constitutes a "molecule ID." This omission could lead to potential confusion for readers unfamiliar with the term.
Also, there's no explanations how you assign ids (charge property_id, ... in Figure 4)

3. "However, at the time of writing, it is not possible to search for specific molecules in the datasets listed on the web interface.
Moreover, data visualization tools are either limited or nonexistent, making it challenging for users to explore or understand the data
without downloading and navigating through extensive collections."

Consider using http://pccdb.org/ for molecule searches. This platform also offers visualization tools for molecular orbitals, based
on the PubChemQC project. https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00083

4. The authors' decision to employ a variety of methods, especially the wB97 series, and different basis sets for molecules raises questions.
It is crucial to maintain consistency in the choice of quantum chemical methods and basis sets to ensure comparability and reliability
of results. What were the underlying motivations or considerations that led to this diverse selection?

5. "A molecule can be minimally described by its chemical composition, charge, and spin multiplicity."
It would be prudent to include "atomic coordinates" in this description, especially since the Potential Energy Surface (PES)
is discussed shortly after. Additionally, consider incorporating the term "ground state of the molecule" for clarity.

6. On page 6, I find the paper's definition of a molecule to be unsatisfactory. Please rewrite using more solid concepts.
The distinctions made between "the physical definition" and "the chemical definition" are somewhat ambiguous. It's widely acknowledged
that there isn't a concrete definition of a molecule, and quantum chemists consistently emphasize the importance of the PES. The concept
of chemical bonding can be nebulous, and this is a well-known fact among chemists.

Please reevaluate and refine the definitions provided. For instance, the statement, "Different local minima on a PES may correspond to
structures with different bonds, but they may also simply be different conformational isomers (conformers)" seems inaccurate. Molecules
with distinct bonds should be designated with unique names.

"The PES, in turn, is defined by the chemical composition, total number of electrons, number of unpaired electrons, and the DFT methods
(level of theory and other calculation parameters) employed."

Replace "number of unpaired electrons" with "spin multiplicity." Additionally, it's essential to mention the state of the molecule, whether
it's in its ground or excited state.

7. Please write some example input and output. There is an example code in p.23 however, more detailed explanations
should be added.

Reviewer 3

Spotte-Smith and co-workers describe an application to enhance new and usability of existing computational chemistry databases, under the FAIR principles. In the MPcules app many properties are stored and made accessible for each molecule which will make it useful for many users.

Authors stress that different levels of theory can be used, but for some reason they only mention DFT. What about the more accurate coupled cluster calculations? There are quite a few such dataset, although likely not under the FAIR principles. Most of these are reviewed in a recent paper on datasets for machine learning and force field development. https://doi.org/10.1021/acs.jcim.2c01127

Although partial charges computed in different manners may be useful, the selection presented omits the CM5 charges that are used for instance in the OPLS force field. Since authors performed ESP calculations it would be extremely useful if also the ESP grid and potentials were shared for application with different charges models, and for empirical models that use charge sites that are not located on the atom (e.g. lone pairs) or models using explicit multipoles. In short, access to the ESP grids and potentials allows users to generate more advanced models for electrostatics than what atom-centered partial charges can do.

Authors briefly discuss the calculation of thermodynamic properties using approximate algorithms without reference to the plethora of methods available to do exactly this, (e.g. Gaussian-n methods due to Curtiss, Ochterski etc., the complete-basis-set methods and the Weizmann methods). They should comment the relative accuracy of these methods.

An important omission from the database seems to be off-equilibrium compounds. Structures, energies and forces on atoms are extremely useful in machine learning for both prediction of reactions and development of force fields. It would be good if authors commented this omission anyway.

In summary, authors present an important new data set, and therefore the work should be published. However, without access to the ESP and with only optimized geometries for the compound, the use to the machine learning community will be limited.

Details:

Authors stress the importance of the FAIR principles, therefore it seems odd that the database they are using is based on a commercial software, Q-Chem. Authors should comment this and suggest ways to use open source quantum chemistry packages instead.

As regards assigning priority to calculations, this is a laudable effort (but difficult and somewhat arbitrary). If one’s purpose is to look up a certain property for a compound this is useful. In other cases, user might wish for the best set of calculations for many compounds, is this provided for as well? That is, can the app suggest the best level of theory for a potentially large range of compounds?

In addition, for transparency it should be easy for users to assess the accuracy of presented “best” results in the computational chemistry field as a whole.

Page 13: The unit of the RMSD should be specified.

Authors should comment how MPculeIDs differ from other well-known compound identifiers such as Inchey and IncheyKey. Why do they introduce yet another identifier? If something this reduces the reusability of the database. For instance, it is easy to lookup compounds by their InChey in pubchem or chemspider. Are these important resources expected to support yet another identifier? Alternatively, does the application presented here support those other identifiers? Also, CAS springs to mind in this context.

Page 23, authors should describe where to obtain an api-key.

Author response

We thank the reviewers for the time and thought that went into their reading of and responses to our manuscript. We have addressed the reviewer’s comments, questions, and critiques below.

Referee: 1

Comments to the Author
In this study, the authors present a database within the "Materials Project." After exploring the database via the provided web link, I have several comments:

1. Regarding the vibrational properties, could the authors incorporate mode-dependent frequencies? In practical applications, the correlated vibrational mode is of greater significance.

Response: Currently in the MPcules database, we include vibrational frequencies calculated using a harmonic approximation in which each normal mode is assumed to be independent and the potential energy surface at a given stationary point is assumed to be (locally) quadratic in all dimensions. If we understand the reviewer correctly, they are asking about correlations between normal modes, which break these assumptions. It is possible to calculate anharmonic vibrational modes and potentially their correlations using quantum chemical methods, but this is considerably more expensive and often laborious. Although we agree that such advanced analysis could be helpful for practical purposes, it would not be appropriate for a large database calculated using high-throughput methods.

Changes: None

2. In Figure 4, the metal binding, represented as reaction A−M→A+M, typically does not function in solvents. Additionally, other properties such as spin and charge, as depicted in the database, are solely based on gas-phase conditions without considering interactions with solvent molecules. This limits the database's applicability in real-world reactions.

Response: We respectfully disagree with the reviewer on these critiques.

It is true that metal ions tend to be fully solvated when in solution, meaning that they are unlikely to be isolated, as depicted in Figure 4 and implied by the reaction A−M→A+M. However, there are many situations in which metals shed all or part of their solvation shell. For instance, this occurs during metal intercalation and plating in metal-ion batteries (see e.g. Baskin et al., J. Phys. Chem. Lett. 2021, 12(18), 4347–4356; Xu et al., Langmuir 2010, 26(13), 11538–11543). Considering the binding energy of a metal atom or ion to a single solvent molecule, as reflected in A−M→A+M, is useful in analyzing these and related situations. It is also worth mentioning that many properties can be usefully approximated using only a single solvent molecule, rather than a full solvation shell. Hou et al. (Chem. Sci. 2021, 12, 14740) recently found that DFT-calculated reduction potentials for clusters Li-S1 and Li-Sn are essentially the same, where S is a solvent molecule and n > 1.

Regarding atomic partial charges and spins, we believe the reviewer might have had some misconceptions. We have not only calculated these properties in the gas phase. On the contrary, for tens of thousands of molecules included in our database, we have calculated atomic partial charges and atomic partial spins in an implicit solvent medium using either the polarizable continuum model (PCM) or solvent model with density (SMD). It is true that we have for now chosen not to focus on explicit solvation, and we acknowledge that the addition of explicit solvent will likely change properties such as partial charges and spins. However, many research groups studying solution-phase molecules and reactions rely on implicit solvent models because they can dramatically lower cost and computational complexity compared to using full explicit solvation. Aside from cost considerations (which are, as we noted above, particularly relevant for the construction of large databases), we would argue that the properties calculated using implicit solvation methods can be practically useful. Indeed, several of us recently used atomic partial charges calculated using the SMD implicit solvent model to rationalize selectivity in electrolyte decomposition reactions in Li-ion batteries (see Spotte-Smith*, Petrocelli*, et al., ACS Energy Lett. 2023, 8(1), 347–355).

Changes: None

3. Figure 2 appears somewhat cluttered. While numerous lines depict similar functions, some connections are notably absent. For instance, there's no arrow linking the vibrational properties to metal bonding, which seems inaccurate.

Response: We thank the reviewer for their feedback regarding Figure 2. We have attempted to spread the figure out to be less crowded.

There are no missing lines in Figure 2, and likewise, there are no arrows that are superfluous. The reviewer’s concern may have arisen out of a misunderstanding of what the figure is trying to convey. Figure 2 does not attempt to show how various physical or chemical properties are related. Rather, as we state in the figure caption, the point of this figure is to show relationships between different collections in the MPcules database, specifically as they pertain to database construction (i.e. what input collection is used to construct what output collection). While, for instance, vibrational properties could be related to metal binding, in our database vibrational properties are not used to inform metal binding properties. There is therefore no arrow needed between those collections.

Changes: Figure 2 has been modified to be more spread out and less crowded.

Referee: 2

Comments to the Author
Overall Comment:
The paper offers a overview of the current status of the Material Project for Molecules, which is still in its nascent stages.
While I recognize the significance of such an endeavor, I have a few comments and suggestions for improvement.

Comments:
1. On what criteria or methodology did the authors base their selection of molecules for this project?
(Almost the same number of molecules as LIBE dataset)

Response: In the Materials Project database of inorganic crystals, there is not one set of criteria that guides what materials are added. Some materials were obtained from other sources (e.g. the Inorganic Crystal Structure Database), while others were added for particular, project-specific purposes – for instance, discovering novel cathode materials for batteries.

Likewise, the molecules in MPcules do not come from a single source and are not added based on a single set of criteria. As we noted in our manuscript (and as the reviewer mentioned in their comment), some data comes from previously published datasets LIBE and MADEIRA, while other molecules are small organic or organometallic molecules. We chose not to discuss the origin of the non-LIBE/MADEIRA data in this manuscript, as we intend to describe the data in more detail in future manuscripts. To clarify for the reviewers, however, we can disclose that the additional data comes mainly from three sources: 1) additional studies related to battery electrolyte decomposition; 2) a campaign to predict the properties of organic hydrolysis reactions; and 3) an effort to re-calculate the properties of the molecules in the QM9 dataset (Ramakrishnan et al., Sci. Data 2014, 1(1), 1–7) at various charge and spin states (specifically, charge 0 spin multiplicity 1, charge 0 spin multiplicity 3, charge -1 spin multiplicity 2, and charge +1 spin multiplicity 2).

Changes: In the section “The Current State of MPcules”, we more explicitly indicate that specific data in MPcules will be described in future work and that molecules are not included based on a single set of guidelines.

“The main focus of this work is to describe a general computational infrastructure for processing, storing, and disseminating calculated molecular properties. We expect the data stored on MPcules to change and grow over time, and specific additions to the database will be discussed in future works. Nonetheless, we here briefly discuss the scale and scope of the MPcules database as it exists at the time of this writing.”

“Molecules in the MPcules database do not come from a single source and are not selected based on any single set of criteria.”

“In addition to the molecules in LIBE and MADEIRA, MPcules contains molecules relevant to Mg-ion battery electrolytes with tetrahydrofuran electrolytes, as well as large numbers of small organic molecules, the properties of which have been calculated in vacuum and in many cases in an implicit solvent medium approximating water. As mentioned above, we intend to describe these data in further detail in future works.”

2. The choice to employ "molecule ID" as opposed to more conventional identifiers like InChI or SMILES is intriguing. Could the authors shed light on the rationale behind this decision? Additionally, the paper seems to lack a clear definition or explanation of what constitutes a "molecule ID." This omission could lead to potential confusion for readers unfamiliar with the term.
Also, there's no explanations how you assign ids (charge property_id, ... in Figure 4)

Response: A discussion of the MPculeID was previously provided in the Supplementary Information of our manuscript. This has now been moved to the main text. We also summarize our discussion here.

When we began work on incorporating molecular data in the Materials Project, we considered many options for unique identifiers. We looked for a format that would be 1) unique (different molecules, properties, etc. would get different identifiers), 2) persistent (the IDs would not change over time), and 3) chemically meaningful.

In the crystalline materials database of the Materials Project, the “MPID” format is used, which consists of a prefix and an integer which is derived from the unique identifier of a material “task document”. For instance, diamond silicon has the MPID “mp-149”. We found the MPID unsatisfactory for several reasons. Most importantly, as we have learned from many years of work on the Materials Project, keeping MPIDs persistent can be challenging. If some of the tasks used to generate a material are deprecated, then the task ID used to generate the MPID could change. In addition, MPIDs do not carry any chemical information.

We also considered InChI, SMILES, and similar molecular string representations which do carry chemical information and which in principle can be made unique (though there can be many valid SMILES for a given molecular structure). Most of these formats are not designed to distinguish between metal coordination environments, which make then inappropriate for our database, which contains tens of thousands of molecules with coordinated metals. In particular, the InChI standard explicitly disregards bonds to metals, so two molecules with the same covalent structure but with metals in different coordination environments will always be associated with the same InChI string. This makes InChI insufficient for use as a unique identifier in our case; however, we note that we do store InChIs and InChI-keys in MPcules. In the near future, we hope to allow users to search for molecules based on these values.

In the end, we reluctantly decided that the best option was to create our own ID specification, the MPculeID. This format consists of four parts – a hash of the molecular graph structure, the chemical formula, the charge, and the spin multiplicity. Though the graph hash is not human-readable, it is chemically meaningful (two molecules represented by isomorphic graphs are guaranteed to have the same hash). Given our definition of a molecule, these IDs are guaranteed to be unique for a given molecule, and they are persistent, because no part of the ID is tied to a particular calculation.

With regard to the property IDs, these actually were discussed in the main text of our manuscript (see Molecular Properties in the previous version of the manuscript). However, to improve clarity, we have moved all discussion of unique identifiers to a new subsection, “Unique Identifiers”

Changes: We reorganized our discussion of unique identifiers, creating a “Unique Identifiers” subsection repeated below:

“Unique Identifiers

The principles of findability and accessibility require that data be given IDs which can be used to search for and reference specific information. In addition to being unique and persistent, it is desirable (though less essential) for IDs to carry chemical information and to be interpretable by human users.

Tasks
When tasks are inserted into the MPcules database – for instance, after a DFT calculation has completed – they are assigned a sequential numerical ID. We prepend these numerical IDs with a string (e.g. “mpcule”) to form a unique task ID.

Molecules
In the Materials Project database, materials are given MPIDs which are derived from task IDs as described above. For instance, “mp-1518” represents CeRh3. While MPIDs are unique and persistent for a given task, they are not necessarily persistent for materials, as older calculations used to generate an MPID could be deprecated over time. Moreover, MPIDs do not carry any chemical information, human-interpretable or otherwise.

The most widely used representations for molecules which could be used as IDs are the simplified molecular-input line-entry system (SMILES) and the International Chemical Identifier (InChI). SMILES has numerous pitfalls which make it inappropriate for use as a database ID. Most importantly, SMILES strings are not unique, and there can be several valid SMILES for the same structure. Though it is possible to generate unique “canonical” SMILES, this fundamental lack of uniqueness makes searching for molecules by SMILES problematic. SMILES is also designed primarily for organic molecules and struggles to robustly represent metals and coordination complexes. As many of the molecules in MPcules contain coordinated metal atoms or ions, this is a severe limitation. The self-referencing embedded strings (SELFIES) devices by Krenn, Aspuru-Guzik, and colleagues, significantly improve on SMILES - most notably, by ensuring that all possible SELFIES strings represent chemically valid molecules - and can in principle support arbitrary metal bonding. However, at present, SELFIES can only be generated via SMILES, which ultimately means that many of the same pitfalls persist. InChIs are guaranteed to be unique – for a given molecular structure, there can be only one InChI – but the InChI generation algorithm explicitly ignores metal bonding, again meaning that metal-coordinated molecules with different coordination environments cannot be distinguished by InChI.

To overcome the limitations of existing IDs and molecular representations, we have devised a new ID format - the MPculeID. The basic ID has four parts that separated by hyphens; these four parts represent the connectivity, composition, charge, and spin multiplicity of the molecule. For connectivity, we generate a graph representation of the molecule (see “Building molecules”) and hash it using the Weisfeiler-Lehman graph hashing algorithm originally implemented in networkx. This hash can be augmented with features of the nodes (atoms) or edges (bonds). In the association stage of molecules building, where MoleculeDocs are differentiated by their exact structure, we augment the graph with the Cartesian coordinates (XYZ) of the atoms. In the collection stage, where MoleculeDocs are distinguished by connectivity only (without concern for exact bond lengths, angles, etc.), we instead augment only with the string representation of the element (e.g. “Li” for lithium). To ensure consistency, when representing the composition, we always use the alphabetized chemical formula (e.g. “C1Li2O3” for lithium carbonate or Li2CO3). For molecules with negative charge, we prefix the charge with “m” instead of a minus sign “-” to distinguish from the hyphen separators.

The MPculeID comes closer to simultaneously meeting the goals of uniqueness, persistence, and interpretability. Though hash collisions - in which multiple distinct inputs are converted to the same hashed output - are essentially unavoidable with the Weisfeiler-Lehman hash or any other hashing method, it is exceptionally unlikely that any two molecules with different connectivities will nonetheless have the same hash, formula, charge, and spin. In practice, the MPculeID should therefore always be unique. Because the hashing algorithm is deterministic, the same graph input will always receive the same hash, meaning that MPculeIDs will not change over time. The Weisfeiler-Lehman algorithm further guarantees that graphs that are isomorphic produce the same hash, which means that these hashes can be used to compare molecular structures (acknowledging the possibility of hash collisions). Finally, though graph hashes are not human-interpretable, they do carry chemical information, and as the formula, charge, and spin information in the MPculeID are easily understood, users reading an MPculeID should be able to obtain a basic understanding of the underlying data.

Molecular Properties
Though one could search for a property document using defining characteristics such as molecule ID, for convenience, we also define IDs for property documents. These IDs are generated by constructing a string with the identifying information for the document (including MPculeID, solvent, and - where relevant - method, as well as potentially other information used to generate the document); this string is then hashed using the BLAKE2 algorithm, as implemented in the Python standard library. The uniqueness of a hash can in general not be guaranteed, but because there are other ways to access a desired property document using data that are essentially guaranteed to be unique, the relatively remote possibility of hash collisions is acceptable in the case of property documents.”

3. "However, at the time of writing, it is not possible to search for specific molecules in the datasets listed on the web interface.
Moreover, data visualization tools are either limited or nonexistent, making it challenging for users to explore or understand the data
without downloading and navigating through extensive collections."

Consider using http://pccdb.org/ for molecule searches. This platform also offers visualization tools for molecular orbitals, based
on the PubChemQC project. https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00083

Response: We thank the reviewer for bringing the PCCDB web site to our attention. We have included PubChemQC in our discussion of other databases of calculated molecular properties in the Introduction.

Changes:

“In contrast, few FAIR databases of calculated molecular properties exist. It remains common for computational chemistry data to be presented as a single unit (for instance, a zipped file that cannot be easily searched), or worse, not be publicly shared at all. The Molecular Sciences Software Institute's QCArchive and the Public Computational Chemistry Database Project (PCCDB) are noteworthy and laudable examples of quantum chemical databases approaching FAIR standards.

QCArchive hosts large collections of internally generated and user-submitted data, including the popular QM9 and ANI-1 datasets. The data on QCArchive can be downloaded in HDF5 format from their web site or can be accessed through a representational state transfer (REST) API with a high-level Python client, making it accessible and interoperable. QCArchive data is also reasonably findable and reusable. Molecules in QCArchive are given unique IDs. However, at the time of writing, it is not possible to search for specific molecules in the datasets listed on the web interface. Moreover, data visualization tools are either limited or nonexistent, making it difficult for users to discover or digest the data without downloading and sifting through large collections. In terms of reusability, QCArchive boasts an enormous collection of molecules and datapoints with provenance based on over 10 million calculations, but the available data are often limited in scope and applicability. Many of the datasets included in QCArchive contain relatively few properties (for instance, only structures and electronic energies), meaning that the data can only easily be applied to very specific tasks, e.g. training ML force-fields for molecular dynamics.

PCCDB hosts data from PubChemQC, a collection of electronic structure properties for more than 2 million molecules taken from the PubChem database. PCCDB has a web app that allows users to search for molecules with particular properties and then visualize those molecules, their absorption spectra, and their molecular orbitals. Calculation inputs are available through the web interface, providing users with some means to access (meta)data about e.g. calculation parameters. An API is also available, and the standard is specified in the web site's documentation. However, to our knowledge no client for this API has been released, which nontrivially increases the burden for end users to interact with the data and especially to download large collections of data for e.g. high-throughput screening or ML applications. Like QCArchive, data in PCCDB is limited in scope, with a strong emphasis on excited state and optical absorption properties. In our assessment, data in PCCDB is findable and interoperable but is somewhat lacking in accessibility and reusability.”

4. The authors' decision to employ a variety of methods, especially the wB97 series, and different basis sets for molecules raises questions.
It is crucial to maintain consistency in the choice of quantum chemical methods and basis sets to ensure comparability and reliability
of results. What were the underlying motivations or considerations that led to this diverse selection?

Response: While we agree with the reviewer that internal consistency is a desirable goal when designing a database, we chose in MPcules to make a compromise between consistency and flexibility.

To motivate our desire for flexibility, first consider the task of choosing a consistent level of theory for a database. Again, the reviewer is right that one should, ideally, use a single level of theory, and this level of theory should be the most accurate one that can be reasonably used given computational budgets. However, the most accurate level of theory available can (and does!) change over time. In the crystal structure database of the Materials Project, we have over the last several years undertaken an effort to adopt r2SCAN as our density functional of choice over our previous choice of PBE. r2SCAN has proven to be more accurate for a range of properties, and increasing computational power has made meta-GGA calculations of materials more tractable. But migrating to a new functional is expensive and time-consuming, and we have had to find ways for the PBE and r2SCAN data to coexist and interact (see e.g Kingsbury et al., npj Comput. Mater. 2022, 8(1), 195). Moreover, different levels of theory (and particularly different functionals) may be more accurate in different situations (for an extensive review of functional accuracy, see Mardirossian & Head-Gordon, Mol. Phys. 2017, 115(19), 2315–2372). This is especially true when considering different solvent models and their associated parameters. As we note in the main text of our manuscript, a calculation performed with PCM in water (ε = 80) is no better in general than one performed in THF (ε = 7), but it is more appropriate for aqueous applications.

Putting aside the issue of accuracy for a moment, practical considerations drove us to allow multiple levels of theory in MPcules. MPcules is not a single-purpose, static dataset. Rather, we envision MPcules as a database that will, over time, grow by including calculations from many sources designed for different applications, like the QCArchive mentioned above. Allowing calculations using many different levels of theory makes it much easier for different researchers to add their data without having to follow rigid calculation parameters that we dictate.

We believe that the diverse levels of theory included in MPcules do not make the database any less useful. It is still possible to extract data from MPcules that are consistent in terms of level of theory, as the functionals, basis sets, implicit solvent models, and (where appropriate) solvent parameters for each property are stored in a molecule’s summary document. At the same time, we note that we are planning in the near future to perform additional calculations on each molecule at a single, consistent level of theory (e.g. ωB97M-V/def2-SVPD/vacuum) to ensure that at least some properties for all molecules are directly comparable.

Changes: None

5. "A molecule can be minimally described by its chemical composition, charge, and spin multiplicity."
It would be prudent to include "atomic coordinates" in this description, especially since the Potential Energy Surface (PES)
is discussed shortly after. Additionally, consider incorporating the term "ground state of the molecule" for clarity.

Response: The reviewer’s point about distinguishing between ground and excited states is very valid. However, it is worth noting that MPcules contains molecules which are not in their ground state, so it would not be correct for us to limit our language to consider only ground-state molecules. As but one example, we calculated the properties of diatomic oxygen in the singlet state (1O2), while the ground state is a triplet (3O2). We have nonetheless added a mention of ground vs. excited states in our initial discussion of defining molecules.

Regarding “atomic coordinates”, we respectfully disagree with the reviewer. As our text says, the minimal description of a molecule includes its composition, charge, and spin multiplicity. If we write “1CO32-“, it will be commonly understood that we are referring to the carbonate anion in the ground state (singlet). We agree that information about bonding or about atomic coordinates should be used to describe a molecule more uniquely and completely, but this information is not strictly necessary in all cases. If anything, we concede that our minimal description might be slightly too restrictive, as one can imagine examples where even charge and spin are unnecessary for an understandable description (e.g. “CH3” will be read by many chemists as “the neutral methyl radical”).

Changes:

“A molecule can be minimally described by its chemical composition, charge, and spin multiplicity. This definition is in line with common written nomenclature for molecules and ions. As a small example, diatomic oxygen in the triplet ground state (3O2) is differentiated by composition from the oxygen atom (O1), by charge from a peroxide anion (O2-2), and by spin from the singlet excited state (1O2). Notably, additional information may be needed to distinguish between ground and excited states”

6. On page 6, I find the paper's definition of a molecule to be unsatisfactory. Please rewrite using more solid concepts.
The distinctions made between "the physical definition" and "the chemical definition" are somewhat ambiguous. It's widely acknowledged
that there isn't a concrete definition of a molecule, and quantum chemists consistently emphasize the importance of the PES. The concept
of chemical bonding can be nebulous, and this is a well-known fact among chemists.

Please reevaluate and refine the definitions provided. For instance, the statement, "Different local minima on a PES may correspond to
structures with different bonds, but they may also simply be different conformational isomers (conformers)" seems inaccurate. Molecules
with distinct bonds should be designated with unique names.

"The PES, in turn, is defined by the chemical composition, total number of electrons, number of unpaired electrons, and the DFT methods
(level of theory and other calculation parameters) employed."

Replace "number of unpaired electrons" with "spin multiplicity." Additionally, it's essential to mention the state of the molecule, whether
it's in its ground or excited state.

Response: To summarize our understanding, the reviewer raised the following issues with our definition of a molecule are the following:
1. The terms “physical definition” and “chemical definition” may be misleading, as many chemists and in particular quantum chemists would favor a PES-based definition over a bond-based definition.
2. Chemical bonding is not well defined.
3. The reviewer takes issue with the idea that different local PES minima could correspond to conformers, rather than different unique molecular structures that would be given different names
4. “Spin multiplicity” is preferred over “number of unpaired electrons”
5. Again, the reviewer emphasizes the difference between ground and excited states

To point 1, while we would argue that many chemists think in terms of chemical bonds over PES minima – whether those bonds are rigorously well defined or not – we agree that these names are arbitrary and probably unnecessary. We have removed the use of the terms “physical definition” and “chemical definition”.

To point 2, we agree that defining bonding is difficult, and we acknowledged this in the main text of our manuscript. We have further emphasized the difficulty in using a definition of a molecule based on the idea of bonds.

On point 3, we appreciate the comment by the reviewer, but based on our experience and data, we retain our position. If one performs two DFT geometry optimization calculations at the same level of theory with the same chemical system (i.e. same composition, charge, and spin multiplicity), but with different initial atomic coordinates, there are three options (ignoring the possibility of a calculation failing to converge):
1. Both calculations optimize to the exact same structure, with identical XYZ coordinates, bond lengths, bond angles, dihedral angles, etc.
2. The calculations optimize to structures with the same bonding (given some definition), but with slightly different bond lengths, different bond angles, etc. These optimized structures are conformers
3. The calculations optimize to structures with different bonding.
In all three cases, the two geometry optimization calculations have converged to PES local minima. In cases 2 and 3, those minima are distinct, but thinking in terms of chemical bonds, it is reasonable to say that the two minima in case 2 are really the same molecule, just in different conformations. As we note in the main text of our manuscript, many molecular properties are based on averages over conformational ensembles, which lends legitimacy to the idea that conformers are different manifestations of the same molecule, even though they lie at different local PES minima. Only in case 3 would the two PES minima be called different molecules in a bonding-based description, and typically, only in case 3 would the two PES minima have different names in e.g. the IUPAC specification. We emphasize that this is not just a thought experiment. The three structures that we show in Figure 1a of the main text are all actual structures of different PES minima optimized at the same level of theory. By our chosen definition of bonding, they all have identical bonds (including both covalent and coordinate bonds), but they are in different conformations.

We have changed to use the term “spin multiplicity”, in line with the reviewer’s preference.

As we alluded to above, there are cases in our database where we have performed calculations on the same molecule at the same charge state but at multiple different spin multiplicities (e.g. singlet and triplet oxygen). Of course, only one of these can be the ground state. During database construction, we must define molecules before we do any direct comparisons (in e.g. the oxygen example, we must group the tasks associated with the singlet state and the tasks associated with the triplet state before we can say which one is lower in energy). This means that we do not know a priori if a given molecule is in its ground or excited state, and so our definition of a molecule cannot and does not make such a ground-vs.-excited distinction.

Changes:

“To specify beyond this starting point, there are two natural definitions: one based on PES, and another based on the idea of chemical bonding (Figure 1).”

“The PES, in turn, is defined by the chemical composition, total number of electrons, spin multiplicity, and the DFT methods (level of theory and other calculation parameters) employed.”

“In the first definition, molecule is defined as a local minimum on a PES. The PES, in turn, is defined by the chemical composition, total number of electrons, spin multiplicity, and the DFT methods (level of theory and other calculation parameters) employed. In this definition, every unique atomic structure (in terms of interatomic distances and angles) corresponding to a local PES minimum obtained via a geometry optimization calculation is a different molecule. It is worth noting that this physical definition is used within the Materials Project's data for crystalline solids to define a unique “material”.

In the second definition, it is the connectivity of a molecule - the way that atoms are linked to each other through chemical bonds and other interatomic interactions - that distinguishes molecules. Different local minima on a PES may correspond to structures with different bonds, but they may also simply be different conformational isomers (conformers). This definition is somewhat more complex than the picture based on PES, as it requires additional definitions and decisions. For instance, this definition relies on the idea of a “bond” and associated criteria determining when two or more atoms are or are not bonded. We note that it is extremely challenging to rigorously define chemical bonding, and ultimately, most definitions are arbitrary.

In MPcules, we use both the PES-based and the bonding-based definitions to construct molecules, as described below (“Building Molecules”). However, as most chemical observables of interest – including various spectra, electrochemical properties, and reaction properties like thermodynamics or kinetics – are averaged over different interconverting conformers, we rely in most cases on the definition based on bonding.”

“In the first (association) stage (Figure a), tasks are grouped according to a PES-based definition of a molecule (i.e., each structure corresponding to a unique local minimum of a PES is a unique molecule).”

7. Please write some example input and output. There is an example code in p.23 however, more detailed explanations should be added.

Response: We thank the reviewer for this suggestion. We have added input and output illustrating how users can interact with data after querying the Materials Project API.

Changes: Several new code blocks were added, providing an example of accessing data from a molecule summary document after querying the Materials Project API.

Referee: 3

Comments to the Author
Spotte-Smith and co-workers describe an application to enhance new and usability of existing computational chemistry databases, under the FAIR principles. In the MPcules app many properties are stored and made accessible for each molecule which will make it useful for many users.

Authors stress that different levels of theory can be used, but for some reason they only mention DFT. What about the more accurate coupled cluster calculations? There are quite a few such dataset, although likely not under the FAIR principles. Most of these are reviewed in a recent paper on datasets for machine learning and force field development. https://doi.org/10.1021/acs.jcim.2c01127

Response: We have primarily focused our discussion on DFT because, at this time, MPcules contains only properties calculated using DFT and can only accept DFT calculations. That is, coupled-cluster calculations cannot be included at present. However, we have mentioned databases that include coupled-cluster calculations. For instance, the QCArchive includes ANI-1ccx, which was calculated at the CCSD(T) level. We have also written (see Future Work) that we hope to include methods other than DFT (including wavefunction methods) in the future.

Changes: None

Although partial charges computed in different manners may be useful, the selection presented omits the CM5 charges that are used for instance in the OPLS force field. Since authors performed ESP calculations it would be extremely useful if also the ESP grid and potentials were shared for application with different charges models, and for empirical models that use charge sites that are not located on the atom (e.g. lone pairs) or models using explicit multipoles. In short, access to the ESP grids and potentials allows users to generate more advanced models for electrostatics than what atom-centered partial charges can do.

Responses: We thank the reviewer for this suggestion. We agree that providing ESP data beyond atomic partial charges would be useful, and it also may be worthwhile to add additional methods of calculating atomic partial charges such as CM5. In the short term, it is not possible to add these features, either because we do not have data available (we have not calculated any CM5 atomic partial charges) or because adding these features would be a considerable engineering undertaking. To add ESP data would require, at minimum, changes to our DFT output parsers, database schema, and database construction pipeline, as well as potentially the addition of new API endpoints. We will strongly consider these features, and in particular adding ESP grids and potentials, for future developments.

Changes: None

Authors briefly discuss the calculation of thermodynamic properties using approximate algorithms without reference to the plethora of methods available to do exactly this, (e.g. Gaussian-n methods due to Curtiss, Ochterski etc., the complete-basis-set methods and the Weizmann methods). They should comment the relative accuracy of these methods.

Response: As we understand, the methods that the reviewer mentions, including Gaussian and Weizmann methods, are designed to extrapolate wavefunction (e.g. MP2 or coupled-cluster) thermochemistry to the complete basis set limit. As we just mentioned, MPcules can currently accept only DFT calculations. As such, we believe that detailed discussion of wavefunction methods is outside of the scope of this work.

Changes: None

An important omission from the database seems to be off-equilibrium compounds. Structures, energies and forces on atoms are extremely useful in machine learning for both prediction of reactions and development of force fields. It would be good if authors commented this omission anyway.

Response: Because of the way that we have defined molecules – requiring in most cases that geometry optimization calculations be performed and that molecules reflect PES minima – it is true that our database does not focus on off-equilibrium compounds. However, such data are available to users through the Materials Project API. Specifically, one can extract optimization trajectories – including structures, energies, and forces at off-equilibrium points of the PES – through the MPcules tasks collection. This was mentioned in the main text of our manuscript (see “The Materials Project API” in the previous version), though we did not previously connect this data to use in ML. We seek to clarify this point, as we agree with the reviewer that this data would be highly useful for ML and in particular force-field development.

Changes: “We note that, in addition to obtaining complete task, molecule, property, and summary documents, we have also provided API endpoints that extract more targeted information. For instance, using the /molecules/tasks/trajectory/ endpoint, it is possible to extract information from a task related to a geometry optimization trajectory, including the structures, energies, and forces along that trajectory. This off-equilibrium data could be used, among other purposes, to train ML interatomic potentials.”

In summary, authors present an important new data set, and therefore the work should be published. However, without access to the ESP and with only optimized geometries for the compound, the use to the machine learning community will be limited.

Details:

Authors stress the importance of the FAIR principles, therefore it seems odd that the database they are using is based on a commercial software, Q-Chem. Authors should comment this and suggest ways to use open source quantum chemistry packages instead.

Response: The FAIR principles say that data should be findable, accessible, interoperable, and reusable. They make no guidelines as to how data should be generated, only that once the data is generated, it should be open, easy to interact with, contain necessary metadata, and be useful to the public. As such, we argue that a database of calculations generated using NWChem, Psi4, or PySCF is not inherently more FAIR than one generated using Jaguar, Q-Chem, or Gaussian.

While we defend that the Materials Project (including its VASP-based crystal structure database and its Q-Chem-based molecular database) adheres to FAIR principles, we at the same time agree in principle with the reviewer’s preference for open source codes. As we mentioned in “Calculation Methods and Sources” in our “Future Work” section, we hope to make the MPcules infrastructure more flexible to enable codes other than Q-Chem to be used. Specifically, in this section, we mention xTB (open source), ORCA (closed source but free for academic use), and NWChem (open source), though we do not necessarily intend to limit ourselves to these codes.

Changes: None

As regards assigning priority to calculations, this is a laudable effort (but difficult and somewhat arbitrary). If one’s purpose is to look up a certain property for a compound this is useful. In other cases, user might wish for the best set of calculations for many compounds, is this provided for as well? That is, can the app suggest the best level of theory for a potentially large range of compounds?

In addition, for transparency it should be easy for users to assess the accuracy of presented “best” results in the computational chemistry field as a whole.

Response: As the reviewer states – and as we freely admit in our manuscript – assigning priority to different levels of theory is basically arbitrary. In a response to another reviewer above, we wrote that there is often not one level of theory that is best in all cases and for all applications. Thus, while we may assign a general ranking for different levels of theory, we know that this will always be limited and may in some cases be “incorrect” (i.e. we may suggest a level of theory for a given property that is less accurate in a particular situation than some other level of theory).

To more directly address the reviewer’s points:

The rankings that we have created for different levels of theory are applied uniformly. Thus, it could be said that we have chosen preferred levels of theory for all compounds. However, the goal of the app is not to suggest a particular level of theory (for that, we recommend e.g. Duan et al., Nat. Comput. Sci. 2023, 3, 38–47), but rather to recommend a particular property or datapoint from among the tasks present in our database of DFT calculations. We can only make recommendations based on the calculations that have already been performed and the levels of theory that have already been used in MPcules.

With regard to improving transparency by presenting the overall accuracy of the methods presented, we agree that this would be helpful and should be included. We plan on referencing existing benchmark studies in the next release of the Materials Project documentation.

Changes: None (with addition of benchmark results planned).

Page 13: The unit of the RMSD should be specified.

Response: Thank you! The units are in Angstrom. This is now specified.

Changes: “The structures associated with each task (represented by pymatgen Molecule objects) are then compared, and tasks with structures that are identical within a tight tolerance (by default, the root-mean-squared deviation or RMSD ≤ 10-6Å) are grouped together.”

Authors should comment how MPculeIDs differ from other well-known compound identifiers such as Inchey and IncheyKey. Why do they introduce yet another identifier? If something this reduces the reusability of the database. For instance, it is easy to lookup compounds by their InChey in pubchem or chemspider. Are these important resources expected to support yet another identifier? Alternatively, does the application presented here support those other identifiers? Also, CAS springs to mind in this context.

Response: We appreciate the reviewer’s concern on this matter. As we mentioned in response to another reviewer above, the existing solutions, including InChI, are insufficient specifically when attempting to distinguish between molecules with metals in different coordination environments. As such, we felt that there was no choice but to use a different unique identifier.

We do store the InChI and InChI-key in MPcules, and so eventually we hope that users can search for molecules based on InChI or else use the InChI data in MPcules to search on other databases such as PubChem.

Changes: We now mention that InChI are stored in MPcules:

“Although existing identifiers like InChI are not sufficient for use as a unique identifier in the MPcules database, they are widely used and supported. As such, to improve interoperability with other databases, we associate InChIs and InChI-key hashes with each molecule and molecule summary document in MPcules. We intend for users to be able to search for molecules based on their InChI strings in the future.”

Page 23, authors should describe where to obtain an api-key.

Response: In response to this comment, we have added an explanation of how to gain access to the Materials Project API.

Changes: “Upon making an account (https://profile.materialsproject.org/), users of the Materials Project gain access to an API key (https://https://next-gen.materialsproject.org/api). This allows users to interact with the Materials Project API.”

With these changes made, we hope and believe that our manuscript is now fitting for publication in Digital Discovery.

Round 2

Revised manuscript submitted on 18 Sep 2023

Editor’s decision letter

30-Sep-2023

Dear Dr Persson:

Manuscript ID: DD-ART-08-2023-000153.R1
TITLE: A database of molecular properties integrated in the Materials Project

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions. (In particular, clarifying the issues raised by Referee 2—to the extent that these limitations exist, they need not be "solved" but should at least be noted in the manuscript; to the extent that they are only perceived, you may wish to clarify the text to avoid this perception by readers. )

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Joshua Schrier
Associate Editor, Digital Discovery

************

Reviewer comments

Reviewer 1

I have no further comments

Reviewer 2

The results of this paper are still in a very early stage. I believe the main outcome is the creation
of a database, a web interface and an integration to the material project.
However, I would like to highlight potential violations of the "Findable" principle of the MPculeID introduced
by the authors:

1. Information Compression: Both InChI and SMILES strings can serve as preliminary guesses to molecular structures.
In contrast, the MPculeID compresses both the bonding information and molecular formula into a hash, rendering such
initial guesses unattainable. Additionally, with MPculeID, it's not feasible to:
* Compare the similarity of two distinct molecules, say, using the Tanimoto coefficient.
* Identify specific functional groups, and so on.

2. Incorporating InChI and SMILES representations into the MPculeID might offer some solution. However, identifying
metal bondings is still challenging, and I surmise it's practically infeasible.

3. Position of Hydrogen Atoms: Let's consider Keto-Enol tautomerization as an example. The InChI molecular
representation purposefully introduces ambiguity. If a user searches for a molecule in its Keto form (pertaining to a
large molecule) and your database only possesses its Enol form, would the MPculeID still be effective?

4. Chirality: How does MPculeID address the chirality of a molecule? A graph-based molecular representation might
overlook molecular chirality. C_{n}H_{2n+2} has very large number of chirality centers.
Should the user specify all the chirality centers or not? How about helicene?

Reviewer 3

Authors have answered my question satisfactorily.

Author response

We once again thank the reviewers for their time and consideration, and we are glad that Reviewers 1 and 3 found our previous responses and modifications agreeable. We have addressed Reviewer 2’s additional comments below.

Referee: 2

Comments to the Author

The results of this paper are still in a very early stage. I believe the main outcome is the creation
of a database, a web interface and an integration to the material project. However, I would like to highlight potential violations of the "Findable" principle of the MPculeID introduced by the authors:

1. Information Compression: Both InChI and SMILES strings can serve as preliminary guesses to molecular structures. In contrast, the MPculeID compresses both the bonding information and molecular formula into a hash, rendering such initial guesses unattainable. Additionally, with MPculeID, it's not feasible to:
* Compare the similarity of two distinct molecules, say, using the Tanimoto coefficient.
* Identify specific functional groups, and so on.

Response: As we describe in our manuscript, a database identifier should always be persistent and unique and should ideally (though not necessarily) be chemically meaningful. We emphasize that the MPculeID format indeed meets all these criteria.

The reviewer is correct that, because we use a hash to represent the molecular structure, the MPculeID is not ideal for substructure searches or similarity comparisons (though we note that subgraph hashes could be used for similarity searches; see Shervashidze et al., J. Mach. Learn. Res. 2011, 12, 2539–2561). We fully acknowledge the utility of such features and hope in the future to allow users of the MPcules database to search for molecules based on structural similarity, functional groups, etc. However, we would emphasize that these features are not necessary for an identifier. Most database identifiers that we are aware of are not designed for molecular similarity comparisons. The InChI-key, which is also based on a hashing algorithm, was designed to aid in database searches, but it cannot be used to compare molecular similarity. Further, many database identifiers do not contain any chemical information whatsoever. For instance, though the Reaxys database (https://www.reaxys.com) allows users to search for molecules that are similar to a user-provided structure, the Reaxys IDs are purely numeric. Likewise, QCArchive and PubChemQC use purely numeric identifiers that contain no chemical information.

While we do not believe that the inability to perform similarity and functional group searches using MPculeIDs in any way limits the utility of the MPculeID format as a database identifier, we have nonetheless clarified the limitations of this format in our main text.

Changes: We have added the following paragraph in our discussion of the MPculeID:

“Although the MPculeID format meets the basic requirements for a database ID format and overcomes certain key limitations of previous chemical identifiers, MPculeIDs have limitations of their own. For example, similar graphs do not in general produce similar Weisfeiler-Lehman hashes; these hashes therefore cannot be used to search for similar molecules, including molecules containing a particular substructure or functional group. There are also limits to the current implementation of MPculeIDs in the MPcules database that are not limitations of the basic format. As we have explained, when generating graph hashes for use in MPculeIDs, the graphs can be augmented with atom or bond features. Depending on how the graphs are augmented, different hashes will be produced, which can change if and how species are distinguished. As an example, consider chiral molecules. Different enantiomers have the same connectivity but are thought of as distinct because of their optical, structural, and (in some cases) reactive properties. Because they are by definition non-superimposable, enantiomers can be distinguished by their MPculeIDs in the association stage (where the graphs are augmented with Cartesian coordinates). However, MPculeIDs in the collection stage cannot distinguish between enantiomers because we do not augment the graphs with any information about chirality.”

2. Incorporating InChI and SMILES representations into the MPculeID might offer some solution. However, identifying metal bondings is still challenging, and I surmise it's practically infeasible.

Response: As we have explained previously, SMILES and InChI are not usable in MPcules because they either explicitly ignore metal bonding (InChI) or otherwise struggle with metal-containing compounds (SMILES). For this reason, we chose not to incorporate them in the MPculeID, though InChIs are stored in the MPcules database separately from the MPculeID. We agree with the reviewer that modifying either the SMILES or the InChI formats to effectively represent metal-containing compounds would be greatly challenging, though perhaps worth pursuing at some point in the future.

Changes: None

3. Position of Hydrogen Atoms: Let's consider Keto-Enol tautomerization as an example. The InChI molecular representation purposefully introduces ambiguity. If a user searches for a molecule in its Keto form (pertaining to a large molecule) and your database only possesses its Enol form, would the MPculeID still be effective?

Response: The MPculeID is based in part on molecular graph representations, where we consider each atom (including hydrogens). Because hydrogen atoms in keto-enol pairs are bonded differently, keto-enol tautomers have distinct connectivities and would be represented by non-isomorphic graphs. Therefore, by our definitions (see “What Is a Molecule?”), the keto form and the enol form of a particular keto-enol pair are two different molecules and would in all cases have different MPculeIDs. The MPculeID is “effective” in this case, in the sense that it distinguishes between different molecules in line with our stated goals and desired behavior. However, if a user were searching for a molecule in its keto form but the MPcules database contained only the enol form, the search would yield no results.

Changes: We now specify that we consider hydrogens in defining molecular connectivity.

“Upon detecting bonds, we construct a molecular graph representation using the pymatgen MoleculeGraph functionality. When defining connectivity for a graph representation, we consider bonds to hydrogen atoms, which are always included explicitly in our 3D molecular structures.”

4. Chirality: How does MPculeID address the chirality of a molecule? A graph-based molecular representation might overlook molecular chirality. C_{n}H_{2n+2} has very large number of chirality centers. Should the user specify all the chirality centers or not? How about helicene?

Response: The reviewer hits on an important point. We first think it worthwhile to provide a brief reminder of the MPculeID format. MPculeIDs begin with a hash of a molecular graph representation which can be augmented with some bond and atom features. In MPcules, we augment in two different ways: 1) during the “association” of task documents, we augment the graphs with the Cartesian (XYZ) coordinates of the atoms; 2) during the “collection” of the associated molecules, we augment only with the elemental symbol of the atoms (e.g. “Li” for lithium or “O” for oxygen). MPculeIDs based on graphs augmented with Cartesian coordinates can be used to distinguish between different enantiomers, as by definition such enantiomers are non-superimposable. However, MPculeIDs based on graphs augmented with elemental symbols cannot be used to distinguish between enantiomers. This is not a fundamental limitation of the MPculeID format – we could relatively easily augment the molecule graphs to encode information about chirality – but it is a limitation of the current implementation used in the MPcules database.

The same logic applies to helicene. Helicenes with different chirality will be treated as distinct in the association stage and will receive different MPculeIDs but will be treated as the same in the collection stage because they have the graphs have the same connectivity.

Changes: See above.

Round 3

Revised manuscript submitted on 02 Oct 2023

Editor’s decision letter

12-Oct-2023

Dear Dr Persson:

Manuscript ID: DD-ART-08-2023-000153.R2
TITLE: A database of molecular properties integrated in the Materials Project

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Joshua Schrier
Associate Editor, Digital Discovery

Reviewer comments

Reviewer 2

The authors have responded sincerely to my comments, and the deficiencies in the content have been addressed.

Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.