From the journal Digital Discovery Peer review history

ESAMP: event-sourced architecture for materials provenance management and application to accelerated materials discovery

Round 1

Manuscript submitted on 30 Mar 2023
 

18-Apr-2023

Dear Dr Suram:

Manuscript ID: DD-ART-03-2023-000054
TITLE: ESAMP: Event-Sourced Architecture for Materials Provenance Management and Application to Accelerated Materials Discovery

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after minor revisions.

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Joshua Schrier
Associate Editor, Digital Discovery

************


 
Reviewer 1

I commend the authors for an outstanding effort in materials data curation infrastructure. Of special note is the introduction of the materials "state" which allows for a very nuanced and flexible approach. The demonstrated application to the MEAD dataset proved its utility. I hope others will work on adapting this approach to their own workflows.

One comment the authors might consider is how this scheme differs from the approach taken by the PRISMS project

Reviewer 2

The paper proposes ESAMP, a database architecture that stores experimental material science data by tracking provenance of both materials and processes as well as the analyses of the raw data.

The material provenance database is easily accessible through the Caltech data repository with a versioning system with different sizes of the database, which is appreciated. The instructions to set up the database are clear but for ease of use it may be useful to have a docker container script that sets up the database for you, although not required.

Unfortunately in the github repo, there are no python environment files indicating the libraries used. At a minimum I'd like to see a requirements.txt or a conda environment yaml file.

In the jupyter notebook it would be helpful to refer back to the relevant parts of the paper in the code comments

Reviewer 3

In this manuscript, the authors present a proposed data structure and reference database implementation for tracking the chronology of a material sample. The manuscript is well written and clear. The rigorous and flexible representation chosen to represent these histories has tremendous potential to substantially improve the Reusability of data that is archived in this format. Many concerns will be expressed below, but these should primarily be understood as future-leaning critiques. The work is worthy of public release and should be published with minor revisions.

There are previous efforts to use the idea of a chronology to store scientific information in a comprehensive way; there seems to be a claim of novelty on this point: "Such approaches aim to streamline and minimize information loss that occurs in an experimental laboratory. We focus on modeling the complete ground truth of materials provenances". A few such examples are:
https://www.animl.org/
Lin et al., https://doi.org/10.1021/acs.jcim.1c00028
Citrine Informatics, https://citrineinformatics.github.io/gemd-docs/
but the authors should feel free to select other sources if they wish.

A major gap in this manuscript is lack of discussion around controlled vocabularies. In practice, this manifests as lack of constraint on strings and JSON keys present in the database. In order for information entered into this database to be Interoperable, it is necessary to reconcile concepts like "anneal" and "temp" in this resource with other resources, not just be internally consistent. As the authors cite in the submission letter a parallel manuscript in preparation "to expand on the queryability, transparency, and the ability to capture hierarchical relationships using graphs built on top of the ESAMP framework." It is not obvious that the second manuscript will address this fundamental gap in this work, but for the sake of this review, it will be assumed that it will. It will be deeply disappointing if this assumption is not borne out.

This work was clearly performed with a Lagrangian perspective - following the history of a particular collection of atoms. Such control mass models do not necessarily map well to all scenarios, such continuous flow systems or environmental conditions, though they function reasonably robustly for documentation of a laboratory process. There is not need to modify the manuscript or underpinning work for this consideration -- it is simply something the authors may wish to consider in their future work.

The Supplemental Information and Data Availability Statement are not written with the same level of care as the primary article. For example
* it appears that there is a missing link in the MEAD database instructions
* Both sections 1.1 and 1.2 of the Supplemental Information refer to two columns and then describe three.
* No type is listed for two of the columns listed in section 1.2.
While the prose is appreciated, for the sake of clarity, it suggested that the authors add explicit tables (in the manuscript sense) of all columns in the tables (in the database sense) to the Supplementary Information. It is noted that the description does not include the entire schema, such as the DOI included on the Collection table (which is a valuable thing to have in the database).

The discussion of relevance to "computational samples" is a bit over-reaching. While interoperability between predictively sourced and experimentally sourced data would be truly valuable to the community, the nature of the metadata of those two sources are quite different. It would be much more reasonable for such a discussion to take place in the promised Knowledge Graph manuscript, as that would presumably address reconciliation between disparate families of terminologies / physical concepts in technical detail as well as discussing interoperability of data from a domain perspective.

While the inclusion of a generation number in the ancestors table makes some sense if one is simply considering indexing a multitree, it is a physically confused concept. In particular, as "hidden processes" may be discovered and subsequently included in later records, depth of a graph becomes confused unless one starts editing old records (which has archival / replicability implications). It is suggested that discussion of that query ought be removed.

Finally, in its present state, this work has tremendous aspirations but is functionally unusable for the broader research community. Facility with SQL is not a particularly common skill amongst those generating laboratory data and that is a conscious choice made by many of our very competent colleagues. While this database has potential to serve as a foundational technology for materials informatics and high-throughput experiments, it requires a substantial amount of effort developing an ecosystem of tools for ingestion, visualization and structuring. No mention of the long term need for these is made. Please add such a discussion, including current efforts to support community adoption. If a starting point would be helpful, the MaRDA extractors working group (https://github.com/marda-alliance/metadata_extractors) might meet that need.


 

We thank all the reviewers for the detailed comments. Please see our responses below. We have also attached a file under the category - response to reviewers with the some content.

REVIEWER REPORT(S):
Referee: 1

Comments to the Author
I commend the authors for an outstanding effort in materials data curation infrastructure. Of special note is the introduction of the materials "state" which allows for a very nuanced and flexible approach. The demonstrated application to the MEAD dataset proved its utility. I hope others will work on adapting this approach to their own workflows.

One comment the authors might consider is how this scheme differs from the approach taken by the PRISMS project

We thank the reviewer for their comments. We have added the following sentences that compares this work with efforts such as PRISMS, GEMD, PolyDAT. “Prior efforts such as The Materials Commons \cite{puchala2016materials}, GEMD \cite{GEMD}, and PolyDAT \cite{lin2021polydat} have also focused on modeling materials provenances. GEMD uses a construction based on Specs and Runs for Materials, Ingredients, Processes, and Measurements. However, the distinction between material and ingredient does not clearly mimic reality, because materials can become ingredients for further processing or synthesis. Similarly, there isn't an explicit distinction between measurements and processes. Especially, in case of in-operando or in-situ experiments, a single experiment corresponds to both a process and also a measurement. PolyDAT focuses on capturing transformations and characterizations of polymer species. Materials Commons focuses on creation of samples, datafiles, and measurements by processes. We acknowledge the efforts of these earlier works, here we aim to further simplify the data architecture such that it is easily generalizable for various data sources. We also simplify various terminologies such as Materials, Ingredients, Processes, Measurements, Characterizations, Transformations into three main entities - Sample, Process, and Process Data. We also introduce a concept called “state” that enables dynamic sample → Process Data mapping and demonstrate its value for machine learning.”


Referee: 2

Comments to the Author
The paper proposes ESAMP, a database architecture that stores experimental material science data by tracking provenance of both materials and processes as well as the analyses of the raw data.

The material provenance database is easily accessible through the Caltech data repository with a versioning system with different sizes of the database, which is appreciated. The instructions to set up the database are clear but for ease of use it may be useful to have a docker container script that sets up the database for you, although not required.
Our docker container scripts to setup the database are provided here: https://github.com/modelyst/mps-docker and we have added this to our Data availability statement.


Unfortunately in the github repo, there are no python environment files indicating the libraries used. At a minimum I'd like to see a requirements.txt or a conda environment yaml file.
We assume the reviewer is referring to this repo: https://github.com/TRI-AMDD/ESAMP-usecase and we have added a conda environment yaml file to this repository.

In the jupyter notebook it would be helpful to refer back to the relevant parts of the paper in the code comments
We added sub headers to highlight code relevant to data retrieval, ML model building, and corresponding plots.

Referee: 3

Comments to the Author
In this manuscript, the authors present a proposed data structure and reference database implementation for tracking the chronology of a material sample. The manuscript is well written and clear. The rigorous and flexible representation chosen to represent these histories has tremendous potential to substantially improve the Reusability of data that is archived in this format. Many concerns will be expressed below, but these should primarily be understood as future-leaning critiques. The work is worthy of public release and should be published with minor revisions.

There are previous efforts to use the idea of a chronology to store scientific information in a comprehensive way; there seems to be a claim of novelty on this point: "Such approaches aim to streamline and minimize information loss that occurs in an experimental laboratory. We focus on modeling the complete ground truth of materials provenances". A few such examples are:
https://www.animl.org/
Lin et al., https://doi.org/10.1021/acs.jcim.1c00028
Citrine Informatics, https://citrineinformatics.github.io/gemd-docs/
but the authors should feel free to select other sources if they wish.

We have added an additional section that compares this work with efforts such as PRISMS, GEMD, PolyDAT.
“Prior efforts such as The Materials Commons \cite{puchala2016materials}, GEMD \cite{GEMD}, and PolyDAT \cite{lin2021polydat} have also focused on modeling materials provenances. GEMD uses a construction based on Specs and Runs for Materials, Ingredients, Processes, and Measurements. However, the distinction between material and ingredient does not clearly mimic reality, because materials can become ingredients for further processing or synthesis. Similarly, there isn't an explicit distinction between measurements and processes. Especially, in case of in-operando or in-situ experiments, a single experiment corresponds to both a process and also a measurement. PolyDAT focuses on capturing transformations and characterizations of polymer species. Materials Commons focuses on creation of samples, datafiles, and measurements by processes. We acknowledge the efforts of these earlier works, here we aim to further simplify the data architecture such that it is easily generalizable for various data sources. We also simplify various terminologies such as Materials, Ingredients, Processes, Measurements, Characterizations, Transformations into three main entities - Sample, Process, and Process Data. We also introduce a concept called “state” that enables dynamic sample → Process Data mapping and demonstrate its value for machine learning.”


A major gap in this manuscript is lack of discussion around controlled vocabularies. In practice, this manifests as lack of constraint on strings and JSON keys present in the database. In order for information entered into this database to be Interoperable, it is necessary to reconcile concepts like "anneal" and "temp" in this resource with other resources, not just be internally consistent. As the authors cite in the submission letter a parallel manuscript in preparation "to expand on the queryability, transparency, and the ability to capture hierarchical relationships using graphs built on top of the ESAMP framework." It is not obvious that the second manuscript will address this fundamental gap in this work, but for the sake of this review, it will be assumed that it will. It will be deeply disappointing if this assumption is not borne out.
Thank you for raising this gap. We note that since our architecture does not categorize processes (or any tables) based on their types such as characterization, machining etc. This means that inconsistency in the nomenclature does not affect our database architecture. By defining sets of equivalent terms for terms used in tables such as process_details we can achieve interoperability amongst varied databases that use ESAMP architecture. We have added the following paragraph at the end of the Adoption subsection -
“Another key barrier for adoption is inconsistencies in the nomenclature used for variables in the database. For example, various databases might use anneal\_temperature or heating\_temp to describe the same variable. In cases where the type of process (such as characterization, machining etc) determines the database schema, inconsistent nomenclatures could result in inconsistencies in the database architecture increasing the barrier for interoperability. Whereas, in the case of ESAMP, these variables are present in details tables such as process\_details. Therefore, defining sets of equivalent terms for terms used in the details tables can support in achieving interoperability amongst various databases.”

This work was clearly performed with a Lagrangian perspective - following the history of a particular collection of atoms. Such control mass models do not necessarily map well to all scenarios, such continuous flow systems or environmental conditions, though they function reasonably robustly for documentation of a laboratory process. There is not need to modify the manuscript or underpinning work for this consideration -- it is simply something the authors may wish to consider in their future work.
We agree with this comment. To completely track all the variables one could use a similar type of architecture to measure instrument provenances, wherein the sample tables would be replaced by instrument tables. Environmental provenances could be recorded using their location tag as one type of instrument. For example, lab room XYZ could be an instrument and any environmental recording corresponding to this instrument could be associated with other instruments in that lab using timestamps and also requiring that every instrument has a location tag similar to how instruments in a physical lab are required to have a location tag.
Our article mentions the following to briefly address this topic - “To enable these benefits, we must first track the state of samples and instruments involved in a laboratory to capture the ground truth completely. In this article, we focus mainly on the state of samples and note that the architecture could capture the state of instruments or other research entities.” However much more in-depth discussion is out of scope for this article.


The Supplemental Information and Data Availability Statement are not written with the same level of care as the primary article. For example
* it appears that there is a missing link in the MEAD database instructions
We are not sure which link the reviewer is referring to here. Other reviewers were able to access the database. We have double checked that there are no missing links.
* Both sections 1.1 and 1.2 of the Supplemental Information refer to two columns and then describe three.
* No type is listed for two of the columns listed in section 1.2.
We have corrected these offending statements in the revised SI.

While the prose is appreciated, for the sake of clarity, it suggested that the authors add explicit tables (in the manuscript sense) of all columns in the tables (in the database sense) to the Supplementary Information. It is noted that the description does not include the entire schema, such as the DOI included on the Collection table (which is a valuable thing to have in the database).
We included the complete database schema as the last Figure in the SI.



The discussion of relevance to "computational samples" is a bit over-reaching. While interoperability between predictively sourced and experimentally sourced data would be truly valuable to the community, the nature of the metadata of those two sources are quite different. It would be much more reasonable for such a discussion to take place in the promised Knowledge Graph manuscript, as that would presumably address reconciliation between disparate families of terminologies / physical concepts in technical detail as well as discussing interoperability of data from a domain perspective.

We agree that the nature of the metadata of the experimental and theoretical sources are quite different, and we have softened the language around simulation-experiment integration. We have also provided a few guiding examples to aid in this conversation.

The last three paragraphs of this subsection now read:
“In general, the significant differences in metadata associated with simulation and experimental workflows have resulted in databases that have significantly different architecture, increasing the barrier for integration of experimental and simulation datasets. Since the key entities of ESAMP are independent of the type of samples, processes, and process data, it allows representation of various forms of data including simulation and experiments using similar architectures. This reduces the accessibility and queryability barrier for integrating experimental and simulation datasets.

As long as the experimental and simulation databases have a single common key (for example: composition, polymer ID) the barrier for initial comparison between simulation and experimental data is significantly reduced because of the increased accessibility and queryability enabled by ESAMP. However, complex queries that depend on the metadata that enable more detailed experiment to simulation comparison may not be obvious. We hope that experts who have experience in simulation-experiment integration will publicly share the specific queries used for comparison in addition to publishing simulation and experimental databases that use similar architecture. For example, an initial comparison of band gap derived from simulation vs experiment could be based on a query that depends on common composition. A more detailed comparison could be to compare experimental measurements obtained on materials that have been annealed in air within a certain temperature range with simulated band gaps for compositions wherein the corresponding crystal structure is on the thermodynamic convex hull for specific ranges of oxygen chemical potential. Transparent publication of the queries that share similar language for simulation vs experiment comparison will open the doors for more data-driven integration between theory and experiment. Wherein, simply comparing the findings from theory and experiment can help shed light on where the computational simulations are valid. Additionally, one could train machine learning models to map simulation values to experimental values and use that to make predictions about future experiments. The use of similar architecture for experimental and simulation databases is also likely to aid in development of an interface for simulation assisted autonomous experimentation.

Computational models are often benchmarked against experimentally obtained values. However, this mapping relies upon the common keys used for comparison between simulation and experiment to be valid for the measurement associated with the property. If an intervening process changes the material's state, the mapping between the simulation and experimental dataset would be incorrect. Therefore, it is advantageous to use ESAMP to define state equivalency rules similar to those described earlier, to ensure a more relevant comparison of simulation-experiment data.”



While the inclusion of a generation number in the ancestors table makes some sense if one is simply considering indexing a multitree, it is a physically confused concept. In particular, as "hidden processes" may be discovered and subsequently included in later records, depth of a graph becomes confused unless one starts editing old records (which has archival / replicability implications). It is suggested that discussion of that query ought be removed.
We agree with the reviewer that this table is not essential to the original database. However, the generation number in the ancestors table is a very useful derived concept, especially to simplify queries that are dependent on child parent relationships. Further, the concept of derived table is consistent with the need to update the ancestor ranks when any new processes are inserted into a materials’ provenance. We added the following to the manuscript to communicate the derived nature of parent and ancestor tables: “The parent and the ancestor tables are not essential to the database and are tables that can be derived from the materials provenance. However, these derived tables are extremely valuable for simplifying complex queries dependent on sample lineages.”

Finally, in its present state, this work has tremendous aspirations but is functionally unusable for the broader research community. Facility with SQL is not a particularly common skill amongst those generating laboratory data and that is a conscious choice made by many of our very competent colleagues. While this database has potential to serve as a foundational technology for materials informatics and high-throughput experiments, it requires a substantial amount of effort developing an ecosystem of tools for ingestion, visualization and structuring. No mention of the long term need for these is made. Please add such a discussion, including current efforts to support community adoption. If a starting point would be helpful, the MaRDA extractors working group (https://github.com/marda-alliance/metadata_extractors) might meet that need.
We agree that this database framework is one aspect of the larger ecosystem of tools necessary to accelerate efforts for FAIR usage of experimental data. We added the following to emphasize adoption related challenges and opportunities:
“To accelerate adoption of FAIR usage of experimental data, we believe that other aspects of data management such as data ingestion and data parsing need to be streamlined along with the use of generalizable database architectures such as ESAMP. The generalizable framework and language of our database architecture lends itself to development of simple user-interface modules that will assist in the data ingestion step. However, parsing data even after ingestion can be particularly challenging due to the presence of various file types such as files for x-ray diffraction, electrochemistry, x-ray photoemission spectroscopy etc. We believe that community sourcing of these parsers and their association with process types could be greatly beneficial to our ecosystem, and we particularly point to the effort undertaken by the MaRDA extractors working group\cite{MaRDAextract}.
We also point out that many prior efforts focus on static mapping of samples to attributes derived from process data. Our architecture in conjunction with the concept of "state" enables state equivalency rule based mapping of samples to process data attributes, which expands the utility of this database architecture to analysis of materials workflows that include state altering processes.
Another key barrier for adoption is inconsistencies in the nomenclature used for variables in the database. For example, various databases might use anneal\_temperature or heating\_temp to describe the same variable. In cases where the type of process (such as characterization, machining etc) determines the database schema, inconsistent nomenclatures could result in inconsistencies in the database architecture increasing the barrier for interoperability. Whereas, in the case of ESAMP, these variables are present in details tables such as process\_details. Therefore, defining sets of equivalent terms for terms used in the details tables can support in achieving interoperability amongst various databases.”





Round 2

Revised manuscript submitted on 03 Jun 2023
 

14-Jun-2023

Dear Dr Suram:

Manuscript ID: DD-ART-03-2023-000054.R1
TITLE: ESAMP: Event-Sourced Architecture for Materials Provenance Management and Application to Accelerated Materials Discovery

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below. (PLEASE NOTE: you may wish to make a minor modification in response to the comment from referee #3; please consider this, in the interest of clarity, and make relevant modifications at the proof stage. )

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Joshua Schrier
Associate Editor, Digital Discovery


 
Reviewer 3

The updated manuscript (along with its supporting information) is strong and generally ready for publication. There seems to be a minor confusion in newly-added the sentence:

> However, the distinction between material and ingredient does not clearly mimic
reality, because materials can become ingredients for further processing or synthesis.

as the ingredient concept in the GEMD model is an edge connecting a material to a process, communicating an amount of material. That naming choice in the model is poor and thus the confusion is unsurprising. Removing this sentence is one reasonable remedy, or possibly modifying it to communicate the potential for confusion.

Reviewer 2

All changes seem appropriate and changes to code repositories allow ease of use and reproducibility




Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.
Creative Commons BY license