From the journal Digital Discovery Peer review history

The laboratory of Babel: highlighting community needs for integrated materials data management

Round 1

Manuscript submitted on 27 Feb 2023
 

15-Mar-2023

Dear Dr Pozzo:

Manuscript ID: DD-PER-02-2023-000022
TITLE: The Laboratory of Babel: Highlighting community needs for integrated materials data management

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions.

Please revise your manuscript to fully address the reviewers’ comments. (You may take the reviewer 1 comments with a grain of salt; this is a less-experienced reviewer, and the suggestions, although well-intentioned and worth considering, may not all align with the nature of this perspective article.)

When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Joshua Schrier
Associate Editor, Digital Discovery

************


 
Reviewer 1

All the content is there. I really enjoyed the Experiment, Group, & Community Scale Approach. It provides structure, allows for engagement, and

Reviewer 2

This is an excellent contribution by the authors. I not only enjoyed reading their perspective---even in moments of disagreement---I am excited to share it with my colleagues and collaborators. While it is billed as a perspective, the discussion of existing work is nearly as comprehensive as some reviews. The tone of the work inspires hope and excitement for the research community. I am looking forward to this manuscript's publication.

I have a few minor notes and queries that I believe will improve the manuscript prior to acceptance.

Bottom pg 2: Should the notion of customer development, or scoping user requirements, be introduced in this area? While the language is somewhat absent from the materials research community, it is a potent approach with plenty of resources from the tech and entrepreneurial communitites.

Pg 3: It is an interesting decision to avoid quality control at experiment-scale data management. While I disagree in principle, I think this at least deserves justification. Catching up to poorly QC'd data at the Group Scale can often be more cumbersome than planning this from the start.

Pg3: "The manual entry of data and notes by researchers needs to be well supported". It would be helpful to clarify what "well supported" means to the author.

Fig 2: This and the surrounding context seems to be missing discussion on data validators, verification, and schema (e.g. json schema, pydantic models, pandera).

Pg 5: In discussion of hardware the authors state by analogy, "these competing standards demonstrate the need for community consensus in standards development [in] data management." I would note that these hardware standards can be both competitive and collaborative, especially when they are interoperable. There is something to be said for using the right tool for the right job (e.g. a liquid/solid dispenser using SILA talking to a robotic arm using ROS talking to a beamline using EPICS/Bluesky). This can occur in data management as well (side by side ZMQ and Kafka buses, or SQL and NoSQL databases).

Pg 9: I would underscore the "as generally useful as possible", by noting that sometimes a specific database enhances more productivity than a generic one. To draw a biological example, a database of therapeutic antibodies (SAbDab) provides great utility that the PDB or UniProt cannot.

Pg9: In this section, I would draw attention to tools like Tiled (https://blueskyproject.io/tiled/) that are built around data retrieval/access that allow for plugins for accepted standards, and produce common machine formats for work in the python ecosystem.

Pg 11: Perhaps note how the pdb has lead to new achievements (Alpha Fold)

Pg 11: It would be helpful to provide more detail/examples on "as few as possible but as many as necessary", particularly for the materials disciplines.


 

Dr. Schrier and Digital Discovery editorial team:

Thank you for your consideration of our manuscript. We have made revisions to the manuscript that integrate suggestions from the reviewers. Please find point-by-point responses to reviewer comments below, as well as in the attached document.

Thank you,

Brenden Pelkie and Lilo Pozzo

Referee: 1

All the content is there. I really enjoyed the Experiment, Group, & Community Scale Approach. It provides structure, allows for engagement, and

While the above comments appear incomplete, they are all that we received. We did not make any changes to address comments from reviewer 1.

Referee: 2

This is an excellent contribution by the authors. I not only enjoyed reading their perspective---even in moments of disagreement---I am excited to share it with my colleagues and collaborators. While it is billed as a perspective, the discussion of existing work is nearly as comprehensive as some reviews. The tone of the work inspires hope and excitement for the research community. I am looking forward to this manuscript's publication. I have a few minor notes and queries that I believe will improve the manuscript prior to acceptance.

Bottom pg 2: Should the notion of customer development, or scoping user requirements, be introduced in this area? While the language is somewhat absent from the materials research community, it is a potent approach with plenty of resources from the tech and entrepreneurial communities.

This is an excellent suggestion to include. While this work aims to lay out a set of ideals for what new data management infrastructure could look like, this shouldn’t replace true customer scoping for new implementations. We have added the following to the introduction: “. While we hope our perspective helps guide future work on research data infrastructure, it should not replace formal customer development or user requirement scoping processes, such as those used in technology and entrepreneurship (e.g. NSF Innovation Corps) Developers of new data management tools should thoroughly evaluate the needs of the scientists who will be using them, so that these tools are a simple and valuable addition to research workflows.

Pg 3: It is an interesting decision to avoid quality control at experiment-scale data management. While I disagree in principle, I think this at least deserves justification. Catching up to poorly QC'd data at the Group Scale can often be more cumbersome than planning this from the start.

This is a valid critique and we agree with the reviewer that data ‘quality control’ should take place at all stages whenever possible. We have added modifications throughout the manuscript to highlight this change. We originally envisioned experimental-scale data collection to be usually ‘lightweight’ and focused on recording experimental data at relatively fast rates to keep up with high-throughput frameworks. This could make it a bit more challenging to implement some thorough and exhaustive controls (e.g. identifying outliers or small but systematic errors). Yet, implementing these would makes sense at any time that it is possible. For example, exclusion of evidently faulty data could sometimes be simple and be used to alert users in real time that there are problems in the workflow that should be addressed immediately. Still, an important aspect of data management infrastructure is to create complete records of experiments, including those resulting in ‘bad data’. These can be used to identify subtle but systematic problems. Quality control at the group level is likely to be more thorough, as users who are removed from experiments can help identify problems that an experimenter may not. We now suggest that quality control should take place at all levels, whenever possible.

Pg3: "The manual entry of data and notes by researchers needs to be well supported". It would be helpful to clarify what "well supported" means to the author.

We’ve clarified by specifying that a GUI could perform this role. Given recent advances in natural language processing technology and large language models such as GPT-4, we also can envision future interfaces to research data infrastructure that leverage these technologies, such as an Amazon Alexa style voice assistant that allows scientists to dictate observations directly into a structured data format. We’ve added a mention of this possibility to this section: “A graphical user interface could provide this support. Recent advances in natural language processing technologies, such as GPT-4, may also enable new ways of recording data, such as a voice-assistant based lab notebook.”

Fig 2: This and the surrounding context seems to be missing discussion on data validators, verification, and schema (e.g. json schema, pydantic models, pandera).

Data validation is likely to be highly application-specific and technical. To maintain the high-level readability, we originally avoided discussing this topic. We have added discussion about where data validation fits in the data processing pipeline. Discussion about schema definitions for applications were also omitted, and we have added discussion on this as well. In our opinion this all belongs in ‘application of data model’ step of the pipeline. We have not updated the figure 2 graphic to maintain visual clarity. Main sections of text added are “Data validation checks that collected data is in an expected format and an expected range. For example, a simple validation on the recorded mass of a sample could check first that the entry is numeric and not a text string, then that the value is within the measurable range of the balance used. This approach does not verify that the recorded number is correct but can catch major issues with data. As discussed above, invalid data should still be recorded but also flagged for review.”, and “A prerequisite to using such a data model is a schema to describe what data is stored and how it is related. Tools to parse data from its source and transform it into the data model are also needed to implement the data models we describe. Developing these items can be a significant challenge.”

Pg 5: In discussion of hardware the authors state by analogy, "these competing standards demonstrate the need for community consensus in standards development [in] data management." I would note that these hardware standards can be both competitive and collaborative, especially when they are interoperable. There is something to be said for using the right tool for the right job (e.g. a liquid/solid dispenser using SILA talking to a robotic arm using ROS talking to a beamline using EPICS/Bluesky). This can occur in data management as well (side by side ZMQ and Kafka buses, or SQL and NoSQL databases).

The reviewer makes a good point that competing, but collaborative standards can lead to specialized tools that work very well for specific tasks. The reviewer’s example of a hypothetical experiment that leverages specialized tools is a clear way to illustrate this point. However, there is a balance to be found between encouraging these new standards and a fractured landscape defined by ‘one more standard’ thinking. We’ve added the following discussion to the manuscript: “Collaborative development of competing lab equipment standards could lead to a set of widely adopted interfaces to equipment that each have specialized support for a particular use case. This would allow experimenters to pick the best tools for particular experimental tasks. For example, an automated flow-through nanoparticle synthesis experiment could communicate with a bank of syringe pumps over SiLA to control experimental conditions and a beamline with BlueSky to manage sample characterization. Each of these standards fulfills the needs of the application it is used for and alleviates the need for a single monolithic standard to handle every research task imaginable. However, development of many overlapping standards also has the potential to fracture the ecosystem for managing hardware and software, and preclude straightforward digital data management and communication. Care should be taken in standards development and adoption to avoid this.



Pg 9: I would underscore the "as generally useful as possible", by noting that sometimes a specific database enhances more productivity than a generic one. To draw a biological example, a database of therapeutic antibodies (SAbDab) provides great utility that the PDB or UniProt cannot.

Managing the tradeoff between the benefits of scale (wide user base, name recognition, community contributions standardized) and specificity (excellent support for a niche use case, easier verification that data in the database is what you need) is one of the key challenges of establishing community databases. While we hint at this throughout this section, we have expanded on the discussion of this tradeoff where we introduce the concept of specificity: “Choosing a level of specificity for a database is an important consideration that impacts how data is likely to be re-used in the future. Specialized databases may make re-use simple for new applications that are similar to the original use of the data. However, being too specific can limit community contributions and engagement that is needed to sustain a database after initial support runs out, and can make it difficult to find databases that contain the desired information. Conversely, databases that are too broad in scope might not support the level of detail needed for some downstream use cases.”

Pg9: In this section, I would draw attention to tools like Tiled (https://blueskyproject.io/tiled) that are built around data retrieval/access that allow for plugins for accepted standards, and produce common machine formats for work in the python ecosystem.

We agree that tools like Tiled will make domain-specific file support much easier, and have added a comment to this end as well as a Tiled reference.

Pg 11: Perhaps note how the pdb has lead to new achievements (Alpha Fold)

We’ve added a mention of AlphaFold and RosettaFold as examples of groundbreaking advances enabled by effective data sharing: “Access to the PDB has also enabled groundbreaking advancements such as the accurate prediction of protein folding and de-novo structures based on sequence, with machine learning models”

Pg 11: It would be helpful to provide more detail/examples on "as few as possible but as many as necessary", particularly for the materials disciplines.

This is the same tradeoff that was discussed above with regards to making databases specific enough to support real use cases. This section is specifically discussing the FAIRmat interpretation of database specificity. Their solution of federated databases with centrally stored metadata does have the potential to avoid the lack of support for detail in broadly scoped databases or findability issues with ones that are too specific. We have added the sentence “Their proposal, which is to create a federated network of databases with centrally searchable metadata, has the potential to enable domain specific databases that are still findable and reusable for applications in other contexts”




Round 2

Revised manuscript submitted on 24 Mar 2023
 

27-Mar-2023

Dear Dr Pozzo:

Manuscript ID: DD-PER-02-2023-000022.R1
TITLE: The Laboratory of Babel: Highlighting community needs for integrated materials data management

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Joshua Schrier
Associate Editor, Digital Discovery


******
******

Please contact the journal at digitaldiscovery@rsc.org

************************************

DISCLAIMER:

This communication is from The Royal Society of Chemistry, a company incorporated in England by Royal Charter (registered number RC000524) and a charity registered in England and Wales (charity number 207890). Registered office: Burlington House, Piccadilly, London W1J 0BA. Telephone: +44 (0) 20 7437 8656.

The content of this communication (including any attachments) is confidential, and may be privileged or contain copyright material. It may not be relied upon or disclosed to any person other than the intended recipient(s) without the consent of The Royal Society of Chemistry. If you are not the intended recipient(s), please (1) notify us immediately by replying to this email, (2) delete all copies from your system, and (3) note that disclosure, distribution, copying or use of this communication is strictly prohibited.

Any advice given by The Royal Society of Chemistry has been carefully formulated but is based on the information available to it. The Royal Society of Chemistry cannot be held responsible for accuracy or completeness of this communication or any attachment. Any views or opinions presented in this email are solely those of the author and do not represent those of The Royal Society of Chemistry. The views expressed in this communication are personal to the sender and unless specifically stated, this e-mail does not constitute any part of an offer or contract. The Royal Society of Chemistry shall not be liable for any resulting damage or loss as a result of the use of this email and/or attachments, or for the consequences of any actions taken on the basis of the information provided. The Royal Society of Chemistry does not warrant that its emails or attachments are Virus-free; The Royal Society of Chemistry has taken reasonable precautions to ensure that no viruses are contained in this email, but does not accept any responsibility once this email has been transmitted. Please rely on your own screening of electronic communication.

More information on The Royal Society of Chemistry can be found on our website: www.rsc.org




Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.
Creative Commons BY license