From the journal Digital Discovery Peer review history

Extraction of chemical structures from literature and patent documents using open access chemistry toolkits: a case study with PFAS

Round 1

Manuscript submitted on 19 maalis 2022
 

03-May-2022

Dear Dr Schymanski:

Manuscript ID: DD-ART-03-2022-000019
TITLE: Extracting and Comparing PFAS from Literature and Patent Documents using Open Access Chemistry Toolkits

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions.

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy from CASRAI, https://casrai.org/credit/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

************


 
Reviewer 1

An application of parsing chemical structures from text corpora is demonstrated for PFAS structures. My comments are:

- False positive and false negative rates for OC|processor software was not demonstrated using a curated and benchmarking dataset of text-corpus annotated with PFAS structures.

- Structures found in the Google patents corpus are available in PubChem ( https://pubchem.ncbi.nlm.nih.gov/source/24262 ). It is not clear why these patents had to be reprocessed by the OC|processor software.

- What was the overlap between PFAS structures obtained from CORE and Patents corpora?

Reviewer 2

Data availability checklist 6b: Though the main software used is either open-source or provided in the paper's github repo, I didn't find the actual scripts which were used to generate the results presented in the paper.

An additional comment: though data in google big query is technically publicly available, and working with it there is quite powerful, it does require a google account and accessing it may have costs associated with it (I'm not sure about this). I think it would be preferable to also provide a version of the raw data tables in some form of repository as well as including the BQ links.

Signed
Greg Landrum

Reviewer 3

Definition B:
(AH)(AH)(F)C-C(AH)F2 group
This is a confusing definition. "AH groups could be hydrogen or any other atom" Why not just say that AH can be any atom? Is this different to "containing the substructure FC-CF2". This is more consistent with definition A ("contains a CF2")

Definition A allows the carbon atom to be sp2 hybridised (double-bonded to another atom). It is not quite clear if definition B includes or excludes per-fluoro-propene and similar structures. Perfluoroethyleneoxide?

Referring to the carbon atoms as 'aliphatic' may be intended to mean they are sp3 not sp2. However, the IUPAC Gold Book says aliphatic compounds can be unsaturated: https://goldbook.iupac.org/terms/view/A00217

Definition C seems to share these ambiguities and might also be better defined as a substructure. R1, R2 and R3 can, presumably, be the same as each other, and AH can be different to each other. Why are R groups numbered and AH atoms not?

The paper requires unambiguous definitions.

Reference 18 (OC|processor) needs to be made clear - is this a program about which there is information available to the reader? It should not be acceptable to base a paper on a program which is inaccessible to the readership.

"standard InChI (version 1.3)."
This should probably say version 1.03. The latest version is 1.06

The abstract does not mention the InChI, which is central to the process. InChI should be mentioned in the abstract.

This is an interesting paper which covers two distinct areas: extracting molecular data from large databases and generating a database of PFAS. Both areas are interesting, but the paper is not quite clear whether it is primarily a study of PFAS, or primarily a study of molecule data extraction with PFAS as an example. The title suggests the former, but the paper seems more weighted to the second.

This is a study which deserves publication, but all of the issues raised above need to be addressed first.


 

Dear Dr. Hung,

Thank you for considering our manuscript submission to Digital Discovery, to you and the reviewers for your time, and your decision to accept this manuscript for publication after revisions. We have revised the manuscript to address the reviewers’ comments, have included a point by point response to the reviewers’ comments ("Comment" and "Response" - we have also provided a formatted PDF of this response for easier reading) and have highlighted the changes we have made in a separate track changes version of the manuscript. We had included a CRediT-based author contribution statement already in the submitted manuscript, this is also present in the revised manuscript. All authors agreed to this statement prior to submission. We have provided ORCIDs for all co-authors that have one.

We note that, as discussed with the editorial team, we have reinstated Ian Wetherbee as an author following final receipt of the (much delayed) approval from Google, a fact that was reflected in an updated preprint version of the manuscript released early April once approval was received (DOI: 10.26434/chemrxiv-2022-nmnnd-v2). This is now updated accordingly in the revised submission (corresponding adjustments are highlighted in the track changes manuscript – Lines 5, 9, 14, 588, 594 and 599-600).


REVIEWER REPORT(S):
Referee: 1
Comments to the Author
Comment: An application of parsing chemical structures from text corpora is demonstrated for PFAS structures. My comments are:
- False positive and false negative rates for OC|processor software was not demonstrated using a curated and benchmarking dataset of text-corpus annotated with PFAS structures.
Response: Thank you for pointing out this omission, the recall and precision of OC|processor has already been described elsewhere (DOI: 10.1093/database/baz001) but was not mentioned in the manuscript (apologies for this omission). We have amended this and added the information to the manuscript introduction as follows on Lines 119-120 (ref. 20 is new and is the article mentioned above):
The precision and recall of OC|processor has been detailed elsewhere20.

Comment: Structures found in the Google patents corpus are available in PubChem ( https://pubchem.ncbi.nlm.nih.gov/source/24262 ). It is not clear why these patents had to be reprocessed by the OC|processor software.
Response: Thank you for this comment, the Google patent corpus is indeed available in PubChem. These were processed with OC|processor / OC|miner by OntoChem and Google and provided to PubChem. This dataset is available (references are provided in the article) and was not reprocessed here. We have reworded the methods slightly to clarify this aspect and it now reads (L161-165):
For the current work, a set of 111,730,728 Google Patent documents semantically annotated with OC|processor in May 2021 using both the text and images found in these patents was used. The resulting annotations are available in a BigQuery table32 dated May 13, 2021 …
The data source contact noted at the URL above is also one of the authors of this work (Ian Wetherbee), as this processing is a collaboration between OntoChem and Google. We note that not all structures provided to PubChem are actually “live” - due to the challenge of extraction from patents, PubChem pre-filters the selection that become live. We made a separate deposition of the PFAS data to ensure these extracted PFAS compounds are all live – this was also mentioned in the article (L514-5) and the discussions with PubChem were acknowledged (L603-4).

Comment: What was the overlap between PFAS structures obtained from CORE and Patents corpora?
Response: Thank you for this comment, we have calculated the overlap between the CORE and Patents for each definition (A=12876, B=1806 and C=866) and added this into the article (L469-471):
The overlap of the PFAS in the CORE and Patent datasets for the different definitions were (A) 12,876; (B) 1806; and (C) 866 PFAS entries, showing that the extraction of data from different sources reveals highly complementary results.


Referee: 2
Comments to the Author
Comment: Data availability checklist 6b: Though the main software used is either open-source or provided in the paper's github repo, I didn't find the actual scripts which were used to generate the results presented in the paper.
Response: Many thanks for thoroughly checking the information we provided. Most of the results were generated using the Java software described in the article, rather than with scripts (hence no scripts were available per se to provide with the original article). We have now updated the GitHub repository to include this information for one table as an example. This is contained in a folder "toReproduceManuscriptData", with (1) input necessary to reproduce Table 4 values and (2) a README file with stepwise instructions on how to use the repository to reproduce the values in the Table 4 of the manuscript.
We note that some of the overlap statistics later in the article were performed using functionality in PubChem that is currently performed semi-automatically and cannot be provided as scripts; a subset of us are working with PubChem closely to automate these procedures but this is beyond the scope of the current article (as mentioned above, discussions with PubChem were acknowledged). We hope that this functionality will be available soon (and will communicate accordingly once ready).

Comment: An additional comment: though data in google big query is technically publicly available, and working with it there is quite powerful, it does require a google account and accessing it may have costs associated with it (I'm not sure about this). I think it would be preferable to also provide a version of the raw data tables in some form of repository as well as including the BQ links.
Response: Thank you for this comment, this was indeed a challenge that we considered carefully, also discussing with the editorial team in light of the journal requirements. The Big Query links are indeed much more powerful, they require a google login but no cost-associated access (unless more powerful interaction using the cloud services is desired). This access allows users to work with the data in the existing structure. However, we did also provide all the data required to reproduce the manuscript material on the FigShare archive as clarified in the data statement (DOI: 10.6084/m9.figshare.17168960.v1), so that users / readers have all options available, with login and more powerful options, or full open access for further use with their own methods. We have amended the data availability statement accordingly (L612-6):
Finally, in addition to the deposit on FigShare, the patent annotations and the unique compounds from patents and CORE can be accessed via the embedded URLs (also given in the reference section, references 30-32). A (free) login is required for these URLs, which enables more powerful analysis than was possible via other repositories.

Comment: Signed Greg Landrum
Response: Thank you for your time and comments, Greg!

Referee: 3
Comments to the Author
Definition B:
(AH)(AH)(F)C-C(AH)F2 group
This is a confusing definition. "AH groups could be hydrogen or any other atom" Why not just say that AH can be any atom? Is this different to "containing the substructure FC-CF2". This is more consistent with definition A ("contains a CF2")
Response: We have taken community definitions for PFAS in this article and done our best to represent them as intended using terms generally accepted in the community, firstly by expressing them in simple representations (as shown in Figure 1) and then expressing them in cheminformatics terms (see Experimental, line 211-215). For the PFAS community, there is a significant difference whether H is included as a neighbouring atom or not, as this can influence the degradation potential of the PFAS, and thus their persistence and other downstream behaviour in the environment. This is the major difference between Definitions B and C. As a result, we would prefer to keep this distinction clear by using “AH” for Definition B and “R” for Definition C.
However, we agree with the reviewer that these definitions are imperfect and we have revised the material accordingly, please see responses further below for more details.

Comment: Definition A allows the carbon atom to be sp2 hybridised (double-bonded to another atom). It is not quite clear if definition B includes or excludes per-fluoro-propene and similar structures. Perfluoroethyleneoxide?
Response: Please see the response above, and the response further below. We have provided all the tables needed to investigate the presence of various compounds in the categories, including interactive files in MetFrag. Based on this dataset, we could see that both examples mentioned by the reviewer fulfil Definitions A and B but not C.

Comment: Referring to the carbon atoms as 'aliphatic' may be intended to mean they are sp3 not sp2. However, the IUPAC Gold Book says aliphatic compounds can be unsaturated: https://goldbook.iupac.org/terms/view/A00217
Response: Please see response above and below. We have used various community definitions and as the reviewer has noted, the aspect of unsaturation is not clarified in a sufficiently detailed manner for our purposes in some of these definitions. Thus, we hope that this manuscript will contribute to the debate surrounding these definitions so that community-agreed definitions can be more concretely defined to allow for correct implementation in cheminformatics methods. By extracting data at a large scale as performed here, more potential PFAS compounds are revealed (beyond the lists of thousands currently commonly used) so that the true breadth of potential PFAS chemistry can be better appreciated, informing this debate and hopefully eventually the refinement/improvement of the definitions.

Comment: Definition C seems to share these ambiguities and might also be better defined as a substructure. R1, R2 and R3 can, presumably, be the same as each other, and AH can be different to each other. Why are R groups numbered and AH atoms not?
Response: As we noted above, Figure 1 was provided as a simple visual representation of the definitions, and the cheminformatics equivalents for each toolkit have been provided in the Experimental, lines 211-5. While we attempted to add a clarifying statement to the figure caption on the basis of this comment, this ended up adding more confusion rather than clarity and we prefer to leave things in the standard, established notation that we have chosen.

Comment: The paper requires unambiguous definitions.
Response: The reviewer is correct and indeed the entire PFAS community requires unambiguous, careful definitions – which is currently very contentiously debated in the community (even to the point of legal proceedings in the US). With this article, we took three definitions under discussion by the community and investigated their influence on the extraction of PFAS information from documents from a purely cheminformatics perspective. As clear from the results presented in this article, there is a major impact on the number of compounds considered – and as noted by the reviewer, a lack of clarity in the definitions can result in some “grey zones”. We therefore hope that this article will contribute to the PFAS community efforts to establish a working definition that accurately reflects the chemical space they wish to capture under the concept of “PFAS”. We would like to note that there are several parallel efforts underway investigating the impact of different PFAS definitions that are beyond the scope of the current article (as they are efforts within larger projects and initiatives), the current work contributes to the evidence informing this debate.
On the basis of these last few comments, we have made the following revisions to the manuscript:
L71-74: Since the current definition of PFAS is strongly debated by the community, three different structural definitions of PFAS in use have been considered in this case study , clarified below and shown in Figure 1:

L563-8: These definitions came from the PFAS community, with A being recently proposed by the OECD, and both B and C deriving from definitions used by the US EPA. These definitions did not always contain sufficient cheminformatic detail to clarify certain edge cases, such as unsaturation or hybridization. As such, the results here are intended to contribute to the current debate surrounding the definition of PFAS and help further refine these definitions.

Comment: Reference 18 (OC|processor) needs to be made clear - is this a program about which there is information available to the reader? It should not be acceptable to base a paper on a program which is inaccessible to the readership.
Response: We have provided references to manuscripts describing OC|processor and added some additional references where possible (see below) and have gone to great efforts to archive all data and code associated with the material presented in this article in a variety of formats and locations for further use. We carefully reviewed the journal requirements before submitting this manuscript and had discussions with the managing editor in advance. To the best of our knowledge we fulfilled every point that was requested by the journal (as also indicated by the checklist submitted by Reviewer 2). Should more information be required we are happy to discuss further with the editorial office.

Comment: "standard InChI (version 1.3)."
This should probably say version 1.03. The latest version is 1.06
Response: Thank you for pointing this out, indeed version 1.03 (endorsed for unix) was used, this has been updated in the two places where the InChI version was mentioned (highlighted in the track changes version, L143 & 153).

Comment: The abstract does not mention the InChI, which is central to the process. InChI should be mentioned in the abstract.
Response: While the SMILES representation (and the different treatment of this information by the toolkits) is rather key to the process here, InChI is also mentioned many times in the article. Consequently, we have also added this to the abstract as requested (L25-6):
Each toolkit was used for structure parsing, normalization and subsequent substructure searching, using SMILES as structure representations of chemical molecules and International Chemical Identifiers (InChIs) for comparison.

Comment: This is an interesting paper which covers two distinct areas: extracting molecular data from large databases and generating a database of PFAS. Both areas are interesting, but the paper is not quite clear whether it is primarily a study of PFAS, or primarily a study of molecule data extraction with PFAS as an example. The title suggests the former, but the paper seems more weighted to the second.
Response: Thank you for this comment, we have revised the title from the existing “Extracting and Comparing PFAS from Literature and Patent Documents using Open Access Chemistry Toolkits” to “Extraction of Chemical Structures from Literature and Patent Documents using Open Access Chemistry Toolkits: A Case Study with PFAS” (L1-3) – we hope that this better reflects the focus of the article now.

Comment: This is a study which deserves publication, but all of the issues raised above need to be addressed first.
Response: We hope that we have addressed these issues adequately now.


Other changes:
We have made a few other changes to update and improve the clarity/accuracy of the material and to add additional references. All changes are marked in the track changes manuscript and include lines: 22-3, 31, 60, 94, 183, 210, 211-4, 219, 232-6 (including new references 44-6), 255 (new reference 49), 475/6, 479, 515-6 (and the corresponding datasets online), 558-60/568-70). References 20, 44-46 and 49 are new. We have also updated the acknowledgements to acknowledge the efforts of Jane Frommer (Collabra) and the reviewers.

We confirm that all authors have contributed to the manuscript (see contributions statement), received the reviewer comments and have approved the revised article and response for submission. We note that all line numbers here referred to the track changes version.

We hope that we have revised the manuscript appropriately in light of the reviewer comments for further consideration and we look forward to your response.

Signed: Emma Schymanski & Lutz Weber on behalf of all authors.




Round 2

Revised manuscript submitted on 17 touko 2022
 

31-May-2022

Dear Dr Schymanski:

Manuscript ID: DD-ART-03-2022-000019.R1
TITLE: Extraction of Chemical Structures from Literature and Patent Documents using Open Access Chemistry Toolkits: A Case Study with PFAS

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry


 
Reviewer 1

Great work. Thanks for replying to my comments.




Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.
Creative Commons BY license