From the journal Digital Discovery Peer review history

Chebifier: automating semantic classification in ChEBI to accelerate data-driven discovery

Round 1

Manuscript submitted on 08 Dec 2023
 

29-Jan-2024

Dear Dr Hastings:

Manuscript ID: DD-ART-12-2023-000238
TITLE: Chebifier: Automating semantic classification in ChEBI with AI to accelerate data-driven discovery

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions.

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

************


 
Reviewer 1

see attachment

Reviewer 2

The manuscript by Glauer et al, provides a tool and methodology to enable the automatic classification of chemicals within ChEBI ontology classes using their chemical structures (represented by SMILES). Based on the user study, the chemical classes the tool predicts are too general, but it seems that the model and the tool can be improved as the ChEBI ontology expands with newer versions. The methodology undertaken within the submitted manuscript is generally valid. However, it seems that the authors are using a mixture of manually curated and partially automated entries in their training data which could potentially lead to inaccurate or less specific ontology classification predictions. However, this could have been avoided if they had used the manually curated (3-star) entries in ChEBI. Furthermore, it would be beneficial to explain for the readers (preferentially with an example) the types of chemical structures the tool works well on and the chemical structures the tool does not predict well or fails on.

Reviewer 3

The study “Chebifier: Automating semantic classification in ChEBI with AI to accelerate data-driven discovery” by Martin et al. Proposes an automated pipeline for the classification of chemicals in the ChEBI Ontology.
The effort stems in the amount of work that annotation of new substances in ChEBI needed by a human revisor as compared to the amount of substances present in public databases like PubChem. To automate the process the authors have used a Transformer architecture starting from SMILES and with the caution of treating complete chemical symbols as “single characters”. The results have been finally compared to a human panel of 12 members showing a correcteness of 75%.
Overall the manuscript is well written, the references and results are well presented and in my opinion is worth of being published in Digital Discovery. Below there is a list of minor points to which the authors that could perhaps be clarified to further improve the study; none of them is critical however:

1. The authors used SMILES to start with. Could SMARTs or another variant be an option or, once the information is processed, it would not yield any significant advantage?
2. I do not know how much work is available in public foundation using transformers models outside language. Could any other “less hot” approach of network be equally performant?
3. Why only F1? This should be discussed at little bit more
4. The same of the exponent in wC expression at page 5
5. About the poll of 12 people? This is not a little limited? Also the two participants refraining from answering did provide at least some feedback about difficulties or other problems?
6. Computational cost and resources: I could not find a comment about it but I think it’s relevant. Where was the training and test done? How much time it took and on which architecture? This I guess the important point of this list.


 

Dear reviewers,

Thank you for your feedback on our submission “Chebifier: Automating se- mantic classification in ChEBI to accelerate data-driven discovery”. We found the feedback very valuable, and have amended our manuscript to address each individual comment, which has significantly improved the quality.

Below is a point-by-point list of the provided feedback and our corresponding responses detailing the changes we made. Thank you for your time and consideration of our work.

Sincerely,
Martin Glauer and Janna Hastings, on behalf of all the authors.


---

1.1 Review #1
Remark: 1. During my use of the Chebifier web tool, I encountered issues with its functionality. It returned the same (wrong) results regardless of the inputted SMILES strings. Addressing this technical glitch is crucial for the paper evaluation.

Response: Thank you for pointing out this bug and apologies that you encountered it. The observed behaviour was due to an unfortunate combination of a sub-optimal interface design and a bug in our backend. With the previous interface, the user had to save their input before the input was passed to the server. Unsaved inputs all returned the same, incorrect classification response. We restructured the interface and removed the requirement to explicitly save inputs – these are now automatically saved. We also fixed the bug that caused the system to return incorrect classifications for empty inputs.

Remark: 2. The sentence “ChEBI, or the Chemical Entities of Biological Interest database, offers the largest ontology in the domain of life sciences chemistry.” In the Introduction is not entirely accurate. While ChEBI is significant for the classification of chemical entities, other ontologies like CHEMINF or OntoSpecies, have larger vocabularies and broader scope than ChEBI and PubChem RDF (although not an ontology in the traditional sense) offers a more extensive range of entries than ChEBI.

Response: Thank you pointing out this potentially misleading formulation. We have amended our introduction accordingly to reflect only that ChEBI is a large and widely used bio-ontologies for chemistry, rather than making a very specific evaluative claim that it is the largest. Indeed, CHEMINF addresses a related domain (chemical informatics), PubChem RDF, while not in itself an ontology as such (rather it uses ChEBI classes for classification), is a very large knowl- edge graph of biologically relevant chemistry, and OntoSpecies also assembles content from PubChem and ChEBI.

Remark: 3. The paper does not explicitly state what occurs when a chemical entity belongs to a class not represented in ChEBI. I wonder if it returns a higher-level class or results in an incorrect classification. I was unable to personally assess it due to technical difficulties encountered while using the web tool (refer to point 1).

Response: The expected behaviour in this case is that the tool should return a higher-level class if the chemical entity belongs to a class not represented in ChEBI. As the ChEBI ontology is hierarchical, all valid SMILES strings should be classified at the very least as a ‘chemical entity’, the root class of the ontology. Our system is optimised to return at least one classification in all cases. However, it is important to note that the underlying model is a trained deep neural network, therefore, there is always a possibility of incorrect classifications or unforeseen behaviours.

Remark: 4. In their previous work (cited as Ref. 10) the authors employed a similar architecture and methodology. While the paper acknowledges this previous work, it doesn’t provide a comprehensive discussion that differentiates the current methodology from that developed in the earlier research.

Response: We thank the reviewer for highlighting that we were not sufficiently clear about the novelty introduced in the current manuscript. Indeed, we have developed some of the methodology for the classification in our prior research. However, we have made significant advancements on the prior work to enable the system that is presented here: we have significantly expanded the number of classes we are able to classify into, we have introduced a novel weighting scheme to push classifications lower in the hierarchy, we have built a web interface for user access to the model and predictions, and conducted a user study. We have now updated the Methods and Discussion of the manuscript to better reflect the novelty of the present contribution.


Remark: 5. The paper compares its approach with ClassyFire, a tool for structure-based chemical ontology classification. ClassyFire operates on a set of predefined rules for classifying chemicals, which might more closely mimic expert decisions in specific scenarios. In contrast, Chebifier leverages a neural network approach, eliminating the need for developers to continually update or maintain explicit classification rules alongside the ontology. This AI-driven model allows Chebifier to dynamically adapt as the ontology evolves. However, it’s important to note that introducing new classes in ClassyFire involves adding new rules, whereas in Chebifier, it necessitates retraining of the neural network. This distinction in the methodology for updating and expanding the system’s capabilities has significant implications. The paper should include a discussion on the time and resources required for retraining Chebifier, thereby offering a more comprehensive understanding of the advantages and limitations of each tool in the domain of chemical ontology classification.

Response: We thank the reviewer for highlighting our omission to sufficiently discuss the resources required to re-train our model. Fine-tuning the pre-trained base model with the ChEBI dataset over 100 epochs on two NVIDIA TITAN X finishes in less than 15 hours. Therefore, our model may easily be kept up to date with the regular monthly release cycle of the ontology. We have now updated the manuscript to include this information.

1.2 Review #2
Remark: 1. [...] it seems that the authors are using a mixture of manu- ally curated and partially automated entries in their training data which could potentially lead to inaccurate or less specific ontology classification predictions. However, this could have been avoided if they had used the manually curated (3-star) entries in ChEBI.
Response: We absolutely agree with the reviewer that it is sub-optimal to be training our model using content that was automatically added to ChEBI rather than only the 3-star fragment of manually assembled content. However, it is potentially somewhat challenging to extract only the 3-star fragment of ChEBI as an ontology, as the star rating is published for classes but not for relationships between classes, and it is necessary to extract a fully connected ontology which will preclude bypassing, for example, mid-level classes that are not 3-star. Due to the associated complexity, we have used the full version of ChEBI for the current work but in future work we do plan to develop an algorithm to filter out as much of the automated part of ChEBI as possible while preserving the connected ontology hierarchy. We have updated the discussion of the present manuscript to reflect this point more clearly.

Remark: 2. Furthermore, it would be beneficial to explain for the readers (preferentially with an example) the types of chemical structures the tool works well on and the chemical structures the tool does not predict well or fails on.

Response: We thank the reviewer for this helpful suggestion. We have added some observations about the groups of classes for which we observe poorer performance to the discussion of limitations.


1.3 Review #3
Remark: 1. The authors used SMILES to start with. Could SMARTs or another variant be an option or, once the information is processed, it would not yield any significant advantage?

Response: Indeed, harnessing SMARTS would allow us to elegantly specify the structural features of chemical classes, which would add an additional verification and matching capability to our prediction system. Unfortunately, we are not aware of any large-scale publicly available SMARTS datasets that we could use for this particular ontology extension task. We are, however, currently looking into ways to generate such a dataset ourselves. We extended our discussion of Future Work accordingly to reflect this.

Remark: 2. I do not know how much work is available in public foundation us- ing transformers models outside language. Could any other “less hot” approach of network be equally performant?

Response: We agree with the reviewer that it can be beneficial to consider a broad range of model architectures. In our previous research, we explored a variety of different approaches, starting from traditional machine-learning ap- proaches such as logistic regression and Bayes classifiers, then simpler sequence-based models such as LSTMs, and finally full-fledged transformer models. Our evaluations have shown that the selected Transformer architecture (Electra) does outperform these “less hot” approaches. Yet, the use of a sequence-based representation does not come without downsides. Our analyses have also shown, that due to the linear character of SMILES, sequence-based models are less performant when detecting complex ring or branching structures, such as for peptides or complex sugars. As we write in our future work section, we are planning to use less ”deep”, graph-based approaches to address this downside, and it is our ultimate outlook that the most performant system for the problem as a whole will involve an ensemble of different model types to enable different approaches to mitigate the strengths and weaknesses of each other.

Remark: 3. Why only F1? This should be discussed at little bit more

Response: In our previous works, we have also investigated other metrics on a comparable base model. See, for example: Memariani, Adel, et al. “Automated and explainable ontology extension based on deep learning: A case study in the chemical domain.” arXiv preprint arXiv:2109.09202 (2021). In the present work, as the evaluation is already long and multifaceted including user-oriented eval- uation, we wanted to select a single standard metric for the model performance. The F1 is widely used as a balance between precision and recall.

Remark: 4. The same of the exponent in wC expression at page 5

Response: The method here has been used as proposed by Cui et al. in their referenced work. We amended the relevant section to include a more extensive explanation of the metric used here.

Remark: 5. About the poll of 12 people? This is not a little limited? Also the two participants refraining from answering did provide at least some feedback about difficulties or other problems?

Response: While the participant number of 12 may seem small, it is within the normal number of participants for a user study of a web tool in a highly specialized field. Moreover, we did observe that the main themes of the study feedback emerged multiple times among the participants. The two participants who refrained from answering specifically mentioned questions, did provide feedback on the challenges with the user interface that they had encountered, which feedback was instrumental for us to improve the design and stability of the interface.

Remark: 6. Computational cost and resources: I could not find a comment about it but I think it’s relevant. Where was the training and test done? How much time it took and on which architecture? This I guess the important point of this list.

Response: The fine-tuning finishes in less than 15 hours on two NVIDIA TITAN X. We added this information to Section 2.1.




Round 2

Revised manuscript submitted on 05 Mar 2024
 

26-Mar-2024

Dear Dr Hastings:

Manuscript ID: DD-ART-12-2023-000238.R1
TITLE: Chebifier: Automating semantic classification in ChEBI with AI to accelerate data-driven discovery

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our LinkedIn account [https://rsc.li/Digital_showcase] please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry


 
Reviewer 2

The authors have given a well written response to the reviewers questions/comments and have integrated some of the feedback into the revised article.

Reviewer 3

In this revised version the authors have answered to the remarks made by me and by the other reviewer and I think it may now be published. Somehow I am not entirely satisfied with answers to remarks n. 2 and 3, which, in my opinion, should have deserved more discussion at least in the supplementary information but I do not think this should stop the publication of this study.




Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.
Creative Commons BY license