Peer review - What can attribution methods show us about chemical language models?

10-May-2024

Dear Dr Robinson:

Manuscript ID: DD-ART-03-2024-000084
TITLE: What can Attribution Methods show us about Chemical Language Models?

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Joshua Schrier
Associate Editor, Digital Discovery

************

Reviewer comments

Reviewer 1

I have been assigned as the 'Data Reviewer' for this manuscript. I think the GitHub repository is well-written and contains all the necessary information to reproduce the results given in the manuscript. I would recommend the following minor revisions:
1. Specify the version of the AqueousSolu dataset used (Version v1.2, Jul 1, 2022).
2. List the packages used (and their versions) in the README file. This information is also in the requirements.txt file, but it would be good to have it where it can be easily read.
3. Include an Acknowledgements and References section in the GitHub repository.

Reviewer 2

In this work, the authors focused on exploring the use of Chemical Language Models for explainable property prediction. They compared the performance of different models and investigated their accuracy in predicting aqueous solubility. Novelty of the study is that the authors compared the application of attribution methods to a specific CLM.

- Methods:
-Why did you select MegaMolBART?
-Did you consider to include one scaffold splitting strategy as well as comparison?
-How did you preprocess the compounds for the comparison method?
-Table 1: Standard deviations?

-Results:
-I am wondering about the comparison to SOTA approaches like this:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10638308/
-Did you compare with SHAP and RF/SVM on the same dataset
-Did you look into datasets where it is known that specific substructures drive
activity? E.g. on target bioactivity datasets?
-The model is unable to accurately model symmetry and frequently attributes very
different relevance to symmetric functional groups, which might be due to the
difficulty of reconstructing the structure of the molecule from a string-based
representation.
--> very interesting observation --> would be even more obvious if bigger
substructures should drive the activity
-ECFP: see comment to SHAP and RF
-This phenomenon is likely to be particularly severe in our case due to the size of
fingerprint chosen (512 bits): Why did you chose 512 bits?
-Evaluating which attribution is correct or most accurate is not possible since no
reference attribution label exists for solubility or other physical properties: Why
did you chose this dataset?

-very interesting comparison and very interesting insights like: Visualizations of the model’s latent space lead us to conclude that the model uses SMILES strings to map molecules into a structural latent space and predicts solubility based on position, rather than regressing based on learned molecular features and functional groups. From a user perspective it is of interest to use the best performing model in combination with the most helpful XAI method

Reviewer 3

In this paper, the authors explore explainability for chemical language models trained on SMILES strings using attribution techniques. I think in general this kind of study can be very interesting and relevant but the authors only study this in the context of predicting solubility and only use one chemical language (SMILES). There are many more possible and useful (and more interesting) tasks to consider. There are also more chemical languages to consider as well. I think to see if their insights hold up the authors should at least consider one more task-- perhaps toxicity? As well they should at least consider one more chemical language-- perhaps SELFIES?

This might allow you to provide a more detailed discussion about how these insights "... may be leveraged for the design of informative chemical spaces
for training more accurate, advanced and explainable models."

Author response

REVIEWER REPORT(S):
Referee: 1

Comments to the Author
I have been assigned as the 'Data Reviewer' for this manuscript. I think the GitHub repository is well-written and contains all the necessary information to reproduce the results given in the manuscript. I would recommend the following minor revisions:
1. Specify the version of the AqueousSolu dataset used (Version v1.2, Jul 1, 2022).

Response: We have updated the manuscript to report the version of the dataset both in the
repository (commit dea52c3) and in the methods section of the manuscript (page 2).

2. List the packages used (and their versions) in the README file. This information is also in the requirements.txt file, but it would be good to have it where it can be easily read.

Response: We have added a list of packages in the README file of the code repository as
requested (commit dea52c3).

3. Include an Acknowledgements and References section in the GitHub repository.

Response: We have added acknowledgements section to the readme and added a references file (references.bib)
to the GitHub repository (commit dea52c3 and commit a4bf5fd).

We thank the reviewer for their time and careful examination of the data and
code associated with our manuscript.

Referee: 2

Comments to the Author

In this work, the authors focused on exploring the use of
Chemical Language Models for explainable property prediction.
They compared the performance of different models and
investigated their accuracy in predicting aqueous solubility.
Novelty of the study is that the authors compared the
application of attribution methods to a specific CLM.

Response: We thank the reviewer for their time and constructive comments on our manuscript.

- Methods:
-Why did you select MegaMolBART?

Response: We used MegaMolBART as our pretrained CLM, as it offers several important
advantages. First, this is an open-source model where the code and the model's
weights are readily available. Second, it is based on the ChemFormer
architecture [Irwin, Dimitriadis, He, Bjerrum, Mach. Learn.: Sci. Technol. 3
(2022) 015022], using a self-supervised learning approach relying on a
state-of-the-art Transformer architecture, and uses SMILES strings as inputs.
Using the SMILES molecular representation is beneficial, as this offers a
compact representation in contrast to full molecular graph representations, and
as this is a wide-spread representation prevalent in many databases. Third,
MegaMolBART has been trained on roughly 1.45 billion SMILES strings, over an
order of magnitude more data than has been used to train ChemFormer (trained
over roughly 100 million strings), leading to much better performance. We
believe it is one of the most powerful CLM at this time. It has been shown that
Large Language Models strongly benefit from large pretraining datasets
[arXiv:2203.15556, arXiv:2001.08361], thus implying that the MegaMolBART has
higher performance than ChemFormer. Similarly, we expected the large molecular
space covered MegaMolBART and state-of-the-art pretraining techniques to yield
the strongest Chemical Language Model available at the time.

We have updated the manuscript to clarify our decision to use MegaMolBART
(page 3, MegaMolBART section).

-Did you consider to include one scaffold splitting strategy as well as comparison?

Response: We have added results from using the scaffold splitting strategy as suggested by
the reviewer (page 2, page 5, Table 1, Table S1 and Figure S1). We find that the
scaffold split test errors were much higher than for the other split strategies
while validation set errors were similar for all splits (Table 1, Table S1,
revised Fig. S1).

-How did you preprocess the compounds for the comparison method?

Response: We did not preprocess compounds for the comparison methods,
and used SMILES strings supplied directly from the database. For
ECFP, we converted SMILES to ECFP fingerprints using rdkit
(MorganFingerprint) as detailed in the methods section.

-Table 1: Standard deviations?

Response: We have included standard deviations of the cross-validation errors for all
investigated models in Table S1 in the revised Table S1.

-Results:
-I am wondering about the comparison to SOTA approaches like this:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10638308/

Response: We have used the method from the suggested paper to implement support vector
regression (SVR) models and attempted to explain the model using SVERAD.
However, the paper focuses on applying explainability methods to classification
tasks, and does not discuss necessary adaptions to explain support vector
regression models. Thus, this SVM implementation is not suited to explaining
regression methods as implemented in the reference. We have included details of
the SVR method with ECFP in Table 1 and Table S1.

-Did you compare with SHAP and RF/SVM on the same dataset

Response: We have provided the results from further investigations using random forest and
support vector regression with ECFP fingerprints in Table 1 and Table S1. These
results show slightly higher performance than the hierarchical regression head
variant with ECFP fingerprints.

-Did you look into datasets where it is known that specific substructures drive
activity? E.g. on target bioactivity datasets?

Response: We have not included such data sets, as they are qualitative
and largely suited to classification problems. Thus, they do
not allow us to attempt to quantitatively map molecular
structure onto property as we can for solubility data.

-The model is unable to accurately model symmetry and frequently attributes very
different relevance to symmetric functional groups, which might be due to the
difficulty of reconstructing the structure of the molecule from a string-based
representation.
--> very interesting observation --> would be even more obvious if bigger
substructures should drive the activity

Response: This observation is indeed quite pertinent to our
investigation. We consider it a result of the SMILES notation
used by the model as encoded chemical structure. Though an
equivalent graph structure of a SMILES string may possess
symmetry, the nature of the SMILES construction means that the
graph structure is not reflected in the resulting string. We
hypothesize that this is the source of the model’s lack of
ability to assign similar weights to symmetrical features.

-ECFP: see comment to SHAP and RF

We have provided the results from further investigations using
random forest and support vector regression with ECFP
fingerprints in Table 1 and Table S1. Both ML models achieve similar
predictive results, and are comparable to the hierarchical
regression head.

-This phenomenon is likely to be particularly severe in our case due to the size of
fingerprint chosen (512 bits): Why did you chose 512 bits?

Response: A 512 bit fingerprint size was chosen as it is of the same dimension as the
vector created for regression in MegaMolBART (which is dictated by the
dimensions of the pretrained model). In this sense, ECFP is an algorithmic
counterpart to the compression of molecular structure created by MegaMolBART. We
have collected additional results for the commonly used fingerprint size of 2048
bits (Table 1 and Table S1). We observe higher predictive performance for the
2048 bit model, which is consistent with the gain in accuracy given by a model
with more degrees of freedom. We have manually compared the explainability
results between the 2048 and 512 bit representations. We find that they broadly
give similar results, but are not identical in every case (Figure S14-S17).

We have updated the text on Page 6 to clarify our choice to focus on the 512 bit
fingerprint size.

-Evaluating which attribution is correct or most accurate is not possible since no
reference attribution label exists for solubility or other physical properties: Why
did you chose this dataset?

Response: We chose solubility as a target for prediction, as it is a key
physical property for performing chemical investigations, and
is of high industrial interest. Though it is common chemical
practice to develop ‘rule-of-thumb’ approaches which map
chemical structure into an expected regime of, for instance,
solubility, there are no instances where these rules produce
precise, accurate, quantitative predictions. Thus, it is of
great importance to investigate whether the connection between
molecular structure and solubility can be distilled using
chemical language models.

-very interesting comparison and very interesting insights like: Visualizations of the model’s latent space lead us to conclude that the model uses SMILES strings to map molecules into a structural latent space and predicts solubility based on position, rather than regressing based on learned molecular features and functional groups. From a user perspective it is of interest to use the best performing model in combination with the most helpful XAI method

Response: We thank the reviewer for their encouraging comments.

Referee: 3

Comments to the Author
In this paper, the authors explore explainability for chemical language models trained on SMILES strings using attribution techniques. I think in general this kind of study can be very interesting and relevant but the authors only study this in the context of predicting solubility and only use one chemical language (SMILES). There are many more possible and useful (and more interesting) tasks to consider. There are also more chemical languages to consider as well. I think to see if their insights hold up the authors should at least consider one more task-- perhaps toxicity? As well they should at least consider one more chemical language-- perhaps SELFIES?

This might allow you to provide a more detailed discussion about how these insights "... may be leveraged for the design of informative chemical spaces for training more accurate, advanced and explainable models."

Response: We thank the reviewer for their time and consideration in reviewing our
manuscript. We agree that there are many tasks for which Chemical Language
Models may be applied. Here, we have focused on solubility, as it is of high
importance to the chemical sciences, and a key property of interest in
industry. Focussing on developing our explainability approach on this one
specific task took significant development, mainly due to interpreting the
results, which must be visualised not only ‘per sample’, but also compared to a
multitude of other structures. It would indeed be possible to transfer this
method to other data sets, but focusing our efforts on a single phenomenon,
such as toxicity, would be a much more complex task with regard to
explainability. Furthermore, toxicity is a much more complex phenomenon than
solubility, requiring the consideration of the interaction of molecular
structure with a huge range of potential biological targets for toxicity.
Indeed, most publicly available toxicity datasets are qualitative or discrete
(i.e. classification), and are thus much lower in fundamental chemical
information content than, for example, solubility.

We have not included a comparison with SELFIES. Work on training the ChemBERTa
model has noted that there was no significant difference in model performance
between using SMILES and SELFIES as molecular language input
(arXiv:2010.09885). Similarly, ChemBERTa-2 focuses exclusively on SMILES.
Furthermore, no pretrained Chemical Language Models of comparable quality and
scale are available for SELFIES.

Editor’s decision letter

22-Jun-2024

Dear Dr Robinson:

Manuscript ID: DD-ART-03-2024-000084.R1
TITLE: What can Attribution Methods show us about Chemical Language Models?

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions.

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Joshua Schrier
Associate Editor, Digital Discovery

************
EDITOR'S COMMENTS:

In the response to the previous reviewer asking about other representations (e.g., SELFIES), you replied:
"We have not included a comparison with SELFIES. Work on training the ChemBERTa model has noted that there was no significant difference in model performance between using SMILES and SELFIES as molecular language input (arXiv:2010.09885). Similarly, ChemBERTa-2 focuses exclusively on SMILES. Furthermore, no pretrained Chemical Language Models of comparable quality and scale are available for SELFIES."

However, no change was made in the manuscript.

I agree with the reviewer that it is likely to be a question that would be asked by many readers, but it was not answered in your revision.

Therefore, I ask that you include a brief statement (perhaps adapted from your response above) as part of the manuscript. This does not necessitate any calculations, but will help clarify the justification of what you did and open areas for future research.

************

Reviewer comments

Reviewer 1

All of my comments have been addressed and I recommend the manuscript be accepted.

Reviewer 3

None

Author response

Dear Dr. Schrier,

Thank you for your letter and decision. Our responses to the reviewers' comments are given below.

Yours sincerely,

Dr. William E. Robinson

EDITOR'S COMMENTS:

In the response to the previous reviewer asking about other representations (e.g., SELFIES), you replied: "We have not included a comparison with SELFIES. Work on training the ChemBERTa model has noted that there was no significant difference in model performance between using SMILES and SELFIES as molecular language input (arXiv:2010.09885). Similarly, ChemBERTa-2 focuses exclusively on SMILES. Furthermore, no pretrained Chemical Language Models of comparable quality and scale are available for SELFIES."

However, no change was made in the manuscript.

I agree with the reviewer that it is likely to be a question that would be asked by many readers, but it was not answered in your revision.

Therefore, I ask that you include a brief statement (perhaps adapted from your response above) as part of the manuscript. This does not necessitate any calculations, but will help clarify the justification of what you did and open areas for future research.

RESPONSE

We apologise for this omission and agree that it is indeed an important consideration to include in the manuscript. As suggested, we have updated the main text on page 3 to clarify our choice of using SMILES strings over SELFIES as CLM input.

************
REVIEWER REPORT(S):
Referee: 1

Comments to the Author
All of my comments have been addressed and I recommend the manuscript be accepted.

RESPONSE

We thank the reviewer for their time and consideration of our manuscript.

Referee: 3

Comments to the Author
None

RESPONSE

We thank the reviewer for their time and consideration of our manuscript.

************

Editor’s decision letter

27-Jun-2024

Dear Dr Robinson:

Manuscript ID: DD-ART-03-2024-000084.R2
TITLE: What can Attribution Methods show us about Chemical Language Models?

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our LinkedIn account [https://rsc.li/Digital_showcase] please fill out this form: https://form.jotform.com/213544038469056.

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Joshua Schrier
Associate Editor, Digital Discovery

******
******

Please contact the journal at digitaldiscovery@rsc.org

************************************

DISCLAIMER:

This communication is from The Royal Society of Chemistry, a company incorporated in England by Royal Charter (registered number RC000524) and a charity registered in England and Wales (charity number 207890). Registered office: Burlington House, Piccadilly, London W1J 0BA. Telephone: +44 (0) 20 7437 8656.

The content of this communication (including any attachments) is confidential, and may be privileged or contain copyright material. It may not be relied upon or disclosed to any person other than the intended recipient(s) without the consent of The Royal Society of Chemistry. If you are not the intended recipient(s), please (1) notify us immediately by replying to this email, (2) delete all copies from your system, and (3) note that disclosure, distribution, copying or use of this communication is strictly prohibited.

Any advice given by The Royal Society of Chemistry has been carefully formulated but is based on the information available to it. The Royal Society of Chemistry cannot be held responsible for accuracy or completeness of this communication or any attachment. Any views or opinions presented in this email are solely those of the author and do not represent those of The Royal Society of Chemistry. The views expressed in this communication are personal to the sender and unless specifically stated, this e-mail does not constitute any part of an offer or contract. The Royal Society of Chemistry shall not be liable for any resulting damage or loss as a result of the use of this email and/or attachments, or for the consequences of any actions taken on the basis of the information provided. The Royal Society of Chemistry does not warrant that its emails or attachments are Virus-free; The Royal Society of Chemistry has taken reasonable precautions to ensure that no viruses are contained in this email, but does not accept any responsibility once this email has been transmitted. Please rely on your own screening of electronic communication.

More information on The Royal Society of Chemistry can be found on our website: www.rsc.org

From the journal Digital Discovery Peer review history

Round 1

Reviewer 1

Reviewer 2

Reviewer 3

Round 2

Reviewer 1

Reviewer 3

Round 3

Transparent peer review