Peer review - Assessment of chemistry knowledge in large language models that generate code

08-Nov-2022

Dear Dr White:

Manuscript ID: DD-ART-08-2022-000087
TITLE: Do large language models know chemistry?

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry

************

Reviewer comments

Reviewer 1

In this work, authors evaluate the ability of LLMs to generate computer codes based on the prompts for problems in chemistry. They provide a set of benchmark problems, which is made publicly available. While the work is interesting, authors need to address some of the concerns.

1) Although the title of the manuscript seems attractive, the contents fail to live up to the title. Specifically, the posed question is very broad. To answer the question, multiple standard tasks in NLP such as NER, question-answering, question generation and further detailed analysis is required. This particular manuscript focusses only on code generation (and briefly on molecular structures) and hence the title need to be modified to reflect this scope.

2) Further, it is not clear how the prompts have been selected. Is it arbitrary, or is it based on some textbook questions, or was a survey conducted among different populations? A more formal process on how the prompts were developed need to be outlined as the one the main contribution of the work is the database itself, which is claimed to be a benchmark dataset for evaluating LLMs.

3) Again, to evaluate the LLMs, it is not clear why the authors focusses on code generation task. Specifically, the question is in the context that a code typically has a well-defined structure. And a code for Monte-Carlo simulation whether in chemistry or for a casino will have the same structure with just different variables. Thus, why did authors think this is representative of the knowledge of chemistry?

4) The work focusses on GPT based LLMs and the tasks selected are also along those lines. When discussing LLMs, there are several other models such as BERT, SciBERT, ChemBERT, T5 etc. An exhaustive evaluation should take into account all the different types of language models and their pros and cons, including what tasks can be performed by which model etc.

Altogether, the work presents an interesting direction that is surely worth exploring with some preliminary results that are promising. However, the manuscript fails to give justice to the topic, and is far from a complete manuscript that provides a detailed analysis on the topic. As such, it is recommended to perform either of the following:
(i) Perform a broad comparison on different tasks and LLMs mentioned earlier in the review. or
(ii) Make the scope of the paper narrow and have a much more detailed discussion on the tasks considered with detailed evaluations on where it fails, why does it fail, what are the training dataset that it is exposed to, does it contain similar context, can it be fine-tuned for few-shot generalization etc.

Reviewer 2

This article demonstrates an interesting application of large language models (LLMs) to generate code for chemistry related tasks. There are minor suggestions to improve the overall clarity of the text with regards to model selection and validation:
1. It would be good to compare the model performance with some simple language models as baseline.
2. Some text describing how the data is split for training and validation should be included in the manuscript. A brief discussion on hyperparameter optimization would also be useful.

Reviewer 3

The development and application of large language models (LLMs) in several scientific disciplines, including chemistry, will almost certainly result in a significant shift in how we do science within the next few years. This paper examines the performance of existing LLMs in few chemistry programming and molecular structure prediction tasks.

The manuscript is well-written and captivating. The data adequately supports the conclusions, and the overall work is ambitious and credible.

During the editorial phase, I would recommend experimenting with different plot styles for figures 3 and 4, possibly using a spider plot to make the trends more easily readable. This should not be considered a suggestion for minor revisions.

Overall, I recommend that the current manuscript be accepted without further delay.
Teo

Author response

Reviewer 1
Reviewer 1, Comment 1
In this work, authors evaluate the ability of LLMs to generate computer codes based on the prompts for problems in chemistry. They provide a set of benchmark problems, which is made publicly available. While the work is interesting, authors need to address some of the concerns.
1) Although the title of the manuscript seems attractive, the contents fail to live up to the title. Specifically, the posed question is very broad. To answer the question, multiple standard tasks in NLP such as NER, question-answering, question generation and further detailed analysis is required. This particular manuscript focusses only on code generation (and briefly on molecular structures) and hence the title need to be modified to reflect this scope.
Author Reply: We have changed the title to more specifically address our scope. Our goal was to produce a set of benchmark problems that can be used to evaluate whether LLMs contain chemistry knowledge. Rather than try to make an exhaustive test of all possible ways in which these models could know chemistry, we rather focused on a particular way of formulating questions (as function definitions) that can be evaluated for current and forthcoming text generating LLMs. This does not limit us to the field of computational chemistry, with most of our examples being those taken from undergraduate chemistry curriculum, simply formulated as coding tasks. To make it clear that the scope of our example prompts is much wider than computational chemistry, we have added an additional paragraph (paragraph 2) in Section II describing the range of categories studied.
Reviewer 1, Comment 2
Further, it is not clear how the prompts have been selected. Is it arbitrary, or is it based on some textbook questions, or was a survey conducted among different populations? A more formal process on how the prompts were developed need to be outlined as the one the main contribution of the work is the database itself, which is claimed to be a benchmark dataset for evaluating LLMs.
Author Reply: Our goal was to create a framework by which LLMs can be evaluated, which includes a set of benchmark problems and the software to evaluate those problems with different LLMs. Our goal was to make this framework expandable by contributions from the community, easily facilitated by “github pull requests”. Upon acceptance of a pull request, the entire dataset is automatically evaluated by our software.
We have now described this process in more detail in the first paragraph of section II in the main text. In brief, to build an initial database, we first made the list of categories seen in Table 1. This is a wide range of categories that fall within the umbrella of chemistry. Moreover, we felt that ourselves and members of our research groups, who have undergraduate, masters, and PhDs in chemistry or chemical engineering, have collectively sufficient expertise to generate representative topics from typical undergraduate and graduate classes across these domains. We surveyed members of our research groups for these topics, and then assigned some of them and ourselves (the authors of this paper) to create working examples and add them to our database.
Reviewer 1, Comment 3
Again, to evaluate the LLMs, it is not clear why the authors focusses on code generation task. Specifically, the question is in the context that a code typically has a well-defined structure. And a code for Monte-Carlo simulation whether in chemistry or for a casino will have the same structure with just different variables. Thus, why did authors think this is representative of the knowledge of chemistry?
Author Reply: As described in the previous section, the tasks brainstormed and generated by experts in our research groups are chemistry questions, not simply code generation problems. They include undergraduate chemistry topics like Ideal Gasses, Phase Transitions, Chemical Kinetics, properties of peptides, heat capacity, dipole moments, and more. All of the examples are included in the paper’s SI, on a website, and in the data folder:
https://github.com/ur-whitelab/nlcc-data/tree/main/data
We feel that any quantitative question as well as some non-quantitative problems (such as generating SMILES strings) can be evaluated via generation of a function that returns the answer, and it leaves little ambiguity about whether the model generated a correct solution. Of course, some could not be evaluated automatically, such as tasks related to plotting scientific data, which is why we employed human evaluators (again, the authors of this paper) for the difficulty and accuracy of those tasks.
Reviewer 1, Comment 4
The work focusses on GPT based LLMs and the tasks selected are also along those lines. When discussing LLMs, there are several other models such as BERT, SciBERT, ChemBERT, T5 etc. An exhaustive evaluation should take into account all the different types of language models and their pros and cons, including what tasks can be performed by which model etc.
Author Reply: We have revised the section on model architectures (within Methods) to contrast the different types of models. Although, some work has explored BERT-type models for code completion, they cannot be generally used for prompt completions and are not competitive with GPT models because they require fixed length completions. T5 is similar - it can classify or translate code, but it cannot answer open-ended prompts. This is detailed in the introduction
Reviewer 1, Comment 5
Altogether, the work presents an interesting direction that is surely worth exploring with some preliminary results that are promising. However, the manuscript fails to give justice to the topic, and is far from a complete manuscript that provides a detailed analysis on the topic. As such, it is recommended to perform either of the following: (i) Perform a broad comparison on different tasks and LLMs mentioned earlier in the review. or (ii) Make the scope of the paper narrow and have a much more detailed discussion on the tasks considered with detailed evaluations on where it fails, why does it fail, what are the training dataset that it is exposed to, does it contain similar context, can it be fine-tuned for few-shot generalization etc.
Author Reply: We have increased the number of models to 9. There are very few models that can write correct code, because they require billions of parameters, but we have taken additional time and computing resources to evaluate even more models.
Reviewer 2
Reviewer 2, Comment 1
This article demonstrates an interesting application of large language models (LLMs) to generate code
for chemistry related tasks. There are minor suggestions to improve the overall clarity of the text with regards to model selection and validation:
1. It would be good to compare the model performance with some simple language models as baseline. Author Reply: We have now compared with some additional smaller models (e.g., CodeGen-350M, CodeGen-1B), and the near 0% accuracy on the models rules out simpler models (results in the SI). We have added discussion of our rationale for using models with greater than 1B parameters and a citation to a paper showing the relatively monotonic relationship between parameter count and benchmark accuracy in the main text.
Reviewer 2, Comment 2
2. Some text describing how the data is split for training and validation should be included in the manuscript. A brief discussion on hyperparameter optimization would also be useful.
Author Reply: The LLMs studied are already trained elsewhere, so we do not have any test/train splitting to perform.
Reviewer 3
Reviewer 3, Comment 1
The development and application of large language models (LLMs) in several scientific disciplines, including chemistry, will almost certainly result in a significant shift in how we do science within the next few years. This paper examines the performance of existing LLMs in few chemistry programming and molecular structure prediction tasks.
The manuscript is well-written and captivating. The data adequately supports the conclusions, and the overall work is ambitious and credible.
During the editorial phase, I would recommend experimenting with different plot styles for figures 3 and 4, possibly using a spider plot to make the trends more easily readable. This should not be considered a suggestion for minor revisions.
Overall, I recommend that the current manuscript be accepted without further delay.
Author Reply: Thank you for your positive comments! We have tried reducing the number of elements from the plots to be clearer and put detailed plots into the SI.

Editor’s decision letter

16-Jan-2023

Dear Dr White:

Manuscript ID: DD-ART-08-2022-000087.R1
TITLE: Assessment of chemistry knowledge in large language models that generate code

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions.

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry

************

Reviewer comments

Reviewer 2

I would like to thank the authors for their response to the comments and recommend the manuscript for publication.

Reviewer 3

The paper has been changed in response to most of the other referee's feedback, and it is more focused than the prior version.
I strongly recommend to accept the manuscript in the current form.

Reviewer 1

Authors have satisfactorily addressed the concerns. However, in the meantime, the world has been taken over by the ChatGPT storm. I tried some of the prompts in the paper with ChatGPT. For example, "Can you write a python function that returns the phase transition pressure at temperature T2 given a heat of vaporization HVap, and and reference temperature and pressure T1 , P1?"

It gave a very nice running code with the following documentation:
"This function uses the Clausius-Clapeyron equation to calculate the phase transition pressure at temperature T2. The Clausius-Clapeyron equation relates the change in vapor pressure of a substance to its heat of vaporization and the change in temperature. The equation is given by:
P2/P1 = exp(-HVap/(RT2) + HVap/(RT1))
where P1 and P2 are the vapor pressures at temperatures T1 and T2, respectively, HVap is the heat of vaporization, R is the ideal gas constant, and T1 and T2 are the temperatures in Kelvin."

I have two suggestions to the author:
1. They may include the ChatGPT results in this paper. I know this is additional work. But I think this will significantly improve the value of the paper and will make it one of the first papers to use ChatGPT in chemistry domain.
2. Please clarify the analysis performed in this work is anecdotal and not exhaustive. Several specific approaches are tried out and this doesn't reveal rigorously the knowledge of LLMs on chemistry. Authors may refer papers in the NLP community where such questions have been rigorously addressed by first defining questions and tasks of varying degrees and intensities and then evaluated on multiple language models.

Please note that both of these are suggestions from the reviewer and the authors may or may not choose to do it. Congratulations for a nice paper.

Author response

We have addressed the comments of Referee 1 to further point out the specific limits on the domains we have tested in the discussion, and we have added comments about ChatGPT to the intro and results, as well as an SI figure using ChatGPT.

We feel these should address all the remaining comments.

Editor’s decision letter

19-Jan-2023

Dear Dr White:

Manuscript ID: DD-ART-08-2022-000087.R2
TITLE: Assessment of chemistry knowledge in large language models that generate code

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry

******
******

Please contact the journal at digitaldiscovery@rsc.org

************************************

DISCLAIMER:

This communication is from The Royal Society of Chemistry, a company incorporated in England by Royal Charter (registered number RC000524) and a charity registered in England and Wales (charity number 207890). Registered office: Burlington House, Piccadilly, London W1J 0BA. Telephone: +44 (0) 20 7437 8656.

The content of this communication (including any attachments) is confidential, and may be privileged or contain copyright material. It may not be relied upon or disclosed to any person other than the intended recipient(s) without the consent of The Royal Society of Chemistry. If you are not the intended recipient(s), please (1) notify us immediately by replying to this email, (2) delete all copies from your system, and (3) note that disclosure, distribution, copying or use of this communication is strictly prohibited.

Any advice given by The Royal Society of Chemistry has been carefully formulated but is based on the information available to it. The Royal Society of Chemistry cannot be held responsible for accuracy or completeness of this communication or any attachment. Any views or opinions presented in this email are solely those of the author and do not represent those of The Royal Society of Chemistry. The views expressed in this communication are personal to the sender and unless specifically stated, this e-mail does not constitute any part of an offer or contract. The Royal Society of Chemistry shall not be liable for any resulting damage or loss as a result of the use of this email and/or attachments, or for the consequences of any actions taken on the basis of the information provided. The Royal Society of Chemistry does not warrant that its emails or attachments are Virus-free; The Royal Society of Chemistry has taken reasonable precautions to ensure that no viruses are contained in this email, but does not accept any responsibility once this email has been transmitted. Please rely on your own screening of electronic communication.

More information on The Royal Society of Chemistry can be found on our website: www.rsc.org

From the journal Digital Discovery Peer review history

Round 1

Reviewer 1

Reviewer 2

Reviewer 3

Round 2

Reviewer 2

Reviewer 3

Reviewer 1

Round 3

Transparent peer review