From the journal Digital Discovery Peer review history

14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon

Round 1

Manuscript submitted on 12 Jun 2023
 

25-Jun-2023

Dear Dr Jablonka:

Manuscript ID: DD-ART-06-2023-000113
TITLE: 14 Examples of How LLMs Can Transform Materials Science and Chemistry: A Reflection on a Large Language Model Hackathon

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry

************


 
Reviewer 1

This manuscript reports a variety of projects undertaken during a 1.5 day LLM hackathon. I found myself slightly unclear as to the primary purpose of this article: is it a perspective? A research article? A meeting report-out with no scientific conclusion? Are the examples meant to be reproducible? This is an unusual type of article, so I would appreciate more explicit framing from the authors. It has the potential to be a very interesting contribution, but at the moment seems to fall short due to a lack of depth and defensible claims.

It is not possible to provide a rigorous scientific review of each of the projects, as they are described in shallow detail and often make no substantial claim or demonstration of performance. It is important that the article try to describe more precisely (a) what was done in each of the case studies, (b) how it makes use of, or replaces, or improves upon existing tools/models, and (c) how this is likely to impact scientific workflows in the future. Moreover, being transparent about the limitations of these demonstrations would not detract from the breadth of what is reported and is important, as the authors acknowledge.

Comments:
1. The first paragraph of the abstract should probably be re-evaluated, as it does not provide any context for this work. To what “tools” are the authors referring? Every computational technique used in chemistry or materials science? Claiming that the effectiveness of computational techniques is limited warrants substantially more specificity.
2. The framing of the work in the introduction is also unclear. I realize there is a large breadth of topics and that the authors are intentionally being general in their comments, but the first paragraph seems to focus on supervised machine learning techniques. In this context, what do they consider a “tool” versus a “machine-learning model”? Is this about the frameworks used and types of models being too varied, or the fact that the field is full of single task models trained on a specific chemical/materials property endpoint? Somewhat ironically, this paper introduces over a dozen new fragmented tools.
3. The text contains several grandiose statements that do not make specific, defensible claims. It comments that “we need to start thinking about how LLMs will impact the future of materials science, chemistry, and beyond” while citing several articles that have already done this. “The teams delivered more results than in most other hackathons we participated in.” It further comments that the diversity of applications “show[s] that LLMs are here to stay and are likely a foundational capability that will be integrated into most aspects of the research process” and that “potential applications are almost unlimited”. This is the kind of statement that could belong in a perspective article as it is purely opinion-based.
4. The overall manuscript would be significantly more useful if there was a practical description of how each of these projects was performed. That is, the actual steps in the process to fine-tune such LLMs for these or similar workflows. Not all linked repositories seem to offer this. Code should also versioned and deposited via Zenodo or something comparable.

Some specific comments on the examples:
1. Table II should incorporate the baseline performance metrics that are in the supporting information. The error of GPT2-LoRA is 1-2 orders of magnitude larger than SchNet and FCHL. This relates to an overall point above about how these examples are likely to affect scientific workflows in the future – this is a long way off from outperforming these traditional QSPR approaches.
2. For the additional context phrases depicted in Figure 1, did regression performance worsen if these relationships written in natural language were rephrased? Or if the description was known to be false or opposite to the trend? The improvement from 0.67 to 0.72 R^2 should be contextualized in terms of the variance one observes when making these changes to the context.
3. The discussion of word2vec embeddings fails to discuss prior approaches in materials science (e.g., 10.1038/s41586-019-1335-8, 10.1063/5.0021106). The success of the project is implied by the sentence “Visual inspection…” yet the SI mentions that ScholarBERT does better when structural similarity is incorporated, which would be the more traditional way of doing this search; does ScholarBERT+structure work better than structure alone? What is the actual success metric being used here, and what can be said about the actual promise of this approach?
4. Does the use of paraphrased text templates improve performance on regression tasks, or is this a hypothetical benefit?
5. There are insufficient details about the “GA” used to optimize for structural similarity to Vitamin C; what were the precise prompts used for this evaluation, and how would the authors compare its performance to that of a traditional SMILES, SELFIES, or Graph GA?
6. The order of SI sections does not match the main text. The section on MAPI-LLM also appears to be missing.
7. For sMolTalk, the description suggests that sometimes the LLM retrieves the wrong structure and visualizes something from the in-context learning examples instead of the query. What is the authors’ perspective on this failure mode, which seems to be common to LLM workflows (i.e., you can’t guarantee the LLM is doing what you asked it to, unlike other code execution workflows)
8. In what way is Whinchat like an ELN, and what are the logging/validation capabilities it offers in addition to the LLM agent interface? Are there insights that Datalab provides (e.g., on suitable analytical techniques) that were particularly impressive or unimpressive?
9. What are the direct applications of the knowledge graph the authors use Insight Graph to extract – is it solely to “launch a literature review”? The authors acknowledge that pairwise connections might not be enough to model materials.
10. TableToJson mentions OCR, but this seems irrelevant as the task is starting from text not images. Otherwise, this section is a clear demonstration of strengths and limitations.
11. The example GPT-2-medium output in TitleToAbstract contains grammatically and syntactically incorrect sentences. This seems like it is worth commenting on; is it simply because the authors have used an out of date base model?
12. I-Digest mentions that the questions generated for a video might allow students to be guided to relevant timestamps or additional materials. Is this entirely speculative at this stage?

Reviewer 2

While I can see the importance of this hackathon in highlighting the use of LLMs to the material science and broader chemistry community, this does not fall in the category of a typical scientific publication that can be or needs to be peer reviewed. Generally, when a manuscript is sent for peer review:
1. It has clear working hypotheses that it intends to prove or disprove.
2. Provides methodology for reproduction and verification.
3. Can address revisions suggested by reviewers through further experimentation.

In this case, it does neither of the three. If I suggest a certain change or ask the authors to verify a claim, I am not sure that they will be able to address it since the hackathon is over and the participants disbanded. In several cases, the effort is the result - in the sense that the participants set out to prove that something can be done using LLMs and have done so. For example, in the case A.c "molecular discovery by context", the authors surmise that "Visual inspection indicates that the selected molecules indeed bear similarities to known hydrogen carrier molecules.". This is a valid result for a two day hackathon but is not a sufficient result for a peer reviewed scientific publication. There is no feedback that I can give here that is actionable because there is no concrete result. This is the case for the majority of the results presented here.

I do agree that this article should be published and Digital Discovery is an apt venue for the same. However, this does not fall into the category of a typical research article but must be published as a perspective, letter to the editor or some other category that clearly mentions that it is not peer reviewed. These articles sacquire citations so it should still be useful and indexable for the community.

Reviewer 3

Summary:
This paper summarizes the 14 projects in the LLMs for chemistry hackathon. These projects include a wide range of chemistry and material tasks such as extracting knowledge, developing new educational applications, etc.

Strengths:
1. With the revolutionary development and application of LLM in various fields, there’s limited research about LLMs in chemistry. This paper provides extensive ideas demonstrating the utilization of LLMs in material science and chemistry, which give valuable insights to researchers in this filed.
2. The hackathon is well-organized and researchers from diverse research fields contribute to this hackathon, which increasing the impact of this paper.

Weakness:
1. The 14 projects of this paper is promising and it would be better to add more insightful conclusions and prospects in the Conclusion section. Anyway, this paper provides many practical applications of LLMs in chemistry and I recommend to accept it.


 

Dear Dr. Hippalgaonkar:

As discussed, we would like to submit the revised version of our manuscript as Perspective.
We think that we could address most, if not all, of the reviewers' comments in the revised version of the manuscript.

Sincerely,
Ben Blaiszik and Kevin Maik Jablonka


Reviewer Point P 1.1 — This manuscript reports a variety of projects undertaken
during a 1.5 day LLM hackathon. I found myself slightly unclear as to the primary
purpose of this article: is it a perspective? A research article? A meeting report-out
with no scientific conclusion? Are the examples meant to be reproducible? This is
an unusual type of article, so I would appreciate more explicit framing from the
authors. It has the potential to be a very interesting contribution, but at the moment
seems to fall short due to a lack of depth and defensible claims.
Reply: We changed the article type to “Perspective”. In addition, we edited the introduction
to clarify the scope
One of the aims of this work is to provide food for thought on how LLMs such as
GPT-4, 1–6 can be used to address these challenges.
and
This article showcases some of the projects (Table 1) developed during the
hackathon with the goal of providing ideas and examples for the use of LLMs in
the molecular and materials sciences.
Reviewer Point P 1.2 — It is not possible to provide a rigorous scientific review
of each of the projects, as they are described in shallow detail and often make no
substantial claim or demonstration of performance. It is important that the article try
to describe more precisely (a) what was done in each of the case studies, (b) how it
makes use of, or replaces, or improves upon existing tools/models, and (c) how this
is likely to impact scientific workflows in the future. Moreover, being transparent
about the limitations of these demonstrations would not detract from the breadth of
what is reported and is important, as the authors acknowledge.
Reply: We explicitly added summary statements for points (a)–(c) to the SI sections of the
projects. In addition, we also added statements about challenges and potential future work.
Reviewer Point P 1.3 — 1. The first paragraph of the abstract should probably
be re-evaluated, as it does not provide any context for this work. To what “tools”
are the authors referring? Every computational technique used in chemistry or
materials science? Claiming that the effectiveness of computational techniques is
limited warrants substantially more specificity.
Reply: Based on the feedback we also revised the first paragraph of the abstract, which now
reads
Large-language models (LLMs) such as GPT-4 caught the interest of many sci-
entists. Recent studies suggested that these models could be useful in chemistry
and materials science. To explore these possibilities, we organized a hackathon.
Reviewer Point P 1.4 — 2. The framing of the work in the introduction is also
unclear. I realize there is a large breadth of topics and that the authors are inten-
tionally being general in their comments, but the first paragraph seems to focus on
supervised machine learning techniques. In this context, what do they consider a
“tool” versus a “machine-learning model”? Is this about the frameworks used and
2
types of models being too varied, or the fact that the field is full of single task models
trained on a specific chemical/materials property endpoint? Somewhat ironically,
this paper introduces over a dozen new fragmented tools.
Reply: To clarify, we revised the introduction
Since science rewards doing novel things for the first time, we now face a deluge
of cheminformatics and simulation tools as well as machine-learning models for
various tasks.
and
All these tools and models commonly require input data in their own rigid,
well-defined form (e.g., a table with specific columns or images from a specific
microscope with specific dimensions).
Reviewer Point P 1.5 — 3. The text contains several grandiose statements that do
not make specific, defensible claims. It comments that “we need to start thinking
about how LLMs will impact the future of materials science, chemistry, and beyond”
while citing several articles that have already done this. “The teams delivered more
results than in most other hackathons we participated in.” It further comments that
the diversity of applications “show[s] that LLMs are here to stay and are likely a
foundational capability that will be integrated into most aspects of the research
process” and that “potential applications are almost unlimited”. This is the kind of
statement that could belong in a perspective article as it is purely opinion-based.
Reply: We switched the article type to “perspective”.
Reviewer Point P 1.6 — 4. The overall manuscript would be significantly more
useful if there was a practical description of how each of these projects was performed.
That is, the actual steps in the process to fine-tune such LLMs for these or similar
workflows. Not all linked repositories seem to offer this. Code should also versioned
and deposited via Zenodo or something comparable.
Reply: The repositories have been revised and have been archived on Zenodo. The links to
the repositories can be found in the revised Table 1.
Some specific comments on the examples:
Reviewer Point P 1.7 — 1. Table II should incorporate the baseline performance
metrics that are in the supporting information. The error of GPT2-LoRA is 1-2 orders
of magnitude larger than SchNet and FCHL. This relates to an overall point above
about how these examples are likely to affect scientific workflows in the future – this
is a long way off from outperforming these traditional QSPR approaches.
Reply: We added the baselines to the table’s caption.
While it is acknowledged that our LLM-based approach presently doesn’t surpass the
performance of traditional QSPR methods, we emphasize the novelty and potential of our
approach. Currently, our LLMs are fine-tuned solely based on SMILES strings, a 2D
representation of molecular structures, rather than the full 3D geometries utilized by most
atomistic machine learning models. Despite this, our method has already demonstrated
encouraging performance in predicting molecular properties.
Given this, we posit that there remains significant potential for improvement in the
performance of our LIFT framework. As the development and accessibility of LLMs advance,
particularly in the field of multimodality, which could allow the perception of 3D geometries,
we expect to see a notable enhancement in their predictive capabilities.
3
In the future, with advancements in LLMs and fine-tuning methods like LoRA, we envision
the LIFT framework acting as a powerful tool for tasks like inverse design and reinforcement
learning, potentially accelerating the discovery of novel molecules. Our study represents an
early exploration into the application of LLMs for molecular property prediction, and we
appreciate the reviewer’s recognition of its potential in this exciting direction.
Reviewer Point P 1.8 — 2. For the additional context phrases depicted in Figure 1,
did regression performance worsen if these relationships written in natural language
were rephrased? Or if the description was known to be false or opposite to the
trend? The improvement from 0.67 to 0.72 R2 should be contextualized in terms of
the variance one observes when making these changes to the context.
Reply: Indeed, our experimental results show that changing the description of these
relationships can affect the model’s accuracy. We now added to the SI
When we incorrectly changed the context of the ratio of fly ash to GGBFS, it
negatively affected the R-squared value for ICL, causing it to drop to 0.6. This
misrepresentation of the rule led to a decrease in the model’s predictive accuracy,
demonstrating that the quality of the information included in the “fuzzy” context
is critical to the overall performance of LLMs. It should be noted, however, that
the impact on the R-squared value may vary depending on the importance of the
rule in the overall context. That is, not all changes in context have a similar
impact, and the drop to 0.6 might occur only in the case of the ratio of fly ash to
GGBFS. Other studies, such as those conducted in the LIFT work, have shown
LLM performance for minor changes in wording or the presence of noise in the
features. In these experiments, the robustness of LIFT-based predictions was
comparable to classical ML algorithms, making it a promising alternative for
using fuzzy domain knowledge in predictive modeling.
From our experiments, we can confirm that the R-squared values for different implemen-
tations of this approach did not fall below the 0.7 mark, even when we started over with
completely different wording. This range shows the potential variability one might expect
when adjusting the natural language descriptions of these relationships. However, we must
emphasize that these observations, while promising, are not conclusive. Our results highlight
the possibility and potential of using LLMs in this way. However, more comprehensive studies
are needed to fully understand the impact of changes to the fuzzy context on regression
performance. We appreciate the reviewer’s insightful question, which has led us to expand on
the discussion in our manuscript. To address this point, we have added the following sentence
in the Text2Concrete part of our manuscript:
While the Text2Concrete example has not exhaustively analyzed how “fuzzy”
context alterations affect LLM performance, we recognize this as a key area for
future research that could enhance the application of LLMs and our approach to
leveraging “fuzzy” domain knowledge within materials science.
Reviewer Point P 1.9 — 3. The discussion of word2vec embeddings fails to discuss
prior approaches in materials science (e.g., 10.1038/s41586-019-1335-8, 10.1063/5.0021106).
The success of the project is implied by the sentence “Visual inspection. . . ” yet the
SI mentions that ScholarBERT does better when structural similarity is incorpo-
rated, which would be the more traditional way of doing this search; does Scholar-
BERT+structure work better than structure alone? What is the actual success metric
being used here, and what can be said about the actual promise of this approach?
Reply: We added a reference to the quoted review article and added additional emphasis on
the prior work of Tshitoyan
4
Much context is available in the full text of scientific articles. This has been
exploited by Tshitoyan et al. 7 who used a Word2Vec 8 approach to embed words
into a vector space. Word2Vec does so by tasking a model to predict for a word the
probability for all possible next words in a vocabulary. In this way, word embed-
dings capture syntactic and semantic details of lexical items (i.e., words). When
applied to material science abstracts, the word embeddings of compounds such
as Li2CuSb could be used for materials discovery by measuring their distance
(cosine similarity) to concepts such as “thermoelectric”.9 However, traditional
Word2Vec, as used by Tshitoyan et al. 7 , only produces static embeddings, which
remain unchanged after training.
To clarify the promise of the presented approach we added
Based on our empirical data, computing the energy capacity (wt%H2) and
energy penalty (kJ/mol/H2) of adding and removing H2 to the molecule (which
are the quantitative “success metrics” for this project) of a candidate molecule
using traditional quantum chemistry takes around 30 seconds per molecule on
a 64-core Intel Xeon Phi 7230 processor, whereas the proposed LLM approach
can screen around 100 molecules per second on a V100 GPU, achieving a 3000
times speedup.
to the SI.
Reviewer Point P 1.10 — 4. Does the use of paraphrased text templates improve
performance on regression tasks, or is this a hypothetical benefit?
Reply: To clarify that no tests on potential improvements have been performed we clarify
We expect this to reduce the risk of overfitting to a specific template. Latter
might be particularly important if one still wants to retain general language
abilities of the LLMs after finetuning.
Reviewer Point P 1.11 — 5. There are insufficient details about the “GA” used to
optimize for structural similarity to Vitamin C; what were the precise prompts used
for this evaluation, and how would the authors compare its performance to that of a
traditional SMILES, SELFIES, or Graph GA?
Reply: To clarify that no systematic analysis has been performed, we now write in the main
text
Future work will need to systematically investigate potential improvements
compared to conventional GAs.
In addition, we added more details to the SI, such as an example prompt
The following molecules are given as SMILES strings associated with a tanimoto
similarity with an unknown target molecule. Please produce 10 SMILES strings
that you think would improve their tanimoto scores using only this context. Do
not try to explain or refuse on the grounds of insufficient context; any suggestion
is better than no suggestion. Print the smiles in a Python list.
Reviewer Point P 1.12 — 6. The order of SI sections does not match the main text.
The section on MAPI-LLM also appears to be missing.
Reply: We double-checked the ordering in the revised version.
Reviewer Point P 1.13 — 7. For sMolTalk, the description suggests that sometimes
the LLM retrieves the wrong structure and visualizes something from the in-context
learning examples instead of the query. What is the authors’ perspective on this
failure mode, which seems to be common to LLM workflows (i.e., you can’t guarantee
the LLM is doing what you asked it to, unlike other code execution workflows)
Reply:
5
retrieving the wrong structure failure mode This relates to the reliability
of the LLMs truthfulness and factuality of retrieving correct structure IDs (PDB IDs for
proteins, or compound’s CID on PubChem). We cannot rely on the LLM to remember
internally all such IDs and thus one requires the use of other tools (actual API queries to
web servers, e.g. PubChem or PDB), i.e. making the LLM work in the Agent setting, with
reliable API as tools (augmented retrieval). This should improve the reliability; or in other
words, the reliability is then dissected into two different steps that might be easier to control:
a) ensuring the Agent / LLM chooses the right API search tool (i.e. it knows hemoglobin is
a protein), b) the API search tools are reliable enough to find the desired molecule / protein
(which relies on commonly established practices for information retrieval, no LLM needed).
To clarify this, we added the following to the SI
We are currently developing an agent based on the ReAct approach 10 tooled with
these APIs so that correct structures are always retrieved (i.e., to avoid the LLM
needs to remember internally all such IDs).
in - context learning prompt leakage This is quite unclear how to solve, and
folks within the AI community are trying to do just that. Two approaches might work: a)
extend the prompt with specific, descriptive, and actionable instructions that steer the LLM
from repeating what it sees in the prompt, b) perform some discrete optimization over specific
examples that should be included within the prompt, to minimize prompt leakage. Moreover,
the reliability of LLM workflows can also be improved by performing multiple iterative LLM
calls—meaning that a model like GPT-4 could look at the prompt and output, and tell us
whether this is a valid, generalized 3dmol.js command, or whether the original LLM that
generated the code is only repeating what it sees in the prompt.
To clarify this, we added the following to the main text
For instance, fragments from the prompt tend to leak into the output and must
be handled with more involved mechanisms, such as retries in which one gives
the LLMs access to the error messages or prompt engineering.
Reviewer Point P 1.14 — 8. In what way is Whinchat like an ELN, and what are
the logging/validation capabilities it offers in addition to the LLM agent interface?
Are there insights that Datalab provides (e.g., on suitable analytical techniques) that
were particularly impressive or unimpressive?
Reply: Datalab is the research data management software that offers ELN functionality. To
clarify, we now write in the main text:
The whinchat team (Joshua D. Bocarsly, Matthew L. Evans, and Ben E. Smith)
embedded an LLM chat interface within datalab, an open source materials
chemistry data management system, where the virtual LLM-powered assistant
can be “attached” to a given sample.
In addition, we now refer to the examples in the SI, which showcase some of the insights
This is shown in the examples given in SI section 2C, where whinchat was able
to provide hints about which NMR-active nuclei can be probed in the given
sample.
Reviewer Point P 1.15 — 9. What are the direct applications of the knowledge
graph the authors use Insight Graph to extract – is it solely to “launch a literature
review”? The authors acknowledge that pairwise connections might not be enough
to model materials.
Reply: The revised section now reads
6
A further optimized version of this tool might offer a concise and visual means
to quickly understand and compare material types and uses across sets of ar-
ticles and could be used to launch a literature review. An advanced potential
application is the creation of structured, materials-specific datasets for fact-based
question-answering and downstream machine-learning tasks.
Reviewer Point P 1.16 — 10. TableToJson mentions OCR, but this seems irrelevant
as the task is starting from text not images. Otherwise, this section is a clear
demonstration of strengths and limitations.
Reply: We edited the section to read now
Although some techniques could help in the process of extracting this information
(performing OCR or parsing XML), converting this information in structured
data following, for example, a specific JSON schema with models remains a
challenge.
Reviewer Point P 1.17 — 11. The example GPT-2-medium output in TitleToAbstract
contains grammatically and syntactically incorrect sentences. This seems like it is
worth commenting on; is it simply because the authors have used an out of date
base model?
Reply: As a clarification, we added
Interestingly, the generated abstract contains grammatically and syntactically
incorrect sentences. We suspect that this is due to our use of a small, outdated,
base model. However, more systematic analysis will need to be performed in
future work.
to the SI.
Reviewer Point P 1.18 — 12. I-Digest mentions that the questions generated for
a video might allow students to be guided to relevant timestamps or additional
materials. Is this entirely speculative at this stage?
Reply: We revised the section to read now
The I-Digest (Information-Digestor) hackathon team (Beatriz Mouriño, Elias
Moubarak, Joren Van Herck, Sauradeep Majumdar, Xiaoqi Zhang) created a
path toward such a new educational opportunity by providing students with a
digital tutor based on course material such as lecture recordings.
In the future, these questions might be shown to students before a video starts,
allowing them to skip parts they already know or after the video, guiding students
to the relevant timestamps or additional material in case of an incorrect answer.
Importantly, and in contrast to conventional educational materials, this approach
can generate a practically infinite number of questions and could, in the future,
be continuously be improved by student feedback.
7
reviewer 2
Reviewer Point P 2.1 — While I can see the importance of this hackathon in high-
lighting the use of LLMs to the material science and broader chemistry community,
this does not fall in the category of a typical scientific publication that can be or
needs to be peer reviewed. Generally, when a manuscript is sent for peer review:
1. It has clear working hypotheses that it intends to prove or disprove. 2. Provides
methodology for reproduction and verification. 3. Can address revisions suggested
by reviewers through further experimentation.
In this case, it does neither of the three. If I suggest a certain change or ask the
authors to verify a claim, I am not sure that they will be able to address it since the
hackathon is over and the participants disbanded. In several cases, the effort is the
result - in the sense that the participants set out to prove that something can be done
using LLMs and have done so. For example, in the case A.c "molecular discovery
by context", the authors surmise that "Visual inspection indicates that the selected
molecules indeed bear similarities to known hydrogen carrier molecules.". This is a
valid result for a two day hackathon but is not a sufficient result for a peer reviewed
scientific publication. There is no feedback that I can give here that is actionable
because there is no concrete result. This is the case for the majority of the results
presented here.
I do agree that this article should be published and Digital Discovery is an apt
venue for the same. However, this does not fall into the category of a typical research
article but must be published as a perspective, letter to the editor or some other
category that clearly mentions that it is not peer reviewed. These articles sacquire
citations so it should still be useful and indexable for the community.
Reply: We changed the article type to “Perspective”.
8
reviewer 3
Reviewer Point P 3.1 — This paper summarizes the 14 projects in the LLMs for
chemistry hackathon. These projects include a wide range of chemistry and material
tasks such as extracting knowledge, developing new educational applications, etc.
Strengths: 1. With the revolutionary development and application of LLM in
various fields, there’s limited research about LLMs in chemistry. This paper pro-
vides extensive ideas demonstrating the utilization of LLMs in material science and
chemistry, which give valuable insights to researchers in this filed. 2. The hackathon
is well-organized and researchers from diverse research fields contribute to this
hackathon, which increasing the impact of this paper.
Reviewer Point P 3.2 — Weakness: 1. The 14 projects of this paper is promising
and it would be better to add more insightful conclusions and prospects in the
Conclusion section.
Reply: We revised the conclusion section. For example, we now write
Overall, a common use case has been to use LLMs to deal with “fuzziness” in
programming and tool development. We can already see tools like Copilot and
ChatGPT being used to convert “fuzzy abstractions” or hard-to-define tasks
into code. These advancements may soon allow everyone to write small apps
or customize them to their needs (end-user programming). Additionally, we
can observe an interesting trend in tool development: most of the logic in the
showcased tools is written in English, not in Python or another programming
language. The resulting code is shorter, easier to understand, and has fewer
dependencies because language models are adept at handling fuzziness that is
difficult to address with conventional code. This suggests that we may not need
more formats or standards for interoperability; instead, we can simply describe
existing solutions in natural language to make them interoperable. Exploring
this avenue further is exciting, but it is equally important to recognize the
limitations of language models, as they currently have limited interpretability
and lack robustness.
We also rewrote the subsequent paragraphs
It is interesting to note that none of the projects relied on the knowledge or
understanding of chemistry by LLMs. Instead, they relied on general reasoning
abilities and provided chemistry information through the context or fine-tuning.
However, this also brings new and unique challenges. All projects used the
models provided by OpenAI’s API. While these models are powerful, we cannot
examine how they were built or have any guarantee of continued reliable access
to them.
Although there are open-source language models and techniques available, they
are generally more difficult to use compared to simply using OpenAI’s API.
Furthermore, the performance of language models can be fragile, especially for
zero- or few-shot applications.
Reviewer Point P 3.3 — Anyway, this paper provides many practical applications
of LLMs in chemistry and I recommend to accept it.




Round 2

Revised manuscript submitted on 17 Jul 2023
 

04-Aug-2023

Dear Dr Jablonka:

Manuscript ID: DD-ART-06-2023-000113.R1
TITLE: 14 Examples of How LLMs Can Transform Materials Science and Chemistry: A Reflection on a Large Language Model Hackathon

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, and discussion with the editorial office, I will be pleased to accept your manuscript for publication after revisions.

To address the reviewers' concerns as to whether this is appropriate as a Perspective, I ask that you include a disclaimer regarding the assessment of the examples, similar to the following:

"While different chemistry challenges were attempted during this hackathon, the results were preliminary. Digital Discovery did not peer review the soundness of each study, rather the peer review for this Perspective was in order to demonstrate the feasibility of Large Language Models in framing and attempting similar grand challenges."

If you would like to discuss the phrasing of this, or you have any concerns, please let us know.

Please revise your manuscript to fully address this and the remaining reviewers’ comments and add the above. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Alexander Whiteside MRSC
Assistant Editor, Digital Discovery
Royal Society of Chemistry
T: +44 (0) 1223 432117 | www.rsc.org

Sign up to journal issue alerts and news here - rsc.li/alerts

************


 
Reviewer 3

This paper summarizes 14 examples about materials science and chemistry in the LLMs for chemistry hackathon. From all previous reviews and the authors’ revision, the review and suggestions are as follows:

Strengths:
1. The motivation of this work is clear and the novelty of this work is perfect because the emergence of LLMs may revolutionize the material science and chemistry field.
2. The categories of projects grouped by authors (Table I) are clear and show a broader vision in the combination of LLMs and chemistry.

Weakness and Suggestions:
1. In Accurate Molecular Energy Predictions and Molecule Discovery by Context Genetic tasks of the Predictive modeling part, authors are encouraged to provide more details about the prompting methods (How the prompts are designed, how to find the examples of ICL). Lack citations about some previous work[1][2] and it would be better to discuss the difference between this work and previous work.
[1] Taicheng Guo, et al. What indeed can GPT models do in chemistry? A comprehensive benchmark on eight tasks. https://arxiv.org/abs/2305.18365
[2] Jiatong Li, et al. Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective. https://arxiv.org/abs/2306.06615

2. In Extracting Structured Data from Free-form Organic Synthesis Text task of the Knowledge extraction part, lack citations about some previous work[3] and authors are also encouraged to discuss this.
[3] Jiang Guo, et al. Automated Chemical Reaction Extraction from Scientific Literature. https://pubs.acs.org/doi/pdf/10.1021/acs.jcim.1c00284

3. It would be better the clarify why each task/example is important for practical use in material science or chemistry research field such as Text-template paraphrasing task and InsightGraph task, etc.

4. The topic and some methods of this paper are timeliness. As the LLM technologies are developed and changed rapidly, authors are encouraged to discuss the potential long-term impact of this work from future perspective.

To summary, this work summarized a well-organized LLM for chemistry hackathon and expands the research field of the combination of LLM and chemistry. I recommend this article should be published as a perspective in Digital Discovery.

Reviewer 2

I appreciate the authors taking the time to review the manuscript to address the comments of other authors. I find no significant structural or scientific changes in the manuscript to recommend acceptance. This is not a peer reviewable scientific manuscript, and I suggest that either the editor publish this as an invited perspective or else it be published elsewhere.

Reviewer 1

I continue to have concerns about the significance of this work. The authors have minor changes to the text but have left out some of the more nuanced responses in the letter to reviewers. Almost every result is described a “promising”, and benefits are generally “hypothetical”, “in the future”, or limited to the convenience of natural language interfaces.

I shared the concerns of Reviewer 2 that this work is not really amenable to peer review. While the article type has been changed to a Perspective, the examples in the work are still presented as research results. If accepted as a Perspective, I would strongly encourage the journal to add a disclaimer that the scientific results have not been fully peer-reviewed because these should not be considered to meet the quality of Digital Discovery.

Comments:
1. As the now-stated goal of the work is to provide ideas and examples for the use of LLMs, is the set of examples described a comprehensive overview of goals that the authors feel are worthwhile? Are there other tasks and opportunities that were not pursued in the hackathon but are worth consideration by readers? This is something I would expect a Perspective article to discuss.
2. The inclusion of “one sentence summaries” does help clarify what the authors see as the contributions of each example. In some ways, these are the most substantial part of the Perspective. but are buried in the SI. They make sense for the most part, but many are uninformative and highlight that these projects were conducted without real evaluation of impact.

Some specific comments on the examples:
1. The LLM-based agent using ReAct is an example of an excellent summary that defines the problem in chemistry/materials science terms, provides sufficient standalone context for the approach, and clearly outlines the need for more rigorous investigation into a specific aspect of performance. The 3dmol.js description and BO description are also great.
2. The ICL project summarizes the results as indicating that predictive models can be built without any training, but the method write-up discussed randomly sampling training sets of 10 formulations. Even if the way training data is used is ICL and not SGD, it is somewhat misleading to say that a predictive model is built without training.
3. The hydrogen carrier project’s summarized Results and Impact only states that the approach can recommend molecules with a success rate better than random, but this does not actually summarize the evaluation nor the impact. Much like the first example that had substantially worse performance than existing approaches, perhaps results/impact here should be clear that there was no comparison to existing generative modeling approaches, so the utility of the model is not clear. There is a typo: “showing aggregating”.
4. The text-templates problem/task definition does not define a chemistry/materials science task. The problem seems to be that there are many string-based representations of molecules that we would like LLMs to recognize as equivalent. Similarly, “Increasing the efficiency of GAs” is not a chemistry/materials task in and of itself – the task is iterative molecular optimization.
5. Unless the examples are meant to be peer-reviewed in earnest, I will forgo additional comments.


 

1


Reviewer Point P 1.1 — This paper summarizes 14 examples about materials
science and chemistry in the LLMs for chemistry hackathon. From all previous
reviews and the authors’ revision, the review and suggestions are as follows:
Strengths: 1. The motivation of this work is clear and the novelty of this work is
perfect because the emergence of LLMs may revolutionize the material science and
chemistry field. 2. The categories of projects grouped by authors (Table I) are clear
and show a broader vision in the combination of LLMs and chemistry.
Weakness and Suggestions:
Reviewer Point P 1.2 — 1. In Accurate Molecular Energy Predictions and Molecule
Discovery by Context Genetic tasks of the Predictive modeling part, authors are en-
couraged to provide more details about the prompting methods (How the prompts
are designed, how to find the examples of ICL). Lack citations about some previous
work[1][2] and it would be better to discuss the difference between this work and
previous work. [1] Taicheng Guo, et al. What indeed can GPT models do in chem-
istry? A comprehensive benchmark on eight tasks. https://arxiv.org/abs/2305.18365
[2] Jiatong Li, et al. Empowering Molecule Discovery for Molecule-Caption Transla-
tion with Large Language Models: A ChatGPT Perspective. https://arxiv.org/abs/2306.06615
Reply: We added citations to those works and added brief discussions:
Note that in this case, molecules are not generated de novo (as, for example, in
Li et al. 1 ) but retrieved from existing databases.
These few-shot learning abilities have also been benchmarked by Guo et al. 2 .
Reviewer Point P 1.3 — 2. In Extracting Structured Data from Free-form Or-
ganic Synthesis Text task of the Knowledge extraction part, lack citations about
some previous work[3] and authors are also encouraged to discuss this. [3] Jiang
Guo, et al. Automated Chemical Reaction Extraction from Scientific Literature.
https://pubs.acs.org/doi/pdf/10.1021/acs.jcim.1c00284
Reply: We added this reference and a brief discussion
In contrast to previous approaches, such as the one of, 3 the use of LLM does not
require a specialized modeling setup but can be carried out with relatively little
expertise.
Reviewer Point P 1.4 — 3. It would be better the clarify why each task/example
is important for practical use in material science or chemistry research field such as
Text-template paraphrasing task and InsightGraph task, etc.
Reply: In addition to the one-sentence summaries, we now added additional context:
Latter might be particularly important if one still wants to retain general lan-
guage abilities of the LLM after finetuning on chemistry or material science
data.
A further optimized version of this tool might offer a concise and visual means to
understand and compare material types quickly and uses across sets of articles—
a task that currently is very laborious.
2
Reviewer Point P 1.5 — 4. The topic and some methods of this paper are time-
liness. As the LLM technologies are developed and changed rapidly, authors are
encouraged to discuss the potential long-term impact of this work from future per-
spective.
Reply: We edited the conclusion section and added
This work showcased some potential applications of LLMs that will benefit from
further investigation.
Reviewer Point P 1.6 — To summary, this work summarized a well-organized
LLM for chemistry hackathon and expands the research field of the combination of
LLM and chemistry. I recommend this article should be published as a perspective
in Digital Discovery.
3

Comments to the Author I appreciate the authors taking the time to re-
view the manuscript to address the comments of other authors. I find
no significant structural or scientific changes in the manuscript to rec-
ommend acceptance. This is not a peer reviewable scientific manuscript,
and I suggest that either the editor publish this as an invited perspective
or else it be published elsewhere.
4

Reviewer Point P 3.1 — I continue to have concerns about the significance of this
work. The authors have minor changes to the text but have left out some of the more
nuanced responses in the letter to reviewers. Almost every result is described a
“promising”, and benefits are generally “hypothetical”, “in the future”, or limited to
the convenience of natural language interfaces. I shared the concerns of Reviewer 2
that this work is not really amenable to peer review. While the article type has been
changed to a Perspective, the examples in the work are still presented as research
results. If accepted as a Perspective, I would strongly encourage the journal to add
a disclaimer that the scientific results have not been fully peer-reviewed because
these should not be considered to meet the quality of Digital Discovery.
Reply: We added the disclaimer:
While different challenges were explored during this hackathon, the results were
preliminary. Digital Discovery did not peer review the soundness of each study.
Instead, the peer review for this Perspective was to scope the potential of Large
Language Models in chemistry and materials science.
Reviewer Point P 3.2 — 1. As the now-stated goal of the work is to provide ideas
and examples for the use of LLMs, is the set of examples described a comprehensive
overview of goals that the authors feel are worthwhile? Are there other tasks and
opportunities that were not pursued in the hackathon but are worth consideration
by readers? This is something I would expect a Perspective article to discuss.
Reply: As stated, the overview is not comprehensive and we cannot know all possible
applications.
The diversity of the prototypes presented in this work shows that the potential
applications are almost unlimited, and we can probably only see the tip of the
iceberg—for instance, we didn’t even touch modalities other than text thus far.
However, we added one more potentially interesting avenue
In addition, we also want to note that the projects in the workshop mostly ex-
plored the use of LLMs as tools or oracles but not as muses. 4 From techniques
such as rubber duck debugging (describing the problem to a rubber duck), 5
we know that even simple—non-intelligent—articulation or feedback mecha-
nisms can help overcome roadblocks and create creative breakthroughs. Instead
of explaining a problem to an inanimate rubber duck, we could instead have a
conversation with an LLM, which could probe our thinking with questions or
aid in brainstorming by generating diverse new ideas. Therefore, one should
expect an LLM to be as good as a rubber duck—if not drastically more effective.
Reviewer Point P 3.3 — The inclusion of “one sentence summaries” does help
clarify what the authors see as the contributions of each example. In some ways,
these are the most substantial part of the Perspective. but are buried in the SI. They
make sense for the most part, but many are uninformative and highlight that these
projects were conducted without real evaluation of impact.
Reply: Indeed, some projects were carried out without evaluation of impact. To highlight
the exploratory nature of our work, we write
5
The projects were typically carried out in an exploratory way and without any
evaluation of impact.
Some specific comments on the examples:
Reviewer Point P 3.4 — 1. The LLM-based agent using ReAct is an example
of an excellent summary that defines the problem in chemistry/materials science
terms, provides sufficient standalone context for the approach, and clearly outlines
the need for more rigorous investigation into a specific aspect of performance. The
3dmol.js description and BO description are also great.
Reviewer Point P 3.5 — 2. The ICL project summarizes the results as indicating
that predictive models can be built without any training, but the method write-up
discussed randomly sampling training sets of 10 formulations. Even if the way
training data is used is ICL and not SGD, it is somewhat misleading to say that a
predictive model is built without training.
Reply: If one uses ICL there is no effective training, i.e., no weight updates. To clarify this
we write in the SI
Predictive models can be built without any training (i.e., update of weights)
Reviewer Point P 3.6 — The hydrogen carrier project’s summarized Results and
Impact only states that the approach can recommend molecules with a success rate
better than random, but this does not actually summarize the evaluation nor the
impact. Much like the first example that had substantially worse performance than
existing approaches, perhaps results/impact here should be clear that there was no
comparison to existing generative modeling approaches, so the utility of the model
is not clear. There is a typo: “showing aggregating”.
Reply: We fixed the typo and extended the caveat statement to now read
Since no direct comparisons to other approaches have been performed, bench-
marks compared to conventional generative modeling are needed.
Reviewer Point P 3.7 — 4. The text-templates problem/task definition does not
define a chemistry/materials science task. The problem seems to be that there
are many string-based representations of molecules that we would like LLMs to
recognize as equivalent. Similarly, “Increasing the efficiency of GAs” is not a chem-
istry/materials task in and of itself – the task is iterative molecular optimization.
Reply: To clarify the relevance of the text-template paraphrasing, we now write
This approach will allow us to automatically create new paraphrased high-
quality prompts for lift-based training very efficiently—to augment the dataset
and reduce the risk of overfitting to a specific template. Latter might be particu-
larly important if one still wants to retain general language abilities of the llm
after finetuning on chemistry or material science data.
We additionally now use the reviewer’s phrasing for the one-sentence summary for the
“GA without genes” project:
increasing the efficiency of iterative molecular optimization
Reviewer Point P 3.8 — 5. Unless the examples are meant to be peer-reviewed in
earnest, I will forgo additional comments




Round 3

Revised manuscript submitted on 06 Aug 2023
 

08-Aug-2023

Dear Dr Jablonka:

Manuscript ID: DD-ART-06-2023-000113.R2
TITLE: 14 Examples of How LLMs Can Transform Materials Science and Chemistry: A Reflection on a Large Language Model Hackathon

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry


******
******

Please contact the journal at digitaldiscovery@rsc.org

************************************

DISCLAIMER:

This communication is from The Royal Society of Chemistry, a company incorporated in England by Royal Charter (registered number RC000524) and a charity registered in England and Wales (charity number 207890). Registered office: Burlington House, Piccadilly, London W1J 0BA. Telephone: +44 (0) 20 7437 8656.

The content of this communication (including any attachments) is confidential, and may be privileged or contain copyright material. It may not be relied upon or disclosed to any person other than the intended recipient(s) without the consent of The Royal Society of Chemistry. If you are not the intended recipient(s), please (1) notify us immediately by replying to this email, (2) delete all copies from your system, and (3) note that disclosure, distribution, copying or use of this communication is strictly prohibited.

Any advice given by The Royal Society of Chemistry has been carefully formulated but is based on the information available to it. The Royal Society of Chemistry cannot be held responsible for accuracy or completeness of this communication or any attachment. Any views or opinions presented in this email are solely those of the author and do not represent those of The Royal Society of Chemistry. The views expressed in this communication are personal to the sender and unless specifically stated, this e-mail does not constitute any part of an offer or contract. The Royal Society of Chemistry shall not be liable for any resulting damage or loss as a result of the use of this email and/or attachments, or for the consequences of any actions taken on the basis of the information provided. The Royal Society of Chemistry does not warrant that its emails or attachments are Virus-free; The Royal Society of Chemistry has taken reasonable precautions to ensure that no viruses are contained in this email, but does not accept any responsibility once this email has been transmitted. Please rely on your own screening of electronic communication.

More information on The Royal Society of Chemistry can be found on our website: www.rsc.org




Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.
Creative Commons BY license