From the journal Digital Discovery Peer review history

You do not have JavaScript enabled. Please enable JavaScript to access the full features of the site or access our non-JavaScript page.

Round 1

Manuscript submitted on 08 Oct 2023

Editor’s decision letter

24-Oct-2023

Dear Dr Hippalgaonkar:

Manuscript ID: DD-ART-10-2023-000202
TITLE: Harnessing GPT-3.5 for Text Parsing in Solid-State Synthesis – case study of ternary chalcogenides

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Joshua Schrier
Associate Editor, Digital Discovery

************

Reviewer comments

Reviewer 1

The manuscript offers an intriguing method for leveraging GPT-3.5 in text parsing for solid-state synthesis, especially concerning ternary chalcogenides. The goal is to introduce a more efficient, automated technique for extracting synthesis recipes from the extant scientific literature. While I found the manuscript engaging and innovative in addressing a significant challenge in the materials science field, several areas necessitate comprehensive revision for the work to reach its full potential.

Major Concerns:

1. The GitHub page requires better organization to ensure reproducibility for readers. The README file is overly brief, lacking descriptions for each file, the environment, and so on. The LLM_data_extraction_template.ipynb file exhibits a timeout error that needs resolution. The visualization.ipynb file should have more detailed comments and perhaps be sectioned for clarity; without this, it's difficult to grasp its content. Overall, the author must scrutinize each file thoroughly to ensure both reproducibility and readability.

2. Many recent works employ GPT models for text mining. These have been published in both arXiv and other peer-reviewed journals, particularly this year. The author cites a few of them (ref 16-19), which is commendable. However, some are somewhat preliminary, serving as proofs of concept. I believe more works, even beyond the realm of solid-state, should be cited and discussed. Highlighting GPT-3.5's prowess in text mining is crucial.

3. What is the exact number of papers investigated and the data points mined? The manner in which these numbers are reported is convoluted (e.g., "publication set of 21 papers," "from over 100 PDF documents," "173 research articles"). This caused confusion during my reading, and others might similarly struggle. Additionally, upon checking the dataset on GitHub, I didn't find 173 lines in the posted datasets. I suggest integrating this data into a figure for clarity, akin to https://arxiv.org/abs/2109.08098 by Heather J. Kulik.

4. The statement, "we refined a set of prompts for GPT-3.5 to extract the same information," necessitates more detailed elaboration. Both the specifics and the strategy behind this refinement should be explored. Merely listing observations and questions doesn't suffice to ascertain if efficiency genuinely improved. Numerous recent papers (again, including ArXiv papers and those in peer review journals) delve into prompt engineering strategy, with many pertaining to chemistry and materials science. A more comprehensive review is encourage and more relevant work should be cited. In addition, an ablation study or an in-depth discussion on prompt refinement would bolster the paper's rigor.

5. I couldn't find any mention of the ground truth. If it exists, how was it obtained? How is accuracy determined? What about precision, recall, and the F1 score? The study also misses a comprehensive error analysis. Recognizing the model's errors can yield insights into its advantages and constraints.

6. The remark, "GPT-3.5 responds with a 'hallucinated' response with three sequential melt times ranging from low to high temperatures," requires a thorough exploration of "hallucination" in this context. Its definition, significance, and the importance of its minimization in text mining are crucial for readers.

Minor Concerns:

7. It should be "GPT-3.5" instead of "GPT3.5."

8. The statement, "We first undertook a comprehensive manual download of all papers published between 2000 and 2023 that discussed solid-state synthesis recipes of CuInTe2" left me puzzled. Why focus on this specific compound's synthesis recipes?

9. The terms "gold standard" and "silver standard" are ambiguous. The author should elucidate their naming and process either in the main text or figures. In essence, the gold and silver standards, as well as the Extended Chemical Space (ExChSp) datasets, are briefly touched upon but need detailed descriptions.

10. The text in Figure 3 is too small to discern.

11. There are sporadic grammatical errors and awkward phrasings that obstruct readability. Some examples include "By leveraging on this workflow," "... visualization from text-based literatures," "Our data analysis also suggest," and "We observe the inherent anthropogenic bias endemic in published papers – only positive results are mostly reported."

12. How long does manual paper selection and download take? Are there plans for efficiency improvements?

13. What are the costs associated with the OpenAI API in this study?

14. Out of curiosity, why not GPT-4? Its API is also accessible. How do the two versions compare?

15. The claim, "...AgSbTe2 (chemically similar, but not seen by GPT-3.5)" raises questions. What evidence backs this? How can one ascertain if GPT-3.5 has encountered specific content?

Reviewer 2

Thway et al. report an analysis of the use of GPT-3.5 for data
extraction of synthesis conditions for ternary chalcogenides from
papers. For this, manually curate a small dataset and develop a
prompting strategy for GPT-3.5. The paper reports some metrics, but
technical details are often missing.

# Technical comments

- *How are the accuracy scores computed?* The accuracy is perhaps one
of the most important quantitative results of this work. However, it
is not clear to me what those numbers mean. Neither the text nor the
code clarify how those numbers are computed:

- The plot in Fig. 2 contains scores for "literature unspecified"
values. Those scores tend to be higher. However, if the
literature is the ground truth, how do you compute an accuracy
if this value is not specified in the literature?

- Average accuracy depends on how you average: Do the first
average per "category" and then over samples or vice versa? Or
do you do something else?

- Why not report some other, perhaps more interpretable metrics,
such as Hamming loss or the number of entries that are
completely correctly extracted

- *Where do those errors come from?* In a similar vein, it might be
interesting for readers to have a better overview of where the
errors from: Why not create a table in which you summarize how often
the model fails to find the relevant section, how often it is a
syntactical error and how often semantical?

- *What is the precise workflow?* While one can extract some details
from the code repository, this is not what the reader need to do to
follow the article. A few questions remained unanswered to me:

- How have the chunks been chosen?

- How have the prompts been optimized (while avoiding data
leakage)?

- What temperature has been chosen for sampling (GitHub suggests
0, but is this really the case?)

- *What is the impact of the "Is the given data following this format?
If not re-format." prompt:* This is a very interesting prompt, and I
would be very interested in an ablation study. I suppose this prompt
was not there in the first iterations; what kind of problem did the
introduction solve?

- Figure 2c) Since the dataset is not immense, it might be insightful
to simply show the data (as swarmplots)

- How are the hyperparameters chosen for the models built for the
feature-importance analysis?

- How is the feature importance exactly computed

- If you do leave-one-out cross-validation, you end up with $n$
models, where $n$ is the length of the dataset. Does Figure 3
show the analysis for one model (which) or an average (how
averaged?) model?

- If you perform SHAP-based feature importance analysis you need
to compute the SHAP values for a set of datapoints. For which
datapoints did you compute the values? For the training set? For
the test set? For the entire dataset?

- Why did you train a different kind of model for SHAP analysis?
("As an alternative means of analysis, an XGBoost classifier was
implemented to derive SHAP values of the features")

- “Therefore, we ask the question – can LLMs be used to parse the literature,  but also produce a machine learning readable dataset? “ ——  Technically, this broad question has been addressed (in work coming out  of Ceder’s group as well as with some examples at the LLM Hackathon —  both are cited in the paper, https://arxiv.org/pdf/2212.05238.pdf  might deserve a citation, too).  Perhaps it might make sense to rephrase the research question to be more specific.

# Reproducibility

- I encourage the authors to

- add license information to their repository

- contains the code/scripts for the actual experiments

- make the code citable by archiving it, for example, on Zenodo

- the link in the paper should point to the repository and not to
the organization

Reviewer 3

Summary
The authors have gleaned synthesis recipes for chalcogenide-based thermoelectric materials by GPT-3.5, fine-tuned through optimized GPT-based prompts.
They have generated a database, achieving an overall accuracy of approximately 73%. This database is subsequently utilized to infer synthesis conditions for ternary chalcogenides. A classifier model is constructed based on this database, achieving an accuracy of approximately 60% in predicting phase purity. This study offers an approach for information extraction by integrating Large Language Models (LLMs) into the realm of materials science research.

Points of view
Methodologically, this work is limited in its scope to chalcogenide-based thermoelectric materials and to GPT3.5 rather than the greater GPT4 for higher accuracy. It does not present any prompting strategy in the fine-tuning process and just has shown the final prompts. But it demonstrates the efficiency of automated data extraction and underscores the broader applicability of LLMs in material science research, enables users to conduct text mining and corpus curation without developing specific NLP algorithms. However, the overall accuracy of approximately 73% is not high enough, resulting the classification model do not perform very well (60% accuracy). The extraction and machine learning accuracy should be improved. In reviewer’s opinion, this work should be improved and not enough for publication in current state.

Issues with methodology
1) What does LITERATURE SPECIFIED and LITERATURE UNSPECIFIED in Figure 2 (a) mean, are corresponding to the GPT prompted dataset of CuInTe/Se and ExChSp?
2) What is the meaning of TEMPERATURE in the vertical coordinate in Figure 2 (b), is the maximum temperature of the heating curve?
3) Why are there fewer points in Figure 2(c) for PURE than for NOT PURE?
4) The authors attribute the difference between the significance of the features obtained using the gold standard dataset and the ExChSp dataset to the different formats of the scientific papers, but it could also be due to the lower extraction precision. This can be further demonstrated by switching to a more balanced corpus for extraction.

Author response

Dear Editor and reviewers, thank you for the wonderful suggestions. We've incorporated changes and provide a point-by-point response attached as a separate file.

Thanks,
Kedar

This text has been copied from the Microsoft Word response to reviewers and does not include any figures, images or special characters:

REVIEWER REPORT(S):
Referee: 1

Comments to the Author
The manuscript offers an intriguing method for leveraging GPT-3.5 in text parsing for solid-state synthesis, especially concerning ternary chalcogenides. The goal is to introduce a more efficient, automated technique for extracting synthesis recipes from the extant scientific literature. While I found the manuscript engaging and innovative in addressing a significant challenge in the materials science field, several areas necessitate comprehensive revision for the work to reach its full potential.

We thank the reviewer for finding our work innovative and the constructive suggestions. GPT enables researchers to efficiently parse solid state synthetic recipes without the need for manual extraction or significant LLM knowledge. We further address the reviewer’s concern in a point-to-point response attached below.

Major Concerns:

1. The GitHub page requires better organization to ensure reproducibility for readers. The README file is overly brief, lacking descriptions for each file, the environment, and so on. The LLM_data_extraction_template.ipynb file exhibits a timeout error that needs resolution. The visualization.ipynb file should have more detailed comments and perhaps be sectioned for clarity; without this, it's difficult to grasp its content. Overall, the author must scrutinize each file thoroughly to ensure both reproducibility and readability.

We thank the reviewer for their helpful advice. We have improved the Github page (https://github.com/Kedar-Materials-by-Design-Lab/Harnessing-GPT-3.5-for-Text-Parsing-in-Solid-State-Synthesis-case-study-of-ternary-chalchogenides) by including a separate directory for the ExChSp results and explaining the timeout error. The visualization.ipynb file has also been updated with annotations. If there is any further modification the reviewer deems necessary, we are happy to further accommodate.

We also thank the reviewer for the detailed probing on the timeout error in the original notebook. The timeout error is returned by the OpenAI package when a request is not answered in time and is a noted problem and inconvenience for prompting. Other than necessitating a rerun of the code, it does not have any adverse effect for the parsing. We choose to keep the error messages to illustrate an existing issue, which a user might come across while attempting to reproduce our methods using published Open AI packages.

2. Many recent works employ GPT models for text mining. These have been published in both arXiv and other peer-reviewed journals, particularly this year. The author cites a few of them (ref 16-19), which is commendable. However, some are somewhat preliminary, serving as proofs of concept. I believe more works, even beyond the realm of solid-state, should be cited and discussed. Highlighting GPT-3.5's prowess in text mining is crucial.

We thank the reviewer for their commendation. We are not certain of the particular pieces of work that the reviewer has in mind but we have added the following citations on the use of GPT-3.5 for scientific text mining beyond solid-state synthesis:
Changes made in introduction:
Language Models (LLMs) have recently emerged as an alternative tool to extract knowledge from scientific literature, enabling contextualizing and summarizing of information efficiently and robustly. This has been demonstrated across different materials science fields; chemistry [12,16–21], polymers [22], general materials [23–27], optical materials [28], crystal structures [29], and even other fields such as medicine [30–32]. These models can be used to identify and categorize key information such as research conclusions and methods, and trends, making it easier for researchers to access relevant insights rapidly. By contextualizing and summarizing information, these models provide an alternative route to potentially facilitate the efficient extraction of knowledge from existing literature data. We then ask the question – specific to fields with sparse literature and strong reporting bias, can LLMs be used to parse the synthesis information, but also produce a machine learning readable dataset?

Such an approach bypasses the need for specialized NLP tools, offering a streamlined method for text parsing that is more accessible to the scientific community. To further illustrate the applicability of GPT parsing, we focus on ternary chalcogenide-based materials because they are the state-of-the-art thermoelectric materials at intermediate temperatures [33], where the availability of synthesis literature is relatively smaller in size compared to the examples cited previously, meaning that the ability to tune the LLM is also limited. We consider a similar prompt engineering strategy reported by Zheng et al [18] to refine this workflow, which we describe further below.

3. What is the exact number of papers investigated and the data points mined? The manner in which these numbers are reported is convoluted (e.g., "publication set of 21 papers," "from over 100 PDF documents," "173 research articles"). This caused confusion during my reading, and others might similarly struggle. Additionally, upon checking the dataset on GitHub, I didn't find 173 lines in the posted datasets. I suggest integrating this data into a figure for clarity, akin to https://arxiv.org/abs/2109.08098 by Heather J. Kulik.

We thank the reviewer for pointing out the lack of clarity of writing, which is crucial for readability of our paper. To clarify, 163 research papers in total were parsed for the ExChSp dataset, of which relevant information was only extracted from 61 papers based on our optimized prompts. The dataset in Github reflects the final list of these 61 papers.

Changes made:
Additionally, we consider a secondary and larger dataset of solid-state synthesis, extended to ABX2 and Tl-based chalcogenide systems. Similarly, we performed comprehensive manual download of English-based research papers published between 2000 and 2023 that discussed solid-state synthesis recipes of ABX2 compounds including AgInTe/Se2, CuGaTe/Se2, AgInTe/Se2, TlSbTe2, TlGdTe2, TlBiTe2, and KGdTe2, excluding methods such as solution-based synthesis (too many precursors and generally speaking, lower phase purity) or the Bridgman method, which is for single crystal growth. The same set of 21 CuInTe/Se synthesis papers were used constructing the Gold Standard, and subsequently parsing the Silver Standard. Additionally, a total of 168 papers from other ternary chalcogenide compounds were compiled, but only 61 were successfully parsed by GPT-3.5; the rest failed the first prompt (did not contain synthesis information, or PyPDF failed to format it). Table 1 below provides a list of the datasets applied in this work.

Table 1: Names and description of each dataset.

4. The statement, "we refined a set of prompts for GPT-3.5 to extract the same information," necessitates more detailed elaboration. Both the specifics and the strategy behind this refinement should be explored. Merely listing observations and questions doesn't suffice to ascertain if efficiency genuinely improved. Numerous recent papers (again, including ArXiv papers and those in peer review journals) delve into prompt engineering strategy, with many pertaining to chemistry and materials science. A more comprehensive review is encourage and more relevant work should be cited. In addition, an ablation study or an in-depth discussion on prompt refinement would bolster the paper's rigor.

We thank the reviewer for pointing out the need to clarify the refinement strategy. We have demonstrated the refinement strategy by attaching the notebook prompt_engineering_progress.ipynb in the GitHub repository, and also included a more thorough discussion of our prompt engineering strategy in the manuscript. As for a comprehensive review with relevant work cited, we have addressed that in comment 2 above with more substantial citations across different fields. However, an ablation study is beyond the scope of our current work.

Changes made:
Following this, we refined a set of prompts for GPT-3.5 to extract the same information, taking note to logically infer information when not provided, giving examples from the Gold Standard. The prompt set was optimized iteratively based on the following principles:
1. All questions put together in a single prompt, without any standard formatting and based on human intuition.
2. All questions put together in a single prompt, with standard formatting.
3. Questions broken up into a sequence of prompts, without standard formatting.
4. Questions broken up into a sequence of prompts, with standard formatting.

We noticed that LLM have a hard time trying to reason/extract information from a paragraph that require human intuition. The answer gets more consistent when we provide appropriate examples in the prompt. However, when we extract too much information in one go, it is found that sometimes the LLM misses certain information and other times it ‘misbehaves’ with unexpected behaviour, without adhering to the formatting instructions in our prompt set. Overall, we found that sequentially extracting information one by one with standard answers gives the best results. The iterative process is reported in prompt_engineering_progress.ipynb in the GitHub repository.
5. I couldn't find any mention of the ground truth. If it exists, how was it obtained? How is accuracy determined? What about precision, recall, and the F1 score? The study also misses a comprehensive error analysis. Recognizing the model's errors can yield insights into its advantages and constraints.

We thank the reviewer for pointing this out, it is indeed important to clarify the ground truth for synthesis recipes as many details are not reported and inferred. Since this is not a classification problem with false positives and negatives, but rather a measure of whether the specific detail is correct/wrong, we can only report the fraction of correct labels (which is essentially the inverse of Hamming loss which reports fraction of wrong labels). In this case, wrong labels can be either ‘NA’, or when GPT-3.5 returns a completely wrong detail, for example indicating water-cooling for quench type when the text did not specify this, which a human would reasonably infer as air-cooling instead.

Changes made in methodology:
We propose evaluating the accuracy of text parsing by reporting the fraction of correct labels, and the overall error (inaccurate parsing) rate. In total, there are four possible situations:

Table 2: Accuracy metric for specified and unspecified information

And figure 2:

Figure 2. Details on the Gold and Silver Standard. A) Accuracy of GPT-3.5 extracted Silver Standard comparing against manually obtained Gold Standard, considering accuracy of both specified (dark blue) and unspecified details (light blue), as well as overall percentage of wrong details (orange). B) Heating curves reported for the Gold Standard dataset. C) Box charts for heating information in Gold Standard with respect to phase purity (1 refers to pure, 0 to not).

6. The remark, "GPT-3.5 responds with a 'hallucinated' response with three sequential melt times ranging from low to high temperatures," requires a thorough exploration of "hallucination" in this context. Its definition, significance, and the importance of its minimization in text mining are crucial for readers.

We thank the reviewer for the comment. In our work, we further explored the possibility of using GPT to extrapolate the synthesis recipe for AgSbTe2 by only informing the format of synthesis (base compounds, primary, secondary, annealing, etc) only. The recipe predicted is incorrect since it suggests annealing temperature higher than secondary melting temperature. This shows that GPT, without prior knowledge of synthesis data, cannot successfully predict a reasonable synthesis recipe. A dataset is therefore required. We recognize the gravity of using the term “hallucination” in the context of LLMs and take the reviewer’s point seriously. In this case, since we are explicitly asking GPT-3.5 to extrapolate beyond existing knowledge, strictly speaking we should not call this a hallucination. Hence, we have removed this word and rephrased it simply as: “inaccurate based on domain expertise”.

Changes made:
According to Table 2, the synthesis temperature and time for AgInTe2 are reasonable, with a ~1200 K melting stage and a ~700K annealing temperature, no primary melting. One interesting fact is that the interpolation is able to suggest quenching as cooling state which is related to phase precipitation tendency during the synthesis [39]. For extrapolating to predict a synthesis recipe for AgSbTe2, the GPT-3.5 model was not able to yield a proper synthesis recipe, by merely guessing that the recipe for AgSbTe2 is similar to that of AgInTe2 or AgInSe2. This result is only because the ChExSp dataset was provided to the GPT-3.5 API as an input – else, GPT-3.5 responds with a response with three sequential melt times going from low to high temperatures, which we know is inaccurate based on domain expertise. The suggested sequentially increasing temperature stages are different from domain expert recipes where a secondary high-temperature melting stage happens before a mid-temperature annealing stage to allow for melt crystallization followed by phase homogeneity.

We posit that GPT-3.5 which is trained on a corpus of mainly non-scientific text, contains incomplete information on solid-state synthesis recipes. It is possible that the Gold Standard used to prompt GPT-3.5 is contradictory to its knowledge, which we suggest is likely since it is paired with the fact that we implemented the responses with a temperature of zero (i.e., no creativity, since we deemed this as a non-creative writing task).

Minor Concerns:

7. It should be "GPT-3.5" instead of "GPT3.5."

All instances have been corrected.

8. The statement, "We first undertook a comprehensive manual download of all papers published between 2000 and 2023 that discussed solid-state synthesis recipes of CuInTe2" left me puzzled. Why focus on this specific compound's synthesis recipes?

We thank the reviewer for their comment. We limited ourselves to a particular class of materials because solid-state synthesis is specific to each class. To demonstrate the applicability of GPT parsing, we focused on ternary chalcogenide-based materials because they are state-of-the-art thermoelectrics at intermediate temperatures. If one wanted to apply the prompt set to e.g., oxide materials which have completely different synthesis conditions, a modified recipe format that matches the actual synthesis should be used instead of the current format which has multiple heating stages and a densification stage.

Changes made:
Such an approach bypasses the need for specialized NLP tools, offering a streamlined method for text parsing that is more accessible to the scientific community. To further illustrate the applicability of GPT parsing, we focus on ternary chalcogenide-based materials because they are the state-of-the-art thermoelectric materials at intermediate temperatures [33], where the availability of synthesis literature is relatively smaller in size compared to the examples cited previously, meaning that the ability to tune the LLM is also limited. We consider a similar prompt engineering strategy reported by Zheng et al [18] to refine this workflow, which we describe further below.

9. The terms "gold standard" and "silver standard" are ambiguous. The author should elucidate their naming and process either in the main text or figures. In essence, the gold and silver standards, as well as the Extended Chemical Space (ExChSp) datasets, are briefly touched upon but need detailed descriptions.

We appreciate the reviewer’s effort to improve the readability of the work. We have addressed this together with comment 3 above.

10. The text in Figure 3 is too small to discern.
We thank the reviewer for pointing out this mistake. We have replotted Figure 3 again so that the text is discernible.

Changes made:

Figure 3. Decision tree classifier results. The decision tree architecture and accuracy reported for a) Gold Standard dataset b) ExChSp dataset. Feature importance of both decision trees for leave-one-out strategy are reported in c) and d) respectively.

11. There are sporadic grammatical errors and awkward phrasings that obstruct readability. Some examples include "By leveraging on this workflow," "... visualization from text-based literatures," "Our data analysis also suggest," and "We observe the inherent anthropogenic bias endemic in published papers – only positive results are mostly reported."

We thank the reviewer’s suggestion on improving the readability of our work. We take the reviewer’s comments into consideration and have corrected them accordingly. The abstract has been significantly rewritten to better reflect a summary of our work. Other grammatical errors and awkward phrasings we do not report here as we have made a concerted effort to improve the clarity of writing across the whole manuscript, and addressed in other comments as well.

Changes made in abstract:
Optimally doped single-phase compounds are necessary to advance state-of-the-art thermoelectric devices which convert heat into electricity and vice-versa, requiring solid-state synthesis of bulk materials. For data-driven approaches to learn these recipes, it requires careful data curation from large corpus of text which may not be available for some materials, as well as a refined language processing algorithm which presents a high barrier of entry. We propose applying Large Language Models (LLMs) to parse solid-state synthesis recipes, encapsulating all essential synthesis information intuitively in terms of primary and secondary heating peaks. Using a domain-expert curated dataset for a specific material (Gold Standard), we engineered a prompt set for GPT-3.5 to replicate the same dataset (Silver Standard), doing so successfully with 73% overall accuracy. We then proceed to extract and infer synthesis conditions for other ternary chalcogenides with the same prompt set. From a database of 168 research papers, we successfully parsed 61 papers which we then used to develop a classifier to predict phase purity. Our methodology demonstrates the generalizability of Large Language Models (LLMs) for text parsing, specifically for materials with sparse literature and unbalanced reporting (since usually only positive results are shown). Our work provides a roadmap for future endeavors seeking to amalgamate LLMs with material science research, heralding a potentially transformative paradigm in the synthesis and characterization of novel materials.

Changes made in further discussion:
Our GPT-based framework’s implications reach beyond just solid-state synthesis recipes or thermoelectric materials. It showcases the adaptability of LLMs to handle niche domains with limited literature and not requiring highly tuned models with extensive data curation. Traditional NLP models are often closely linked and tuned based on their training data, risking a drop in performance when tasked with new domains. In contrast, we were able to successfully perform text extraction with very minimal initial training as shown on the results for Gold and Silver Standard. Our data analysis also The feature importance reported in Figures 3 and 4 suggest that secondary temperature is the most crucial step for solid-state synthesis, which would help scientists in developing temperature profiles.

12. How long does manual paper selection and download take? Are there plans for efficiency improvements?

We thank the reviewer for this insightful question. The download took 1 day and 4 people were involved, for a total 168 PDF files in the ExChSp dataset. This is because papers come from different publishers and web links, which can vary significantly. Automated extraction of the entire corpus of papers across all relevant publishers and journals was deemed inaccessible.

13. What are the costs associated with the OpenAI API in this study?
14. Out of curiosity, why not GPT-4? Its API is also accessible. How do the two versions compare?

We thank the reviewer for this line of query, and costs associated are indeed a point of consideration for performing text mining via commercial and open-source software such as GPT-3.5, which has a lower cost per token compared to GPT-4. We add the following text to the methodology section.

Changes made in methodology:
Our engineered prompts aim to provide a cost-efficient method for parsing solid state synthesis recipes, which we did so with the total budget of all prompt refining experiments and actual text parsing using GPT-3.5 being within 50 SGD (~36 USD). Therefore, the effective cost per PDF is around 0.29 SGD (~ 0.20 USD) per PDF. While GPT-4 allows for higher accuracy in certain scenarios and enables more functionalities, we focused on GPT-3.5 instead as: (1) GPT-3.5 is accessible compared to GPT-4 which requires a subscription (2) a lower API cost compared to GPT-4 and (3) the parsing accuracy of 3.5 and 4 from text were found to be similar for such literature.

15. The claim, "...AgSbTe2 (chemically similar, but not seen by GPT-3.5)" raises questions. What evidence backs this? How can one ascertain if GPT-3.5 has encountered specific content?

We thank the reviewer for their question. The ExChSp dataset was extracted with a contextual prompt where we presented the ‘model answer’ from the Gold Standard, formed from CuInTe2. AgSbTe2 is chemically similar in that sense, and we anticipate that GPT-3.5 should extract and parse it similarly (AgSbTe2 is chemically similar to CuInTe2 in the sense that both are ternary tellurides of ABX2 formula, and share the same cubic crystal symmetry). Regarding the possibility of GPT-3.5 encountering the specific paper, it is not possible because the detailed synthesis recipe requires subscription of journal articles to access, which is hard for GPT-3.5 which is trained on openly accessible data.

Changes made:
Leveraging upon the ChExSp dataset, we tested the possibility to interpolate synthesis conditions for AgInTe2 (part of the dataset), as well as extrapolate for AgSbTe2, both of which are chemically similar to the material studied in the Gold Standard, i.e., CuInTe2 provided as a contextual prompt. We anticipate that GPT-3.5 has no knowledge of their synthesis conditions, as they would only be found in subscription-based scientific journals, and not an open-source dataset.

Referee: 2

Comments to the Author
Thway et al. report an analysis of the use of GPT-3.5 for data extraction of synthesis conditions for ternary chalcogenides from papers. For this, manually curate a small dataset and develop a prompting strategy for GPT-3.5. The paper reports some metrics, but technical details are often missing.

We thank the reviewer for their constructive comments. Indeed, one advantage of this work is that we do not require a large dataset to train a language model for text parsing. We proposed an effective prompt set and evaluated the metrics of parsing which can be further used for machine learning. Below, we provide a point-to-point response to the reviewer’s queries.

# Technical comments

- *How are the accuracy scores computed?* The accuracy is perhaps one of the most important quantitative results of this work. However, it is not clear to me what those numbers mean. Neither the text nor the code clarify how those numbers are computed:
- The plot in Fig. 2 contains scores for "literature unspecified" values. Those scores tend to be higher. However, if the literature is the ground truth, how do you compute an accuracy if this value is not specified in the literature?
- Average accuracy depends on how you average: Do the first average per "category" and then over samples or vice versa? Or do you do something else?
- Why not report some other, perhaps more interpretable metrics, such as Hamming loss or the number of entries that are completely correctly extracted

We thank the reviewer for their careful consideration of the metric. We address all 4 comments below.

Since this is not a classification problem with false positives and negatives, but rather a measure of whether the specific detail is correct/wrong, we can only report the fraction of correct labels (which is essentially the inverse of Hamming loss which reports fraction of wrong labels). In this case, wrong labels can be either ‘NA’, or when GPT-3.5 returns a completely wrong detail, for example indicating water-cooling for quench type when the text did not specify, which a human would reasonably infer as air-cooling instead.

Changes made:
We propose evaluating the accuracy of text parsing by reporting the fraction of correct labels, and the overall error (inaccurate parsing) rate. In total, there are four possible situations:

Table 2: Accuracy metric for specified and unspecified information

And figure 2:

Figure 2. Details on the Gold and Silver Standard. A) Accuracy of GPT-3.5 extracted Silver Standard comparing against manually obtained Gold Standard, considering accuracy of both specified (dark blue) and unspecified details (light blue), as well as overall percentage of wrong details (orange). B) Heating curves reported for the Gold Standard dataset. C) Box charts for heating information in Gold Standard with respect to phase purity (1 refers to pure, 0 to not).

- *Where do those errors come from?* In a similar vein, it might be interesting for readers to have a better overview of where the errors come from: Why not create a table in which you summarize how often the model fails to find the relevant section, how often it is a syntactical error and how often semantical?

We thank the reviewer for their question. Figure 2a reports the accuracy and error rate for each category as we explained for the above comments. Our results indicate that the highest error rates are found for ramping time, cooling type and dopant. In particular, ramping and cooling times are often not reported in literature but can be reasonably estimated via descriptions in the experiment. Our golden dataset included human-expert estimated ramping time, which GPT-3.5 found difficult to infer. This shows that GPT is not performing well to infer information from the text like a human expert, but can do well when the text clearly specifies the requisite information.

Changes made in results and discussion:
We first consider the comparison between Gold and Silver Standard, which are based on the same set of CuInTe/Se papers. The GPT-based Silver Standard achieves a 73% overall accuracy as shown in Figure 2. In general, the highest accuracies were seen for all heating temperatures and time, base compound, and densification techniques, which are among the most important information towards high purity products. We observe that the errors in base compound and dopant are often due to cases where papers discuss multiple types of compounds, or when the reactants reported are based on ternary compounds rather than base elements, which leads to confusion in parsing by GPT-3.5.

When applied on the expanded chemical space (ExChSp), >60% accuracy was achieved to correctly parse the dopants from complex chemical formulas. Our developed approach demonstrates that even without complicated NLP tuning, information embedded in chemical formulas can successfully be extracted via optimized GPT-based prompting.

Additionally, being able to extract sequential heating stages is important for further material engineering, such as crystallinity or crystal structure, as it corresponds to the time-temperature profile. More sophisticated synthesis information that might not be explicitly documented in literature such as ramping rates, cooling type, and phase purity are among the lower accuracy reported as these details need to be inferred by GPT3.5. In the Gold Standard where information was manually parsed, we inferred the ramping rate and cooling type based on the technique used, and phase purity via the diffraction plot, which is obviously not contained directly in the text. Most notably, details on secondary melt, cooling type and dopant are often not explicitly reported in the text but could be easily inferred by a human expert. Consequently, these categories reported significantly poorer accuracies.

Changes made in further discussion:
Leveraging upon the ChExSp dataset, we tested the possibility to interpolate synthesis conditions for AgInTe2 (part of the dataset), as well as extrapolate for AgSbTe2, both of which are chemically similar to the material studied in the Gold Standard, i.e., CuInTe2 provided as a contextual prompt. We anticipate that GPT-3.5 has no knowledge of their synthesis conditions, as they would only be found in subscription-based scientific journals, and not an open-source dataset.

According to Table 2, the synthesis temperature and time for AgInTe2 are reasonable, with a ~1200 K melting stage and a ~700K annealing temperature, no primary melting. One interesting fact is that the interpolation is able to suggest quenching as cooling state which is related to phase precipitation tendency during the synthesis [39]. For extrapolating to predict a synthesis recipe for AgSbTe2, the GPT-3.5 model was not able to yield a proper synthesis recipe, by merely guessing that the recipe for AgSbTe2 is similar to that of AgInTe2 or AgInSe2. This result is only because the ChExSp dataset was provided to the GPT-3.5 API as an input – else, GPT-3.5 responds with a ‘hallucinated’ response with three sequential melt times going from low to high temperatures, which we know is inaccurate based on domain expertise. The suggested sequentially increasing temperature stages are different from domain expert recipes where a secondary high-temperature melting stage happens before a mid-temperature annealing stage to allow for melt crystallization followed by phase homogeneity.

We posit that GPT-3.5 which is trained on a corpus of mainly non-scientific text, contains incomplete information on solid-state synthesis recipes. It is possible that the Gold Standard used to prompt GPT-3.5 is contradictory to its knowledge, which we suggest is likely since it is paired with the fact that we implemented the responses with a temperature of zero (i.e., no creativity, since we deemed this as a non-creative writing task).

- *What is the precise workflow?* While one can extract some details from the code repository, this is not what the reader need to do to follow the article. A few questions remained unanswered to me:
- How have the chunks been chosen?
- How have the prompts been optimized (while avoiding data leakage)?
- *What is the impact of the "Is the given data following this format? If not re-format." prompt:* This is a very interesting prompt, and I would be very interested in an ablation study. I suppose this prompt was not there in the first iterations; what kind of problem did the introduction solve?

We thank the reviewer for this great question that we overlooked, which will improve readability of our paper; we thank them as well as the comment on prompt refinement, which was a key part of our work here. We address all 4 comments below.

We initially started by prompting from the ChatGPT interface which has a limited context window. The original prompt strategy was to feed as many synthesis paragraphs as possible, and request GPT-3.5 to extract and parse into a table without any further guidance. We found that the output was inconsistent in formatting and accuracy, as well as structure. From further analysis with domain expertise, we therefore determined the Gold Standard structure which includes primary and secondary heating steps, since it was observed that many recipes included a preliminary heating step for the binary system first. After we optimized the prompt, we managed to get good feedback from LLM. However, from time to time, we observed that the output did not follow the format that was given in the prompt. We noticed that only happened when we asked for more than one output (e.g. multiple temperatures). Hence, we introduced another format check prompt to check the response from LLM. It is a strategy of sequencing LLMs to get a desired output, as we have found in related work. Once we have implemented the check, LLM is able to identify the output that is not following the given format and re-format it.

Changes made in introduction:
Such an approach bypasses the need for specialized NLP tools, offering a streamlined method for text parsing that is more accessible to the scientific community. To further illustrate the applicability of GPT parsing, we focus on ternary chalcogenide-based materials because they are the state-of-the-art thermoelectric materials at intermediate temperatures [33], where the availability of synthesis literature is relatively smaller in size compared to the examples cited previously, meaning that the ability to tune the LLM is also limited. We consider a similar prompt engineering strategy reported by Zheng et al [18] to refine this workflow, which we describe further below.

Changes made in methodology:
Following this, we refined a set of prompts for GPT-3.5 to extract the same information, taking note to logically infer information when not provided, giving examples from the Gold Standard. The prompt set was optimized iteratively based on the following principles:
5. All questions put together in a single prompt, without any standard formatting and based on human intuition.
6. All questions put together in a single prompt, with standard formatting.
7. Questions broken up into a sequence of prompts, without standard formatting.
8. Questions broken up into a sequence of prompts, with standard formatting.

We noticed that LLM have a hard time trying to reason/extract information from a paragraph that require human intuition. The answer gets more consistent when we provide appropriate examples in the prompt. However, when we extract too much information in one go, it is found that sometimes the LLM misses certain information and other times it ‘misbehaves’ with unexpected behaviour, without adhering to the formatting instructions in our prompt set. Overall, we found that sequentially extracting information one by one with standard answers gives the best results. The iterative process is reported in prompt_engineering_progress.ipynb in the GitHub repository.

It was observed that the use of simple questions and a restriction to no more than two questions per prompt contributed to improved accuracy in information extraction. The initial question is aimed at the identification of synthesis information within the paper. If such information is absent in the paper, it is skipped, and the next paper is then processed. Once the synthesis paragraph is detected by the program, the subsequent question is employed to extract details regarding the base compound and dopant. Following that, the next question pertains to the temperature profile mentioned within the synthesis paragraph. However, from time to time, we observed that the output doesn’t follow the format that is given in the prompt, especially for prompts which require multiple outputs in the same response. Therefore, it was necessary to include formatting checks in the sequence.

Presented below is a series of eight questions within six prompts:
1. Does it include description of synthesis information?
2. Does the experiment result in pure phase formation of crystal?
3. What is the base compound used in the experiment? Exclude dopant and do not include "x" when you mention the base compound.
4. What is the dopant used in the experiment to dope the base compound? Generally, it is written before "x". The dopant is not included in the base compound. Write chemical symbol (e.g. C for Carbon)
5. What is the temperature profile of the experiment? Answer in a tabular format.
6. Is the given data following this format? If not re-format.
7. Choose one of the cooling types whether it is left in the room, in water or immersed in something cold, or left in the furnace: "Room" or "Quenching" or "Furnace".
8. Choose what is the densification technique used to densify the powder: "Hot Press" or "Sintering" or "NA".

Our engineered prompts aim to provide a cost-efficient method for parsing solid state synthesis recipes, which we did so with the total budget of all prompt refining experiments and actual text parsing using GPT-3.5 being within 50 SGD (~36 USD). Therefore, the effective cost per PDF is around 0.29 SGD (~ 0.20 USD) per PDF. While GPT-4 allows for higher accuracy in certain scenarios and enables more functionalities, we focused on GPT-3.5 instead as: (1) GPT-3.5 is accessible compared to GPT-4 which requires a subscription (2) a lower API cost compared to GPT-4 and (3) the parsing accuracy of 3.5 and 4 from text were found to be similar for such literature.

The re-generated set of the same dataset of papers is hereby named the ‘Silver Standard’. This prompt set is paired with the PyPDF library to then convert PDFs to machine readable form, where we broke down each PDF into text chunks to fit into the token limit. For splitting the text string into chunks, we use PyPDFLoader.load_and_split function. The function in turns uses RecursiveCharacterTextSplitter, which has a default maximum chunk size of 4000 with an overlap of 200 between each chunk.

- What temperature has been chosen for sampling (GitHub suggests 0, but is this really the case?)
Yes, we set the temperature to 0 for reproducibility, and because we deemed that some randomness was not necessary for a non-creative writing task.

- Figure 2c) Since the dataset is not immense, it might be insightful to simply show the data (as swarmplots)
We thank the reviewer for the comment, we did consider swarm plots to present the data. However, we found that given such a small dataset for the Gold Standard (with only 21 papers), a swarm plot did not present meaningful information compared to a box and whisker plot.

- How is the feature importance exactly computed
- If you do leave-one-out cross-validation, you end up with $n$ models, where $n$ is the length of the dataset. Does Figure 3 show the analysis for one model (which) or an average (how averaged?) model?

We thank the reviewer for this comment, and address these 2 questions together below.

The feature importance is computed by taking the .feature_importances reported by the decision tree, following the visualization.ipynb notebook. We observed that due to randomness in initialization, the feature importance rankings had significant differences, which is why we opted for the leave-one-out (LOO) cross-validation strategy performed 200 times and calculated the average of the feature importance of 200 x 61 (dataset) models. We choose 200 times as we observed that it is a large enough number to eliminate any randomness in the fitting.

- If you perform SHAP-based feature importance analysis you need to compute the SHAP values for a set of datapoints. For which datapoints did you compute the values? For the training set? For the test set? For the entire dataset?
- Why did you train a different kind of model for SHAP analysis? ("As an alternative means of analysis, an XGBoost classifier was implemented to derive SHAP values of the features")
- How are the hyperparameters chosen for the models built for the feature-importance analysis?

We thank the reviewer for careful consideration of the methodology. We address these 3 comments together below.

Given such a sparse and imbalanced dataset, we found that there were negligible differences in the SHAP results when following the tutorial shown in https://shap.readthedocs.io/en/latest/example_notebooks/tabular_examples/tree_based_models/Basic%20SHAP%20Interaction%20Value%20Example%20in%20XGBoost.html. This was in sharp contrast to our approach with the decision tree classifier.

Since the focus is on the prompt engineering rather than a deep dive into the data analysis. We decide to go with default hyperparameters, training a XGBoost model at 5000 iterations, also shown in the visualization.ipynb notebook.

- “Therefore, we ask the question – can LLMs be used to parse the literature,  but also produce a machine learning readable dataset? “ ——  Technically, this broad question has been addressed (in work coming out  of Ceder’s group as well as with some examples at the LLM Hackathon —  both are cited in the paper, https://arxiv.org/pdf/2212.05238.pdf  might deserve a citation, too).  Perhaps it might make sense to rephrase the research question to be more specific.

We thank the reviewer for their comment. We limited ourselves to a particular class of materials because solid-state synthesis is specific to each class. To demonstrate the applicability of GPT parsing, we focused on ternary chalcogenide-based materials because they are the state-of-the-art thermoelectrics at intermediate temperatures. If one wanted to apply the prompt set to e.g., oxide materials which have completely different synthesis conditions, a modified recipe format that matches the actual synthesis should be used instead of the current format which has multiple heating stages and a densification stage.

Changes made:
Language Models (LLMs) have recently emerged as an alternative tool to extract knowledge from scientific literature, enabling contextualizing and summarizing of information efficiently and robustly. This has been demonstrated across different materials science fields; chemistry [12,16–21], polymers [22], general materials [23–27], optical materials [28], crystal structures [29], and even other fields such as medicine [30–32]. These models can be used to identify and categorize key information such as research conclusions and methods, and trends, making it easier for researchers to access relevant insights rapidly. By contextualizing and summarizing information, these models provide an alternative route to potentially facilitate the efficient extraction of knowledge from existing literature data. We then ask the question – specific to fields with sparse literature and strong reporting bias, can LLMs be used to parse the synthesis information, but also produce a machine learning readable dataset?

Such an approach bypasses the need for specialized NLP tools, offering a streamlined method for text parsing that is more accessible to the scientific community. To further illustrate the applicability of GPT parsing, we focus on ternary chalcogenide-based materials because they are the state-of-the-art thermoelectric materials at intermediate temperatures [33], where the availability of synthesis literature is relatively smaller in size compared to the examples cited previously, meaning that the ability to tune the LLM is also limited. We consider a similar prompt engineering strategy reported by Zheng et al [18] to refine this workflow, which we describe further below.

# Reproducibility
- I encourage the authors to
- add license information to their repository
- contains the code/scripts for the actual experiments
- make the code citable by archiving it, for example, on Zenodo
- the link in the paper should point to the repository and not to the organization

We thank the reviewer for careful consideration of our work’s reproducibility. We have improved the Github page (https://github.com/Kedar-Materials-by-Design-Lab/Harnessing-GPT-3.5-for-Text-Parsing-in-Solid-State-Synthesis-case-study-of-ternary-chalchogenides) by including a separate directory for the ExChSp results and explaining the timeout error. The visualization.ipynb file has also been updated with annotations. If there is any further modification the reviewer deems proper, we are happy to further accommodate.

We further thank the reviewer for the detailed probing on the timeout error in the original notebook. The timeout error is returned by the OpenAI package when a request is not answered in time and is a noted challenge and inconvenience for prompting. Other than necessitating a rerun of the code, it does not have any adverse effect parsing. We choose to keep the error messages to illustrate an extant challenge which one might come across while attempting to reproduce our methods using published Open AI packages.

Referee: 3

Comments to the Author
Summary
The authors have gleaned synthesis recipes for chalcogenide-based thermoelectric materials by GPT-3.5, fine-tuned through optimized GPT-based prompts. They have generated a database, achieving an overall accuracy of approximately 73%. This database is subsequently utilized to infer synthesis conditions for ternary chalcogenides. A classifier model is constructed based on this database, achieving an accuracy of approximately 60% in predicting phase purity. This study offers an approach for information extraction by integrating Large Language Models (LLMs) into the realm of materials science research.

We thank the reviewer for the accurate summary of this work. A point-to-point response is attached to address the reviewer’s comments below.

Points of view
1. Methodologically, this work is limited in its scope to chalcogenide-based thermoelectric materials and to GPT3.5 rather than the greater GPT4 for higher accuracy.

We thank the reviewer for their question. We limited ourselves to a particular class of materials because solid-state synthesis is specific to each class. To demonstrate the applicability of GPT parsing, we focused on ternary chalcogenide-based materials because they are the state-of-the-art thermoelectrics at intermediate temperatures. If one wanted to apply the prompt set to e.g., oxide materials which have completely different synthesis conditions, a modified recipe format that matches the actual synthesis should be used instead of the current format which has multiple heating stages and a densification stage.

The introduction of GPT-4 allows for higher accuracy in certain scenarios and enables more functionalities. We focused on version 3.5 instead of version due to that: (1) GPT-3.5 has widespread availability and accessibility compared to GPT-4 which requires a subscription. Furthermore, GPT-3.5 is ~5% API cost compared to GPT-4, making our model more accessible to common researchers, (2) the parsing accuracy of 3.5 and 4 from text are similar.

Changes made in introduction:
Language Models (LLMs) have recently emerged as an alternative tool to extract knowledge from scientific literature, enabling contextualizing and summarizing of information efficiently and robustly. This has been demonstrated across different materials science fields; chemistry [12,16–21], polymers [22], general materials [23–27], optical materials [28], crystal structures [29], and even other fields such as medicine [30–32]. These models can be used to identify and categorize key information such as research conclusions and methods, and trends, making it easier for researchers to access relevant insights rapidly. By contextualizing and summarizing information, these models provide an alternative route to potentially facilitate the efficient extraction of knowledge from existing literature data. We then ask the question – specific to fields with sparse literature and strong reporting bias, can LLMs be used to parse the synthesis information, but also produce a machine learning readable dataset?

Such an approach bypasses the need for specialized NLP tools, offering a streamlined method for text parsing that is more accessible to the scientific community. To further illustrate the applicability of GPT parsing, we focus on ternary chalcogenide-based materials because they are the state-of-the-art thermoelectric materials at intermediate temperatures [33], where the availability of synthesis literature is relatively smaller in size compared to the examples cited previously, meaning that the ability to tune the LLM is also limited. We consider a similar prompt engineering strategy reported by Zheng et al [18] to refine this workflow, which we describe further below.

Changes made in methodology:
Our engineered prompts aim to provide a cost-efficient method for parsing solid state synthesis recipes, which we did so with the total budget of all prompt refining experiments and actual text parsing using GPT-3.5 being within 50 SGD (~36 USD). Therefore, the effective cost per PDF is around 0.29 SGD (~ 0.20 USD) per PDF. While GPT-4 allows for higher accuracy in certain scenarios and enables more functionalities, we focused on GPT-3.5 instead as: (1) GPT-3.5 is accessible compared to GPT-4 which requires a subscription (2) a lower API cost compared to GPT-4 and (3) the parsing accuracy of 3.5 and 4 from text were found to be similar for such literature.

2. It does not present any prompting strategy in the fine-tuning process and just has shown the final prompts. But it demonstrates the efficiency of automated data extraction and underscores the broader applicability of LLMs in material science research, enables users to conduct text mining and corpus curation without developing specific NLP algorithms.

We thank the reviewer for pointing out the need to clarify the refinement strategy. We have demonstrated the refinement strategy by attaching the notebook prompt_engineering_progress.ipynb in the GitHub repository, and also included a more thorough discussion of our prompt engineering strategy in the manuscript.

Changes made:
Following this, we refined a set of prompts for GPT-3.5 to extract the same information, taking note to logically infer information when not provided, giving examples from the Gold Standard. The prompt set was optimized iteratively based on the following principles:
1. All questions put together in a single prompt, without any standard formatting and based on human intuition.
2. All questions put together in a single prompt, with standard formatting.
3. Questions broken up into a sequence of prompts, without standard formatting.
4. Questions broken up into a sequence of prompts, with standard formatting.

We noticed that LLM have a hard time trying to reason/extract information from a paragraph that require human intuition. The answer gets more consistent when we provide appropriate examples in the prompt. However, when we extract too much information in one go, it is found that sometimes the LLM misses certain information and other times it ‘misbehaves’ with unexpected behaviour, without adhering to the formatting instructions in our prompt set. Overall, we found that sequentially extracting information one by one with standard answers gives the best results. The iterative process is reported in prompt_engineering_progress.ipynb in the GitHub repository.

It was observed that the use of simple questions and a restriction to no more than two questions per prompt contributed to improved accuracy in information extraction. The initial question is aimed at the identification of synthesis information within the paper. If such information is absent in the paper, it is skipped, and the next paper is then processed. Once the synthesis paragraph is detected by the program, the subsequent question is employed to extract details regarding the base compound and dopant. Following that, the next question pertains to the temperature profile mentioned within the synthesis paragraph. However, from time to time, we observed that the output doesn’t follow the format that is given in the prompt, especially for prompts which require multiple outputs in the same response. Therefore, it was necessary to include formatting checks in the sequence.

3. However, the overall accuracy of approximately 73% is not high enough, resulting the classification model do not perform very well (60% accuracy). The extraction and machine learning accuracy should be improved. In reviewer’s opinion, this work should be improved and not enough for publication in current state.

We thank the reviewer for their critical question. Since this is not a classification problem with false positives and negatives, but rather a measure of whether the specific detail is correct/wrong, we can only report the fraction of correct labels (which is essentially the inverse of Hamming loss which reports fraction of wrong labels). In this case, wrong labels can be either ‘NA’, or when GPT-3.5 returns a completely wrong detail, for example indicating water-cooling for quench type when the text did not specify, which a human would reasonably infer as air-cooling instead.

The accuracy is heavily affected by categories which are often not explicitly mentioned in original PDF text, such as “Secondary ramping time”, “Cooling Type”, and “Dopant”. Human experts can estimate or infer such information via domain expertise, which is challenging for GPT in the current stage. If we do not consider these 3 categories, the accuracy is approximately 82% for text parsing along.

Changes made in methodology:
We propose evaluating the accuracy of text parsing by reporting the fraction of correct labels, and the overall error (inaccurate parsing) rate. In total, there are four possible situations:

Table 2: Accuracy metric for specified and unspecified information

And figure 2:

Figure 2. Details on the Gold and Silver Standard. A) Accuracy of GPT-3.5 extracted Silver Standard comparing against manually obtained Gold Standard, considering accuracy of both specified (dark blue) and unspecified details (light blue), as well as overall percentage of wrong details (orange). B) Heating curves reported for the Gold Standard dataset. C) Box charts for heating information in Gold Standard with respect to phase purity (1 refers to pure, 0 to not).

Changes made in results and discussion:
We first consider the comparison between Gold and Silver Standard which are based on the same set of CuInTe/Se papers. The GPT-based Silver Standard achieves a 73% overall accuracy as shown in Figure 2. In general, the highest accuracies were seen for all heating temperatures and time, base compound, and densification techniques, which are among the most important information towards high purity products. We observe that the errors in base compound and dopant are often due to cases where papers discuss multiple types of compounds, or when the reactants reported are based on ternary compounds rather than base elements, which leads to confusion in parsing by GPT-3.5.

Additionally, being able to extract sequential heating stages is important for further material engineering, such as crystallinity or crystal structure, as it corresponds to the time-temperature profile. In the Gold Standard where information was manually parsed, we inferred the ramping rate and cooling type based on the technique used, and phase purity via the diffraction plot, which is obviously not contained directly in the text. Most notably, details on secondary melt, cooling type and dopant are often not explicitly reported in the text but could be easily inferred by a human expert. Consequently, these categories reported significantly poorer accuracies.

Issues with methodology
1) What does LITERATURE SPECIFIED and LITERATURE UNSPECIFIED in Figure 2 (a) mean, are corresponding to the GPT prompted dataset of CuInTe/Se and ExChSp?

We thank the reviewer for this comment. We address this in the comment above.

2) What is the meaning of TEMPERATURE in the vertical coordinate in Figure 2 (b), is the maximum temperature of the heating curve?

We thank the reviewer for their question. Here, we would like to clarify that the vertical coordinate denotes the furnace temperature during the heating-cooling-annealing process. The maximum temperature of the heating curve can be directly read from the graphs.

3) Why are there fewer points in Figure 2(c) for PURE than for NOT PURE?

We thank the reviewer for this question. In solid state chemistry, not all processes lead to pure phase due to competition at different temperatures. Making a pure-phase material as opposed to mixed-phase materials can be challenging and typically requires careful process optimizations.

4) The authors attribute the difference between the significance of the features obtained using the gold standard dataset and the ExChSp dataset to the different formats of the scientific papers, but it could also be due to the lower extraction precision. This can be further demonstrated by switching to a more balanced corpus for extraction.

We thank the reviewer for this question. The dataset which we have extracted for each chalcogenide composition is already the maximal set of papers which can be processed by GPT. In that sense, the corpus is already as balanced as it gets.

Round 2

Revised manuscript submitted on 21 Nov 2023

Editor’s decision letter

08-Dec-2023

Dear Dr Hippalgaonkar:

Manuscript ID: DD-ART-10-2023-000202.R1
TITLE: Harnessing GPT-3.5 for Text Parsing in Solid-State Synthesis – case study of ternary chalcogenides

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions.

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Joshua Schrier
Associate Editor, Digital Discovery

************

Reviewer comments

Reviewer 1

I believe that the authors have successfully addressed nearly all of my comments with appropriate attention and care. However, I have a few follow-up minor comments that I believe the authors can address to further improve the paper. Therefore, I recommend the publication of this work following minor revisions.

Comment:

1. Regarding the performance metrics for this work on GPT-3.5, could improvements be made through approaches such as the chain of thoughts to enhance inference or via better prompt writing strategies? The author should discuss this in the manuscript, possibly as a future direction.
2. In response to my previous question about determining accuracy (comment #5), the author mentions that incorrect labels can result from "...or when GPT-3.5 returns a completely wrong detail." How is this detected and accounted for? To clarify, my question on this aspect is more about whether this accuracy determination can be automated. Regardless, this should be clearly stated in the manuscript for the reader's understanding.
3. Please check ArXiv references 18, 25, and 32 for their latest versions, as I believe these have been published in conferences or similar venues.
4. Please double-check the content in Figure 1, as the positioning of some elements does not seem to be correct.
5. The author mentioned, "(3) the parsing accuracy of 3.5 and 4 from text were found to be similar for such literature." This is an important observation but lacks supporting data.

Reviewer 3

The authors have carefully revised the manuscript in accordance with the opinions of reviewers and editors. The quality of the manuscripts has been improved remarkably.
In my opinion, this manuscript is now acceptable.

Author response

Full response is appended as a separate file

This text has been copied from the Microsoft Word response to reviewers and does not include any figures, images or special characters:

Comments to the Author
I believe that the authors have successfully addressed nearly all of my comments with appropriate attention and care. However, I have a few follow-up minor comments that I believe the authors can address to further improve the paper. Therefore, I recommend the publication of this work following minor revisions.
We would like to show our appreciation to the reviewer for his/her meticulous effort in ensuring a high-quality publication. We address the minor revision comments below.

Comment:
1. Regarding the performance metrics for this work on GPT-3.5, could improvements be made through approaches such as the chain of thoughts to enhance inference or via better prompt writing strategies? The author should discuss this in the manuscript, possibly as a future direction.
Thank you for the suggestion, we agree that more consideration needs to be given to future direction and refinement of this work for further scientific contribution. We have included an additional paragraph in the further discussion section.
Changes made in further discussion section:
We also acknowledge the paucity of literature in this specific field. Even though we searched and extracted knowledge from 3 decades of literature (with a total of 162 research papers, although only 61 were successfully extracted), the dataset is severely biased towards positive results. Hence, we would emphasize upon the community that there is a pressing need for balanced datasets where negative experimental results are also reported. We hope that combining our framework with a domain-specific Gold Standard is the first step towards a transferable approach, applicable across different realms of materials science, that enables users to conduct text mining and corpus curation without developing specific NLP algorithms.
Apart from the need of for high-quality and balanced datasets, the outlook of this work includes further refinement in prompt engineering and chain- of- thought inference to better tune responses for a given model. Further on, fine-tuning base models or even training new models on a sufficiently large corpus of scientific text is a more ambitious task. Finally, we also propose sequencing another LLM to cross check on the data extraction work for better reliability of results, and to help compute validate the accuracy metric that we use above.

2. In response to my previous question about determining accuracy (comment #5), the author mentions that incorrect labels can result from "...or when GPT-3.5 returns a completely wrong detail." How is this detected and accounted for? To clarify, my question on this aspect is more about whether this accuracy determination can be automated. Regardless, this should be clearly stated in the manuscript for the reader's understanding.
We apologise for the confusion here – our amended manuscript has attempted to clarify the methodology as much as possible. For the accuracy determination, it was done manually since the total number of details (papers x categories) is relatively low. However, we do acknowledge that automating this process could be immensely helpful, as we also address above as a potential refinement of the work.
Changes made in results and discussion:
To conduct a thorough analysis of the extracted data, we next employ a simple machine learning model, a decision tree classifier, to identify how temperature profiles may impact the production of the pure-phase compound. This model is trained on both the Gold Standard and ExChSp datasets, and the results are reported in figure 3a and 3b respectively. Referring to the 4 possible conditions listed in Table 2, we manually compute the accuracy metric for each entry (every detail for each paper).

3. Please check ArXiv references 18, 25, and 32 for their latest versions, as I believe these have been published in conferences or similar venues.
Thank you for pointing out this error, we have corrected references 18, 25 and 32 to their latest versions as listed below. The exact numbering is not the same, however.
Z. Zheng, O. Zhang, C. Borgs, J.T. Chayes, O.M. Yaghi, ChatGPT Chemistry Assistant for Text Mining and the Prediction of MOF Synthesis, J Am Chem Soc. 145 (2023) 18048–18062. https://doi.org/10.1021/jacs.3c05819.
I. Beltagy, K. Lo, A. Cohan, SciBERT: A Pretrained Language Model for Scientific Text, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019: pp. 3615–3620. https://doi.org/10.18653/v1/D19-1371.
R. Nadkarni, D. Wadden, I. Beltagy, N.A. Smith, H. Hajishirzi, T. Hope, Scientific Language Models for Biomedical Knowledge Base Completion: An Empirical Study, in: D. Chen, J. Berant, A. McCallum, S. Singh (Eds.), 3rd Conference on Automated Knowledge Base Construction, AKBC 2021, Virtual, October 4-8, 2021, 2021. https://doi.org/10.24432/C5QC7V.

4. Please double-check the content in Figure 1, as the positioning of some elements does not seem to be correct.
We thank the reviewer for pointing this out, we have realigned some of the components, and made it more well-structured.
Changes:
(figure)

5. The author mentioned, "(3) the parsing accuracy of 3.5 and 4 from text were found to be similar for such literature." This is an important observation but lacks supporting data.
We thank the reviewer for their consideration on this, and we agree that supporting data is important. We have included conversation histories for both GPT3.5 and GPT4 from our preliminary exploration and refinement of the methodology in the Github repository README: https://github.com/Kedar-Materials-by-Design-Lab/Harnessing-GPT-3.5-for-Text-Parsing-in-Solid-State-Synthesis-case-study-of-ternary-chalchogenides/tree/main.
Changes made in methodology section:
Our engineered prompts aim to provide a cost-efficient method for parsing solid state synthesis recipes, which we did so with the total budget of all prompt refining experiments and actual text parsing using GPT-3.5 being within 50 SGD (~36 USD). Therefore, the effective cost per PDF is around 0.29 SGD (~ 0.20 USD) per PDF. While GPT-4 allows for higher accuracy in certain scenarios and enables more functionalities, we focused on GPT-3.5 instead as: (1) GPT-3.5 is accessible compared to GPT-4 which requires a subscription (2) a lower API cost compared to GPT-4 and (3) the parsing accuracy of 3.5 and 4 from text were found to be similar for such literature. A preliminary comparison between both models is reported in the Github repository as conversation histories.

Round 3

Revised manuscript submitted on 21 Dec 2023

Editor’s decision letter

21-Dec-2023

Dear Dr Hippalgaonkar:

Manuscript ID: DD-ART-10-2023-000202.R2
TITLE: Harnessing GPT-3.5 for Text Parsing in Solid-State Synthesis – case study of ternary chalcogenides

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Joshua Schrier
Associate Editor, Digital Discovery

Reviewer comments

Reviewer 1

I have reviewed the responses and revisions from the authors, and they have satisfactorily addressed all of my comments. I believe this manuscript is now ready for acceptance.

Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.

From the journal Digital Discovery Peer review history

Harnessing GPT-3.5 for text parsing in solid-state synthesis – case study of ternary chalcogenides

Round 1

Reviewer 1

Reviewer 2

Reviewer 3

Round 2

Reviewer 1

Reviewer 3

Round 3

Reviewer 1

Transparent peer review