From the journal Digital Discovery Peer review history

You do not have JavaScript enabled. Please enable JavaScript to access the full features of the site or access our non-JavaScript page.

Round 1

Manuscript submitted on 21 Jan 2024

Editor’s decision letter

01-Mar-2024

Dear Dr Polak:

Manuscript ID: DD-ART-01-2024-000016
TITLE: Model-Agnostic Materials Data Extraction from Text Using Language Models

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions.

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

************

Reviewer comments

Reviewer 1

The paper proposes a human-in-the-loop approach for extracting materials property data from text with a focus scientific literature. The authors first describe the general problem setup and their method, which involves 3 steps along with a 0th step of text processing. The method has 2 human-in-the-loop steps along with 2 automated language model steps that classify and extract text. The LM steps can also involve some fine-tuning with small amount human labeled during the extraction process. Following the description of the method, the paper shows results for different LMs on the extraction of bulk modulus data from literature data. Figure 3 provides general performance data for the different LMs outlining recall-precision curves and related metrics. Figure 4 is the most convincing results of the paper as it shows the benefit of human in the loop fine-tuning for the LMs with increased performance for all models studied. This shows that the proposed method does yield better results as proposed by the authors. Following the results, the authors provide a discussion for an additional use case and mention additional methods. This followed by a conclusions and more details on datasets. Overall, the paper provides a useful method for a practical materials problem. In its current form, the paper is somewhat verbose in the method, results and discussion sections and could benefit from condensing the main message to make it easier for the reader to understand the important parts. I favor accepting assuming the authors make minor revisions outlined below:

1. Do you have actual numbers on the human time it took to put the final results together? This is specific to the ones you describe in your paper.

2. On the compute perspective, can you comment on the possibility of running the LMs on a local machine? It seems like the CPUs you are using are data center or workstation CPUs. How feasible would it be to run the same procedures on modern laptops? This would the usability argument significantly stronger.

3. Figure 1 is unclear in what it is trying to show. Do you want to show that your method allows processing of larger of entries with less human labor? If so, I suggest a different way of communicating that; maybe a table that shows the number of entries processed per unit human labor along with a quality description attached.

4. The paper does not have results for the fully automated “modest” quality results (referring to Figure 1 schema). It would be good to add a discussion on those in the results or discussion section to motivate the use of human-in-the-loop steps as proposed in the method. If the zero-shot setting meets this criteria, the authors should clearly state this. It would also be good to have a “baseline” result in the figures as well that describes the minimum acceptable performance to be useful. In that case, one can see which models meet that bar.

5. How does your method compare to CDE for the case mentioned in the discussion on critical cooling rates for metallic glasses?

Reviewer 2

The study presents a method for extracting materials data from research papers using large language models (LLMs) with human supervision. While the methodology is not highly novel considering the existing the literature, this approach is relatively efficient, requires minimal coding or prior knowledge, and is suitable for small-sized databases. The third step (human assisted data structurization) majorly limits the scalability of the approach for middle to large-sized datasets. It achieves high recall and nearly perfect precision. The method can be adaptable to new and improved language models, allowing higher performance. I believe this paper can be a good addition to Digital Discovery, however, it requires major work. Please see my comments below:

“In addition, fully automated approaches often require extensive retraining and building of parsers specific for different properties, as well as a significant amount of coding. These methods thus often require a significant initial investment of human time.”
Not necessarily. This is only true for conventional methods, where things needed to be hard-coded for any desired output. Please check out: https://arxiv.org/abs/2312.07559 and https://arxiv.org/abs/2312.11690 as an example.

GPT-3.5 Davinci? Do you mean GPT-3.5 as the LLM and text Davinci as the embedding model? It is important to distinguish between the LLM and the embedding model used for tokenization. Please explain.

“For example, if we know the data is numeric we can keep just sentences containing a number. In the case of bulk modulus (see Dataset 1 in Sec. VI)”
Considering the semantic search capabilities that come with models like GPT-4 using cos similarity or maximum marginal relevance top k retrieval, this seems to be a major under-utilization. If you include sentences that contain numbers, you’re likely to gather many irrelevant sentences containing entities like “Figure 4”, “Dataset 1” and such. And this just adds to the post-processing workload.
Another example of under-utilization: “However, with experience, this time quickly reduces as the user gets used to the process. In addition, more experienced users may employ simple computer codes, e.g. based on regular expressions, which would preselect possible candidates for values and units, reducing the time significantly”. You can simply describe the format of your desired output as natural language text to the LLM as input prompt and avoid worrying about parsing or manually extracting that info from a sentence.

“Step 3 Data structurization”
This is called template filing in natural language processing. Please use NLP terminologies for better coherency.

“The user will typically perform this step by first ranking the sentences by their probability of being relevant, which is the output from Step 1 (or Step 2 if performed).”
How do you get these probabilities? Is that logits of the generated token for the outputs? Please elaborate

“In general, human assisted data structurization, even when only the sentences containing the relevant data are given, may be a tedious and time-consuming task. However, at this point it is the only method that can guarantee an almost perfect precision.”
This is a strong statement and it is not true. There are emerging methods that do the autonomous template filing with low or zero human assistance and still have a reasonable performance. See these two papers for more details: Please check out: https://arxiv.org/abs/2312.07559 and https://arxiv.org/abs/2312.11690 as an example.

Just to confirm, does the workflow involves breaking the research paper into sentences, and running the model on a single sentence at a time, or do you run your model on the entire research article, and find relevant/irrelevant sentences? If it is the first case, I don’t think we can really call it named-entity recognition, especially in a single sentence like: “We determined the bulk modulus to be 123 GPa.”

Comparing ROCs in Figures 3b and 4b, looks like the fine-tuned Bart with few examples (Fig 4b. AUC is about 0.3 with 20 positive examples) is performing worse than the zero-shot setting (Fig 3b. AUC is 0.6). Why is that happening? Please explain

Reviewer 3

General comment:  

The authors have satisfactorily addressed all the comments as per the review comments and the response submitted. Accordingly, I have now only following queries.

1) In the VI. Datasets section of the paper, the authors write: "The bulk modulus is a benchmark dataset of sentences, a test dataset consisting of 100 papers randomly chosen from the over 10000 paper results of a search query “bulk modulus”+”crystal” returned from the ScienceDirect API. In the written text of these 100 papers, there are 18408 sentences in total, out of which 237 sentences mention the value of bulk modulus."

The sciencedirect API return only 6000 results for a  given query. Hence these numbers don't look consistent.  Can authors elaborate more on this aspect? https://dev.elsevier.com/api_key_settings.html

2) Since the authors have access to gpt4, why did they not structure the data in json format using the inbuilt functionality of the OpenAI api for gpt4?

3) What are the similarities and differences between this work and author’s own previous work: https://www.nature.com/articles/s41467-024-45914-8 ? This question is very important to address as it is not very evident what the differences are.

Finally, the authors have also uploaded their API key in the code (probably by mistake). They can remove it to prevent its misuse.

Author response

Dear Editor and Reviewers,

Thank you for your valuable feedback on our manuscript. The revisions made in response to your constructive comments have significantly improved the manuscript. We have carefully addressed all concerns raised during the review process.

In our previous revision, we shortened the title of the paper. However, we noticed from the comments we received in the current review that keeping the previous title might have been better as it helped to emphasize the flexibility and generality of the approach. Therefore, with your permission, we would like to return to the previous title - Flexible, Model-Agnostic Method for Materials Data Extraction from Text Using General Purpose Language Models

We are grateful for the opportunity to refine our manuscript based on your feedback and hope that the revised version meets your expectations.

Please find a point-by-point response to all reviewers' remarks below.

Best regards,

Maciej P. Polak and Dane Morgan

Referee: 1

Comments to the Author
The paper proposes a human-in-the-loop approach for extracting materials property data from text with a focus scientific literature. The authors first describe the general problem setup and their method, which involves 3 steps along with a 0th step of text processing. The method has 2 human-in-the-loop steps along with 2 automated language model steps that classify and extract text. The LM steps can also involve some fine-tuning with small amount human labeled during the extraction process. Following the description of the method, the paper shows results for different LMs on the extraction of bulk modulus data from literature data. Figure 3 provides general performance data for the different LMs outlining recall-precision curves and related metrics. Figure 4 is the most convincing results of the paper as it shows the benefit of human in the loop fine-tuning for the LMs with increased performance for all models studied. This shows that the proposed method does yield better results as proposed by the authors. Following the results, the authors provide a discussion for an additional use case and mention additional methods. This followed by a conclusions and more details on datasets. Overall, the paper provides a useful method for a practical materials problem. In its current form, the paper is somewhat verbose in the method, results and discussion sections and could benefit from condensing the main message to make it easier for the reader to understand the important parts. I favor accepting assuming the authors make minor revisions outlined below:

Do you have actual numbers on the human time it took to put the final results together? This is specific to the ones you describe in your paper.

Response: We added an approximate total human time required for compiling the final full critical cooling rate dataset described in the text.

Revised text (page 12, Sec. IV B):
The total human time required for gathering this dataset did not exceed 5 hours.

On the compute perspective, can you comment on the possibility of running the LMs on a local machine? It seems like the CPUs you are using are data center or workstation CPUs. How feasible would it be to run the same procedures on modern laptops? This would the usability argument significantly stronger.

Response: We repeated the exercise on a typical relatively modern laptop (Macbook Pro 2019 with Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz). This proved to be even more efficient than our workstation CPUs (which are quite old). It took approximately 20 minutes. We added the relevant numbers and detail to the text

Revised text (page 6, Sec. II):
The fine-tuning itself, for the small locally hosted models (bart and DeBERTaV3), takes around 30 minutes on an older workstation CPU (Intel(R) Xeon(R) CPU E5-2670), 20 minutes on a modern laptop CPU (Intel(R) Core(TM) i9-9880H), and can be reduced to just a few minutes if GPUs are used.

Figure 1 is unclear in what it is trying to show. Do you want to show that your method allows processing of larger of entries with less human labor? If so, I suggest a different way of communicating that; maybe a table that shows the number of entries processed per unit human labor along with a quality description attached.

Response: The figure is supposed to communicate that depending on the required accuracy of data and the amount of data present, different methods should be chosen as the preferred method - for small datasets, the quickest and most accurate (higher quality) is manual data curation; for very large datasets - manual curation is not possible and automated approaches have to be adapted, but they require a certain time investment upfront and as a downside provide data of lower quality. For mid-size databases, neither of these two methods are optimal, as fully manual curation is too time consuming, and fully automated data extraction is lower quality (and requires upfront time investment too). Therefore, the method we applied bridges that gap and is best applied for mid-sized databases, where it allows to obtain high quality data significantly faster than fully manual data curation, and up until certain dataset size in the mid-high dataset size is still faster than automated data extraction (which requires initial time investment). We hope this explanation clarifies the utility of this figure. In previous reviews we received, reviewers explicitly mentioned Fig. 1 as a helpful figure, which we also believe it to be, therefore we would like to keep it.

The paper does not have results for the fully automated “modest” quality results (referring to Figure 1 schema). It would be good to add a discussion on those in the results or discussion section to motivate the use of human-in-the-loop steps as proposed in the method. If the zero-shot setting meets this criteria, the authors should clearly state this. It would also be good to have a “baseline” result in the figures as well that describes the minimum acceptable performance to be useful. In that case, one can see which models meet that bar.

Response: The 'modest' quality results correspond to fully automated methods, such as CDE the reviewer mentions in the next comment. We do not show these results, as these methods are now well understood and the expectations of their performance are known (and mentioned in the introduction). The approach we describe in this paper is meant to bridge the gap between small very high quality databases, and automated 'modest' quality results, and are not meant to replace or substitute either of those.
Throughout the text we suggest 90% recall as the desired minimum performance (see Fig. 3 (d) as an example). The nature of the human-in-the-loop approach here allows the user to tune the performance of the method, as the performance is directly correlated to the amount of human time spent.
We revised the text to clarify this point.

Revised text (p. 7 Sec II):
It is entirely up to the user to decide the quality they require from their database, and the quality of the results will be proportional to the amount of time spent in this step. Recall of 90% seems to be a reasonable value to stop the process, as the precision sharply drops for higher values, which diminishes returns for the human time involved.

How does your method compare to CDE for the case mentioned in the discussion on critical cooling rates for metallic glasses?

Response: This statistic has been included in the text already on p. 11, Sec IV:

To provide comparison to other existing methods, we used ChemDataExtractor2 (CDE2) [14], a state-of-the-art named entity recognition (NER) based data extraction tool. With CDE we obtain a recall of 37% and precision of 52%, which are comparable to those reported for thermoelectric properties (31% and 78%, respectively) obtained in Ref. [15].

However, it is worth to notice that the comparison of the method to CDE is slightly unfair, as CDE is a fully automated approach and therefore is inherently less accurate than our approach, which effectively provides data of quality comparable to that of a reference, fully manually curated data. The high quality comes at a cost of the human time spent in the human-in-the-loop steps, and therefore, as mentioned before, limits the extracted database size. CDE does not have those limitations, which comes at a cost of data quality, but allows for generating very large databases.

Referee: 2

Comments to the Author
The study presents a method for extracting materials data from research papers using large language models (LLMs) with human supervision. While the methodology is not highly novel considering the existing the literature, this approach is relatively efficient, requires minimal coding or prior knowledge, and is suitable for small-sized databases. The third step (human assisted data structurization) majorly limits the scalability of the approach for middle to large-sized datasets. It achieves high recall and nearly perfect precision. The method can be adaptable to new and improved language models, allowing higher performance. I believe this paper can be a good addition to Digital Discovery, however, it requires major work. Please see my comments below:

“In addition, fully automated approaches often require extensive retraining and building of parsers specific for different properties, as well as a significant amount of coding. These methods thus often require a significant initial investment of human time.”
Not necessarily. This is only true for conventional methods, where things needed to be hard-coded for any desired output. Please check out: https://urldefense.com/v3/__https://arxiv.org/abs/2312.07559__;!!Mak6IKo!JY2fYq2Q39iP9CD1saKOemoUdn8EQWCgrrfgs7pe9d0v9gSuO2G5RsplgsQoU3RgnEOZcvruuInI2FVwUeqHbaOyctOj$ and https://urldefense.com/v3/__https://arxiv.org/abs/2312.11690__;!!Mak6IKo!JY2fYq2Q39iP9CD1saKOemoUdn8EQWCgrrfgs7pe9d0v9gSuO2G5RsplgsQoU3RgnEOZcvruuInI2FVwUeqHbTz6lOJk$ as an example.

Response: Thank you for this valuable comment. We expanded our discussion of automated methods to include the modern LLM, agent-based methods and included the mentioned references. We also clarified that the statement above is only true for conventional methods.

Revised text (p. 2, Sec. I):
Recently, fully automated agent-based LLM approaches to analyze scientific text have been proposed as well, which are capable of answering science questions with information from research papers [35], and generating customizable datasets [36].
(...)
In addition, conventional, fully automated NLP approaches often require extensive retraining and building parsers (...)

GPT-3.5 Davinci? Do you mean GPT-3.5 as the LLM and text Davinci as the embedding model? It is important to distinguish between the LLM and the embedding model used for tokenization. Please explain.

Response: We appreciate this comment. The OpenAI nomenclature is slightly confusing. There should be no mention of GPT-3.5 davinci, it was an unfortunate leftover typo from a previous revision. It should have said GPT-3 davinci (which is the official GPT-3 model name) or GPT-3.5. No embedding models have been used. We corrected the model names across the paper in accordance with the official OpenAI nomenclature.

“For example, if we know the data is numeric we can keep just sentences containing a number. In the case of bulk modulus (see Dataset 1 in Sec. VI)”
Considering the semantic search capabilities that come with models like GPT-4 using cos similarity or maximum marginal relevance top k retrieval, this seems to be a major under-utilization. If you include sentences that contain numbers, you’re likely to gather many irrelevant sentences containing entities like “Figure 4”, “Dataset 1” and such. And this just adds to the post-processing workload.

Response: We appreciate this comment. We agree that modern LLM can perform such tasks to a much higher fidelity than just a search for numbers. However, one of the key features we emphasize in our work is that the approach is very simple, accessible, and requires minimal amount of coding. Removing sentences that do not contain numbers is a trivial and almost instantaneous operation which removes a significant amount of irrelevant data without the necessity of any additional knowledge or effort. The classification performed later easily removes all the other obvious irrelevant sentences (such as sentences containing phrases like "Fig. 1" etc.) and since they are so obviously irrelevant, the model almost never misclassifies them as relevant, and they pose no risk of lowering the precision of the data extraction and they do not add any workload to the human involvement in the process, while keeping this step simple.

Another example of under-utilization: “However, with experience, this time quickly reduces as the user gets used to the process. In addition, more experienced users may employ simple computer codes, e.g. based on regular expressions, which would preselect possible candidates for values and units, reducing the time significantly”. You can simply describe the format of your desired output as natural language text to the LLM as input prompt and avoid worrying about parsing or manually extracting that info from a sentence.

Response: This is certainly true, and we mention right in the next sentence that this is possible. However, this is only relatively simple when modern, computationally expensive LLMs are used (like GPT4), and smaller LMs are unable to perform that task without significant training. In addition, as mentioned in the text (Page 7, Sec, III; Step 3), we evaluated the performance of such automated template filling, and its performance, although impressive, is not enough to produce very high quality data. Therefore we suggest that even when LLMs are used, a human involvement in template filling is necessary for obtaining high-quality results. Since the approach we present here is meant to be flexible, and model-agnostic, we decided to use the simplest and most accessible approach, and only hint at more automated possibilities.

“Step 3 Data structurization”
This is called template filing in natural language processing. Please use NLP terminologies for better coherency.

Response: Thank you for this remark. The paper is mostly aimed at materials researchers, which are often unfamiliar with NLP terminology, which is why the term "Data Structurization" was used. We added the term "template filling" as an alternative to reach a wider audience.

“The user will typically perform this step by first ranking the sentences by their probability of being relevant, which is the output from Step 1 (or Step 2 if performed).”
How do you get these probabilities? Is that logits of the generated token for the outputs? Please elaborate

Response: The probabilities in the case of small LMs used through huggingface are classification scores, and log probabilities of the output token (logprobs) in the case of GPT3 models. We expanded the relevant paragraphs to clarify this aspect.

Revised text (p. 7. Sec. II):
The user will typically perform this step by first ranking the sentences by their probability of being relevant (classification scores in the case of small LMs, bart and DeBERTaV3, or log probability of the output token in the case of GPT-3), which is the output from Step 1 (or Step 2 if performed) (...)

“In general, human assisted data structurization, even when only the sentences containing the relevant data are given, may be a tedious and time-consuming task. However, at this point it is the only method that can guarantee an almost perfect precision.”
This is a strong statement and it is not true. There are emerging methods that do the autonomous template filing with low or zero human assistance and still have a reasonable performance. See these two papers for more details: Please check out: https://urldefense.com/v3/__https://arxiv.org/abs/2312.07559__;!!Mak6IKo!JY2fYq2Q39iP9CD1saKOemoUdn8EQWCgrrfgs7pe9d0v9gSuO2G5RsplgsQoU3RgnEOZcvruuInI2FVwUeqHbaOyctOj$ and https://urldefense.com/v3/__https://arxiv.org/abs/2312.11690__;!!Mak6IKo!JY2fYq2Q39iP9CD1saKOemoUdn8EQWCgrrfgs7pe9d0v9gSuO2G5RsplgsQoU3RgnEOZcvruuInI2FVwUeqHbTz6lOJk$ as an example.

Response: Like the reviewer mentions, they offer "reasonable performance", which places them in the category which we reserve for "large datasets", where no method exists for "almost perfect precision". While these methods are very effective and useful, in particular in the regime of large datasets, we do not see clear evidence that they are presently capable of providing an "almost perfect precision" for the data problems tackled here. We demonstrate that this "almost perfect precision" is achievable, albeit within a human-in-the-loop approach as the one presented here, for small and modest sized databases, which are the goal of these paper.
We added the above mentioned references to the text as another example of emerging fully automated methods.

Revised text (p. 2, Sec. I):
Recently, fully automated agent-based LLM approaches to analyze scientific text have been proposed as well, which are capable of answering science questions with information from research papers [35], and generating customizable datasets [36].

Just to confirm, does the workflow involves breaking the research paper into sentences, and running the model on a single sentence at a time, or do you run your model on the entire research article, and find relevant/irrelevant sentences? If it is the first case, I don’t think we can really call it named-entity recognition, especially in a single sentence like: “We determined the bulk modulus to be 123 GPa.”

Response: Yes, the workflow involves breaking the paper into sentences. We agree with the reviewer and do not call it named-entity recognition anywhere in the paper.

Comparing ROCs in Figures 3b and 4b, looks like the fine-tuned Bart with few examples (Fig 4b. AUC is about 0.3 with 20 positive examples) is performing worse than the zero-shot setting (Fig 3b. AUC is 0.6). Why is that happening? Please explain

Response: With few examples, the model is trained on a very specific and not very diverse set of information, therefore it positively classifies only sentences matching that very specific set of data. With too small of a training dataset, the model weights are updated with information inadequate to constrain it, and the model does not perform well. A generic zero-shot classification model, given a broad class, such as "bulk modulus", is much more likely to positively classify a sentence related to that class, although with a lower specificity which results in a high recall but relatively low precision. Training model on many representative examples helps increase the recall and increases the specificity by being more likely to narrow down the sentences to only those containing sentences that have full datapoints.

Revised text (p. 10, Sec. III):
One may notice that performance of the fine-tuned models trained with very small training sets perform worse than zero-shot (Fig. 3). When the model is fine tuned on a very specific and not very diverse set of information, the model's weights are updated with information inadequate to constrain it resulting in less accurate performance.

Referee: 3

Comments to the Author
General comment:  

The authors have satisfactorily addressed all the comments as per the review comments and the response submitted. Accordingly, I have now only following queries.

In the VI. Datasets section of the paper, the authors write: "The bulk modulus is a benchmark dataset of sentences, a test dataset consisting of 100 papers randomly chosen from the over 10000 paper results of a search query “bulk modulus”+”crystal” returned from the ScienceDirect API. In the written text of these 100 papers, there are 18408 sentences in total, out of which 237 sentences mention the value of bulk modulus."
The sciencedirect API return only 6000 results for a  given query. Hence these numbers don't look consistent.  Can authors elaborate more on this aspect? https://urldefense.com/v3/__https://dev.elsevier.com/api_key_settings.html__;!!Mak6IKo!JY2fYq2Q39iP9CD1saKOemoUdn8EQWCgrrfgs7pe9d0v9gSuO2G5RsplgsQoU3RgnEOZcvruuInI2FVwUeqHbXgyIcaX$

Response: Thank you for spotting this oversight. This sentence was confusing and unclear. What we meant to say was that there are over 10000 paper results (which the API can return the count of), and then 100 were randomly selected via the API, which, as the reviver noticed, is capable of returning full text of only the first 6000 of them. We revised this sentence to be clear and explicit.

Revised text (p. 12 Sec. VI):
The bulk modulus is a benchmark dataset of sentences. From over 10000 paper results of a search query “bulk modulus”+”crystalline”, a subset of 100 papers from the first 6000 full-text papers available through ScienceDirect API was randomly selected. In the written text of these 100 papers, there are 18408 sentences in total, out of which 237 sentences mention the value of bulk modulus.

Since the authors have access to gpt4, why did they not structure the data in json format using the inbuilt functionality of the OpenAI api for gpt4?

Response: The aim of this paper is to show a simple, accessible, flexible, and model-agnostic method of gathering data. As the reviewer mentioned, the json inbuilt functionality is exclusive to GPT4. We do, however, mention that based on the used model, automated structurization is possible, although we do not explicitly mention the json functionality. We added that as another comment into the text.

Revised text (p. 7 Sec. II):
(...) Some models, like GPT4, offer structured format output, such as json, which may be used to assist the final data extraction step.

What are the similarities and differences between this work and author’s own previous work: https://urldefense.com/v3/__https://www.nature.com/articles/s41467-024-45914-8__;!!Mak6IKo!JY2fYq2Q39iP9CD1saKOemoUdn8EQWCgrrfgs7pe9d0v9gSuO2G5RsplgsQoU3RgnEOZcvruuInI2FVwUeqHbeU7OI5t$ ? This question is very important to address as it is not very evident what the differences are.

Response: Thank you for this comment. Both of these projects were being developed simultaneously, but with different goals in mind. In fact, they are intended to be complementary and address different areas of applicability and method for the same problem, much like we describe in the introduction and show in Fig. 1. This work, in terms of the ease of use, is focused on accessibility, flexibility, and simplicity. It is meant to provide a near-perfect, close to human performance for modest sized databases, at a cost of additional human time necessary in some of the steps, and is meant to be used with any model, including, and most importantly, small, free, and widely accessible language models that can be run on personal computers. The other work, cited by the reviewer, is a demonstration of a fully automated approach, which is only capable of providing good results using the most advanced LLMs such as GPT4 (and is not capable at all of providing "near perfect" results, or work with any smaller models), through complex multi-step queries, which, at least for now, are very expensive and often inaccessible to many researchers. It also falls into the "fully-automated/large datasets/lower quality" category, described in the introduction. These papers are very much complementary to each other, as they target different sizes of datasets, at different levels of the resultant data quality, and utilize entirely different approaches. The only similarities between this and the other paper is that in an example where GPT4 was used, a similarly structured prompt was used for the classification task, and that we used similar properties for some of the benchmarks, since we had prior experience with those and could provide a more accurate assessment. We added the reference to the other paper in the introduction along with a short description that places the other paper in a separate category and should make the distinction much easier.

Revised text (p. 2 Sec. I):
Other fully automated LLM-based methods, including those that leverage complex prompt engineering workflows within LLMs have been proposed to curate large materials datasets of a higher quality than conventional automated NLP methods when used with state-of-the-art LLMs [37].

Finally, the authors have also uploaded their API key in the code (probably by mistake). They can remove it to prevent its misuse.

Response: We really appreciate this comment. We blocked the leaked API key and updated the code with a placeholder for the user's key.

Round 2

Revised manuscript submitted on 25 Mar 2024

Editor’s decision letter

26-Apr-2024

Dear Dr Polak:

Manuscript ID: DD-ART-01-2024-000016.R1
TITLE: Flexible, Model-Agnostic Method for Materials Data Extraction from Text Using General Purpose Language Models

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions.

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

************

Reviewer comments

Reviewer 2

I have read the author's responses to my comments. While most of them were covered and provided satisfactory updates/ responses, few of my comments were not addressed. I am happy to recommend the work for publication after a few minor revisions as below:

1. Please share source code! The files bart-deberta_zeroshot.py and gpt3_oneshot_pn.py do not contain the code on many part of the analysis (ChemDataExtractor2, DeBERTaV3, or how the performance of the models where evaluated and compared)

2. Please reorganize the text for better coherency of the document. As an example this paragraph: “Fig. 3 shows the zero-shot result statistics for the different models, including GPT-3.5 (whose technical names are text-davinci-002 and text-davinci-003) and other GPT models including 3.5 (turbo) and 4, which underlie ChatGPT. The Chat models do not output probabilities so full precision recall curves cannot be plotted, only a single point, which for all Chat models has 100% Recall” discusses the results section and is misplaced in the methods. Another case is where the description of Figure 4 which belongs to the results appears in the Methods section. Also the paper has two method sections (Methods and DESCRIPTION OF THE METHOD). Please revise the formatting

Reviewer 1

The authors addressed most of the feedback from prior reviews. Given the updated paragraphs, I suggest adding relevant references in Section 1: [1] provides an example of creating relevant using instruction fine-tuning; [2] is a relatively recent addition that motivates the important of data extraction for materials science.

I would encourage the authors to try to condense the most relevant information, especially in Section 3 and Section 4, to highlight the most important takeaways of those sections. One approach would be to use paragraph headers similar to Section 2 where the main descriptions of each step are summarized.

[1] Yu Song, Santiago Miret, Huan Zhang, and Bang Liu. 2023. HoneyBee: Progressive Instruction Finetuning of Large Language Models for Materials Science. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5724–5739, Singapore. Association for Computational Linguistics.

[2] Miret, Santiago, and N. M. Krishnan. "Are LLMs Ready for Real-World Materials Discovery?." arXiv preprint arXiv:2402.05200 (2024).

Author response

Dear Reviewers,

We appreciate your insightful feedback on our manuscript. In response to your suggestions, we have made several revisions that have enhanced the quality of our work. We have thoroughly addressed each concern highlighted in the review process.

Thank you for the chance to improve our manuscript with your guidance. We trust that the updated version aligns with your expectations.

Attached is a detailed response to each of your comments.

Best regards,

Maciej P. Polak and Dane Morgan

Referee: 1

The authors addressed most of the feedback from prior reviews. Given the updated paragraphs, I suggest adding relevant references in Section 1: [1] provides an example of creating relevant using instruction fine-tuning; [2] is a relatively recent addition that motivates the important of data extraction for materials science.
[1] Yu Song, Santiago Miret, Huan Zhang, and Bang Liu. 2023. HoneyBee: Progressive Instruction Finetuning of Large Language Models for Materials Science. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5724–5739, Singapore. Association for Computational Linguistics.
[2] Miret, Santiago, and N. M. Krishnan. "Are LLMs Ready for Real-World Materials Discovery?." arXiv preprint arXiv:2402.05200 (2024).

Response: Thank you for providing these additional references. We included them together with relevant descriptions in Section 1.

Revised text (p 1, Sec I):
While currently LLMs often fall short in practical applications, struggling with comprehending and reasoning over complex, interconnected knowledge domains, they offer significant potential for innovation in materials science and are likely to play a crucial role in materials data extraction [1].
(...)
In addition, the emergence of highly specialized LLMs underscores the rapid advancement in the field. In Ref. [36] an instruction-based process specifically designed for materials science enhanced the accuracy and relevance of data extraction. Such specialized fine-tuning shows significant advantages in dealing with niche materials science tasks.

I would encourage the authors to try to condense the most relevant information, especially in Section 3 and Section 4, to highlight the most important takeaways of those sections. One approach would be to use paragraph headers similar to Section 2 where the main descriptions of each step are summarized.

Response: We added short highlights of the most important takeaways to Sections 3 and 4, in a similar fashion to those in Section 2.

Revised text (p. 7, Sec III; p. 10, Sec IV):
Section III: Results
This section provides a comprehensive analysis of various language models' performance in classifying relevant sentences. The analysis highlights the superior performance of the GPT family of models in a zero-shot approach and demonstrates the effectiveness of fine-tuning, while also discussing the results in the context of the accessibility of different models. This section also addresses the challenges posed by highly imbalanced datasets and discusses strategies for reducing human effort in data processing.

Section IV: Discussion
The section discusses the practical application of the presented approach to curate an extensive and accurate database of critical cooling rates for metallic glasses by analyzing a large volume of scientific literature. Comparison to existing, manually curated database and other automated methods is provided. Utility for complex data-oriented tasks like machine learning and the method's potential to handle unrestricted searches effectively is then discussed.

Referee: 2

Comments to the Author
I have read the author's responses to my comments. While most of them were covered and provided satisfactory updates/ responses, few of my comments were not addressed. I am happy to recommend the work for publication after a few minor revisions as below:

Please share source code! The files bart-deberta_zeroshot.py and gpt3_oneshot_pn.py do not contain the code on many part of the analysis (ChemDataExtractor2, DeBERTaV3, or how the performance of the models where evaluated and compared)

Response: Initially the analysis was done partially by hand without the use of robust and shareable codes or scripts. We developed more robust codes that allow users to reproduce the analysis performed within this work and uploaded them into figshare, together with raw data files and scripts to reproduce figures.

Please reorganize the text for better coherency of the document. As an example this paragraph: “Fig. 3 shows the zero-shot result statistics for the different models, including GPT-3.5 (whose technical names are text-davinci-002 and text-davinci-003) and other GPT models including 3.5 (turbo) and 4, which underlie ChatGPT. The Chat models do not output probabilities so full precision recall curves cannot be plotted, only a single point, which for all Chat models has 100% Recall” discusses the results section and is misplaced in the methods. Another case is where the description of Figure 4 which belongs to the results appears in the Methods section. Also the paper has two method sections (Methods and DESCRIPTION OF THE METHOD). Please revise the formatting

Response: Thank you for this comment. The section titling convention was confusing. We revised the title of Section II to "Description of the approach", this section is not meant to be the "Methods" section - it is a high-level description of how our approach is designed and meant to be used. It is necessary to discuss some figures both in this section, and then more in depth in the following "Results" section. The "Methods" section (Section V) is devoted to technical details and is placed at the end of the manuscript. We additionally added key takeaways to the beginning of Sections 3 and 4, to facilitate easier navigation and interpretation of the paper (see response to point 2, Referee 1)

Round 3

Revised manuscript submitted on 10 May 2024

Editor’s decision letter

21-May-2024

Dear Dr Polak:

Manuscript ID: DD-ART-01-2024-000016.R2
TITLE: Flexible, Model-Agnostic Method for Materials Data Extraction from Text Using General Purpose Language Models

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our LinkedIn account [https://rsc.li/Digital_showcase] please fill out this form: https://form.jotform.com/213544038469056.

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

******
******

Please contact the journal at digitaldiscovery@rsc.org

************************************

DISCLAIMER:

This communication is from The Royal Society of Chemistry, a company incorporated in England by Royal Charter (registered number RC000524) and a charity registered in England and Wales (charity number 207890). Registered office: Burlington House, Piccadilly, London W1J 0BA. Telephone: +44 (0) 20 7437 8656.

The content of this communication (including any attachments) is confidential, and may be privileged or contain copyright material. It may not be relied upon or disclosed to any person other than the intended recipient(s) without the consent of The Royal Society of Chemistry. If you are not the intended recipient(s), please (1) notify us immediately by replying to this email, (2) delete all copies from your system, and (3) note that disclosure, distribution, copying or use of this communication is strictly prohibited.

Any advice given by The Royal Society of Chemistry has been carefully formulated but is based on the information available to it. The Royal Society of Chemistry cannot be held responsible for accuracy or completeness of this communication or any attachment. Any views or opinions presented in this email are solely those of the author and do not represent those of The Royal Society of Chemistry. The views expressed in this communication are personal to the sender and unless specifically stated, this e-mail does not constitute any part of an offer or contract. The Royal Society of Chemistry shall not be liable for any resulting damage or loss as a result of the use of this email and/or attachments, or for the consequences of any actions taken on the basis of the information provided. The Royal Society of Chemistry does not warrant that its emails or attachments are Virus-free; The Royal Society of Chemistry has taken reasonable precautions to ensure that no viruses are contained in this email, but does not accept any responsibility once this email has been transmitted. Please rely on your own screening of electronic communication.

More information on The Royal Society of Chemistry can be found on our website: www.rsc.org

Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.

From the journal Digital Discovery Peer review history

Flexible, model-agnostic method for materials data extraction from text using general purpose language models

Round 1

Reviewer 1

Reviewer 2

Reviewer 3

Round 2

Reviewer 2

Reviewer 1

Round 3

Transparent peer review