From the journal Digital Discovery Peer review history

You do not have JavaScript enabled. Please enable JavaScript to access the full features of the site or access our non-JavaScript page.

Round 1

Manuscript submitted on 23 Feb 2023

Editor’s decision letter

17-May-2023

Dear Dr Walker:

Manuscript ID: DD-ART-02-2023-000019
TITLE: Extracting Structured Seed-Mediated Gold Nanorod Growth Procedures from Literature with GPT-3

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions.

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry

************

Reviewer comments

Reviewer 1

This manuscript describes an application of fine-tuned GPT-3 (Davinci) on extracting gold nanorod growth procedures from literature as structured data. This is a timely study as transformers are transforming how scientific communities communicate. I hope the following comments can help improve the quality of this submission.

Model related:

1. Since the paragraphs are sent to GPT for completion individually, can GPT understand that a token is referencing externally (e.g., something not in the current paragraph but clearly defined in the preceding paragraphs)?
2. In the example in Fig. 4, it is interesting that the model is "guessing" the borohydride to be NaBH4. Is there any attempt to limit the model such that the information of completion is a proper subset of that of the prompt?
3. Have the authors noticed inconsistent completions? The same prompt should give (at least semantically or chemically) identical completions.

Data related:

4. The structured JSON template is provided in Fig. 2. The authors may want to provide a schema representation (e.g., JSON-schema) to make this structure reusable/machine-readable. This also helps validate completions.
5. Please include the training script and at least one inference example for the prompt sent directly to OPENAI. Please also indicate the OPENAI API version.

Typeset related:

6. Fig. 1 caption and main text mention a purple component that appears missing.
7. Fig. 4 caption should be top and bottom instead of left and right.
8. an unresolved figure reference in the last paragraph of the SI.

Reviewer 2

See attached PDF

Reviewer 3

The paper entitled "Extracting Structured Seed-Mediated Gold Nanorod Growth Procedures from Literature with GPT-3" reports on the use of large language models (GPT-3) on extracting and structuring (into a json format) of synthesis recipes for seed mediated gold nanorods. The use of GPT-3 seems to be timely, as large language models seem to be very popular and have shown to be very effective in a multitude of applications. The paper is well written and mostly presents the results in a clear way. However, there are a few issues that would benefit from clarifications.

It is unclear whether the information fed to the model is a pre-selected paragraph likely containing a recipe or any paragraph from a pre-selected paper, or any paragraph from any paper. In Sec. 3 the authors write "All input texts for each stage are drawn from the original dataset of synthesis text filtered down to paragraphs likely to describe seed-mediated gold nanorod growth", which suggests the first option for the assessment of the method, while in 4.3 they say "The fine-tuned GPT-3 model was applied to the full dataset of 1,137 papers", which implies all paragraphs from pre-selected papers were put through the model. The same approach should be used for assessment and final database or the results of the assessment would not correspond to the final data. If paragraphs without synthesis are put into the model, there is a risk that an inexistent/incorrect recipe will be extracted, as the fine-tuned model is incentivized to always extract data as it was trained on paragraphs with recipes. How often does that happen?

Related to this, the procedure to find relevant paragraphs seems to be very elaborate and involve not only many steps but many other natural language procedures and language models. Is there a way to involve GPT-3 in this step to simplify it? If not then perhaps rephrasing the title to include something like "scientific text" rather than "literature" would be more accurate.

The value of estimating "placement errors" and "transcription errors" is not clear to me. I'm not exactly sure how "if the same field contains information (as opposed to being empty) in both templates, that is considered a true positive prediction regardless of whether the information explicitly matches", could be considered a positive prediction. It seems to me to be the very definition of the worst possible outcome, and even more, one that is known to be an issue with LLMs, i.e. factual incorrectness and making up information. I understand that "transcription error" would later consider this as an error, but what was the point in assessing that the placement was a "positive"?
If we are considering the accuracy of extraction field-wise, it seems to me that entry is either correct (same in both, empty or not, a true positive), exists in prediction but not ground truth (value present in prediction, but empty or non-existing field in ground truth, a false positive), and not existing or empty in prediction, while present and non-empty in ground truth (false negative).

The authors report an accuracy of 86% in the abstract. I think this is misleading, as it suggests that 86% of recipes can be expected to be correct, while in reality, the 86% refers only to individual pieces of information, and only 40% of full recipes are correct. And that 40% figure is also under special constraints of what is necessary in a recipe. This is not meant to take away from the results of the authors' work, as 40% of such a complex set of information is still impressive, but I think a more relevant number should be reported as the main result, if the title of the paper were to remain as "extracting growth procedures". The 86% accuracy in individual fields would be an important metric supporting that information. It seems like the metric of the biggest value that would really describe the method would be: out of all complete and full recipes in source texts, how many are extracted entirely with all relevant fields accurately extracted? And these should be accompanied or maybe combined with with "out of not complete recipes in the source how many are extracted entirely with all fields correct".

Which also leaves a question related to one of my previous comments - how do we distinguish the incorrectly extracted recipes that pass the "sanity check" for duplicates etc. from those that are fully correct? I fully expect errors in extraction of recipes, but I would also anticipate recipes for other synthesis or something else, even non-existent recipes to be extracted from paragraphs that may contain them. These may be a bigger issue. To my knowledge GPT-3 outputs probabilities of completion tokens, perhaps analysis of the values of probabilities would enable identification of inaccurately extracted entries? It may be preferred that fewer recipes are extracted, if the certainty that these recipes are accurate was high, in particular if they were to be used to reproduce the synthesis procedure. If each procedure has to be reverified by hand, that takes away from the value of the GPT-3 extracted data.

The authors say that the initial zero-shot question-answering framework does not scale well to a large number of papers due to a large number of requests, and later they say the zero-shot performance is poor. Why is a large number of requests an issue? I'm assuming this is done with a computer code so the process may be parallelized or otherwise sped up. If its poor, that means it has been assessed, and I believe a full evaluation of a zero-shot approach is very valuable and should be reported, as it allows to accurately judge how much improvement the time consuming and difficult fine-tuning procedure actually provides, and if the zero-shot performance is poor, that would only emphasize the success of the presented approach. If price is the factor, and the actual performance would be good, other alternative LMMs could be applied. Which leads me to an additional comment; GPT-3 is no longer considered state-of-the-art. Models like, for example, LLaMA, are reported to not only outperform GPT-3 but are also free, which could solve the cost issue. Would it be possible to have at least one other model assessed for comparison?

Journal guidelines for data availability suggest sharing everything that can be shared, including computer codes. I understand that the text paragraphs cannot be shared due to copyright, but the only things shared seem to be the datasets. I would encourage sharing the computer codes as well if possible.

Reviewer 4

The article by Walker et al. describes the study of utilizing GPT-3 to extract structured multi-step seed-mediated gold nanorod growth procedures and outcomes from unstructured text found in scientific articles. The dataset of 1,137 papers mentioning seed-mediated recipes and rod-like morphologies were collected and filtered from the database of gold nanostructure synthesis protocols. Trained on a portion randomly selected from said dataset, the fine-tuned GPT-3 model predicted in the form of synthesis template with F1-score accuracy of 86%. The model then was used to analyze the dataset of 1,137 papers, resulting in 268 papers that contain the complete procedures and outcomes containing all three components of the protocol (seed-growth-characterization).
The advances of this work are 1) the creation of the dataset specifically for seed-mediated gold nanorod growth, while only generic database(s) of gold nanoparticle synthesis protocols exist, and 2) the development of GPT-3 model to generate synthesis templates using JSON structure from unstructured text, which is an important and successful advance in text-mining. The researchers have done their work with detailed and careful validation and analysis. Thus, I think the article will be a great benefit to the scientists.
Strengths:
• First strength is this work addresses the gaps in modern knowledge pertaining to seed-meditated gold nanorod synthesis and the current limitations of the general database for gold nanomaterial synthesis. Strong motivation and justification are provided for the need of high throughput of text-mining for synthesis procedures in the form of structured template.
• Second strength is the detailed and well-thought methodology of developing GPT-3 model for this work. For an instance, the researchers identify two distinct error types in their evaluation: placement error in template and transcription errors which are equally important. This methodology can be applied to many other experimental protocols.
• Third strength is, with the aid of their fine-tuned GPT-3 model, the researchers did a detailed analysis of the dataset they present, identifying which papers containing only some of three main components of seed-mediated gold nanorod synthesis protocol and which ones containing all three with complete outcomes. Instead of presenting the dataset with only 268 papers, the researchers present the entire dataset, recognizing that some scientists may be interested in only one component (e.g. characterization of purchased gold nanorods) of the protocol and leaving to them to filter accordingly.
Weaknesses
• In model training, it was mentioned that nearly 60% of the training and testing sets (141 out of 240 and 23 out of 40 respectively) contained "at least one paragraph with information that could be placed into a synthesis template." Then, in Figure 6, when the breakdown of the 1137 paper dataset was discussed, it did not mention whether those papers present some useful information beyond incidental appearance of keywords related to seed-mediated gold nanorod growth. I find it important to include such analysis while presenting the dataset.
• The discussion and caption do not match the image shown in Figure 1 and therefore appear a bit confusing. The caption states four parts, while the image itself contains only three. The missing part is said to be in 'purple' which does not exist in Figure 1.

Author response

Response to Reviewer 1

1. Since the paragraphs are sent to GPT for completion individually, can GPT understand that a token is referencing externally (e.g., something not in the current paragraph but clearly defined in the preceding paragraphs)?
Response: GPT-3 cannot infer outside of the immediate provided context, so cross-paragraph context is not supported, largely due to the token limit of 2048. This was not observed to be an issue during annotation. When multiple paragraphs in a paragraph were identified to contain synthesis information, it was observed that the contexts of each synthesis component were contained in a single paragraph. However, this was noted to be an issue in some of the predictions, which is why there is a discussion of resolving information conflicts (the same information being extracted from different paragraphs in the same paper). However, the vast majority of the extracted entities (~89\%) are either uniquely identified in a single paragraph or exactly repeated across paragraphs. This is discussed in the ``Full Filtered Dataset'' subsection of the ``Results'' section (4.3, page 8).

2. In the example in Fig. 4, it is interesting that the model is ``guessing'' the borohydride to be NaBH4. Is there any attempt to limit the model such that the information of completion is a proper subset of that of the prompt?
Response: We thank the reviewer for this interesting question. The fine-tuning was performed such that the model was trained on (input, output) pairs consisting of (synthesis paragraph, annotated synthesis template) text. By including examples where ``borohydride'' in the input was resolved to ``NaBH4'' in the output, the model will generally be more likely to make this mapping, correct or otherwise. There is an inherent tradeoff of correctness vs. completeness in how the recipe annotations are done - i.e., do we want the model to annotate like a human, or like a robot? Including ``robot'' annotations with only proper subsets of tokens leads to many recipes with incomplete information; the model is much less useful. Therefore, our annotations are more human-like. They reflect how a domain expert in AuNPs would naturally annotate data, including conversions of very common reagents to their expected formulae, implicit normalizations, and common rephrasings - even if that information is not a true proper subset of the prompt. The downside to this strategy is that the model may occasionally hallucinate wrong answers (e.g., expand a borohydride to an erroneous formula). Additionally, there is an issue with the explicit inclusion of every representation of a precursor that may occur: the token length template would rapidly expand beyond the model context window to accommodate these variations. There were no explicit instructions in the prompting to prevent hallucination since prompting is generally not needed for fine-tuning (per OpenAI's own instructions) unlike what is seen with the later Instruct-GPT models as well as ChatGPT and GPT-4 (none of which are available for fine-tuning). Instruction-based prompting is important for zero- or few-shot learning for a variety of tasks, but not for fine-tuning on a single specific task. Through finetuning, the model learns how to fill the synthesis template based on existing examples. By providing more training examples, errors of these types should occur less often (e.g. if there were enough examples of ``borohydride'' attributes being placed under ``BH4'' in the template). In earlier tests using less training data, the model tended to make much more egregious errors placing volumes and concentrations under the wrong precursors.

3. Have the authors noticed inconsistent completions? The same prompt should give (at least semantical gly or chemically) identical completions.
Response: We thank the reviewer for the question. Model inference was carried out with zero temperature, which should result in deterministic results. However, there is inherent non-determinism in GPU calculations around floating point operations, which may result in small differences in the log probabilities. In cases where there are similarly very small differences between the top two likely token completions, the completions may be different, but this was not explicitly observed. This has been clarified in the manuscript in Section 3.1 (Methods: Overall Procedure) page 3/4: ``Default settings through the OpenAI API (v0.13.0) are used for all fine-tunes of the GPT-3 Davinci model, and a temperature of zero is used for all model predictions with a double line break as the stop sequence. By using a temperature of zero, the results should be deterministic assuming that floating point errors in the GPU calculations are smaller than the differences between the log probabilities of the next token prediction candidates.''

4. The structured JSON template is provided in Fig. 2. The authors may want to provide a schema representation (e.g., JSON-schema) to make this structure reusable/machine-readable. This also helps validate completions.
Response: We thank the reviewer for the suggestion. An empty JSON template is now included with the other JSON files in the supporting data.

5. Please include the training script and at least one inference example for the prompt sent directly to OPENAI. Please also indicate the OPENAI API version.
Response: We thank the reviewer for this question. There isn't really a training script since the command-line interface of the OpenAI API is used to fine-tune the model. Here is the command: openai api fine_tunes.create -t <.jsonl file containing prompt/completion pairs> -m davinci. This has been included in the paper alongside the api version (v0.13.0) in section 3.4 (Methods: Fine-tuning Procedure and Dataset Construction) page 5.

6. Fig. 1 caption and main text mention a purple component that appears missing.
Response: The reviewer is correct. The reference to the purple component has been removed since that was relevant to a prior version of the figure where the synthesis texts were labeled in purple instead of being included in each stage.

7. Fig. 4 caption should be top and bottom instead of left and right.
The reviewer is correct. This has been corrected (the original figure was left-right oriented with a single column article format).

8. An unresolved figure reference in the last paragraph of the SI.
Response: The reviewer is correct. The figure reference was relevant to a prior version of the document and has been removed.

Response to Reviewer 2

1. While I realize that none of the existing approaches can do the full set of tasks the GPT-3 based approach, I am not convinced how meaningful the comparisons to the F1 scores from the paper from Cruse et al. are: From my understanding, the article at hand uses a different test set (and looks at some different extraction tasks), while the writing makes it appear as if it is an apples-to-apples comparison. For a fair comparison: Either use the same test set or clearly state that a quantitative comparison is not really applicable.
Response: The reviewer is correct. There is no real applicable direct comparison available since while both approaches extract similar information, the structure of the extracted information is fundamentally different, such as how this work includes relations between the extracted entities. This has been clarified in the manuscript in section 4.3 (Results: Model Performance) page 7: ``This is still an improvement over similar results, as the gold nanoparticle synthesis protocol and outcome database developed by Cruse et al. extracts morphology measurements, sizes, and units with F1-scores of 70%, 69%, and 91% via NER with MatBERT. However, these entities are not linked together, so while doing so would inevitably introduce additional sources of error and performance would be additionally constrained by the lowest performing extractions, a direct quantitative comparison is not applicable.''

2. Do you need GPT-3 davinci? Given that you fine-tune, is the largest model necessary or can you maybe already use ada or some other smaller model?
Response: We thank the reviewer for the question. There were some initial tests performed with Curie, but it was decided very early in the process to use Davinci since the smaller models were failing on simple tasks. The full training set was not used with any model other than Davinci.

3. How many entries are entirely correct? I wondered why the authors do not report the fraction of extracted paragraphs without any error. For scientific applications, this seems like a relevant metric.
Response: We thank the reviewer for the suggestion, as this is useful information for downstream scientific applications. A histogram of the paragraph-wise as well as paper-wise accuracies (aggregated over entities) are now included in the manuscript. It was found that the average adjusted F1-score aggregated by paragraph/paper was 64%/76%. These are lower since performance tended to be lower for paragraphs with less extractable information. It was additionally found that the information in 33% of the paragraphs and 15% of the papers were extracted perfectly. Furthermore, information was extracted from 48% of the paragraphs and 62% of the papers with >90% adjusted F1-score. Here is the relevant addition to the manuscript in section 4.2 (Model Performance) pages 7-8:

[Figure 6: Histograms showing the adjusted F1-score performances for the (a) paragraphs and (b) papers.]

``The adjusted F1-scores aggregated over extracted entities for the paragraph-wise and paper-wise predictions are shown in Figure 6. Instances in which there were no entities present in either the ground truths or the predictions are omitted from the results, giving a total of 66 paragraphs and 26 papers. For the paragraphs, the average adjusted F1-score was approximately 64% with 22 (33%) perfect predictions and 32 (48%) predictions with >90% adjusted F1-score. For the papers, the average adjusted F1-score was approximately 76% with 4 (15%) perfect predictions and 16 (62%) predictions with >90% adjusted F1-score.''

The paper-wise performance has been included in the abstract as well: ``GPT-3 prompt completions are fine-tuned to predict synthesis templates in the form of JSON documents from unstructured text input with an overall accuracy of 86% aggregated by entities and 76% aggregated by papers.''

4. ``Figure ??'' on page 2 of the SI
Response: We thank the reviewer for pointing this out. The erroneous reference has been removed.

5. In several instances, I wondered why the authors only provide qualitative instead of quantitative evaluation. For example: ``Unfortunately, the standard pre-trained GPT-3 Davinci model is not capable of providing consistent completed templates of high quality in one request.'' and ``The approach of using zero-shot GPT-3 question answering requests to fill the templates tended to produce poor results, but it offered an acceptable starting point for collecting structured recipes.'' – How accurate is the zero-shot inference?
Response: We thank the reviewer for this question. The zero-shot predictions were only used to provide an initial starting point for annotations. The accuracy was not calculated since during annotation, they very clearly extracted data with low accuracy. More, the zero-shot predictions are not practical to obtain since they require a long series of question/answer-style calls - one for each field in the JSON schema. This is in contrast to the fine-tuned model that produces fully-filled JSON documents in a single request. Due to these limitations, only commonly present information was queried for zero-shot prediction (e.g. HAuCl4 volumes and concentrations in the seed solution) rather than all of the entities in the full template, so a direct comparison is not appropriate. This has been clarified in the manuscript in section 3.3 (Question Answering Completions) page 5: ``However, this approach does not scale well to large numbers of papers, as each query is a separate model request, meaning that each paragraph in each paper would require a large number of requests in order to fill a single template. Therefore, this approach is used to construct an initial dataset consisting of synthesis templates for paragraphs from a small number of papers. Due to the small number of papers used, this initial dataset does not necessarily capture the variety of precursors or manners in which critical data can be communicated in text. As such, only information known to be commonly present in seed-mediated gold nanorod synthesis (e.g. the common precursor volumes/concentrations) were queried. Nevertheless, these initial templates, when corrected, provide a suitable starting point for fine-tuning GPT-3 to provide complete synthesis templates in single requests for each paragraph.''

6. It is nice that the authors list the price. Since the price/token changes, however, it might be useful to also add the number of tokens.
Response: We thank the reviewer for the suggestion. Token counts (input and output) are now included in the paper for the fine-tuning, testing set predictions, and full filtered dataset predictions. As follows:

- Section 4 (Results) page 5: ``Default parameters for the fine-tuning process were employed, incurring a cost of 85.30 USD (191,069 prompt tokens and 522,649 completion tokens). The predictions over the testing dataset (40 papers composed of 117 paragraphs) took around eighty minutes to complete and incurred a cost of 14.39 USD (27,327 prompt tokens and 92,126 completion tokens).''
- Section 4.3 (Results: Full Filtered Dataset) page 8: ``The fine-tuned GPT-3 model was applied to the full filtered dataset of 1,137 papers (2,969 paragraphs) at a total cost of 384.31 USD (838,901 prompt tokens and 2,332,796 completion tokens) over 33 hours.''

7. ``The performance of the fine-tuned model was then evaluated using the corrected testing dataset.'' – At this point in the text, I struggled to follow what ``corrected'' means (and how you ensure a fair evaluation given a ``correction'' procedure).
Response: The ``corrected testing set'' refers to the annotations containing ground-truth information used to evaluate the model performance by comparing the model predictions for the same paragraphs to the ground truth. This has been clarified in the manuscript to simply be referred to as the testing dataset.

8. ``quantitative relative error was calculated according to the function s(p, q) = 2 min(p, q)/(p + q)'' – the reader would benefit from defining/giving an example for p and q.
Response: It has been clarified in the text that p and q are non-negative numerical values where p is the predicted numerical value and q is the annotated numerical value. Section 4.1 (Results: Error Evaluation Definitions and Examples) page 7: ``For numerical values with units, the units must be exactly correct and the quantitative relative error was calculated according to the function s(p, q) = 2 min(p, q)/(p+q), which is derived from the absolute proportional difference r(p, q) = |p-q|/(p+q) and is bounded on [0,1] for non-negative numerical values p (predicted numerical value) and q (annotated numerical value).''

9. I’m not sure that ``adjusted'' F1 score is uniquely defined (in particular, since the article sometimes mentions F1, in other cases ``adjusted'' F1).
Response: The adjusted F1-score is defined as the product of the F1-score for information placement and the accuracy for information transcription. This was clarified in the manuscript such that all references are explicit and the ``adjusted F1-score'' is explicitly defined in Section 4 (Results) page 7: ``The combined accuracy (adjusted F1-score) is presented as the product of the F1-score for information placement and the transcription accuracy.''

10. In Figure 7, I recommend not using red and green for color coding (color blindness and no contrast when printed in B/W).
Response: We thank the reviewer for the suggestion. Alternative colors have been used for the figure. The red was made darker and the green was changed to a pale yellow.

11. What is ``outlier detection using an elliptic envelope'' – is this a manual procedure or a specific algorithm?
Response: We thank the reviewer for pointing this out. Using an elliptic envelope for outlier detection is an established procedure originally described in this paper: https://www.tandfonline.com/doi/abs/10.1080/01621459.1984.10477105. We used the implementation provided here: https://scikit-learn.org/stable/modules/generated/sklearn.covariance.EllipticEnvelope.html. References have been added. After using this for outlier detection, the results are manually verified as mentioned in the manuscript, however.

12. Speaking about the use of GPT-3 in chemistry, it is perhaps fair to also cite some other work in this field (e.g., works from the White lab)
Response: We thank the reviewer for pointing this out. More citations were added to the introduction:

- Assessment of chemistry knowledge in large language models that generate code: http://dx.doi.org/10.1039/D2DD00087C
- Bayesian Optimization of Catalysts With In-context Learning: https://arxiv.org/abs/2304.05341
- ChemCrow: Augmenting large-language models with chemistry tools: https://arxiv.org/abs/2304.05376

13. There are some issues in the references (e.g., check ref 71).
Response: We thank the reviewer for pointing this out. Various references were fixed.

14. In the table with metrics, it is probably good to indicate which metrics are used to measure placement and which are the ones to measure transcription accuracy. Otherwise, an accuracy of 100\% below a precision of 81\% might appear very confusing.
Response: We thank the reviewer for pointing this out. These distinctions are indicated in the table in the main manuscript (under the headers ``Placement'', ``Transcription'', and ``Combined''), but this has been corrected in the tables in the supplemental materials.

15. While the pipelining is described in the text, I (and I guess many readers) would love to see the code the authors used
Response: We thank the reviewer for this suggestion. The annotations were largely manually performed (albeit with zero-shot predictions to fill in some values) and the fine-tuning was performed with the command-line interface of the OpenAI API, so there is not much interesting code to present. However, the code used for evaluating model performance has been added to the supporting data and the fine-tuning command was added to the main manuscript.

16. The authors provide JSON files for the data. Still, they are admittedly of limited use for replicating the work (and cross-checking the evaluation), as the original text cannot be shared for legal reasons. Having some tests along with the codebase might give some more confidence.
Response: We thank the reviewer for this suggestion. The model weights additionally cannot be shared outside of an organization on OpenAI, so unfortunately, there is not really a convenient way to provide tests either. However, we have provided DOIs for all of the articles and the first and last 25 characters for each paragraph, so one should be able to reconstruct the dataset using the information we provided, despite copyright restrictions, presuming they have access to the full-text articles. We have additionally provided the scripts for accuracy evaluation between the annotated and predicted templates for the paragraphs in the testing dataset.

Response to Reviewer 3

1. It is unclear whether the information fed to the model is a pre-selected paragraph likely containing a recipe or any paragraph from a pre-selected paper, or any paragraph from any paper. In Sec. 3 the authors write ``All input texts for each stage are drawn from the original dataset of synthesis text filtered down to paragraphs likely to describe seed-mediated gold nanorod growth'', which suggests the first option for the assessment of the method, while in 4.3 they say ``The fine-tuned GPT-3 model was applied to the full dataset of 1,137 papers'', which implies all paragraphs from pre-selected papers were put through the model. The same approach should be used for assessment and final database or the results of the assessment would not correspond to the final data. If paragraphs without synthesis are put into the model, there is a risk that an inexistent/incorrect recipe will be extracted, as the fine-tuned model is incentivized to always extract data as it was trained on paragraphs with recipes. How often does that happen?
Response: The reviewer is correct in noting the performance of the model is dependent on its fine-tuning distribution. In section 2 (Dataset) page 3, we state: ``Using the extracted information, 5,145 papers were identified to contain gold nanoparticle synthesis protocols, of which 1,137 filtered papers were found to contain seed-mediated recipes using the "seed_mediated" flag as well as rod-like morphologies ("rod or "NR" in "morphologies" under "morphological_information") or aspect ratio measurements ("aspect" or "AR" in "measurements" under "morphological_information"). This was done to filter the total papers down to only those likely to contain seed-mediated synthesis recipes for gold nanorods.'' This filtering step applies to all of the data used with GPT-3. However, the filtering still provides many negative examples, as many of the paragraphs only include incidental mentions of these keywords. As stated in section 3.4 (Fine-tuning Procedure and Dataset Construction) page 5: ``For example, seed-mediated growth or nanorod measurements and morphologies may only be incidentally mentioned in a given paragraph that is otherwise not relevant to a specific seed-mediated gold nanorod growth procedure. Of the 240 filtered papers in the training set and the 40 filtered papers in the testing set, 141 and 23 papers respectively contained at least one paragraph with information that could be placed into a synthesis template.'' As such, we expect that paragraphs containing no relevant synthesis information will not result in nonexistent extractions to a greater extent than was reported on. In order to clarify what data was actually processed, when appropriate, references to the papers in the dataset are now referred to as `filtered papers'' to reflect the initial filtering applied via prior cited work.

2. Related to this, the procedure to find relevant paragraphs seems to be very elaborate and involve not only many steps but many other natural language procedures and language models. Is there a way to involve GPT-3 in this step to simplify it? If not then perhaps rephrasing the title to include something like ``scientific text'' rather than ``literature'' would be more accurate.
Response: We thank the reviewer for pointing this out. Given that the paragraphs used were mined from academic literature and then processed using methods established by prior work, we have changed the manuscript title to ``Extracting Structured Seed-Mediated Gold Nanorod Growth Procedures from Scientific Text with GPT-3'' in order to reflect the intermediary steps between the raw text of the literature and the information fed into the model.

3. The value of estimating ``placement errors'' and ``transcription errors'' is not clear to me. I'm not exactly sure how ``if the same field contains information (as opposed to being empty) in both templates, that is considered a true positive prediction regardless of whether the information explicitly matches'', could be considered a positive prediction. It seems to me to be the very definition of the worst possible outcome, and even more, one that is known to be an issue with LLMs, i.e. factual incorrectness and making up information. I understand that ``transcription error'' would later consider this as an error, but what was the point in assessing that the placement was a ``positive''? If we are considering the accuracy of extraction field-wise, it seems to me that entry is either correct (same in both, empty or not, a true positive), exists in prediction but not ground truth (value present in prediction, but empty or non-existing field in ground truth, a false positive), and not existing or empty in prediction, while present and non-empty in ground truth (false negative).
Response: We thank the reviewer for this interesting question. It is correct that the adjusted F1-score (the product of the F1-score and the transcription accuracy) is by far the most meaningful reported metric. To emphasize this, we added the statement ``The combined accuracy (adjusted F1-score) is presented as the product of the F1-score for information placement and the transcription accuracy. This is the most meaningful metric to evaluate the overall performance of the model.'' to section 4.1 (Error Evaluation Examples and Definitions) page 6. Due to the static nature of the rather large synthesis templates, most entries, even for a correct prediction, will be empty. Including true negatives as true positives would inappropriately inflate the accuracy scores, so we avoided doing that. This is conventional in the evaluation of NER predictions and this task is essentially the combination of entity extraction and relation, so we tested them separately. While the precisions and recalls do not contain information about whether the extracted entities were ``transcribed'' correctly, they can give an indication of where the errors are coming from (e.g. a low recall indicates that relevant information is not being extracted at all regardless of transcription error).

4. The authors report an accuracy of 86\% in the abstract. I think this is misleading, as it suggests that 86\% of recipes can be expected to be correct, while in reality, the 86\% refers only to individual pieces of information, and only 40\% of full recipes are correct. And that 40\% figure is also under special constraints of what is necessary in a recipe. This is not meant to take away from the results of the authors' work, as 40\% of such a complex set of information is still impressive, but I think a more relevant number should be reported as the main result, if the title of the paper were to remain as ``extracting growth procedures''. The 86\% accuracy in individual fields would be an important metric supporting that information. It seems like the metric of the biggest value that would really describe the method would be: out of all complete and full recipes in source texts, how many are extracted entirely with all relevant fields accurately extracted? And these should be accompanied or maybe combined with with ``out of not complete recipes in the source how many are extracted entirely with all fields correct''.
Response: Yes, the 86% refers to the overall performance aggregated by entity. The 40%, however, does not refer to any performance accuracies, it only refers to the proportion of information-containing papers that fully describe the procedure and outcome of the seed-mediate gold nanorod synthesis. Across those papers, one can still only expect the accuracy (aggregated over the entities) to be 86% (as determined on the test set). In order to address your comment on the accuracy within the extracted templates for paragraphs/papers, histograms showing the adjusted F1-score by paragraph and paper has been included in the ``Model Performance'' section. It was found that the average adjusted F1-score aggregated by paragraph/paper was 64%/76%. These are lower since performance tended to be lower for paragraphs with less extractable information. It was additionally found that the information in 33% of the paragraphs and 15% of the papers were extracted perfectly. Furthermore, information was extracted from 48% of the paragraphs and 62% of the papers with >90% adjusted F1-score. The paper-wise performance has been included in the abstract.

5. Which also leaves a question related to one of my previous comments - how do we distinguish the incorrectly extracted recipes that pass the ``sanity check'' for duplicates etc. from those that are fully correct? I fully expect errors in extraction of recipes, but I would also anticipate recipes for other synthesis or something else, even non-existent recipes to be extracted from paragraphs that may contain them. These may be a bigger issue. To my knowledge GPT-3 outputs probabilities of completion tokens, perhaps analysis of the values of probabilities would enable identification of inaccurately extracted entries? It may be preferred that fewer recipes are extracted, if the certainty that these recipes are accurate was high, in particular if they were to be used to reproduce the synthesis procedure. If each procedure has to be reverified by hand, that takes away from the value of the GPT-3 extracted data.
Response: We thank the reviewer for this very interesting suggestion. Absolutely verifying the results of an extraction with 100% certainty is indeed a task that can only be performed manually. However, this is not unique to the GPT-3-based method we report. Even a simple text classifier (e.g., a binary classifier of whether a text contains any chemistry-related synthesis recipe) cannot be absolutely trusted unless its predicted results are examined manually. The probabilities of a classifier, however, can be inspected in order to set a classification threshold towards one performance goal or another (e.g., reducing false positives). Analogously, GPT-3 does output log probabilities of each output token; in principle, we could set a threshold on set of tokens in order to identify likely erroneous entries. In practice, however, the log probabilities do not necessarily directly correlate with the type of extraction accuracy we are interested in. The log probabilities essentially measure the likelihood of a token in the context of the other tokens and do not necessarily indicate the accuracy of the information contained by the token. For instance, lower probabilities may be assigned to novel albeit correct information and higher probabilities may be assigned to familiar but incorrect information. Thus, while one can use these log probabilities to estimate subjective confidence in the predicted tokens, this approach does not necessarily evaluate factual correctness. This is definitely an avenue of interesting further study as has been discussed (https://towardsdatascience.com/exploring-token-probabilities-as-a-means-to-filter-gpt-3s-answers-3e7dfc9ca0c) and if one were to solve the relation between model confidence in predicted tokens and the factual extractive accuracy of those tokens, the oft-mentioned hallucination problem with autoregressive models could be addressed more rigorously, presenting a massive advance in application to information extraction tasks. As it stands, this is an open problem in need of more research beyond the scope of this work.

6. The authors say that the initial zero-shot question-answering framework does not scale well to a large number of papers due to a large number of requests, and later they say the zero-shot performance is poor. Why is a large number of requests an issue? I'm assuming this is done with a computer code so the process may be parallelized or otherwise sped up. If its poor, that means it has been assessed, and I believe a full evaluation of a zero-shot approach is very valuable and should be reported, as it allows to accurately judge how much improvement the time consuming and difficult fine-tuning procedure actually provides, and if the zero-shot performance is poor, that would only emphasize the success of the presented approach. If price is the factor, and the actual performance would be good, other alternative LMMs could be applied. Which leads me to an additional comment; GPT-3 is no longer considered state-of-the-art. Models like, for example, LLaMA, are reported to not only outperform GPT-3 but are also free, which could solve the cost issue. Would it be possible to have at least one other model assessed for comparison?
Response: The zero-shot question answering does not scale well because the entities are separately requested one-by-one rather than extracted all at once as with the fine-tuned model. This very quickly becomes cost-prohibitive since filling in the full template (rather than just common information) would require 109 requests for each paragraph with more significant hallucination risk. If this was done for the full filtered dataset, the cost would be around 2k USD (using the same Instruct-GPT model), which is far more expensive than the fine-tuned model. Overall, the zero-shot predictions were not explicitly evaluated since they were only used to provide a starting point for the annotations. The intention from the beginning was always to evaluate the performance of the fine-tuned model, with the zero-shot question answering only included in the methodology to explain how the annotations were initialized. Indeed, only commonly present information was queried for zero-shot prediction (e.g. HAuCl4 volumes and concentrations in the seed solution) rather than all of the entities in the full template, so a direct comparison to the final result from fine-tuning is not really appropriate. This has been clarified in the manuscript. Other more state-of-the-art models such as LLaMa were released (24 February 2023) long after this work was performed (the first half of 2022) and even shortly after this manuscript was submitted (23 February 2023). While this would alleviate some cost issues, the tradeoff is that hardware requirements are rather restrictive for these models. While newer models such a LLaMa do provide interesting routes of investigation and comparison, we feel this is outside the scope of this work.

7. Journal guidelines for data availability suggest sharing everything that can be shared, including computer codes. I understand that the text paragraphs cannot be shared due to copyright, but the only things shared seem to be the datasets. I would encourage sharing the computer codes as well if possible.
Response: We thank the reviewer for the suggestion. The annotations were largely manually performed (albeit with zero-shot predictions to fill in some values) and the fine-tuning was performed with the command-line interface of the OpenAI API, so there is not much interesting code to present. However, the code used for evaluating model performance has been added to the supporting data and the fine-tuning command was added to the main manuscript.

Response to Reviewer 4

1. In model training, it was mentioned that nearly 60% of the training and testing sets (141 out of 240 and 23 out of 40 respectively) contained ``at least one paragraph with information that could be placed into a synthesis template.'' Then, in Figure 6, when the breakdown of the 1137 paper dataset was discussed, it did not mention whether those papers present some useful information beyond incidental appearance of keywords related to seed-mediated gold nanorod growth. I find it important to include such analysis while presenting the dataset.
Response: We thank the reviewer for this question. In the data analysis section, it is stated that: ``To evaluate the completeness of the information this dataset contains, we examined 1,137 papers in the full filtered prediction dataset. Of these, 701 (62%) contained at least one paragraph with a non-empty synthesis template.'' Then further analysis includes discussion of what is present in those templates: ``Of these 701 papers, 678 (97%) fully specified at least one synthesis component: the seed solution, the growth solution, or the gold nanorod dimensions.'' And then further along: ``The vast majority of the papers reported gold nanorod dimensions, with 80% of the 678 papers with at least one fully specified synthesis component containing fully-specified gold nanorod dimensions. Additionally, the majority of the papers fully-specified the seed and growth solutions (respectively 61% and 67%). However, they are distributed such that 40% (268) of the papers fully specified all three components.''

2. The discussion and caption do not match the image shown in Figure 1 and therefore appear a bit confusing. The caption states four parts, while the image itself contains only three. The missing part is said to be in 'purple' which does not exist in Figure 1.
Response: We thank the reviewer for pointing this out. The reference to the purple component has been removed since that was relevant to a prior version of the figure where the synthesis texts were labeled in purple instead of being included in each stage.

Round 2

Revised manuscript submitted on 30 Jun 2023

Editor’s decision letter

25-Jul-2023

Dear Dr Walker:

Manuscript ID: DD-ART-02-2023-000019.R1
TITLE: Extracting Structured Seed-Mediated Gold Nanorod Growth Procedures from Scientific Text with GPT-3

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after a minor revision: in response to the 3rd reviewer's comment - at the very minimum, a discussion about the evolving nature of these LLMs is warranted. If you can, ofcourse, produce results based on LLAMA or LLAMA-2, that will be fantastic. I leave the choice to the authors.

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry

************

Reviewer comments

Reviewer 1

The authors have sufficiently addressed all my previous comments in this revision, I thus recommend publication in its current form.

Reviewer 2

The authors addressed my concerns.

Reviewer 3

Most of the comments were answered satisfactorily. However, both me and Reviewer #2 had questions about the use of other models, which I think has not been addressed well. While I understand that research takes time and money, it may be worth spending more time and money for the sake of improving the research - on January 4th 2024 (so just over 5 months from now), davinci will no longer be available and will be replaced by another version, davinci-002. And who knows how long that one will be available. This may mean that the impact of this paper will be greatly reduced in a very short time, and none of the results will ever again be reproducible.
The solution to this is to:
- as an addition, evaluate the use a free and open LLM, such as LLaMA, or even LLaMA 2 (released recently). Not only are these models likely to be better performing than davinci, but they are also free and open and will never expire or deprecate. Also the base model of LLaMA can be run on a single consumer-grade GPU, so not that prohibitive computationally, and it is quite fast at completions.
- do a full evaluation of a zero-shot approach, even though it is not working well. Once the results become impossible to replicate, the zero-shot evaluation would still provide valuable information on assessing other models on the same or similar tasks and judge performance.

My preference would be the first option, since this would make this work stand the test of time, and it will be relevant for the foreseeable future. Otherwise, after Jan 4th, all of the methods and statistical quantities will become irrelevant as no one will ever be able to take advantage of them. It is unfortunate, but this area is progressing at an astonishing rate, and things like this are bound to happen.

It is up to the authors and the editor to decide whether that change is necessary or not.

Author response

Referees 1 and 2:

Referee 1:
The authors have sufficiently addressed all my previous comments in this revision, I thus recommend publication in its current form.
Referee 2:
The authors addressed my concerns.

Response:
We thank the referees for taking the time to review this work and are happy that the revisions meet the referees' expectations.

Referee 3:

1. Most of the comments were answered satisfactorily. However, both me and Reviewer 2 had questions about the use of other models, which I think has not been addressed well. While I understand that research takes time and money, it may be worth spending more time and money for the sake of improving the research - on January 4th 2024 (so just over 5 months from now), davinci will no longer be available and will be replaced by another version, davinci-002. And who knows how long that one will be available. This may mean that the impact of this paper will be greatly reduced in a very short time, and none of the results will ever again be reproducible.

Response:
The reviewer is correct in pointing out that OpenAI announced on July 6th, 2023 about the retirement of GPT-3 Davinci. However, it is important to note that our submission, inclusive of reviewer responses and the revised manuscript, was completed by June 2023. Given this timeline, the authors were not privy to OpenAI's retirement plan for GPT-3, as it was announced after our submission. We acknowledge the reviewer's concerns about the potential impact on the extension of our approach in the context of these developments and appreciate the commitment to the advancement of research. The revised paper includes results from an open model, Llama-2, as elaborated in the next response, and which should address this concern.

2. The solution to this is to: - as an addition, evaluate the use a free and open LLM, such as LLaMA, or even LLaMA 2 (released recently). Not only are these models likely to be better performing than davinci, but they are also free and open and will never expire or deprecate. Also the base model of LLaMA can be run on a single consumer-grade GPU, so not that prohibitive computationally, and it is quite fast at completions. - do a full evaluation of a zero-shot approach, even though it is not working well. Once the results become impossible to replicate, the zero-shot evaluation would still provide valuable information on assessing other models on the same or similar tasks and judge performance. My preference would be the first option, since this would make this work stand the test of time, and it will be relevant for the foreseeable future.

Response:
In response to the provided comment, we have taken the following actions to address the suggested improvements:

A. Added Llama-2 benchmark:
We have fine-tuned a recently released, free and open Llama-2 as a benchmark to our fine-tuned GPT-3 in our revised paper. We added a paragraph for Llama-2 at the end of Introduction section. Specific details regarding the fine-tuning process for Llama-2 have been included at the end of Section 3.1.

B. Updated Tables:
Tables 1 and 2 in our paper have been updated to present the performance scores of Llama-2 alongside the existing GPT-3 scores. This enables a direct comparison of the two models and their respective results.

C. Discussion on Llama-2 vs. GPT-3:
To provide valuable insights into the comparison between Llama-2 and GPT-3, we have included a dedicated discussion at the end of Section 4.2. The authors note that fine-tuned Llama-2 gives similar or slightly diminished performance compared to fine-tuned GPT-3.

We believe that these additions and updates to our paper address the reviewer's suggestion.

3. Otherwise, after Jan 4th, all of the methods and statistical quantities will become irrelevant as no one will ever be able to take advantage of them. It is unfortunate, but this area is progressing at an astonishing rate, and things like this are bound to happen. It is up to the authors and the editor to decide whether that change is necessary or not.

Response:
We note that in addition to the new Llama-2 results, the authors have made a significant contribution by providing a dataset consisting of 11,644 entities extracted from 1,137 AuNR synthesis publications. This dataset remains available and valuable for further research and analysis, irrespective of the specific model used to construct the data.

Round 3

Revised manuscript submitted on 12 Sep 2023

Editor’s decision letter

15-Sep-2023

Dear Dr Walker:

Manuscript ID: DD-ART-02-2023-000019.R2
TITLE: Extracting Structured Seed-Mediated Gold Nanorod Growth Procedures from Scientific Text with LLMs

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry

******
******

Please contact the journal at digitaldiscovery@rsc.org

************************************

DISCLAIMER:

This communication is from The Royal Society of Chemistry, a company incorporated in England by Royal Charter (registered number RC000524) and a charity registered in England and Wales (charity number 207890). Registered office: Burlington House, Piccadilly, London W1J 0BA. Telephone: +44 (0) 20 7437 8656.

The content of this communication (including any attachments) is confidential, and may be privileged or contain copyright material. It may not be relied upon or disclosed to any person other than the intended recipient(s) without the consent of The Royal Society of Chemistry. If you are not the intended recipient(s), please (1) notify us immediately by replying to this email, (2) delete all copies from your system, and (3) note that disclosure, distribution, copying or use of this communication is strictly prohibited.

Any advice given by The Royal Society of Chemistry has been carefully formulated but is based on the information available to it. The Royal Society of Chemistry cannot be held responsible for accuracy or completeness of this communication or any attachment. Any views or opinions presented in this email are solely those of the author and do not represent those of The Royal Society of Chemistry. The views expressed in this communication are personal to the sender and unless specifically stated, this e-mail does not constitute any part of an offer or contract. The Royal Society of Chemistry shall not be liable for any resulting damage or loss as a result of the use of this email and/or attachments, or for the consequences of any actions taken on the basis of the information provided. The Royal Society of Chemistry does not warrant that its emails or attachments are Virus-free; The Royal Society of Chemistry has taken reasonable precautions to ensure that no viruses are contained in this email, but does not accept any responsibility once this email has been transmitted. Please rely on your own screening of electronic communication.

More information on The Royal Society of Chemistry can be found on our website: www.rsc.org

Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.

From the journal Digital Discovery Peer review history

Extracting structured seed-mediated gold nanorod growth procedures from scientific text with LLMs

Round 1

Reviewer 1

Reviewer 2

Reviewer 3

Reviewer 4

Round 2

Reviewer 1

Reviewer 2

Reviewer 3

Round 3

Transparent peer review