ChemDataWriter: a transformer-based toolkit for auto-generating books that summarise research †

Since the number of scienti ﬁ c papers has grown substantially over recent years, scientists spend much time searching, screening, and reading papers to follow the latest research trends. With the development of advanced natural-language-processing (NLP) models, transformer-based text-generation algorithms have the potential to summarise scienti ﬁ c papers and automatically write a literature review from numerous scienti ﬁ c publications. In this paper, we introduce a Python-based toolkit, ChemDataWriter, which auto-generates books about research in a completely unsupervised fashion. ChemDataWriter adopts a conservative book-generation pipeline to automatically write the book by suggesting potential book content, retrieving and re-ranking the relevant papers, and then summarising and paraphrasing the text within the paper. To the best of our knowledge, ChemDataWriter is the ﬁ rst open-source toolkit in the area of chemistry to be able to compose a literature review entirely via arti ﬁ cial intelligence once one has suggested a broad topic. We also provide an example of a book that ChemDataWriter has auto-generated about battery-materials research. To aid the use of ChemDataWriter, its code is provided with associated documentation to serve as a user guide.


Introduction
The world has witnessed a signicantly growing corpus of scientic papers over recent years, through which scientists publish their research progress as a means of communication within the scientic community. 1However, this large volume of scientic publications also makes it more difficult for researchers to follow research trends and gain insights into the latest scientic ndings.In addition, writing a literature review based on numerous scientic papers is becoming very timeconsuming.Thus, there is an urgent need to nd an efficient way to read, review, and summarise scientic publications.
With the development of deep-learning and naturallanguage-processing (NLP) technologies, research efforts have been invested in the text mining of scientic publications.For example, literature-mining techniques have been used in the biomedical area to identify chemical records, 2,3 extract relational biochemical data, 4,5 and summarise the biomedical literature. 69][20][21][22] To enhance the text-mining performance, many NLP toolkits and models have been created over the past few years, such as Chem-DataExtractor, 23,24 BatteryDataExtractor, 25 MatBERT, 26 MatSci-BERT, 27 and BatteryBERT. 28hile most NLP-related research in chemistry and materials science focuses on natural-language understanding (NLU) and data extraction, 29 another main branch of NLP, naturallanguage generation (NLG), 30 is almost neglected in the text mining of chemical literature; even though such methods could signicantly reduce the time for scientists who need to review the literature.Yet, NLG could be used to generate scientic text if it is tailored to sophisticated scientic concepts and content.By contrast, other elds have already seen many applications of such forms of text generation; see, for example, the automatic generation of ction, 31 sports news, 32 and dialogue conversations. 33The slow progress in applying text generation to the scientic literature might be due to the difficulty of understanding sophisticated scientic concepts and content.The need to resolve chemical names from their associated labels that express the identity of a chemical in a scientic paper (chemical named entity recognition) is also a crucial consideration.Moreover, scientic writing requires high precision and a formal academic style compared to other types of writing.To automatically write a research book is still a challenging task, research on which is still in its infancy, especially for chemistry and material science.
Mishra et al. studied the rst application of textsummarisation application in materials science using deep-Digital Discovery PAPER learning methods. 34The scientic literature of a specic area, friction stir-welded magnesium alloys, was summarised using a text-generation NLP algorithm.However, its data sets were only based on abstracts of research papers, which inevitably causes information loss from the full text during the text summarisation process.The rst machine-generated research book was published in 2019, 35 which provides a brief overview of Liion batteries that had been summarised from research papers.With certain controls from human users, the book was written relatively conservatively to preserve the original meaning of the source text and ensure scientic accuracy.However, the source code and toolkit that achieved this book generation are not open source, thereby posing the difficulty in using their technology within the academic community.In addition, the book-generation algorithm was mainly based on traditional NLP algorithms, while there is potential to improve its performance by introducing deep-learning models.Recently, Taylor et al. released a huge deep-learning-based language model for science that is called Galactica. 36Galactica is the rst tool to generate a literature review automatically; it is trained on a large scientic corpus of research papers, reference materials, knowledge bases and many other sources.However, the model demonstration of Galactica was removed soon aer its release owing to controversial issues surrounding the potential to generate inaccurate and unreliable output as well as causing inadvertent plagiarism.ChatGPT 37 has also been used to autogenerate literature reviews about "digital twins" in the healthcare sector.The review was generated by asking ChatGPT questions that it answered based on the inputted abstracts.The academic validity of the ChatGPT content is yet to be evaluated. 38Language models that are pre-trained unidirectionally, such as generative pre-trained transformer (GPT), 39 face the disadvantage that its token representation only encodes the leward context, while bidirectional models such as bidirectional encoder representations from transformers (BERT) 40 and bidirectional and auto-regressive transformers (BART) 41 have stronger language representations and are more suited to tasks that require a deeper comprehension of context. 42,43While it is true that unidirectional language models can show better performance when the model size is much larger than that of bidirectional language models, larger models have demonstrated only a marginal advantage over smaller models while requiring much greater training resources. 44his paper releases an open-source Python toolkit, Chem-DataWriter, the rst toolkit in the area of chemistry to automatically generate research books that summarises the literature according to an input corpus that has been selected by the user.The core of the tool adopts state-of-the-art transformer models, including text clustering, text retrieval and re-ranking, text summarisation, and paraphrasing.Our toolkit enables users to generate research books in a completely unsupervised fashion: users only need to provide candidate research papers for ChemDataWriter to review and then produce a research book for the user about the summary of the input corpus.In the following sections, we will provide implementation details of ChemDataWriter, as well as three case studies about the analysis of critical parts of our toolkit.We also provide an example of a book about battery research that ChemDataWriter has autogenerated.While ChemDataWriter offers several advantages, we also recognise the importance of reecting upon the moral and philosophical implications of our toolkit.As AI continues to evolve, it is important to navigate its application responsibly.Meanwhile, we provide some recommendations for good practice when using ChemDataWriter.

System overview
Fig. 1 outlines the pipeline of ChemDataWriter, which includes seven main stages: paper downloading, paper screening, topic modelling, text retrieval & re-ranking, text summarisation, content organisation, and reference auto-generation.The nal output is a research book about a specic topic.We employ a relatively conservative approach, by which we mean the generated summary is extracted and re-organised from multiple original sentences rather than written in a new and creative form; this ensures that ChemDataWriter generates an accurate and reliable book.Implementation details of each stage can be found below.

Paper downloading
ChemDataWriter uses the same web scrapers that are embedded within BatteryDataExtractor 25 to download papers from three publishers (the Royal Society of Chemistry, Elsevier, and Springer), as well as the same document processors to preprocess the HTML/XML les into plain text.Web scrapers allow users to download multiple papers on a specic topic, over a specic date range, or from a particular set of journals.ChemDataWriter also includes logic that differentiates and categorises sections of research papers, such as the abstract, introduction, conclusions, and references.Users can also use their own data sources for book generation by providing les in a certain format, i.e. a complete JSON le including the title, abstract, citation information, and full text (optional).

Paper screening
Papers are retrieved according to input keywords, through a query in web scrapers, but the downloaded corpus can contain irrelevant papers where the keyword of the query is usually mentioned in the original paper but does not belong to that exact topic.Hence, a paper-screening step must be completed to lter out irrelevant papers before generating a research book.
Since a high precision is preferred over a high recall for scientic book generation, ChemDataWriter adopts a promptbased learning strategy to classify relevant and irrelevant papers.Prompt-based learning calculates the probability of a given text option, by directly modifying the original input with a prompt template, and can be used in a "few-shot" or "zeroshot" scenario. 45For example, Yin et al. used a prompt template, "the topic of this document is [Z]", which was then inputted into masked pre-trained language models, to predict text that lls the slot [Z]. 46In our study, we also used masked language models, such as BERT or domain-specic BERT, to screen papers according to their abstracts.A prompt template, "A paper in the area of [MASK] with an abstract.",is fed into the language model to predict the [MASK] word.For instance, if we want to obtain a collection of research papers about batteries, all the papers with "battery" or "batteries" as the masked output word will be saved.This way, some papers about battery research may be unintentionally ltered out, but with a key benet that the resulting corpus will be very clean with few noisy data.

Topic modelling
In this stage, ChemDataWriter provides suggestions on potential topics that can be written based on text that is contained within the screened set of scientic papers.According to the output of suggested research topics, users can dene titles and sub-titles of each chapter, aer which ChemDataWriter will produce a full table of contents for the entire book.Chem-DataWriter can also generate text in a mode that automatically provides chapter titles based on the output of suggested research topics, albeit that the auto-generated titles are a list of words rather than a full sentence.Since topic modelling can be time-consuming when the number of papers is large, we set up this step as an optional stage.Users who want to control the content of the auto-generated book themselves can instead manually provide the full table of contents to ChemDataWriter.
We use the BERTopic algorithm as the default topic model to cluster papers in ChemDataWriter. 47BERTopic generates topics through several independent but sequential steps: creating document embeddings using a transformer-based language model; reducing the dimensionality of these embeddings; creating semantically similar clusters, and using a class-based version of a term frequency-inverse document frequency (TF-IDF) model to extract the topic representation from each topic.The output of the topic model is a list of keywords that represent the topic, where each paper is also categorised into a certain topic.While users can choose various alternative embeddings or models for each process, default models are embedded in ChemDataWriter as follows: document embeddings (sentence transformer 48 ), dimensionality reduction (UMAP 49 ), clustering (HDBSCAN 50 ), topic representations (c-TF-IDF 51 ).

Text retrieval & re-ranking
The text-retrieval step retrieves relevant papers according to the inputted topic words, i.e. the names of chapters and subchapters, which are then re-ranked according to their relevance, from high to low.We embedded Haystack's retriever soware module 52 into ChemDataWriter in order to perform the semantic search.The collection of papers is rst saved into a database (in the form of a document store within Haystack, default: InMemoryDocumentStore), from which the retriever can quickly identify the relevant documents that need to be summarised in the next step and dismiss the irrelevant ones.The retriever that employs the TF-IDF model is the default text retriever in ChemDataWriter in order to maintain a good search efficiency, while language models are also accessible and can be used for embedding retrieval.Users can specify the number of relevant documents that need to be found by the text retriever, and the nal output will automatically re-rank the extracted papers in terms of their relevance scores.

Text summarisation
Text summarisation is the core part of ChemDataWriter.The text summarisation algorithms in NLP include extractive summarisation and abstractive summarisation. 53Extractive summarisation is a relatively conservative way to summarise text by selecting and combining the most important part of text or sentences from the original corpus without adding or modifying any information.By contrast, abstractive summarisation is used to rephrase the original text and generate new sentences based on machine comprehension.Abstractive summarisation can create more creative text than extractive summarisation, but this greater creativity comes at the expense of a more error-prone summarisation process.
In order to preserve the original meaning of the text that needs to be summarised, we tested several transformer-based extractive-summarisation language models.We selected the ne-tuned DistilBART that is available on Hugging Face 54 as our chosen model.DistilBART is a smaller version of BART, a transformer-based encoder-decoder model that has been pretrained for natural-language generation. 55BART models combine the features of BERT 40 and GPT 39 models and are particularly effective when they are ne-tuned for text generation.In this study, the CNN Dailymail data set 56,57 was used to ne-tune DistilBART, as this data set offers a diverse range of topics and writing styles, to aid improvements in the generalisation capabilities of our model. 54The ne-tuned textsummarisation model results in a distillation of the most relevant information from each paragraph within original papers into several sentences, thus preserving the original content and meaning, and reducing the risk of creating mistakes.
One problem that conservative text-summarisation models of the papers face is that the summarised output can be very similar or even the same as the original sentence, thus increasing the risk of inadvertent plagiarism.We mitigated this issue by paraphrasing the text of the summary before it is written into the book.To this end, we adopted a "back-translation" paraphrasing approach in order to ensure semantic and syntactic correctness.The "back-translation" model in Chem-DataWriter employs two transformer-based language models that have been ne-tuned for translation: one to translate English into another languages and another to convert the translated text back into English.We evaluated the performance of ChemDataWriter in paraphrasing text using four different foreign languages in the back-translation model, and selected English-to-German and German-to-English as the default paraphrasing models.Users can choose any language models and parameters for back-translation and paraphrasing in order to full their specic needs.

Content organisation
The content-organisation stage of ChemDataWriter automatically organises the auto-generated text summary into a complete book with a certain format.ChemDataWriter contains pre-dened content selection and organisational logic in order to auto-generate a research book, the nature of which is best explained by illustration.For example, the auto-generation of a research book about "Na-ion batteries" will require the inclusion of several chapters that belong to this specic topic, with each chapter representing a sub-area such as anodes, cathodes, or electrolytes.Within each chapter, we generate an introduction, a literature-review section that is sourced from several sub-sections of each inputted paper, and a Conclusion section.The Introduction section is summarised from every abstract of each article in order to provide an overview; while the sub-section of the main text (i.e. the literature review) is generated from the Introduction sections of each paper.Likewise, the Conclusion section of each book chapter consists of the summarised conclusion of each paper.Note that similar sentences that have been summarised from different papers will be merged into one, whereby they have been identied using a similarity measure.
For each Literature-review section between the Introduction and Conclusion of a book chapter, we further simplify the title of each chapter sub-section using a transformer-based titlegeneration model to provide a clear view of the summarised paper.The default title-generation model in ChemDataWriter is the T5 model 58 which has been ne-tuned on the TitleWave 59 data set.By inputting the original long title, the ne-tuned T5 model will output a short title as the title of each chapter subsection.Overall, this content-organisation approach of combining the summarised text of each paper separately is not ideal for combining the summary of research outputs, but its conservative nature signicantly increases scientic correctness as it does not produce abstractive text.

Reference auto-generation
The last part of the machine-generated book is a list of references that comprises a bibliography.We include a reference auto-generator in ChemDataWriter for each publisher (the RSC, Elsevier, Springer) to produce an academic-style reference list.The format is "authors, title, journals, date, volume, issue, page, DOI".The relevant reference information is extracted from the metadata of the HTML/XML le.If users provide their own paper les, we suggest that they also provide the reference data in the correct format.

Results and discussion
We performed three case studies to evaluate the performance of ChemDataWriter.In case study 1, we assessed the performance of the topic modelling stage of ChemDataWriter, where topics were extracted from three corpora of papers about battery research.Case study 2 compared and contrasted different paraphrasing models to improve the quality of book writing and to avoid potential plagiarism issues.Our third case study focused on providing scientic insights into an entire book about recent battery research that ChemDataWriter autogenerated.

Case study 1: suggesting chapter titles based on the topic modelling capabilities of ChemDataWriter
ChemDataWriter suggests the potential content to be written, using the BERTopic model to extract features and identify topics that are present in the original text.In this case study, we inputted three different corpora of scientic papers into ChemDataWriter to test the performance of the topic model, including (1) the entire corpus of battery-research papers with title, abstract, and conclusion, (2) the full-text corpus of papers about Na-ion batteries as dened such that "Na-ion" is part of the title of a paper, and (3) the full-text corpus of papers about "Li-metal" batteries.Each topic consists of ve words, and the minimum number of documents per topic was set as 10.
Table 1 lists the corpus size, the number of extracted topics, and two evaluation metrics (topic coherence and topic diversity) for each data set.Topic coherence is calculated based on the normalised pointwise mutual information (NPMI). 60The topiccoherence score ranges from −1 to 1, where a high coherence score close to 1 means that the words in a topic are semantically similar.In contrast, a coherence score of −1 indicates that words are semantically dissimilar, while a zero coherence score means that no clear semantic relationship has been found within a topic.Topic diversity is the percentage of unique words across all topics within a data set, in which a higher score indicates that topics are varied and words do not overlap between classes of topics.Table 1 shows that topic coherence and topic diversity performance metrics are similar across different data sets.In general, a high topic coherence score ensures that our tool conveys information in a clear and understandable manner, while a high diversity is oen required when the objective is to generate more creative content.Compared to the reported topic coherence and topic diversity scores in the original BERTopic paper, 47 the best topic coherence score in our study (0.178) is only slightly lower than the best value in the original paper (0.192), which indicates a reasonable coherence of the generated scientic-related topics.However, the best topic diversity score is much lower (0.669 < 0.886), which is as expected as ChemDataWriter's ability to convey information is more valued than its creative abilities in the generation of scientic documents.In contrast, the number of generated topics is more varied owing to the difference in the input corpus size and paper types.For example, the entire mixed battery-paper corpus can generate 141 topics, whereas only 26 topics can be found for the Li-metal battery-paper corpus.Even though it involves 7 million words, the Na-ion battery-paper corpus only generates 58 topics, much less than that of the full corpus which possesses 2.34 million more words.
The importance of the data set that is used for input can also be reected in Fig. 2, whereby representative topics in each corpus are illustrated.From the entire corpus of battery papers, we can observe that the topic-modelling process found some topics that are varied and diverse, such as "Li-S batteries", "supercapacitors", "Li-O 2 batteries", and "SOC estimation"; taking these topics as an example, one choice would be to write a book about the different kinds of battery applications.For the Na-ion and Li-metal battery-paper corpora, specic topics were more likely to be found, such as a particular material (TiO 2 , MoS 2 , ScO 2 ) or an application (electrolyte, cathode, impedance).ChemDataWriter offers suggestions of topics, based on the results of the topic-modelling process, and, by default, lets users dene the titles of each chapter in the auto-generated research book.However, it can also produce a table of contents entirely automatically using the list of topic words as titles directly.This case study demonstrates that hidden topics can be found using the BERTopic model, based on which ChemData-Writer can suggest the potential book content that can be autogenerated.The topic model works especially well on a large, diverse text-based data set, such as the full corpus of battery papers.However, this approach also has several weaknesses, which is why this step is optional in ChemDataWriter by default.First, the long computational time required for the topic-modelling stage prevents ChemDataWriter from nding optimal results, as the model needs a few hours to realise the necessary inference from an input corpus that contains around 10 million words.The long-running execution time makes it impossible to ne-tune parameters in each stage of the BER-Topic modelling operational process (document embedding, dimensionality reduction, clustering, and generating topic representations).In addition, since the BERTopic model only nds topics in terms of the importance of words, as judged by their frequency of appearance in the text, a list of topics may contain words that are very different from each other.In Fig. 2c, for example, the rst topic list consists of both broad topics about Li-S batteries (sulphur, lithium-sulphur, Li batteries) and very specic topics about the material polysulde (poly-suldes, polysulde).This issue can lead to a difficult topic interpretation; as such, the default option, to use human intervention in helping to choose section topics, will tend to afford an auto-generated document with a higher-quality output.Creating a more diverse table of contents is a subject in its own right that still under development.

Case study 2: introducing paraphrasing control to reduce text similarity
We mitigated the issue of conservative extractive summarisation, in that the summary can be very similar to the original text, by adopting a "back-translation" strategy to paraphrase text that has been summarised.This strategy follows the notion that one can keep the meaning of a sentence but alter its original wording by translating it into a foreign language and back again into english text.We tested ve language models that are used as machine-translation tools for back-translation, all of which convert english text to and (back) from a foreign language; specically, we employ French, Russian, Arabic, and two German machine translation models (trained on different data sets). 61These models have been developed by the Helsinki-NLP group on open world translation data sets, 61 except for the German2 model, which was the same as the German model except that it has trained on different data set: Facebook's news translation data set. 62he performance of our paraphrasing models was evaluated using two well-accepted automatic quantitative evaluation metrics: the Bilingual Evaluation Understudy (BLEU) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score.Both metrics compare the generated text with a reference text from a gold-standard data set.BLEU compares two texts by counting the number of words in the generation that appear in the reference, where a high precision is preferred to a high recall.By contrast, ROUGE is a recall-oriented metric that checks how much text in the reference also occurs in the generated text.Common ROUGE metrics include ROUGE-1 (unigram/individual words co-occurrence statistics), ROUGE-2 (bigram overlap), and ROUGE-L (longest common subsequence, LCS).The LCS can be calculated for any pair of strings.For example, the LCS for "abcde" and "ace" would be "ace" with a length of 3. The BLEU and ROUGE scores range between 0 and 1, but are oen represented as percentages with a range from 0 to 100, as is shown in Table 2.
Finding a suitable gold-standard data set of reference text determines whether or not the evaluation process can indeed reect the model behaviour.In order to test the performance of paraphrasing in the scientic domain, we evaluated our models on the ParaSCI data set, the rst large-scale scientic paraphrase data set that has been extracted mainly from arXiv papers. 63The ParaSCI data set contains common patterns and characteristics of scientic papers.A high BLEU and ROUGE score could indicate that the paraphrasing models can work well in the scientic area of interest.
Table 2 lists the performance of our back-translation paraphrasing models on the ParaSCI data set.The English-to-German and German-to-English model called "German" scores the highest on all metrics, which are slightly better than those of the English-French back-translation model.Russian and Arabic paraphrasing models showed much worse performance due to the larger language difference between each of them and the English language.This result is also consistent with our evaluation of the battery-related text summarisation (Fig. 3 and 4).Both examples demonstrate that paraphrasing models using German and French only change several words or phrases compared to the original text, while there are major differences between the outputs of the Russian and Arabic language models and their original text.In addition, the Russian and Arabic back-translation models are more likely to produce paraphrased text with incorrect meanings, especially when the scientic term is mentioned (e.g."insulate nature", "the shuttle effect", "LiPS").
We also observed that paraphrasing models can perform differently even when using the same foreign language during back-translation, if they are trained on different data sets.We selected the best performing model ("German") to investigate its model behaviour when trained on a different data set:

Digital Discovery Paper
Facebook's news translation data set. 62As is shown in Table 2, the BLEU and ROUGE scores of the alternative model, which we call "German2", are lower than both the "German" and "French" back-translation models.However, "German2" also showed more differences between the paraphrased and original text (Fig. 3 and 4), while the scientic meaning still remains correct.Therefore, if users want to nd models that differ more substantially from the original text, they could test this text on paraphrasing models which feature the same back-translation language but have been trained on different data sets.
To summarise, our paraphrasing models enable Chem-DataWriter to produce text that differs signicantly from the extractive text summary to reduce the risk of inadvertent plagiarism that has been reported in the use of other textgeneration tools.Thereby, English-German back-translation models, "German", showed the best performance on the Para-SCI data set.Users can easily control the paraphrased output by changing the language models or the training sets that are associated with the back-translation process.However, problems also exist in this approach, such as a relatively high similarity between the original text and the paraphrased output.This text similarity issue is inevitable due to the nature of the extractive summarisation algorithm.A hybrid approach that employs a rule-based and transformer-based model could further improve model performance in this regard, but that would involve considerable human input, while our objective is to focus on achieving an automatic pipeline without any human effort.Abstractive summarisation can also mitigate the similarity problem, but the development of this methodology is not yet sufficiently mature in order to produce reliable scientic content.

Case study 3: analysis of an example auto-generated book
In this case study, we will analyse an example of a research book that ChemDataWriter has auto-generated.This book is entitled "Literature Summary of Recent Research About Na-ion, Li-S, and Li-O 2 Battery Materials".It was generated by summarising  text from 152 scientic papers about battery materials.These papers had been downselected from 25 736 research papers about battery materials, each of which had been classied as relevant battery papers by the prompt-based binary classier.Each le must also include the necessary information to autogenerate a book, including a valid title, abstract, introduction, conclusion, and bibliography.As a result, 152 scientic papers were extracted to compose the nal book about the three battery applications.A copy of the full book can be found in ESI.† Note that we have also obtained explicit permission from the publishers of the 152 papers to allow us to reproduce textual content based on the work of the original authors, just to safeguard ChemDataWriter from any inadvertent plagiarism.
Fig. 5 shows the high-level table of contents of this book, including three main parts (Na-ion batteries, Li-S batteries, and Li-O 2 batteries).Each part consists of three chapters on: cathode materials, anode materials, and electrolytes.The autoselection of these chapter contents follows the popular writing style of review articles on battery materials, such as a literature review of sodium-ion batteries. 64The difference between machine-generated and human-written books lies in the title of the sub-section of each chapter.While ChemDataWriter can only name sub-section titles according to the original title of papers, humans can summarise them more abstractly.For example, Hwang et al. named the titles of sub-sections of "Anode materials" in terms of the reaction mechanisms: insertion materials, conversion materials, and alloying reaction materials. 64The ability to achieve this high-level title generation using machines requires the further development of NLP algorithms, by understanding and uncovering the hidden meaning from the scientic text.
The introduction of each book chapter was summarised from the abstract of each paper that is cited in that chapter.Fig. 6 shows one of these abstracts together with a paragraph of summarised text that is afforded as part of the auto-generated book. 65The rst three sentences in the original form of this abstract were extracted as sentences that contain important information, such as the background, current issue, and the main objective of the paper.The structure of sentences and several words were changed with help from the paraphrasing model, and the term "we" was automatically transformed into "the authors" by ChemDataWriter.Similarly, the "Conclusion" section in each chapter was also written in the same way as the Introduction.
More detailed scientic content of the input papers is provided in the second section, "Literature reviews", where the content for each chapter is summarised from introductions of each original paper.The "Literature reviews" section consists of multiple sub-sections, representing the summary of an individual paper.The title of each sub-section is a simplied version of the full title of each paper, achieved by the transformer-based title generation algorithm. 58An example of the title of a paper can be "Layered tin sulphide and selenide anode materials for Li-and Na-ion batteries", which can be simplied to a subsection title "Layered tin sulphide and selenide anode materials" by ChemDataWriter.Since the title-generation algorithm was not trained on a data set of scientic text, 59 there is a risk that the simplication process could cause the loss of crucial information in a full title.Hence, users may need to refer to fulltitle information and other metadata in the last part of the book: the auto-generated references.
The number of sub-sections is determined by human input.In the example of the book presented herein, the maximum number of sub-sections is the default value, 30.However, the exact number of sub-sections in "Literature reviews" is much lower than 30, owing to the conservative method of ltering out papers whereby only their title, including the query word, will be summarised.In this way, we ensure that each whole paper discusses the specic topic, e.g.anodes of Na-ion batteries.Once candidate papers are found, they are re-ranked according to their relevance scores before being written into the book.

Digital Discovery Paper
Overall, ChemDataWriter provides an approach to automatically generate books that review a scientic area by merging text summaries from multiple papers on a given topic.Users only need to input the candidate paper corpus and, if they wish, a list of topics to form a table of contents, based on which ChemDataWriter can write a book in a completely unsupervised way.The current logic of automatically writing a book is to merge the single-document summary from individual papers.However, we also see opportunities to further improve the text summarisation model by introducing multi-document summarisation.Large-scale data sets such as Multi-XScience have been created recently so that machines can auto-generate a single summary paragraph from multiple paper sources. 66ulti-document summarisation models are still under development, and we expect more domain-specic data sets will be created, in due course, that can improve the text summarisation performance on the scientic area.However, the creation of a custom annotated dataset is a very time-consuming process that requires careful curation and substantial human resources to ensure quality and consistencyit is essentially a major research study in its own right.We expect further advances in this area, once more data have been curated and the technology matures.

Conclusions and outlook
ChemDataWriter is the rst open-source transformer-based toolkit for auto-generating books that summarise research in the area of chemistry.The book-writing pipeline involves conservative text summarisation approaches to ensure the correctness and trustworthiness of the auto-generated book.The book-generation process is implemented in a completely unsupervised way, where users only need to provide the corpus of research papers in order to generate a literature summary.ChemDataWriter can identify hidden topics from a large corpus of data and suggest the book content in terms of extracted topics.ChemDataWriter embeds a "back-translation" model to paraphrase the summarised text in order to alleviate the text similarity issue.We believe that ChemDataWriter has the potential to help scientists accelerate their literature-searching and screening processes.Researchers can also nd the most recent and relevant information about research progress in a specic eld using our toolkit.
While ChemDataWriter offers promising potential in scien-tic research, it is incumbent upon us to use such tools responsibly.There are two key issues to address: mitigating plagiarism and the role of review writing as a form of education.
As is mentioned in the paper, our tool has the potential to be misappropriated as a plagiarism mechanism.The paraphrasing style that ChemDataWriter embues, as demonstrated in Fig. 3  and 4, could be reasonably viewed as patchwork plagiarism (sometimes called mosaic plagiarism). 67,68Such plagiarism is considered to be just as wrong as any other form of plagiarism.
It is therefore crucial that any users of ChemDataWriter solicit a priori explicit permission to reproduce text from the papers that they provide as input to our tool, as we have done through the Copyright Clearance Centre 69 for all 152 papers that fed our case study that produced a book.Otherwise, the result can be furnished as plagiarism, as we exemplify by showing the patchwork plagiarism that would appear by applying a plagiarism checker 70 to Chapter 4 of our book, had we not sought and obtained reproduction rights (see ESI †).While we appreciate that it might feel laborious to seek such reproduction permissions for so many papers, it is imperative since it is the only way to ensure that the result is lawful; besides, this laborious process is still far quicker than writing the review manually.
Considering the role of review writing as a form of education, ChemDataWriter does not stop this process, although it may of course tempt some to avoid learning via this type of education.It is perhaps better to view this as responsible AI in the sense that AI can actively assist humans in writing a review.Indeed, the intention behind ChemDataWriter is not to supplant human authors, but to support them in navigating and consolidating vast amounts of data.While we recognise that paper reviewing can be an important process for a researcher who is new to the eld, such as a PhD student, our tool equally enables many senior scientists, especially those in industry, to speed up the paper review process by implementing this latest technology to assist them.It normally requires several months for a human researcher to write a scientic review article in a given eld, while ChemDataWriter can achieve this in only a few hours.The automation capabilities of our tool could relieve researchers from the burdensome and lengthy process of literature reviews, allowing them to dedicate more time to devising hypotheses, conducting experiments, and pursuing innovative ideas.Moreover, the advent of advanced AI tools like this can speed up the pace of scientic discovery and contribute to the democratisation of knowledge by making complex scientic literature more accessible.
More generally, the development and deployment of AI tools like ChemDataWriter are inevitable as we move forward in the era of AI.Thus, we consider it our responsibility to take the lead in presenting new tools, such as ChemDataWriter, together with clear recommendations in their utility, as we have provided here, before others set inappropriate trends that could carry forward irrecoverably.The world has seen recent examples of this already.Indeed, the line between leveraging technology and maintaining human oversight remains a delicate balance that the scientic community must continue to navigate.
Regarding ChemDataWriter specically, it is important to remember that its input is known and provided by the user.The user therefore has full control over their input decisions, and their copyright choices, which determine their output.This is a key ethical difference to other tools that have been released in the public media recently.Moreover, we have released Chem-DataWriter as an open-source tool, and have likewise provided its code, via this publication.This will allow others to expand upon our work as well as simply use it.In turn, this will democratise the development of these methods; a greater diversity of developers will encourage ethical behaviour and best practice by notion of a collective effort in this research eld.
Challenges still exist in terms of needing more data and more advanced models in order to further improve the book auto-generation pipeline.Most of the transformer-based models in our toolkit were trained on general English-language data sets, while the use of domain-specic data sets will likely enhance the performance of book-generation algorithms in generating scientic text.We also encourage the creation of more gold-standard chemistry data sets for the purpose of evaluation.In terms of models, multi-document summarisation models could be introduced in order to update the content organisation of an auto-generated book.In addition, a hybrid approach of transformer-and rule-based methods could syntactically and semantically improve the quality of the generated text, although such an approach would currently necessitate a considerable human effort; thereby limiting its level of automation.Another future work might be to incorporate creative writing in the book-generation process while preserving the text trustworthiness.

Fig. 1
Fig. 1 Operational pipeline of ChemDataWriter showing each of the seven steps with the specific models used in each stage being described as footnotes in each box.

Table 1 Fig. 2
Fig. 2 Keywords in representative topics that were extracted from the corpus of research papers about (a) Na-ion batteries, (b) Li-metal batteries, and (c) all batteries.

Fig. 5
Fig. 5 Table of contents for the example of the auto-generated book: "Literature summary of recent research about Na-ion, Li-S, and Li-O 2 battery materials".

Fig. 6
Fig. 6 Extractive text summarisation from the original abstract of an example paper to afford a paragraph of the Introduction section of the auto-generated book.

Table 2
BLUE and ROUGE scores (percentages) of paraphrasing on the ParaSCI data set.Paraphrasing models include five back-translation models: German, French Russian, Arabic, and German2 that is trained on a different data set Digital Discovery© 2023 The Author(s).Published by the Royal Society of Chemistry