Zheren
Wang†
ab,
Kevin
Cruse†
ab,
Yuxing
Fei
ab,
Ann
Chia‡
a,
Yan
Zeng
b,
Haoyan
Huo
ab,
Tanjin
He
ab,
Bowen
Deng
ab,
Olga
Kononova§
*a and
Gerbrand
Ceder
*ab
aDepartment of Materials Science & Engineering, University of California, Berkeley, CA 94720, USA. E-mail: olga_kononova@berkeley.edu; gceder@berkeley.edu
bMaterials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
First published on 27th April 2022
Applying AI power to predict syntheses of novel materials requires high-quality, large-scale datasets. Extraction of synthesis information from scientific publications is still challenging, especially for extracting synthesis actions, because of the lack of a comprehensive labeled dataset using a solid, robust, and well-established ontology for describing synthesis procedures. In this work, we propose the first unified language of synthesis actions (ULSA) for describing inorganic synthesis procedures. We created a dataset of 3040 synthesis procedures annotated by domain experts according to the proposed ULSA scheme. To demonstrate the capabilities of ULSA, we built a neural network-based model to map arbitrary inorganic synthesis paragraphs into ULSA and used it to construct synthesis flowcharts for synthesis procedures. Analysis of the flowcharts showed that (a) ULSA covers essential vocabulary used by researchers when describing synthesis procedures and (b) it can capture important features of synthesis protocols. The present work focuses on the synthesis protocols for solid-state, sol–gel, and solution-based inorganic synthesis, but the language could be extended in the future to include other synthesis methods. This work is an important step towards creating a synthesis ontology and a solid foundation for autonomous robotic synthesis.
Scientific text mining has received its recognition in the past few years,4–7 providing the materials science community with datasets on a variety of materials and their properties8–10 as well as synthesis protocols.11–14 Nonetheless, a majority of these text mining studies have been focused on extracting chemical entities such as material names, formulas, properties, and other characteristics.15–19 There have only been a few attempts to extract information about chemical synthesis and reactions and compile them into a flowchart of synthesis actions.20–28 Hawizy et al.20 were early developers for such extraction, using a combination of rule-based regular expressions (regex)29 and syntax tree parsing to identify and classify action phrases in their tool, ChemicalTagger. This approach shows very good performance on organic synthesis procedures. Vaucher et al.21 used a combination of rule-based approaches and machine learning models trained on over 2 million procedural sentences to extract synthesis actions from the organic chemistry patents texts and map them into well-defined language schemas. We found this work to be one of the most robust and accurate in describing organic synthesis procedures. Mehr et al.22 developed a semi-automated workflow that uses NLP-based approaches to translate human-written text into an internal Chemical Description Language (so-called XDL) and then map it into robotic operations. To the best of our knowledge, this is the only work that applied the developed synthesis ontology to robotic synthesis for organic molecules. Mysore et al.23 paved the way for synthesis action graph extraction from the inorganic synthesis text. For this, they applied several neural network-based models and used dependency tree parsing to combine the extracted information into synthesis graphs. Similarly, Kuniyoshi et al. used bi-LSTM combined with BERT word embeddings to construct synthesis graphs for solid-state batteries fabrication,24 which showed excellent results on the extraction of operations using the science literature-specific SciBERT pretrained language model.
As is apparent from the above survey, the automation of synthesis procedures for organic molecules has made significant progress. This is mainly due to the facts that (a) organic synthesis is more deterministic and hence more common in materials science and biochemical domains, and (b) there exist large-scale databases and repositories of organic reactions30,31 and annotated sets32,33 that help to speed up development of the machine-learning approaches for interpretation and prediction of organic synthesis. Even with such data availability, to the best of our knowledge, there have been only a few attempts to create a publicly available annotated corpus containing materials synthesis protocols extracted from the text.13,21,25 The dataset created by Mysore et al.13 contains 230 labeled synthesis paragraphs with labels assigned to material entities, synthesis actions, and other synthesis attributes for inorganic synthesis, and is freely available to users. The dataset used by Vaucher et al.21 was obtained by augmenting the existing Pistachio dataset34 of organic synthesis procedures, and is available upon request. Kuniyoshi et al.25 annotated an in-house dataset of inorganic materials synthesis entities that is publicly available.
A major obstacle in annotating synthesis actions in the text corpora is the lack of a solid, robust, and well-established ontology for describing synthesis procedures in materials science.35 Indeed, researchers prefer to vaguely sketch “methods” sections of the manuscript in general human-readable language rather than follow a specific protocol. This significantly impacts reproducibility of the results, not to mention ambiguity in understanding even when read by a human expert.35 While such ambiguity is inconvenient for human readers, the growing interest in automated AI-guided materials synthesis demands a robust and unified language for describing synthesis protocols in order to make them applicable to autonomous robotic platforms.22,36,37
The previous works describing inorganic synthesis action extraction from the text23,24 have laid the groundwork for extending such methods to this materials field, and have made their datasets available for interested researchers; however, neither provide an ontology for the actions that their models extract. Although development of synthesis action extraction from the text in organic chemistry has significantly accelerated and some groups have developed specific ontologies21,22 for such vocabulary, we found that the existing models do not transfer well to the inorganic synthesis space due to the disparate natures of these two approaches. For example, we found that vocabulary unique to inorganic synthesis like sintering and calcining would be frequently misclassified. Additionally, existing models with developed ontologies do not include tags for important inorganic synthesis actions like shaping of samples into pellets.
In this work, we discuss a potential approach to the problem of inorganic synthesis ontology based on creating a unified language of synthesis actions (ULSA). We demonstrate an application of this approach in describing solid-state, sol–gel, precipitation, and solvo-/hydrothermal synthesis procedures, which cover the majority of inorganic synthesis procedures.38,39 Specifically, we built and created a dataset of 3040 synthesis sentences labeled according to the ULSA schema and trained a neural network-based model that identifies a sequence of synthesis actions in a paragraph, maps them into the ULSA, and builds a graph of the synthesis procedure (Fig. 1). We applied this model to thousands of synthesis paragraphs and analysed the resulting synthesis graphs. The obtained results show that our ULSA vocabulary is comprehensive enough to obtain high-accuracy extraction of synthesis actions as well as to identify the important features of each of the aforementioned synthesis types. Additionally, the ULSA as it is encoded in the labeled dataset can be easily customized and augmented to account for other inorganic synthesis methods. The dataset and the scripts for building such a synthesis flowchart are publicly available. We anticipate these results will be widely used by the researchers interested in scientific text mining and will help (a) to achieve a breakthrough in predictive and AI-guided autonomous materials synthesis and (b) build a robust materials synthesis ontology.
• Starting: a word or a multi-word phrase that marks the beginning of a synthesis procedure. Specifically, this often indicates which materials will be produced. For example: “PMN-PT was synthesized by the columbite precursor method”, “solid-state synthesis was used to prepare the target material”, “the powder was obtained after the aforementioned procedure”.
• Mixing: a word or a multi-word phrase that marks the combination of different materials (in a solid or liquid phase) to form one substance or mass. For example: “precursors were weighted and ball-milled”, “precursors were mixed in appropriate amounts”, “Sb2O3 is added to the solution”, “the solution was neutralized”, “the mixture was stabilized by the addition of sodium citrate”.
• Purification: a word or a multi-word phrase that marks the separation of the sample phases. This also includes drying of a material. For example: “samples were exfoliated from substrates”, “the liquid was discarded and the remaining product was filtered off and washed several times with distilled water”, “the precursors were heated in order to remove the moisture”, “the precipitate was collected by washing the solution in distilled water”.
• Heating: a word or a multi-word phrase that marks increasing or maintaining high temperature for the purpose of obtaining a specific sample phase or promoting a reaction rather than drying a sample. For example: “the powder sample was annealed to obtain a crystalline phase”, “the mixture was subjected to heating at 240 °C for 24 h”.
• Cooling: a word or a multi-word phrase that marks rapid, regular, or slow cooling of a sample. For example: “the product was cooled down to room temperature in the furnace”, “the sample was quenched rapidly in the solid CO2”, “the product was left to cool down to room temperature”.
• Shaping: a word or a multi-word phrase that marks the compression of powder or forming the sample to a specific shape. For example: “the powder was pressed into circular pellets”, “the powder was then pelletized with a uniaxial press”.
• Reaction: a word or a multi-word phrase that marks a transformation without any external action. For example: “the sample was left to react for 6 h”, “the temperature was kept at 1000 K”, “the solution was maintained at 200 K for 12 h”.
• Non-Altering: a word or a multi-word phrase that marks an action done on a sample that either does not induce any transformation of the sample or does not belong to any of the above classes. “The pellets were placed in a sealed alumina crucible”, “the reaction vessel was wrapped with aluminum foil”, “the sample was sealed in a tube”, “the gel was transferred to an oven”.
The 535 paragraphs consisted of 3781 tokenized sentences.16 First, each sentence was classified as either related to synthesis or not related to synthesis. The latter case usually contains sentences about product characterization and other details. Next, we isolated 3040 synthesis sentences and assigned labels to each word or multi-word phrase in the sentence on the basis of the ULSA protocol with annotation schema described in Section 2.1. Only words and phrases describing synthesis actions were annotated. The final dataset consists of these 3040 labeled synthesis sentences. All annotations were performed using a custom Amazon Mechanical Turk-based server.
It is important to keep in mind that we mapped words into the terms of synthesis action per sentence, meaning that we used only information in the context of a given sentence to make a decision about the annotation of a word, rather than the whole paragraph. The reason for this choice is the multiple and diverse possibilities to combine and augment sentences leading to different meanings of the terms. The interpretation of the whole text or paragraph is an entirely separate field of research that is outside the scope of this work.
We chose to annotate those words that are characteristic of a synthesis procedure or result in the transformation of a substance. For example, in the sentence “the precursors were weighed and mixed,” the term “weighed” is not a synthesis action since it is to be expected in synthesis, while “mixing” is a synthesis action because it may have a specific condition and transform the sample, or can be preceded by calcination of the precursors in other syntheses. The exclusion from this rule is the Starting action. Even though terms belonging to this action do not bring any special information or explicit action to the synthesis, we chose to distinguish Starting actions because in a substantial number of cases they can serve as flags to separate multiple synthesis procedures from one another. An illustration of this situation is when precursors are prepared prior to synthesizing a target material, as in sol–gel synthesis.
For the annotation of Mixing synthesis actions, we did not differentiate between powder mixing, ball milling (grinding), addition of droplets, or dissolving of substances. In many situations, this precise definition depends on the solubility of reactants and mixing environment, as well as on other details of the procedure that are never explicitly mentioned in the text. We leave it up to the user to create their own application-based definitions of these Mixing categories. Nonetheless, in the application below we provide a rule-based example of how these types of synthesis actions can be identified in the text.
The Non-Altering action term was introduced to make room for those synthesis actions that are not typical or do not fall into any other category but nevertheless appear as a synthesis action within our definitions. While Non-Altering action terms can be easily confused with Reaction actions or non-actions, the decision depends on the sentence context and can be arbitrarily extended or removed by a user. Comparing “the sample was kept in the crucible” and “the sample was kept overnight,” the former is not a synthesis action while the latter should be considered an important synthesis step.
Ambiguous situations as in the ones mentioned above are ubiquitous in descriptions of syntheses. A substantial amount of these situations occur when authors try to be wordy or use flowery language when writing the synthesis methods. Unfortunately, this often presents a challenge for accurate machine interpretation of the text. We accounted for some of these cases when annotating the data as described below.
First, implicit mentions of synthesis actions (i.e. when a past participle form of a verb is used as a descriptive adjective referring to an already processed material) are the most frequent source of confusion. We chose to annotate these as synthesis actions. For example: “the calcined powder was pressed and annealed.” In this sentence, the descriptive adjective “calcined” could be either a restatement of the fact that there was a calcination step or it could be additional information which had not been mentioned previously. These situations can be later resolved with a rule-based approach, hence we leave it as a task for users of the data.
The situation when a method is specified along with the synthesis action is also common. In a phrase of the form “transformed by a specific procedure,” we consider only the key action (the transformation) as a synthesis action. For example: “the precipitates were separated by centrifugation.” When required, the method can be retrieved with a set of simple rules.
Redundant action phrases are also abundant in many descriptions of the procedures. In a phrase of the form “subjected to a process”, we considered only the processing verb as a synthesis action. For example: “the samples were subjected to an initial calcination process.”
Finally, phrases that attempt to reason the purpose of the action, such as “left to react”, “brought to a boil”, “heated to evaporate,” are considered as one synthesis action. This is done for the purpose of providing flexibility to a user and to let them make a decision on how to treat these cases.
{
“annotations”:
[
{
“tag”: token_tag,
“token”: token
}
],
“sentence”: sentence
}
The repository also contains a script for training a bi-LSTM model that can be used to map words into action terms. Users are not limited to using only the provided dataset, but can augment their usage with other labeled data as long as they satisfy the data format described above. Finally, we also share scripts used for the inference of synthesis actions terms and for building synthesis flowcharts for a list of paragraphs. Examples of model application are available as well.
Amount | |
---|---|
Paragraphs used for annotation | 535 |
Per synthesis type | |
Solid-state synthesis | 199 |
Sol–gel synthesis | 51 |
Solvo-/hydrothermal synthesis | 148 |
Precipitation | 137 |
Total sentences | 3781 |
Synthesis sentences | 3040 |
Action tokens | 5547 |
Per action category | |
Starting | 619 |
Mixing | 1853 |
Purification | 1080 |
Heating | 973 |
Cooling | 259 |
Shaping | 225 |
Reaction | 232 |
Non-Altering | 306 |
To probe the robustness of ULSA and our annotation schema, we asked 6 human experts to annotate the same paragraphs in our dataset and used Fleiss' kappa score to estimate the inter-annotator agreement between the annotations.43 In general, the Fleiss' kappa score evaluates the degree from −1 to 1 to which different annotators agree with one another above the agreement expected by pure chance. A positive Fleiss' kappa indicates good agreement, scores close to zero indicate near randomness in categorization, and negative scores indicate conflicting annotations. This is a generalized reliability metric and is useful for agreement between three or more annotators across three or more categories. Table 2 lists the Fleiss' kappa scores for agreement between human experts annotating the sentences according to the schema described in Section 2.1. The table shows good agreement on distinguishing synthesis sentences from non-synthesis sentences, as well as for all and for each individual synthesis action, including non-actions. The agreement across all action terms is 0.83. Among those, the action terms with lower scores are Shaping and Non-Altering. The low score for Non-Altering is expected since a wide range of actions which do not induce a transformation in the sample could be mapped into this category. The Shaping action term can also be associated with many synthesis operations. For instance, granulating procedures that break a sample into smaller chunks could be considered a Shaping action; at the same time, a bench chemist could consider “granulation” to be Mixing action term since it requires performing a grinding operation to obtain the new shape. Less ambiguous actions terms, such as Heating and Mixing, showed higher agreement.
Score | |
---|---|
Identification of synthesis sentences | 0.69 |
Action terms tagging | 0.83 |
Per action terms | |
Starting | 0.82 |
Mixing | 0.86 |
Purification | 0.79 |
Heating | 0.84 |
Cooling | 0.88 |
Shaping | 0.59 |
Reaction | 0.66 |
Non-Altering | 0.45 |
No action | 0.87 |
Model | Precision | Recall | F1 score |
---|---|---|---|
Baseline 1 | 0.54 | 0.61 | 0.57 |
Solid-state | 0.53 | 0.72 | 0.61 |
Sol–gel | 0.57 | 0.75 | 0.65 |
Hydrothermal | 0.54 | 0.53 | 0.54 |
Precipitation | 0.55 | 0.50 | 0.53 |
Baseline 2 | 0.84 | 0.50 | 0.63 |
Solid-state | 0.84 | 0.54 | 0.66 |
Sol–gel | 0.79 | 0.62 | 0.69 |
Hydrothermal | 0.84 | 0.47 | 0.61 |
Precipitation | 0.84 | 0.44 | 0.54 |
Bi-LSTM | 0.90 | 0.88 | 0.89 |
Solid-state | 0.90 | 0.90 | 0.90 |
Sol–gel | 0.88 | 0.86 | 0.87 |
Hydrothermal | 0.90 | 0.86 | 0.88 |
Precipitation | 0.90 | 0.91 | 0.91 |
These results moved us toward considering a recurrent neural network model for mapping paragraphs into ULSA. It is generally accepted that recurrent neural networks (RNNs), and specifically bi-LSTMs, can effectively process sequential data and keep track of past events.44 Indeed, bi-LSTM is simple enough and does not require exhaustive training and fine-tuning, as is common for BERT45 and GPT models.46–48 The bi-LSTM model combined with word embeddings (Section 2.4.3) was trained on the labeled dataset of 3040 sentences. The bi-LSTM model significantly improves mapping accuracy, yielding >90% F1 score (Table 3). It is important to notice here that all the metrics for baseline and neural network models were computed per sentence, i.e. we evaluated the whole sentence being mapped correctly rather than individual terms.
The output of the baseline models and the bi-LSTM model for exemplary solid-state and hydrothermal synthesis paragraphs are shown in Table 4, which shows significant improvement in the bi-LSTM model performance compared to the baseline models. There are a few reasons why the bi-LSTM model outperforms plain dictionary lookup. First, researchers use diverse vocabulary to describe synthesis procedures, hence there are unlimited possibilities in constructing a lookup table. For instance, “heating” can be referred to as “calcining”, “sintering”, “firing”, “burning”, “heat treatment”, and so on. In this case, a word embedding model helps to significantly improve the score even for those terms that have never appeared in the training set (e.g. “degas”, “triturate”). Second, a given verb is defined as a synthesis action term largely based on the context. Prominent examples are “heating rate”, “mixing environment”, “ground powder”, etc. That is well captured by the recurrent neural network architecture. Lastly, synthesis actions are not only denoted by verb tokens, but also by nouns, adjectives, and gerunds. This can be also learnt by the neural network better than by a set of rules.
Paragraph | Baseline 1 | Baseline 2 | Bi-LSTM |
---|---|---|---|
Target was synthesized by solid state reaction. Precursors were milled together with a mortar and pestle for 10 min. The mixture was then placed into a alumina crucible, heated to 1200 °C at a heating rate of 5 °C min−1 in ambient air, held at 1200 °C for 2 h, and then cooled to room temperature at 20 °C min−1. The resulting powder was then ground for 5 min to break up agglomerates | Starting (synthesized) | Starting (synthesized) | Starting (synthesized) |
Mixing (milled) | Mixing (milled) | Mixing (milled) | |
Mixing (mixture) | Cooling (cooled) | Heating (heated) | |
Heating (heated) | Reaction (held) | ||
Heating (heating) | Cooling (cooled) | ||
Cooling (cooled) | Mixing (ground) | ||
Titanium isopropoxide and isopropanol were mixed and stirred for 30 min. Then, the solution was added to 40 mL nitric acid solution. The mixture was heated at 80 °C for 10 h under stirring, the resulting solution was transferred into Teflon-lined autoclave and kept at 200 °C for 24 h. The precipitates were washed thoroughly with distilled water and ethanol, and dried overnight | Mixing (mixed) | Mixing (stirred) | Mixing (mixed) |
Mixing (stirred) | Mixing (added) | Mixing (stirred) | |
Mixing (added) | Heating (heated) | Mixing (added) | |
Mixing (mixture) | Reaction (kept) | Heating (heated) | |
Heating (heated) | Purification (dried) | Mixing (stirring) | |
Mixing (stirring) | Reaction (kept) | ||
Reaction (kept) | Purification (washed) | ||
Purification (dried) | Purification (dried) |
In summary, we designed a neural network-based model that maps any synthesis paragraph into ULSA with high accuracy and significantly outperforms a plain dictionary lookup approach.
First, we observe that the verbs mapped into ULSA and hence representing synthesis actions are all grouped in the top-left corner of the projection. Indeed, analysis of the individual words in the rest of the space showed that those are the words that generally appear in synthesis paragraphs but do not carry any information about the synthesis procedure. For instance, these are verbs denoting characterization of a material (“detect”, “quantify”, “examine”, “measure”), naming of a sample (“denoted”, “referred”, “named”, “labeled”) or referring to a table or figure. The blob of dots in the middle of the plot are all words that were either mis-tokenized during text segmentation or mistakenly recognized as verbs by the SpaCy algorithm. In the embeddings mapping, these words are replaced with the <UNK> token.
A second interesting observation is that the embeddings related to sintering (blue dots), pelletizing (purple dots) and re-grinding (orange dots) are all located next to each other. This agrees well with the fact that those actions together describe solid-state synthesis processes. Oppositely, the verbs describing solution mixing (orange dots) are in close proximity with the verbs referring to purification, such as filtering or drying (green dots). Similarly, verbs indicating cooling processes (magenta dots) and the verbs referring to reaction processes (red dots) are clustered together. This agrees with the often encountered constructions of “left to react” or “kept and then cooled” describing the final steps of a given synthesis.
Taken together, these results demonstrate that (a) the embeddings model we created reflects well the similarity of the verbs used for synthesis descriptions and (b) the vocabulary of ULSA covers all common synthesis actions used in inorganic synthesis.
Fig. 4 displays the projection of the 1st and 2nd principal components. Each data point here corresponds to one synthesis paragraph, i.e. one synthesis flowchart. Different colors highlight different types of synthesis. A few observations can be made from the plot. First, the clusters of synthesis procedures are well separated and aggregated according to the synthesis types. Specifically, the data points corresponding to solid-state synthesis are narrowly clustered along a line with negative slope while the other synthesis types are spread more widely and the slope of their linear fit is positive. Second, the clusters of data points for precipitation and hydrothermal synthesis almost completely overlap and partially overlap with sol–gel synthesis, while the overlap with solid-state synthesis is negligible.
These two observations agree well with the standard procedures associated with each of the four synthesis types. Indeed, solid-state syntheses usually operate with mixing powder precursors, firing the mixture, and obtaining final products; sol–gel synthesis is considered as a solid-state synthesis with solution-assisted mixing of precursors; hydrothermal and precipitation syntheses usually involve preparation of the sample in solution, then filtering (Purification) to separate the liquid and obtain the final product instead of including a firing step.
To get further insights, we manually sampled and compared synthesis procedures corresponding to the data points along each of the fitted lines. The results show that the 1st principal component correlates with the involvement of Solution Mixing for precursors in synthesis procedures. In other words, the larger a coordinate the data point has along the 1st principal component, the more steps of dissolving and mixing precursors in solution as well as Purification that data point involves. This agrees well with the fact that solid-state synthesis mostly operates with powders while hydrothermal and precipitation procedures are solution-based procedures, and sol–gel syntheses exist in between.
The 2nd principal component corresponds to the level of complexity of the synthesis procedure. The larger and more positive the data point along the 2nd principal component, the more steps are involved in the synthesis process. Interestingly, all four synthesis types exhibit simple synthesis procedures (fewer steps) and complex synthesis procedures (many steps). Nonetheless, solid-state synthesis has the largest deviation along the 2nd principal component compared to hydrothermal and precipitation synthesis since solid-state procedures can involve multiple heating and re-grinding steps for the sample to obtain the desired phase while in solution synthesis this can often be achieved in one or two steps.
Despite these promising results, the ULSA scheme is not considered a complete language and can be significantly improved in the future with contributions from other researchers. First, we only demonstrated that it works for specific inorganic synthesis methods, and introduction of synthesis techniques such as deposition, crystal growth, and others will require extending the ULSA vocabulary or reconsidering the definitions of some terms. Second, the scheme and methodology will benefit from a robust approach to distinguish between various mixing procedures since this is one of the defining items in understanding synthesis protocols. This includes separation between, for example, dissolving precursors and dispersive mixing in a liquid environment, using ball-milling to homogenize the sample and using high-energy ball-milling to actually achieve the final product, adding reagents to promote reaction and adding precursors to compensate for loss due to volatility, and other cases. We have demonstrated that the details of mixing are important for distinguishing between inorganic synthesis methods using simple heuristics, however, the scheme will benefit from a high-fidelity approach. Nonetheless, we anticipate that our results and the ULSA schema will help researchers to develop a data-oriented methodology to predict synthesis routes of novel materials.
Efficient and controllable materials synthesis is a bottleneck in technological breakthroughs. While predicting materials with advanced properties and functionality has been brought to a state-of-the-art level with the development of computational and data-driven approaches, the design and optimization of synthesis routes for those materials is still a tedious experimental task. The progress in inorganic materials synthesis is mainly impeded due to (a) a lack of publicly available large-scale repositories with high-quality synthesis data and (b) a lack of ontology and standardization for communication on synthesis protocols. Indeed, the first matter arises from the fact that the vast majority of experimental data gets buried in lab notebooks and is never published anywhere. As a result, researchers are liable to perform redundant and wasteful experimental screenings through those parameters of synthesis that have already been performed by someone, but are not reported. Even published experimental procedures face the problem of ambiguity of the language used by researchers. This creates a major challenge in acquiring synthesis data from publications by automated approaches including text mining.
The advantage of the paradigm we establish in this work is that it brings us closer to addressing important questions in materials synthesis: “How should we think about the synthesis process?“, “What is the minimum information required to unambiguously identify a synthesis procedure?“, and “Can synthesis be thought of as a combination of fixed action blocks augmented with attributes such as temperature, time, and environment, or are there other important aspects that have to be taken into account?“. These questions will become crucial when transitioning toward AI-driven synthesis.
Recent developments in autonomous robotic synthesis and the attempts to “close the feedback loop” in making decisions for the next synthesis step make the question of synthesis ontology and unification especially important.36,37,49 Indeed, while theoretical decision-making and AI-guided systems can operate with abstract synthesis representations, implementation of this methodology to an autonomous robotic platform will require well-defined and robust mapping onto a fixed set of manipulations and devices available to the robot. The unified language we propose in this work can become a solid foundation for the future development in this direction.
Footnotes |
† Equal contribution. |
‡ Present address: Nanyang Technological University, Republic of Singapore, 639798. |
§ Present address: Roivant Sciences, New York, NY 10036, USA. |
This journal is © The Royal Society of Chemistry 2022 |