Chemical data intelligence for sustainable chemistry

This study highlights new opportunities for optimal reaction route selection from large chemical databases brought about by the rapid digitalisation of chemical data. The chemical industry requires a transformation towards more sustainable practices, eliminating its dependencies on fossil fuels and limiting its impact on the environment. However, identifying more sustainable process alternatives is, at present, a cumbersome, manual, iterative process, based on chemical intuition and modelling. We give a perspective on methods for automated discovery and assessment of competitive sustainable reaction routes based on renewable or waste feedstocks. Three key areas of transition are outlined and reviewed based on their state-of-the-art as well as bottlenecks: (i) data, (ii) evaluation metrics, and (iii) decision-making. We elucidate their synergies and interfaces since only together these areas can bring about the most benefit. The field of chemical data intelligence oﬀers the opportunity to identify the inherently more sustainable reaction pathways and to identify opportunities for a circular chemical economy. Our review shows that at present the field of data brings about most bottlenecks, such as data completion and data linkage, but also oﬀers the principal opportunity for advancement.


Introduction
0][11] However, one of the main problems when developing sustainable processes is the lack of access to information on all multiple co-existing options, hindering a systematic way to shape early, but key process decisions. 12Novel process routes based on renewable or waste feedstock are in fierce competition with the petrochemical-based market, [13][14][15][16] where companies operate at economies of scale and have optimised both processes and supply chains for over a century. 17,18Thus, a shift in industrial techniques, if not enforced through strict regulation, can only happen if sustainable alternatives are equally good or even economically superior solutions.Yet, even earlystage schemes of novel reaction routes require process modelling, pre-collected data, and last but not least chemical intuition, making them a long and manual selection process.Thus, there is a need for a systematic and fast tool to identify the most promising reaction routes.The three key aspects to develop such a chemical data intelligence tool are (i) data, (ii) assessment metrics, and (iii) decision-making approaches (see Fig. 1) and will be discussed throughout this study.
A systematic picture of the available knowledge on reaction data can be illustrated through a network of chemical reactions, where species are connected with one another through chemical reactions -products and reactants of each reaction are connected.Fig. 2 illustrates how the evolving reaction network can connect feedstock molecules (e.g. from biomass) to target molecules (e.g.drug compounds) over a sequence of reactions involving intermediate molecules (e.g. chemical commodities).The increase in electronic data recordings and thus, data availability, has paved the way for rapid progress on reaction networks mined from large chemical databases, sometimes called the chemical universe or the network of organic chemistry (NOC).Fialkowski et al. first introduced the study of organic synthesis reactions with a network representation based on the Beilstein Database. 19Then, studies on the topology and growth of the network, [20][21][22] synthesis planning through the network, 23,24 and applications to One-Pot-Reactions have followed. 257][28] With rapid increases in digitalisation, it is worthwhile to revisit the NOC and identify future avenues for chemical reaction data.Information extraction and information representation play key roles, where tools such as natural language processing (NLP) can lead to more complete datasets and ontological representation, or knowledge graphs, constraints. 30The approach was further extended by advanced metrics and supply chain considerations, 31,32 to represent process networks, 33 and was adopted for the assessment of routes to biopolymers. 34An alternative to the steady state model for reaction network optimisation was presented in ref. 35.Methodologies are evolving quickly, yet their full potential will only be realised if connected to large chemical databases.
To evaluate routes in reaction networks, appropriate criteria must be used.The twelve principles of green chemistry, 36 the ''productivity scheme'', 37,38 and their extension towards green engineering, 39 the ''improvement scheme'', 40 have established a common understanding of environmental considerations in chemical engineering.Metrics, such as the environmental impact factor (E-factor), atom economy (AE), or energy requirements enable us to assess environmental considerations within chemical processes. 38The field is going through a transition from green to sustainable chemistry, which requires the consideration of wider system boundaries.][43] Sustainability criteria should be able to simulate the system boundaries (e.g.demand/supply outside the network) and should be retrievable in an automated manner on early process development stage data sources.Wider chemical reaction systems have previously been analysed based on exergetic efficiencies and sets of chemical heuristics. 26nly together can the areas of data, metrics, and decisionmaking make the most use of chemical data intelligence and enable practitioners to plan the most sustainable reaction routes.In this work, we explore the potential of semantic data for rich and structured chemical knowledge.Advances in the fields such as NLP and recommendation systems are further reviewed as they promise to tackle the challenge of data scarcity.For sustainability aspects in chemical reactions, we elucidate the importance, as well as challenges, of system thinking.We research a navigation system for chemical space similar to Google Maps, showing us the most sustainable pathways in the entanglement of chemical reactions.
We provide a roadmap with our recommendations for the development of a systematic early-stage sustainability assessment tool in Fig. 3. Within the three research fields, we identify impact opportunities and provide action steps and approximate time frames.The foundation for the recommendations is explained throughout this work in the detailed sections on data, metrics, and decision-making.available datasets where single entries are recorded with a uniform resource identifier, enabling linkage to the chemical context and to adjacent fields, such as substance emissions and market values of the molecules.As second impact opportunity, we strongly recommend the inclusion of biological data.This includes molecular transformations from systems biology, purified enzymatic reactions and whole cell transformation, as well as biological feedstocks as primary raw material and as secondary raw material from waste streams.Lastly, we emphasise the importance of complete data structures.While for novel publications, journal standards can enable the recording of stoichiometry, yield, and reaction conditions, the body of already published reactions needs to be revisited in order to withdraw such information in an electronic standard.NLP can gather previously stated information and predictive models can be utilised for the fractions where original sources do not contain the information.
# 2 Metrics impact opportunities and action points.The first opportunity is the transition from green to sustainable metrics.Herein, we recognise the importance to focus on the integration of resource-based assessment, e.g.exergy, which complements emission-based metrics, as well as social and economic assessment.Furthermore, only cross-domain standards, e.g. on allocation and the system boundary, can lead to a successful tool as the chemical sectors spans multiple domains.Last, but not least, we recommend to focus on early-stage metrics, as only earlystage decision-making can lead to inherently sustainable pathways.The second impact opportunity is the measurement of circularity potentials.In order to evaluate the sustainability of a pathway, knowledge on potential uses of waste streams generated throughout the process are indispensable.Considerations on possible upstream treatment and on the stability of molecular properties during multiple cycles of reuse are of interest.The last impact opportunity is the prediction of molecular properties relevant for sustainability assessment, e.g.prediction of chemical exergies from molecular structures.# 3 Decision-making impact opportunities and action points.Future decision-making systems for early-stage sustainability assessment are required to evolve at two fronts: increasing model accuracy and decreasing model complexity.On the one hand, it is desirable to derive modelling frameworks which take for instance solvents, separations, upstream treatment, material circularity, and sequential manufacturing into account.On the other hand, linear models or heuristic, solving strategies for non-linear models, are essential to facilitate large-scale, and thus, systematic assessment.Last, but not least, sustainability is a dynamic systems problem, which requires a decision-making support able to interact with the dynamic nature of the system.While metrics might be able to map current market prices and data can show the demand for or supply of certain materials at regarded geographic locations, the envisioned decision-making framework should work beyond these static snapshots of the system, e.g.following agent-based modelling approaches.
Enabling the three areas to evolve together will be the key aspect for well-reasoned reaction pathway development, tackling the different aspects of sustainability.We would like to stress the following interfaces between the areas in particular.
# Interface 1: data is the foundation for metrics and decisionmaking.Accessible and well-documented databases are a prerequisite to further development of metrics.Molecular properties such as chemical exergy, thermodynamic properties or toxicity values are key aspects of sustainability assessment.Further development of computational tools to automatically generate metrics for given systems are needed.Additionally, data on reaction structures, i.e. the stoichiometric relationships, are essential to formulate mass balances as physical constraints of the decision-making formalism.We anticipate that only clear communication of data needs from the metrics and decision-making community will enable sufficiently quick extensions to the current data sources to be developed.Alongside, conversations between the communities should include ontology development to define the characterisation and relationships of data.
# Interface 2: co-development of metrics and decisionmaking.Throughout this article, we argue that sustainability is a systems science.For a future sustainable chemical supply chain, it is essential to evaluate proposed reactions within their environment, rather than detached from it.Thus, the metrics required to assess the sustainability of novel reaction pathways need to capture wider system interactions considered in the decision-making approach.Here, the foundations of assessing the greenness or the sustainability of reactions are provided by the metrics community, but new requirements for metrics will arise from within the systems modelling community.We anticipate large benefits if both domains come together.
# Interface 3: defining decision-making environments through regional and dynamic data.Data on the price of molecules illustrates the economic interaction of a system with its environment across the system's boundary.This interaction is further defined by data on the availability and the demand of each molecule, thus mapping the chemical supply chain.However, market prices and supply chains as well as the energy price are temporal and spatially fluctuating.In the long term, modelling novel reaction pathways within a circular chemical supply chain requires dynamic interactive environment descriptions, rather than static snapshots.Here, joint research is needed to develop dynamic environments based on suitable data sources.Future advanced tools may even include regional policy insights, as well as costs associated to infrastructure and personal.

Data
Large volumes of open and big data have revolutionised many fields of our modern society. 44While enormous improvements have driven developments in areas such as computer vision, or language recognition, chemical data has yet to overcome urgent challenges, such as establishing openly accessible data with standardised representations and improving quality inconsistencies of existing data. 45FAIR scientific data -findable, accessible, interpretable, and reusable 46 -evolving into BOLD concepts [47][48][49] will push chemical discovery and is essential for cross-disciplinary tasks, such as sustainability assessments.

Databases and accessibility
To facilitate large-scale reaction route screening, access to large reaction databases is the stepping stone.When aiming for more sustainable process routes, it is worthwhile to discuss the extraction of reactions from conventional chemical transformation as well as biosynthetic conversion strategies and alternative chemical conversion strategies.
Conventional chemical reaction databases.There exists a variety of chemical reaction databases with different sizes and accessibility rights as well as distinct coverage of the chemical space.Table 1 outlines a selection of common databases for organic reactions.CASREACT † and Reaxys TM ‡ (in the following called Reaxys) are by far the largest databases for chemical reactions.They include scientific literature and a selection of patents, but require users to buy licenses to work with largescale data.The database called Pistachio § developed by the company NextMove stores reaction from US patents and has released a publicly available subset CC-Zero ¶ of over one million reactions.SPRESI8 is another database for organic reactions, which provides a subset of 500 000 reactions as a free app.The USPTO** database is the smallest, yet it is entirely openly accessible.Open reaction databases are gaining increasing momentum.One recent example is the open reaction database, † † which is a multi-institution initiative to aid machine learning (ML) tasks in chemistry/chemical engineering by providing structured and freely available reaction data.The project sits on GitHub and its launch is planned for early 2021.Regarding data coverage, Thakkar et al. have outlined the differences in data coverage from multiple sources including a dataset based on electronic notebooks from AstraZeneca. 50They studied reaction templates within different databases and found that only 2% of templates were common in all considered data sources. 50Notably, the development of chemical databases is a rapidly developing field.Content breadth and depth are being constantly reviewed and further developed.Reaxys, for instance, now covers supplier information on price, supplier geolocation, packages sizes and much more.
Biosynthetic reaction databases.A hybrid system of biosynthetic and conventional chemical synthesis opens up opportunities for efficient (bio)chemical pathways search.Enzymatic reactions can lead to more efficient reaction pathways with reduced operational costs as synthetic biology can enable shortcuts and flexible design of supply chains (see Fig. 4), improving redox efficiency.Moderate temperatures/pressure and the avoidance of metal catalysts or hazardous solvents can ease synthesis and lower operational costs. 51Additionally, enzymatic reactions are well suited to utilise and further functionalise the biological structures in renewable feedstock.Advantages of metabolic alternatives have encouraged synthetic biologists to find biosynthetic routes that produce bulk chemicals and industrial chemicals such as ethanol, 52 benzoic acid, 53 toluene, 54 etc., and active pharmaceutical ingredients of pharmaceuticals such as flavonoid 55 and tryptophan. 56A metabolic map for the production of bio-based chemicals was summarised by Lee et al. 57 Databases such as Kyoto Encyclopaedia of Genes and Genomes (KEGG), 58 ‡ ‡ Rhea, 59 § § and the Enzyme Catalyticmechanism Database 60 ¶ ¶ open up opportunities to obtain bio-information for metabolic reaction networks.The most comprehensive biological database is KEGG with currently almost 13 000 recorded reactions. 58However, in comparison to conventional chemical databases, enzymatic databases are relatively small and sparse at present.Synthetic biologists are actively working towards the prediction of metabolic reaction behaviours to populate the databases. 61,62lternative conversion strategies.Further opportunities for discovering new reaction pathways include electrochemical or photochemical transformations.Electrocatalytic hydrogenations may hydrogenate molecules in water under ambient conditions and thus replace conventional hydrogenation steps typically requiring elevated pressures. 63Besides, electrochemistry may also unlock entirely new synthetic pathways as novel molecular transformations are observed. 64Harnisch and Urban illustrate the concept of an electrobiorefinery, where they anticipate that the synergies between microbial and electrochemical conversations are likely to impact, amongst others, enlarging product portfolios and exploiting new feedstocks. 65In particular, they outline electrochemistry for decomposition of bio-based feedstocks, e.g.lignin pretreatment, to provide chemical feedstocks, e.g.H 2 , CH 4 , or C 1 -or C 2 -compounds, but also to electrochemically steer fermentation, e.g.CO 2 can be used as carbon source for fermentation cultures. 65Another promising conversion strategy is photochemistry.Research efforts are focusing, for instance, on the utilisation of sunlight to produce CO, ethanol, or methane from CO 2 in aqueous solutions, or on solar-driven organic synthesis, where the target is to obtain highvalue products. 66He and Jana ´ky state that utilising solar energy and CO 2 resources can be expected to yield both fuels and valueadded chemicals.They list a range of possible chemical products in their work and compare the performance of different photochemical conversion strategies. 67In large chemical databases, such as Reaxys, one can specify reaction types, also including electro-or photochemical transformations.This allows specific inclusion of alternative conversion strategies and potentially enlarges the toolbox for the development of more sustainable chemical reaction pathways.

Data formats
Linking data sources is often essential for decision-making, but requires rethinking of existing data practices. 68A stepwise approach for data formats, the 5-star plan88 (cf.Fig. 5), was suggested by Tim Berners-Lee, the inventor of the worldwide web.The plan describes a trajectory for data formats starting by open data in any format and resulting in the semantic web. 69In Fig. 4 Illustration of enzymatic process pathways and conventional process pathways based on fossil feedstocks.In conventional processes, feedstocks are first broken into smaller building blocks and then reassembled and functionalised step by step.Renewable feedstock, however, is often already highly functionalised and different enzymatic transformations can make use of this for direct transformation into different stages of the conventional supply chain.the semantic web, data is accessible both for humans and machines as data is stored with structure and context, generating meaningful content.The content is made comparable between sources through ontologies. 69Ontologies exploit triple relationships, e.g.Acetone his used asi solvent is broken into two concepts ''molecule'' and ''solvent'' and their relationship is given by his used asi.These generate metadata structures, i.e. reusable knowledge representation. 70Additional terminology for data structures is explained in Table 2. Scientific practices to record data openly and in machinereadable formats lack behind.Large chemical databases such as CAS*** are fundamental sources of chemical information, however, they do not fulfil the requirements of the semantic web when it comes to information access and representation. 70his is due to current ways of publishing, e.g.providing PDF files, which support human readability, but are ill-suited for data mining and analysis. 68,702][73] The final product is envisioned to be a ''structured, interlinked and semantically rich knowledge graph''. 73hemical data often lacks relational information and is stored in diverse formats.For example, molecule representations range from structural chemical formulas over numeric descriptors to string representations, e.g.SMILES or SMARTS.The chemical mark-up language was introduced to offer semantics for chemical data. 74It allows the integration of various entities, e.g.molecules, spectra, and reactions in mark-up text for electronic use.Adding relationships to the entities, an ontological structure of information emerges.A few early ontologies for chemical engineering have been developed. 75,76OntoCAPE, for example, defines an ontology for chemical processes. 77,78OntoCAPE introduces chemical primitives such as system property, physical dimensions or units to define a system.System properties can have numerous values.The authors illustrate this through the system property ''temperature'' where different values can be distinguished from each other by the system property ''time'' both given in their respective units, here degrees Kelvin and time in hours, to record a temperature profile over time. 77e extension of such formalisms to broader domains of chemical applications brings about the potential to gather chemical data in a structured and standardised way even across different subject areas.
Linked data is much needed for sustainability assessment as it allows for holistic and cross-disciplinary assessment. 79The development of knowledge graph technologies potentially enables efficient data handling, including conditional data queries within or between subject areas, and more accurate data inference, due to high contextualising.][82] Similar structures of semantic web will be essential to enable sustainability assessment within reaction networks, where temperatures, yields, solvents, and reaction stoichiometry should be recorded and linked with each reaction. 83An entity-based data format would allow us to link further information (e.g.data on waste streams and their compositions, or on regional availabilities of renewable energy) in a modular way, paving the way for more holistic considerations in future reaction planning.

Data completion
While text-mining has enabled gathering large-scale chemical information, such as properties and structures of molecules, connections between reactants and products, to populate electronic chemical databases (cf.Section Databases and accessibility), the methods used in creating the datasets sometimes lack accuracy and miss-classify important pieces of information.Missing stoichiometric data and incorrect recording of multistep reactions as single-step reactions prevents mass balances within a reaction to be calculated and are major hurdles for decision-making based on automatically generated process options.Furthermore, records of reagents, solvents and catalysts are often inconsistent, e.g. they are sometimes absent or incorrectly recorded as a reactant and no information about required quantities is provided.This is challenging for sustainability assessments because the use of reagents, solvents, and catalysts has a significant influence on environmental impacts. 84dditionally, in some cases, clear identification of specific chemical species within the databases is problematic as mixtures Metadata is data about data.One example is structural metadata, which provides schemes and order to data.

Ontology
An ontology provides a uniform approach to describe data semantically.It is a specific conceptualization in a format that allows for reasoning and inference. 79

Semantics
Semantics provide methods to include meaning to information constructs.Adding meaningful tags to pieces of data brings about better readability as it is an abstraction of what the stored piece of data resembles in the real world.Semantic Web Tim Berners-Lee, who is the inventor of the World Wide Web, defines the Semantic Web not as a separate one, but as an extension of it.His vision is a web of information which can be processed by computers and which brings structure to the meaningful content of web pages. 69

Unified resource identifier
This journal is © The Royal Society of Chemistry 2021 of enantiomers are recorded as pure compounds or entries simply state ''mixture out of C3 to C6 hydrocarbons'', without reference to their molecular structures or a database registry numbers.The same problem exists in the identification of complex feedstocks without exact structure and composition, such as lignin, chitin, or cellulose.Inconsistency in temperature and pressure recordings makes an energetic analysis of processes difficult.With rapid developments in the field of NLP and high throughput experiments (HTE), it is expected that data quality will quickly improve and some of these current hurdles for algorithmic use of chemical information will be overcome.
Information extraction from scientific literature.NLP describes a range of computational techniques to analyse and represent natural text in a human-like manner for a variety of applications. 85One application especially relevant to this review is information extraction, the task of gaining structured knowledge from text.NLP aims (i) to aid human-human communication, e.g.translation tasks, (ii) human-machine communication, e.g.conversational agents such as Apple's Siri, or (iii) to bring about benefit for machines and humans, e.g. through learning from large amounts of data. 86According to Hirschberg and Manning, NLP has seen an immense boost within the last few years due to: an increase in computational power, large availability of linguistic data, successful ML algorithms, e.g. the transformer model, 87 and a better understanding of human languages. 86Information extraction from scientific literature brings about great benefits, not only to fill in gaps in data but also to keep track of the ever growing body of literature. 86he last decade has shown immense progress of NLP techniques within chemistry and related fields, making it a promising avenue to overcome data completion tasks.Jessop et al. have developed the Open-Source Chemistry Analysis Routines (OSCAR) software to read entities and information in chemistry publications. 88OSCAR4, a library onto which textmining tools for chemistry can be built, was released and the authors illustrated that OSCAR may also be applied to other areas of physical sciences as customisation through different dictionaries is possible. 88Such transdisciplinary data systems are of increasing importance as sustainability assessment requires a variety of data.Krallinger et al. present a review on the access of chemical information through text mining techniques and especially value chemical entity recognition as well as the interlinkage to biological data. 89NLP techniques have also gained a foothold in related fields, such as nanotechnology, 90 medical/clinical text documents, 91 and biomedical texts. 92owever, linking data from scientific publications of different chemical subject areas is potentially problematic as definitions and reporting standards may vary.Automated assessment frameworks of data quality for linked open data 93,94 may offer the potential to identify pieces of misleading information between communities.
Data acquisition through high-throughput experiments.In HTE multiple reactions are performed in parallel to quickly answer a specific chemical question. 95High-throughput virtual experiments (HTVE) are often utilised to examine material or drug leads at a large scale, when experimental searching becomes impractical due to high cost or technical issues.7][98] This iterative approach has proven to be very powerful, especially when combined with ML guided exploration algorithms. 95,99Eyke et al. employed data produced by HTE to train a ML model, and the ML model predicted reaction outcomes and selected the most informative experimental region for HTE to explore further. 96Chen and Visco built a support vector machine (SVM) model on the basis of experimental data and molecular descriptors, which they trained for the identification of drug candidates 98 and Li et al. demonstrated that artificial neural networks (ANNs) trained on density functional theory (DFT) data were able to capture complex absorbate-metal interaction, providing guidance on the design of bimetallic catalysts. 1002][103][104] For instance in 2004, King et al. reported an automatic experimental system called ''robot scientist'' that was able to independently conduct an entire research cycle, including planning, testing, analysis and re-run if hypothesis and results were inconsistent. 101In the platform assembled by Coley et al., synthetic routes were proposed by a retrosynthesis software and organic synthesis was conducted in flow reactors, automatically configured by a robotic arm. 104In Cronin's group, a robotic platform was constructed along with a standardized architecture for organic synthesis, called the Chemputer. 105By taking advantage of various smart hardware and programming languages, Chemputer system showed the potential to standardize the whole automatic experimental process, from conducting synthesis to generating reports.
Advanced data analysis techniques, such as the transformerbased model developed by Schwaller et al., perform well over data from HTE, but are still not feasible for processing historical experimental data, which often suffers from high inconsistencies. 106With the continuous improvements in HTE, HTVE, automation, and data analysis, highly consistent and reliable experimental data may quickly expand, bringing to light more reliable data standards in larger regions of the chemical space.

Data inference
Some data is never reported in primary research publications or patents, but data may be augmented through data inference -a cheap and fast alternative to experimental studies.In this section, we first outline recommender systems as a general way to complete data matrixes and then highlight developments for data inference of the specific contents as illustrated in Fig. 6.Relevant for reaction route search are: firstly, missing reaction conditions, e.g.temperature, pressure, and reagents, secondly, reaction outcomes, e.g.yield is not always available and thirdly, reaction structures; e.g.reactants and/or products and the reaction stoichiometry are missing.
Recommender systems.Recommender systems have been an effective approach to deal with information overload and are seen most promising in problems related to ''over-choice'' of options.A standard recommender system solves a problem of a set of n users and m items, which it generally recommends to the users according to their preference.The relationships between users and items are commonly represented in a n Â m matrix, being the core element of the recommender system. 107The matrices are often very sparse as little information is originally available and the aim is to predict what the missing cells will be. 107The entries may range from single bits to unstructured text. 108One prominent example of a recommender system has been the Netflix challenge, where a price of one million US$ was awarded to the team to first model the dataset and predict new ratings to a specified accuracy. 109Matrix completion methods or graph recommender systems have been applied for such problems. 107,110,111For a deeper understanding of recommender systems and their current trends and challenges as well as the underlying deep-learning strategies we refer the reader to these surveys. 107,112,113ithin the domain of chemistry, there are early works that recognise the impact of recommender systems for both experimental and computational data.Savage et al. recommend candidate molecules as reactants for the synthesis of desired products.They formulate the problem as link prediction over a graph base, where links represent reactant-product relationships and provide chemical knowledge in the form of molecular fingerprints. 114In 2020, Jirasek et al. have shown an application of a recommender system for the prediction of binary activity coefficients. 115Other examples within chemical engineering include the use of recommender systems to predict drug side effects, 116 to estimate the relevance of chemical compounds to form crystals, 117 and for material choices in polymerisation experiments. 118nference of reaction outcomes.For reaction outcomes, we focus on the inference for yield data.At present, yield records are far from sufficient in most databases.For instance, an exemplary dataset from Reaxys database with 17 million reaction records contains around nine million reactions without any yield information.Additionally, among reaction records with yield information, the yields of only a few products are listed.
Data-driven approaches are promising avenues for yield predictions.Through advances in HTE techniques, detailed and structured experimental data became more easily available.
Simultaneously, ML methods evolved and machine-readable representation of molecules and reactions is under constant development.One of the earliest approaches of ML for yield prediction was presented by Kito et al. and predicted the selectivity of catalytic oxidative dehydrogenation reaction products by using an ANN. 119wo trends emerged for yield prediction afterwards.On the one hand, models are based on more accurate, but expensive inputs through descriptors based on DFT and focus on specific reaction types, on the other hand, models aim to identify more generic relationships for multiple reaction types based on cheaper molecular representations.DFT-based descriptors were employed by Yada et al. for tungsten-catalyzed epoxidation of alkenes by using a linear function ensembled in a logistic regression model, 120  Eyke et al. utilised reaction fingerprints by concatenation of Morgan fingerprints to guide their experimental design for two specific types of reactions through an ANN. 96Sandfort et al. present a broader approach through their structural-based platform for reactivity prediction in organic chemistry. 123The idea was to use molecular fingerprints as the only type of inputs for ML models to solve all kinds of reaction predictions.While no universally applicable fingerprints for all applications were found, only focusing on C-N cross-coupling reactions, a comparable accuracy to the work of Estrada et al. was achieved.Skoraczyn ´ski et al. showed reaction examples where subtle changes in molecular structure or reaction conditions led to distinct reaction results, and thus, argued that general descriptors for diverse sets of organic reactions are difficult to set. 124A very recent development in the area is an algorithm based on NLP. 106Their model consists of an encoder and a regression layer to predict yields and is based on reaction SMILES as inputs.On a dataset based on HTE reactions for Buchwald-Hartwig reactions and Suzuki-Miyaura reactions a high prediction accuracy was achieved, while for a generic dataset, the open-source USPTO, poorer accuracy was obtained.

While the prediction of accurate yields for diverse reaction
This journal is © The Royal Society of Chemistry 2021 types remains a challenge at present, the new methods bring about promising outlooks.
Inference of reaction conditions.Methods for prediction of reaction conditions, also known as reaction context, 125 have evolved from specific methods only valid for certain reaction types and reaction context towards more holistic approaches.The reaction context is made out of discrete decisions, e.g.catalyst and solvents, and continuous decisions, e.g.temperature, pressure, and pH-value, which influence one another.Marcou et al. predicted catalysts and solvents for the Michael reaction through formulating binary classification problems on a set of 198 reactions. 126The prediction of both catalysts and solvents was correct for only eight out of 52 reactions from an external validation set. 126Lin et al. focused on catalyst recommendations for deprotecting reactions and demonstrate their work on catalytic hydrogenation reactions. 127They used the methodology of condensed graph of reactions to reduce a reaction to a single graph, allowing for descriptors and fingerprints, and employed similarity searches to suggest catalysts. 127Gao et al. also recognises such approaches based on chemical similarity, however highlight computational costs in sufficiently large databases. 128egler and Waller aim for conditions recommendations for different types of reactions utilising a knowledge graph. 129The graph consists of two node types (i.e.reactions and molecules) and a variety of edge types, describing the reaction conditions, e.g.his reactant ini, his catalyst ini, his solvent ini.New conditions are predicted through node and link completion tasks. 129The group by Jensen also aimed for a generic model to predict reaction context, here catalysts, solvents, reagents, and temperature. 128They trained an ANN on a dataset of about 10 million single-step and singleproduct reactions and allowed for hierarchical ANN structures which take interdependencies between the reaction contexts into account.In 69.6% of the time, a close match to the recorded conditions is found within the top 10 prediction outcomes. 128Also, the optimal selection of reaction conditions can be solved as an inverse design problem by using an optimisation algorithm to change the inputs of an reaction outcome model.
Inference of reaction structures.To evaluate reaction synthetic routes, masses of products, reactants, and side-products need to be quantified.These can only be computed when all substances and their stoichiometric coefficients are known.However, three main aspects of current data recording hinder this analysis: (i) stoichiometric coefficients are lacking, (ii) reaction co-participants are missing, and (iii) multiple reaction steps are integrated into a single reaction entry.
In the current literature, there exists a limited number of methods for reaction structure completion.Firstly, reaction templates can be utilised.Grzybowski et al. manually curated around 100 000 reaction rules with complete understanding of reaction participants and stoichiometry. 24,25Their templates are now linked with the commercial software SYNTHIA † † † and can guide retrosynthesis and analyse carbon efficiency based on mass conservation. 130However, manual curation of reaction rules is far away from exploring the entire chemical space.Secondly, atom mapping, which relies on the rearrangement of atoms in chemical transformations, is promising to tackle this problem.The completion of an exemplary reaction is shown in Fig. 7, where in (a) stoichiometric coefficients are absent and mass balances are not obeyed and in (b) atom mapping describes the exact transformation and reveals the missing species on the product side.The existing atom mapping methods, described in recent reviews, 131,132 often convert molecules into graphs and compare the most common subgraphs.However, this results in an NP-hard problem where computational time increases exponentially with the number of atoms in molecules.Jaworski et al. utilise graph-theoretical considerations and chose 20 chemical rules/heuristics to correct mapping of reactions. 133his method attempts to complete the stoichiometry, firstly, by adding small molecules such as acetaldehyde, ammonia, and others to balance the reactions and, secondly, by fitting reactions into popular reaction templates and adding the missing parts.Only if such attempts fail, atom mapping is employed.The work by Schwaller et al. utilised NLP to infer reaction structures. 134 neural network (transformer) was trained on a set of mapped reactions and showed to be able to complete the mapping task quicker and with confidence scores. 134Despite the aforementioned improvements in atom mapping, inferring complete reaction structure remains a challenge at present as a precise prediction of functional transformations is required.
Notably, computational capacities have increased immensely over the last few decades, and new computational approaches, such as graphics processing units (GPU) or quantum computing, are promising avenues for complex computational tasks.GPU computing has enormously advanced the field of deep learning in the areas such as computer vision and speech recognition. 135uantum computing has shown to speed different search algorithms 136 and its potential for complex optimisation tasks, such as energy system optimisation, has recently been highlighted. 137However, it has been argued that the power of quantum computing is limited, especially when it comes to NP-hard problems. 138Nevertheless, new computing approaches provide promising avenues for computationally expensive algorithms such as large scale atom mapping challenges.
Besides said data-driven techniques to infer missing species and stoichiometry, it is worthwhile to discuss the automated generation of entire reaction networks based on chemical rules, which also leads to stoichiometric relationships of reaction networks. 139,140For instance, the rule network input generator (RING) can construct complex networks based on a set of input reactants in their SMILES representation and a set of defined reaction rules and constraints. 139,140RING showed to reproduce mechanisms reported in the literature for systems such as dehydration of fructose to produce HMF or acidcatalysed hydrolysis of HMF to levulinic acid. 139Most notably, Marvin et al. combine the network generation method RING with a mixed-integer linear programming (MILP) model for the optimisation of pathways towards biofuel-gasoline blends. 141o generate reaction structures for large-scale networks, encoding all permitted reaction rules and constraints may be tiresome, as theoretically literature and databases can easily provide this information.However, rule-based network generation can be of particular interest in sparse regions of chemical knowledge.Here, methods such as RING can significantly contribute.
With all types of prediction algorithms, it is worthwhile to keep the stochastic nature of the results in mind.While information from the literature often contains measurement uncertainties, introducing predictive algorithms adds model uncertainties.In Section 4 on decision-making, we shortly sketch the influence of uncertainties on finding optimal solutions.

Metrics
The assessment of sustainability through metric values is not trivial.As a system challenge, it calls for use of large data sets.At the same time, it is strongly biased by our subjective view of the dynamic concept of sustainability.Metrics to assess sustainability are diverse and difficult to set.Within the United Nations framework of the Sustainable Development Goals (SDGs), there exist 231 unique indicators to measure sustainability in the 17 dimensions of the SDGs.This illustrates the necessary diversity in dimensions and indicators utilised to assess sustainability.Additionally, strong interconnectivity between dimensions has been acknowledged by many, [142][143][144][145][146] possibly leading to synergetic effects, but also to trade-offs. 143,144here are limited possibilities to measure the sustainability of reaction routes to such extent as for instance the SDGs would indicate.Yet, the transition from green to sustainable chemistry has identified key aspects which should be considered.0][151][152] On the reaction network level, this means considering entire supply chains as well as moving away from purely environmental concerns towards implicit inclusion of societal and economic ones, their trade-offs, and synergies.The European Technology Platform for Sustainable Chemistry, SusChem, ‡ ‡ ‡ focuses on projects which combine all three sectors, and the International Sustainable Chemistry Collaborative Centre, ISC3, § § § highlights inter alia systems thinking, ethical and social responsibility, and circularity as key characteristics for chemists to focus on. 153

Systems thinking
In the search for a more sustainable future, systems thinking and systems modelling are powerful tools.A system can be defined as a whole made out of interlinked, possibly nested, subsystems. 154Sustainability is often referred as a system which sustains itself, making the notion of systems thinking even more relevant for the discussion of sustainability assessment.In the following, we will visit the importance of system boundaries with regard to LCA and to circularity.
System boundaries and life cycle assessment.System boundaries describe the interfaces between the system and its exterior, the environment.The work by Nabavi et al. describes how system boundaries influence the assessment of sustainability aspects in dynamic systems. 155While system dynamic models should have a broad boundary including all variables considered important, 156 it is essential to set boundaries somewhere for practical reasons. 155However, the implications of having set these exact boundaries will play an important role throughout the entire modelling process, in particular when dealing with complex sustainability considerations.
Modelling chemical reactions from a systems perspective also strongly relies on the chosen boundaries.2][43] In LCA, a functional unit, e.g. one kg of the desired product, is taken as reference, and boundaries are drawn, e.g.cradle-gate-grave, often as wide as possible to follow the materials and energy flows.Associated with these are the environmental burdens which can be summed up for impact categories, such as the cumulative energy demand (CED), the global warming potential (GWP), human toxicity, or land use.Social or socio-economic life cycle assessment (S-LCA) incorporates social indicators and describes impacts such as working hours and local employment. 157,158LCA requires the specification of a system boundary on various levels, the ultimate one being the one between nature and the technical system. 159The scope of investigation requires further boundaries as the temporal and the geographic dimensions.Additionally, boundaries are chosen when deciding for metrics of interest (will impacts on aquatic life be within or outside the system boundary?).The tool named ''strategic life cycle management'' utilises sustainability principles as system boundaries and aims to provide an even wider overview. 160 reaction network is a subsystem, which is in exchange with its environment over a system boundary, see Fig. 8. Decisionmaking requires the assignment of assessment aspects to flows that are in exchange with the environment, e.g.what are the monetary values of mass flows, or what is the availability or the demand of mass flows in the geographic context.The choice of system boundary therewith strongly frames the problem and assessment aspects at the system boundary, e.g.how useful, toxic, expensive streams crossing the boundary are, strongly impact the results of any study.As of now, it is difficult to associate these aspects to large quantities of molecules as necessary for large-scale network data.However, semantic web and knowledge graph technologies are envisioned for scientific and chemical data.They can lead the community towards a future where assessment may be easily associated with a large diversity of chemical species.
At present, a chemical reaction network is commonly built based on: (i) one or more feedstock molecules of interest, and (ii) one or more product molecules of interest.Reactions connecting feedstock(s) and product(s) are then introduced manually from literature review or automatically from electronic databases or reaction generators, sometimes constrained by a maximum size of reaction steps.Some queries are open-ended on the product or the feedstock side, allowing queries such as, which is the best feedstock to produce product X, or which valuable products can be produced from feedstock Y.The system boundary can now be further specified, e.g. which species can be exchanged with the environment.Assessment metrics describe the exchange at the system boundary.
System boundaries and circular economy.Nowadays, many industries strive for different levels of circularity of their supply chains. 161,162By industrial symbiosis companies can exchange material flows, allowing by-products from one industrial process to become the feedstock of another and thereby closing material loops. 163In contrast to closing the loop through technology, biodegradable products close the biological loop. 163ncluding circularity in system modelling requires careful evaluation of system boundaries.Fig. 9 outlines that the aspects of circularity may lay within the system boundary, e.g.recycle streams within multi-step reactions, or may lay outside of current system boundary, e.g.similarly to BASF's Verbund system ¶ ¶ ¶ companies or geographical regions can exchange material flows.While internal system circularity, see Fig. 9 (left), influences the necessary exchange of flow quantities, external circularity, see Fig. 9 (right), influences the assessment metrics.Utilising a material as system input, which is an output from another system can contribute to the overall reduction of waste and minimisation of raw material use, e.g.substituting for either fossil or renewable feedstocks.This should be taken into account when evaluation reaction pathways.
Molecular circularity indicators are required to inform on alternative utilisation possibilities.The Ellen McArthur Foundation has shaped the discussion on circularity indicators, introducing the material circularity indicator to aid assessing material flows both at product and at company level 164 and its tool Circulytics,888 which measures circularity for businesses.Additionally, the World Business Council for Sustainable Development has introduced the circular transition indicators as quantitative framework to measure sustainability for businesses.With respect to material circularity in chemical engineering, Razza et al. provide metrics for biobased and biodegradable products, emphasising the biological cycle Fig.8 Exchange of mass and energy at the system boundary from reaction networks (systems) and the economy, society and planetary boundaries.Mass and energy exchange with the environment can be assessed in multiple dimensions, e.g. the value of a mass flow leaving the system can be determined based on its monetary value, the demand for it or its environmental impact.Assessment criteria are influenced by the wider environment, e.g.energy markets and availability of renewable sources vary at different geographic locations and in time.
(extraction of renewable feedstock -use -composting or biodegradation in soil) in contrast to technological recycle steps. 165Most notable, Lokesh et al. have extended common green chemistry metrics towards capturing circularity aspects. 166e note the need of a waste stream database, which records large waste streams over various industries and maps functional molecules and pretreatment to the streams, allowing circularity assessment of chemical loops.
The knowledge on the demand and supply (the usefulness) of material streams across the boundary will allow for the design of circular chemical solutions.For example, in view of a circular chemical supply chain, it is impossible to measure and compare the circularity of two competing reaction pathway options without considering their respective environments.One reaction pathway may seem less competitive due to a high amount of waste material generated throughout the processes.If, however, such material is in demand across the system boundary, the reaction pathway becomes increasingly competitive and the entire system more circular.Note that quantifying the circularity is not a molecular property and hardly a reaction property, but rather a system property, solvable only by data-based descriptions of the environment of a system.

Sustainability metrics for reaction networks from large databases
Ideally, a detailed analysis of social and environmental impacts in an (S-)LCA should be performed to evaluate reaction routes, yet, early-stage reaction data is at present not sufficient for such scopes of analysis.Sustainability assessment is ideally performed early on, as it allows for cheaper and faster implementations of process improvements [167][168][169] and for designing inherently sustainable pathways. 12However, detailed process knowledge is often only available at the end of the process development pipeline. 170,1713][174][175] Yet, only a few take differences in process conditions into account 167,168,176 and the possibility of completely different reaction routes within the reaction network are not discussed at present.The literature on reaction network optimisation has employed various metrics -much simpler than a full LCAto assess the sustainability of reaction routes.The metrics utilised by Voll and Marquardt cover mass balances, energy, and cost criteria. 30Zhang et al. utilised the enthalpy of reaction as energy criteria, however conclude that it is not sensitive enough; they recommend the inclusion of separation processes for better performance. 34In later works, some considerations of environmental impacts were included through energy consumption, resource consumption, emission impact, and toxicity potential or the CED and the GWP in the objective function. 31,32While the previous metrics of assessment were built on manually curated data, large-scale reaction network optimisation can only work with metrics obtainable for millions of molecules and reactions in an automated manner.
Mass-based evaluation.Most early metrics within the field of green chemistry are mass-based evaluations of the reactions.The most influential ones are the AE, 177 the E factor, 178 and the reaction mass efficiency (RME). 38

AE ¼
mass of useful product mass of all reactants Â 100% E factor ¼ mass of total waste mass of useful product (2) RME ¼ actual mass of useful product ðyieldÞ mass of all reactants Â 100% (3) Eqn ( 1)-( 3) require knowledge on the following factors: all participating species and their molecular weight must be known, the reaction stoichiometry is required, and differentiation between products and waste needs to be enabled.If circularity is a premise for sustainable processes, we will need to reassess the binary classification into waste and product for such metrics.Eqn (3) also requires information about product yield.More detailed mass-based metrics, e.g. the mass intensity (MI) or process mass intensity (PMI), include the use of solvents, catalysts, and other substances, leading to a more holistic This journal is © The Royal Society of Chemistry 2021 assessment of the reactions. 150 ¼ mass of all materials excluding water mass of product (4) PMI ¼ mass of all materials including water mass of product (5)   As discussed in the section on Inference of reaction structures, stoichiometry and participating species are often not known and require computationally expensive atom mapping for completion.The section on Inference of reaction outcomes explains that predictions of yield are at present impossible for generic sets of reaction.Furthermore, the differentiation between waste and product is complicated in the context of larger system boundaries and circular economy, where waste streams are seen as potential feedstocks.Improvements in information extraction and data inference can make much more data available in the future.
Evolving towards a linked data structure will make evaluations of molecules at system boundaries much easier in reaction networks.Eqn ( 4) and ( 5) require additional information on the masses of all involved materials, which also necessitates advanced information extraction techniques for chemical entity recognition.
Assuming reaction data is available, some automated tools can be used to determine mass-based metrics in reaction synthesis plans, e.g. the environmental assessment tool for organic synthesis (EATOS),**** the American Chemical Society PMI (prediction) calculator, † † † † or the Andraos' algorithm. 179,180he EATOS and Andraos' method were found most rigorous for material efficiency metrics 180 and the Green star 181,182 and University of Toronto green chemistry initiative method 183 were recommended for environmental and hazard impacts in introductory analysis. 184Key challenges for any of the given algorithms at present are: firstly, approximations if data is missing, e.g.general scaling factors for the required masses of organic solvents and aqueous washes, 180 secondly, the comparability to biotransformation synthesis, 180 thirdly, the evaluation of only linear synthesis trees or synthesis networks, 185,186 and last but not least, the integration of recycles for solvents and catalysts. 179Simplified algorithms for linear and tree cases were introduced 179,186 and applied in a reaction network. 26xergy-based evaluation.Exergy is the maximum amount of work, which can be extracted from a system when the system is brought to thermodynamic equilibrium with components of the natural environment through reversible processes. 187It is a measure of energy quality as it quantifies the ability of a form of energy to do physical work.Exergy destruction is proportional to the entropy generated due to irreversible processes. 188Thus, exergy destruction is a measure of degradation of both energy and material in a system. 189xergetic analysis has been linked to both, the environmental and the economic aspects of sustainability.From the environmental perspective, the concept of exergy has been positively highlighted as it takes the natural environment into account as a reference state. 190Ao et al. however stress that before widely accepting exergy as an environmental impact indicator, more work needs to be done. 191From the economical perspective, it has been noted that exergy can be strongly linked to costs through exergoeconomics. 192Labour and capital costs for processes can be included in exergy evaluation 193 and it may be the most useful function for solving cost-optimisation problems. 194For more information on exergy as a process and/or sustainability indicator, we refer the reader to the reviews by Dewulf et al. and Romero and Linares. 190,193n the context of reaction network optimisation, in our previous work, we utilised an exergy assessment for ranking reaction routes. 26To describe a reaction we included both the physical and chemical exergy of participating species and evaluated further the exergy requirements for process heating and separation. 26Exergetic analysis was applied to rank 15 reaction route options after a priori removal of reactions with insufficient data.
Exergy-based analysis at large scales requires automated retrieval of thermodynamic data.Physical exergies can be computed based on specific heat capacities retrieved from the software COSMOtherm RS, while the computation of chemical exergies pose a larger challenge.Approaches utilising linear regression models for specific types of molecules, e.g.solid or liquid fuels, [195][196][197][198] more advanced ML models, [199][200][201][202] and group contribution techniques 203 have been proposed.Promising for large-scale data, an atomic contribution model was shown to provide a generic framework to provide simple, yet relatively accurate estimations of the standard molar chemical exergies. 204An alternative is a prediction of Gibb's free energy of formation for compounds, e.g. through the Joback method as in ref. 26, from which the chemical exergy can be calculated based on the tabulated exergies of elements. 187,205In the future, we expect graph convolutional neural networks to predict necessary properties to a high accuracy. 206,207arly-stage assessment.To a certain extent, simple chemical rules can substitute the computation of data-intensive metrics at present.Especially for large scale datasets, manual data curation from simulations and/or experiments is not an option.In our previous work, we have hence introduced a few simple chemical heuristics, which can be utilised to provide a rough filtration of reaction routes. 26For instance, datasets may be screened for reactions that have a minimum number of records, making them more reliable, or which report a yield value above a certain threshold, making them more efficient.Further heuristics utilise the chemical structure of the materials and may be applied for example to prevent aromatics or certain heteroatoms.We outline an extended list of example heuristics in Table 3.While efficiency potentials may contribute to the environmental and economic dimensions, toxicity potentials shed light on social and environmental issues, and the reliability of the data can bring advantages in social and economic perspectives through faster and safer process development.Note that one heuristic can also cover multiple potentials.

Decision making
Optimisation algorithms have proven to be a reliable tool for optimal decision-making in complex problems, in particular, in complex network structures.Within reaction network optimisation, decisions on the sequence of reactions from feedstock molecules to target species are required.There commonly exists a variety of reaction sequence possibilities to connect different molecules, see Fig. 10, and appropriate algorithms can make decisions based on metrics discussed in the previous section.Strategies to solve the optimisation formalism depend on the underlying network structure of the problem.The characteristic for the problem of reaction network optimisation however is the number of products and reactants which connect to one reaction, cf.Fig. 10, and which can lead to complex and cyclic network structures.
Decision-making in network structures has been broadly explored in many different fields.4][215] In the following, we will shortly review a selection of related fields and highlight their similarities and differences to reaction network optimisation.We will then visit recent literature on reaction network optimisation, which takes the fully connected structure of chemical reaction networks into account.In Table 4 we explain domain-specific terminology.

Decision-making in network structures
Many decision-making problems can be represented by different network structures where fluxes or connections are optimised.The bipartite reaction network may be approximated by graph projections to reactions or molecules, see Fig. 10, which in turn allow for certain search strategies.
Navigation systems, such as Google Maps, explore the shortest path between two endpoints in a weighted network, where the weighting is the distance or the time required to travel between the points.The algorithm behind navigation systems is often based on the Dijkstra algorithm invented by Edsger Dijkstra in 1959.The algorithm works on a weighted graph, visits each node of the graph, and updates a table on the shortest distances to all others from a selected starting node.Its time complexity is O(|E| + |V|log|V|) where E is the number of edges and V is the number of vertices. 208An extension of the Dijkstra algorithm is the A* (A-star) algorithm, which changes the way the algorithm selects the next node to visit.While in the Dijkstra algorithm, this has been done based on the cost between the start node and the next node, the A* algorithm adds a heuristic function to this process, which estimates the cost from the node to be chosen to the target node. 209ble 3 Outline of possible heuristics for large-scale screening.Note that all heuristics are independent from stoichiometry and need to be adjusted based on the problem formulation.The list is by no means complete and the functions and potentials listed are exemplary  While these routing algorithms bring about benefit in implementation and scale-up, they are not directly applicable to reaction networks.Fig. 10 illustrates simplifications of an illustrative reaction network to a directed and weighted network either focusing on molecules or on reactions.The shortest pathway search may be able to regard parts of sustainability considerations in the weights, however, lacks the systems perspective, where co-products/waste and the source of all co-reactants are regarded as inherently connected to the chosen route.Furthermore, the meaning of weights in reaction networks depends on the case study, as they could be the emissions produced, the costs generated, chemical similarity, or any valid combination of our understanding of sustainability.

Heuristic
Tree-based network searches are utilised in the field of automated retrosynthesis planning.The retrosynthetic analysis describes the task of transforming the structure of a synthetic target molecule into known and simpler starting molecules by constructing a sequence of molecule deconstructions. 216raditionally, iterative cycles of logical analysis and perception were applied by chemists to the target compound and the available data space. 217In automated, or computer-aided, retrosynthesis, an algorithm proposes the most suitable synthesis route.The search space resembles a tree with molecules as nodes and reactions as edges, is intractable large, and the decisionmaking task is the identification of the most suitable branches within the tree.Solving retrosynthetic trees has been largely inspired by tree problems in games such as Chess or Go, however, retrosynthetic trees are considerably different as they are usually shallower (B10-20 steps) but the branching factor is higher (around 200 options at each node). 2100]   211 Coley et al. utilise molecular similarity to inform on edge choices 221 and demonstrate how a learned synthetic complexity metric can assist to scan the exponentially increasing search space. 212After further development, Segler et al. present their algorithm based on ANNs and symbolic artificial intelligence, which showed to produce routes that chemists found on average equivalent to the literature reported routes. 222Their algorithm has recently been commercialised by Elsevier and Pending.AI as Reaxys Predictive Retrosynthesis.
Despite differences in network structure, quick searching strategies from the field of computer-assisted retrosynthesis will become immensely valuable when data and metric hurdles are overcome.For sustainability consideration, the task in reaction networks is truly the optimisation of the entire system, including co-products and co-reactants, rather than one synthetic pathway.While retrosynthetic analysis aims towards any known, simple, and cheap starting molecules, starting molecules fulfilling sustainability considerations largely constrain the search space for sustainable pathway identification.Thus, the network topology resembles two branching trees, which meet in the middle, see the description of the ''forward-backward'' network built by ref. 223 and the overall aim is to optimise the entire system.Nevertheless, techniques from the field of ML-based retrosynthesis will inspire the development of new methods to handle large reaction networks.To take advantage of the full potential of ML-based techniques for decision-making in the chemical domain, the need for explainable artificial intelligence has been emphasised in a recent review article on drug discovery. 224They identify the current lack of an opencommunity platform but highlight the potential of explainable artificial intelligence for the discovery of novel bioactive compounds. 224Similarly, for computational tools to identify novel reaction pathways, we would expect a faster uptake within the community if solution strategies are comprehensible by chemists and chemical engineers.
Other network systems, from which algorithms can be explored, are batch/job scheduling problems.Here multiple inputs and multiple outputs are taken into account per batch and the sequential manner of performed reactions is regarded.Deterministic optimisation means all algorithms based on a rigorous mathematical approach, which will lead to the same solution space when run multiple times with the same system parameters.Heuristic function A heuristic function approximates certain parts of a problem in order to solve a problem more quickly.Precision is traded for speed.Linear programming Linear programming (LP) problems consist out of a linear objective function and linear constraints.

Mixed-integer programming
In mixed-integer programming discrete variables are added to the continuous variables used within the objective function and the constraints.Objective function The objective function describes the value to be optimised.It is a real-value function, by general convention to be minimised over alternative system variables.Relaxation A relaxation is an underestimation of a more complicated system to a simpler system.In optimisation, relaxations can transform hard problems into approximated, yet solvable ones.In process industries, consumer products are produced by sequential processing of chemical and physical tasks (in our case a chemical reaction, but generally any kind of task).Tasks require different process units and different storage facilities for in and outputs, which constrain the solution space. 213In single machine batching and scheduling problems, a set of jobs need to be processed by one machine, where jobs of similar type can be processed together and jobs from different families separately. 215The capacity of the machine as well as processing time and heating or cooling requirements strongly constrain the feasible region. 215If working with multiple pieces of equipment, problems are further constrained by sequential requirements, e.g.some pieces of equipment are always used before others, and by equipment interference problems, e.g.certain tasks cannot be performed simultaneously. 214hile reaction networks and batch scheduling jobs exhibit similarities in their network structure, it is worthwhile to note their divergence in problem specifics, such as interference constraints, a large variety of different types of tasks, and task-specific constraints such as cooling/heating.However, algorithms from the mature field of batch/task scheduling will come in beneficial when automated large-scale reaction network optimisation develops from the conceptional early-stage design towards different implementation levels, considering supply chains and production planning.Very promising concepts for this are integrated decision-making strategies. 225

Pathway optimisation in integrated biorefineries
Identifying the most promising pathway alternatives for the production of chemicals from renewable feedstock has been the focus of superstructure optimisation for integrated biorefineries.A superstructure describes a network of technologies, in particular, a process diagram with all hypothetically useful units and connections. 226,227The advantage of optimising the superstructure e.g. of processes and streams in a biorefinery is that complex interactions between different design choices are considered.However, a rich structure is necessary requiring much data and often leading to large-scale, non-convex, mixed-integer, nonlinear programming models. 226uperstructure problems can be formulated by distinct programming models (i.e.disjunctive programming). 228One approach is a formulation as mixed-integer-nonlinear programming (MINLP) problem. 226,227Giuliano et al. optimise a superstructure for levulinic acid, succinic acid, and ethanol product from lignocellulosic biomass. 227In their approach, rigorous process models account for significant nonlinearities leading to the MINLP formulation.Their problem is linearised to a MILP problem through variable discretisation methods.Kong et al. optimise a superstructure including heat integration and utility plant design by an MINLP problem to which they propose a set of solution methods to speed up the computation. 226onlinearities are introduced by processing unit models, where outlet material flows and outlet temperature are nonlinear functions as well as heat and electricity requirements.Alternatively, Garcia and You describe their product and process network by an NLP. 229Nonconvex terms are caused through economic considerations such as capital expenditures.They utilise a piecewise linear approximation, leading to an easier solvable MILP problem. 229Some works formulate the interdependencies through linear models. 223,230dditionally, most works handle contradicting objective functions through a multi-objective framework.Andiappan et al. formulate a multi-objective optimisation of the superstructure for an integrated biorefinery, addressing possible trade-offs between economic and environmental objectives through two approaches.A bi-level formulation maximises the gross profit on the upper level, subject to the minimisation of the environmental burden and the reaction heat on the lower level.Alternatively, fuzzy optimisation is extended by introducing upper and lower bounds for the factor lambda accounting for the satisfaction of all three objectives. 230Garcia and You utilise the epsilon constraint method to allow for multiple objectives. 229

Early-stage pathways optimisation in reaction networks
In contrast to rich superstructures with rigorous unit operation models, technologies, and utility integration, stand early-stage evaluation methods.Most promising reaction pathways are estimated at an early-stage without rigorous process models of different technologies.One example of such an early stage approach is Bao et al.'s short-cut method for the preliminary synthesis of process technology pathways. 231They propose a chemical species/conversion operator diagram which they optimise through an NLP model.Nonlinearities are introduced through entering and leaving species flowrates in conversion operators and through annualised costs of conversion.Instead of rigorous models, they assess various conversion technologies through characteristics such as yield and cost. 231Further earlystage methods will be discussed in the following three sections.
Reaction network flux analysis.Optimal reaction pathways for the conversion of renewable feedstocks are often examined by the approximate method RNFA. 30,232The RNFA is inspired by earlier works on metabolic networks 29,233 and models mass flows and reactions through linear balance equations for all components.Hereby, sink and source terms represent supply and demand.To model the reactions, all participating species and the stoichiometry of all reactions need to be known.The RNFA does not account for mixing and separation.
While the core of the problem formulation lays in an LP formulation for mass balances that can be efficiently solved in polynomial time (e.g., using state-of-the-art solvers like CPLEX), 234,235 integers have been introduced to account for the activity of fluxes, resulting in a MILP problem. 232Also, alternative optima were identified through the integer constraints, 30,232,236  through the consideration of mixtures, which requires the solver to recompute properties of compositions at each step. 237otably, the RNFA can lead to degenerated solutions when components are consumed and generated in cycles (e.g., equilibrium reactions or protecting groups).Similarly, huge recycle streams can occur as separation is simplified.The RNFA has been successfully applied to identify optimal reaction pathways for biofuel and biopolymer synthesis. 31,34rocess network flux analysis.An extension of the RNFA is the process network flux analysis (PNFA) 33 where pseudocomponents and -reactions are introduced to resemble mixing and separation fluxes.For this, all possible mixtures and potential separation tasks are identified a priori, modelled through short-cut methods, and included as pseudo-components and -reactions.The PNFA resembles the superstructure optimisation problems as it aims to include more detailed process knowledge.Operating cost or energy demand of separations are considered through pre-computed energy demands.Besides, binary variables are introduced for all equipment using big-M formulations that are active when the respective flux is greater than zero, allowing the estimation of the number of process units.The investment costs are considered through binary variables and nonlinear cost correlations.Multiple objectives are taken into consideration by the epsilon constraint method.The overall problem results in an MINLP problem that can be solved using deterministic global solvers like BARON 238,239 or MAiNGO. 240However, solving nonlinear programs is often NP-hard and thus limited to small problem instances.The PNFA, formulated using GAMS 241 and solved by BARON, has been successfully used for biofuels production 33 also including the biomass supply chain 32 and for pathway considerations for biofuel product design. 242etri net optimisation.An alternative modelling approach for the optimisation of pathways in reaction networks is a Petri net.A Petri net explicitly takes the reaction sequence into account which can be an important factor during optimisation.Petri nets were first introduced by Carl Adam Petri 243 and are a type of directed bipartite network.In bipartite networks, two node types exist and links can only connect nodes of different types.In Petri nets, the node types are places (resembling molecules) and transitions (resembling reactions).Their input and output relations are shown by links, called arcs and an incidence matrix, which records the stoichiometry.A flow between places via transitions is given by a marking of places with tokens.Such a marking describes a state of a Petri net.Tokens change from one place to another through the firing of transitions, leading to a change in state. 244,245he Petri net optimisation (PNO) problem determines an optimal sequence of firing certain transitions and a formulation for reaction route optimisation was presented by ref. 35 after an extension of the formulation from ref. 246.Petri nets have been used to model chemical and biological reaction networks, 243,245,247,248 while the use of PNO in chemical engineering has to date mostly focused on batch scheduling. 246,249The MILP problem has higher model complexity than the LP core of the RNFA.Working with a PNO formulation allows to have a more detailed analysis of the solution space, e.g. the reaction sequence is considered, degenerated solutions are prevented, the maximum size of reaction steps is controlled, and non-flux dependent costs can be introduced. 35The number of continuous variables is higher by a factor of the number of reaction steps and the MILP formulation introduces binary variables per reaction and reaction step.Additionally, the number of constraints is higher, due to constraints on the firing of the transitions in sequence.

Uncertainty in decision-making
Data underlying decision-making algorithms often bring about uncertainties, e.g. through experiments and measurements, through data inference to build complete datasets, through real-life scenarios of market prices, and through dynamic changes in supply and demand.To account for uncertainties in key parameters, deterministic models, which describe parameter uncertainties by bounds of anticipated derivations, or stochastic programming, which takes probability distribution functions for parameters into account, are applied. 250he field of optimisation under uncertainty already contains well-established methods, e.g.stochastic programming, robust optimisation, or fuzzy programming, which can be applied on reaction networks with uncertain data. 250,251While stochastic programming approaches generate comprehensive solutions based on probabilities, they are often computationally expensive.Robust programming defines uncertainties as inequality constraints and is often a good alternative if probability distributions are not known. 251For pathways selection in integrated biorefineries, some works have integrated uncertainties.Morales-Rodriguez et al. have applied stochastic process optimisation for lignocellulosic ethanol production and Kasas ět al. outline a strategy based on stochastic programming merging four distinct solution techniques for a bioethanol product case study. 252,253Tay et al. and Tang et al. solve MINLP problems for integrated biorefineries using robust optimisation. 251,254Uncertainties in the aforementioned studies cover amongst others the market price, supply of biomass, and demand for products as well as technological constraints.
Including uncertainties during decision-making brings about benefits as it avoids non-optimal or infeasible solutions, but requires models that inform on uncertainties within the prediction task. 250For reaction networks, this means that uncertainties for key parameters, e.g.stoichiometry and helper species, or reaction conditions need to be collected during data inference stages.

Conclusions and perspective
The identification of sustainable reactions is a highly complex and interdisciplinary challenge.In this review we present the first multidisciplinary perspective, integrating the fields of data, metrics, and decision-making to guide and accelerate further developments.We highlight synergies between the fields and potential for future developments.
Currently, the field of data brings about most bottlenecks, and therewith greatest potential for advancement.Data is, at present, incomplete, lacking information necessary to perform mass balances over large numbers of reactions.Furthermore, enabling linkages of various data sources, e.g.regional waste stream compositions, pretreatment options, or end-of-life use, is essential when dealing with questions of sustainability.For the field of sustainability metrics, we envision, that molecular property prediction, e.g. by graph convolutional networks, will allow more accurate evaluations of different environmental metrics and that linked and accessible data sources will allow assessments across the system boundary.In the area of decision-making, we highlight the importance of the structure of reaction networks (multiple in-and outputs, circular interactions) and the scalability of the previously suggested algorithms as main factors of importance.Methods in the field are well-established and most likely to evolve further through smarter heuristics or ML-guided approximations, enabling solution of system-level problems.
Our findings elucidate the interface between the three areas.This allows scientists to take into account possible improvements within other fields so that we will jointly work towards more sustainable use of present resources.This also highlights the need for targeted interdisciplinary funding across the three domains.Through such targeted interventions society will achieve a faster transition towards developing truly sustainable solutions.The contribution of this work, while conceptual, provides a roadmap towards systematic reaction pathway planning based on rapid digitalisation of chemical data.

# 1
Data impact opportunities and action points.The first opportunity is the development of a chemical big open linked data (BOLD) structure.Emphasis lays on the coverage of freely This journal is © The Royal Society of Chemistry 2021
by Estrada et al. for palladium-catalyzed Buchwald-Hartwig cross-coupling reactions, 121 and by Fu et al. to predict yields of Pd-catalyzed Suzuki-Miyaura reactions in the microfluidic system. 122While Yada et al.only trained their model on 14 data points, Estrada et al. worked with a set of 4140 reaction results with the aid of high-throughput screening.

Fig. 7
Fig. 7 The principle of atom mapping and how it can aid in the completion of reaction structures is outlined.The sample reaction is retrieved from Reaxys with Reaxys reaction ID: 615722.In (a) participants and stoichiometric coefficients are missing and in (b) the reaction is balanced with help species HCl and stoichiometry coefficients.The recording structure from C:1 to Cl:19 denotes the type of atom and its identifier.

Fig. 9
Fig.9Circularity within one system (left) and between multiple systems (right).Circularity within one system affects the flow quantity exchanged at the system boundary, e.g. less solvent is needed as input and generated as output if solvent recovery takes place.Circularity between multiple systems should affect the assessment of exchanged flow quantity, e.g.output flows that can be used at input flows for other processes should be preferred over waste output flows.

Fig. 10
Fig. 10 Illustration of reaction routes in chemical reaction networks.The bipartite reaction network represents reactions as bar nodes (r1 to r4) and molecules as circular nodes.Decision-making is required to decide between two alternatives reaction sequences (r1 and r3 vs. r2 and r4), which connect the same feedstock molecule to the same target molecule.However, different co-reactants are required and different co-products/waste are generated.The directed and weighted molecule network and reaction network are projections of the bipartite network and can function as simplification for shortest path search algorithms.The quantities on the edges illustrate possible weighting schemes.
while at present CPLEX can already account for alternative solutions in LPs without manual extension to MILP formulations.In the work of Besler et al. knowledge on active fluxes has also been used to describe nonflux-related costs for reaction pathways, e.g.toxicity.In the work by Dahmen and Marquardt the RNFA is combined with a model for computer-aided molecular design for mixtures, which resulted in an NLP model.Nonlinearities were introduced This journal is © Royal Society of Chemistry 2021 This journal is © The Royal Society of Chemistry 2021 Soc.Rev., 2021, 50, 12013-12036 | 12031

Table 1
Selection of large databases recording chemical reactions

Table 2
Explanation of specific terminology in data representations uses data in graph structure.Data entities and their semantic types and properties are linked with each other.Knowledge graphs can allow machines and humans to reason from the data.Metadata Segler et al. demonstrated a Monte Carlo Tree Search (MCTS) with three deep neural networks; two ANNs for reaction rule extraction and one as a reinforcement framework. 210Kishimoto et al. investigate two common search techniques within the domain; MCTS and depth-first proof-number (DFPN) search.They find that the enhanced MCTS by Segler et al. outperforms common DFPN, however propose a new DFPN with heuristic rules for edge initialisation, which outperforms Segler's algorithm regarding time complexity and delivers equivalent success rates.

Table 4
Explanation of specific terminology in decision-making The big O notation refers to the time or memory needed to run an algorithm.It is a theoretical measure of the asymptotic behaviour of an algorithm.ConstraintsConstraints determine the feasible space in which variables can lay.They impose limitations, e.g. that material flows cannot be less than zero.Deterministic optimisation