Towards automation of chemical process route selection based on data mining †

A methodology for chemical routes development and evaluation on the basis of data-mining is presented. A section of the Reaxys database was converted into a network, which was used to plan hypothetical synthesis routes to convert a bio-waste feedstock, limonene, to a bulk intermediate, benzoic acid. The route evaluation considered process conditions and used multiple indicators, including exergy, E-factor, solvent score, reaction reliability and route redox efficiency, in a multi-criteria environmental sustainability evaluation. The proposed methodology is the first route evaluation based on data mining, explicitly using reaction conditions, and is amenable to full automation.


Introduction
In the field of process and synthetic chemistry 'clean synthesis' has become one of the standard criteria for good, commercially viable synthesis routes.As a result synthetic and process chemists must be equipped with adequate methodologies for quantification of 'cleanness' or 'greenness' of alternative routes at the early phases of the development cycle.These new criteria, and the traditional criteria of cost, security of supply, health and safety (H&S), and risk, provide a balanced picture of sustainability of a future technology.Thus, there are two separate aspects to process chemistry: developing the chemistry and the process, and evaluating the overall process, which must occur in parallel.Evaluation of the proposed routes requires data.As data science rapidly evolves, chemistry will inevitably use more of the new tools of data mining and data analysis to automate the routine tasks, such as evaluation of process metrics.In this paper we show some initial results in automation of process evaluation based on deep data mining of process chemistry and multi-criteria decision making.
The evaluation of greenness is a mature field, with a large number of published and standardised approaches, of which many are adopted by industry. 1However, all published methods are highly case-specific and rather labour-intensive.In the field of synthetic routes development one of the most exciting new areas is the potential for automation of synthesis planning using data mining. 2What has never been attempted before is to automate route generation and evaluation in a coherent methodology, which would aid process development at the early, data-lean, stages.For this we show how to automatically generate process options using a network representation of a section of Reaxys database, 3 followed by their screening using multi-criteria decision making, see Fig. 1.As the methods mature and become commercially available, such integration and automation will produce significant savings of time, and would deliver a far more detailed view of the competing synthesis route options than is generally possible at the early stages of design.
To date, obtaining the data, assembling the network and finding potential synthesis routes can already be carried out in a fully automated fashion.Due to issues around data availability the connection to the analysis of the routes still has to be initiated manually, involving a data curation step.The subsequent analysis and multi-criteria decision making have been largely automated in this study.To our knowledge this is the first example of the analysis of synthesis routes generated from the network representation of Reaxys obtained through datamining, using reaction conditions and process data.

Process evaluation
Ultimately, life cycle assessment (LCA) should be used for rigorous evaluation of the environmental aspect of sustainability.However, for the early stage of process development mass-and energy proxy measures are frequently used.5][6] It is well understood, however, that assessment of a given synthesis route or a process by a single criterion would almost certainly fail to deliver a holistic picture of the process's sustainability.This has been widely recognised in the literature.[9][10] The indicator-based evaluation methods frequently artificially separate mass and energy streams.Fundamentally, every mass stream carries an associated energy and, thus, also is an energy stream.This separation is avoided by exergy analysis which expresses all streams crossing the system boundaries as streams of energy available to perform useful work, relative to the environment. 11In the chemical industries the use of exergy is particularly attractive, since optimisation for exergy simultaneously considers yield and energy utilisation.
Analysis of the availability of energy (exergy) is gradually finding uptake in the process industry (e.g.ref. 12 and 13) but methodologies combining exergy analysis with other, noneconomic metrics are still being developed.Li et al. attempt  this by combining exergy analysis with safety and economic assessment.Each criterion was evaluated in turn, eliminating processes if they fail a single criteria, rather than adopting a holistic approach. 14Similarly in ref. 15 a methodology is proposed to expand a set of metrics from simple mass-based indicators to include H&S, critical elements, catalysts, energy and life cycle assessment (LCA).This approach relies on generation of experimental data, which is infeasible for the screening of large datasets, the topic addressed in this paper.Dewulf has applied LCA to exergy analysis such as e.g. in ref. 16 and Fan et al. used a graph theoretical approach to generate paths from a reaction network and then sequentially screened them according to exergy dissipation, profit potential and toxicity indices. 17In addition to mass and exergy efficiency, the H&S and environmental impacts of a process, the redox performance of the reactions and their reliability are also important factors in decision making on selection of novel process routes.

Data mining in chemical route development
In 1990 Lawson and Kallies stated that the "Beilstein data […] forms an explicit network […] being equivalent to a map of practically all known synthetic pathways from almost any starting material to almost any product". 18The group of Grzybowski explored this idea, termed the Network of Organic Chemistry (NOC), based on the Beilstein database encompassing a total of 7 million reactions and 7 million substances. 19he initial versions of the NOC had simple directed edges; subsequent versions used a bipartite representation, which had separate nodes for species and for reactions.This is important in the planning of syntheses where reactions have several feedstocks or products, thus displaying the true dependency of the various species involved. 20The NOC is a timeevolving scale free network and thus exhibits behaviour very similar to that of the World Wide Web. 21As a consequence, the molecules contained within the network can largely be divided into either those belonging to the "core" or the "periphery", the two regions varying greatly in size and connectivity.The core is a cluster several thousand times larger than the next largest cluster.Although it only contains 3.5% of all organic substances registered in the NOC, these substances are involved in 35% of all reactions and give rise to 60% of all registered substances. 22or illustrative purposes a section of a network of roughly 1 million reactions centred around (3R)-3-isopropyl-6-methyl-7oxabicyclo[4.1.0]heptane,as downloaded from Reaxys 3 during an iterative search, and assembled using Python 2.7 and the toolbox graph-tool 23 by us, is shown in Fig. 2.
Using the network representation of the chemical reactions data allowed the automatic identification of reaction sequences. 21By making explicit use of the existing data it is possible to plan synthesis routes, optimise parallel synthesis routes, identify one-pot conversions or identify purchasing patterns of precursors to controlled substances. 19,20,24In addition to algorithmic connection of molecules, it was possible to deduce from the network structure information about the reactivity of functional groups and how the functional groups present in a molecule influence each other's reactivity with the results matching theory very well. 25The NOC could also be employed to identify "maximally useful" compounds in a man-Fig. 2 A section of a network of organic chemistry.Dots are species and arrows represent reactions.
ufacturing context. 22More recently the network analysis of the dataset of 10 million reactions was combined with retrosynthetic approaches in Chematica. 2ombination of data mining with retrosynthesis has seen some exciting recent developments. 26Bøgevig et al. report a commercial implementation of this link in a suite from InfoChem by taking a database of 4.4 million reactions and abstracting it into retrosynthetic transforms.During the retrosynthetic analysis it is able to identify papers carrying out similar transformations and use this information in synthesis planning. 27Since September 2015 Wiley offers a commercial platform in the form of ChemPlanner, which includes explicit use of the reported reactions and the retrosynthetic approaches to predict new reactions. 28,29ChemPlanner at present comprises 2 million reactions. 30For a more detailed description of the methodology employed by ARChem, the predecessor of ChemPlanner, the reader is referred to ref. 31.
1.5 million new compounds are claimed to be discovered annually 32 and today Reaxys already contains in excess of 74.9 million compounds (27 million compounds from Reaxys itself and the remainder from integrated databases such as PubChem), 40.7 million reactions and 500 million published experimental facts, 33 yielding an enormous source of data for analysis.To date, the use of datamining in combination with predictive and heuristic tools in chemistry is at its infancy and very few tools are available.However, such tools are being rapidly developed and advances in data science, automation of experiments, wide adoption of electronic lab-books and development of machine readable formats of chemical data exchange, which are all taking place at present, will undoubtedly make enormous impact on chemical process development.
A missing direction of research is the link of network tools and synthesis planning tools with their evaluation at the early stages of process development.A network search or forward reaction planning will always return a number of possible pathways, raising the question how the paths compare under multi-criteria considerations, including environmental and H&S factors.The approach that we present in this paper attempts to combine the use of chemical data available in databases, such as Reaxys, with automated evaluation of synthesis routes, thus combining the two tasks of industrial process development.
As a case study a hypothetical synthesis route from limonene to benzoic acid will be studied.5][36] Two major sources of natural terpenes are limonene as a byproduct of the citrus industry and crude turpentine from tree resin, which can in turn be converted into limonene, with an annual production of 70 000 and 350 000 tons, respectively, 36,37 making it a reasonably abundant feedstock.Though the conversion of limonene to benzoic acid may appear to be destroying economic value due to their relative costs, if limonene is derived from waste and benzoic acid is in turn converted into higher value products, the cost basis would change.The considered route is a hypothetical example of a bio-based route selection and was used to avoid any commercial sensitivity of using current, industrially relevant substances.

Multi-criteria decision making
An implementation of the PROMETHEE methodology (Preference Ranking Organization Method for Enrichment Evaluations) was chosen in the form of the Visual PROMETHEE software suite to perform multi-criteria analysis.Following expert interviews and evaluation of the conflicting criteria, the following weightings were chosen for the specific case of pharmaceutical and speciality chemicals manufacturing: reliability of technology 0.35, exergetic efficiency 0.25, redox economy 0.10, and mass-based environmental indicator 0.15.We should emphasise that weightings will change between different sectors of chemical industries.

Evaluation of exergy
As reference environment in this study the annual average temperature at the Teesside refinery complex in Stockton-on-Tees, UK, was taken, which is 9.1 °C, 38 and 1 atmosphere of pressure.The kinetic and potential exergy were ignored due to lack of data on the eventual process layout.Thus, the exergy of a species i will be given by their chemical and physical exergies.The chemical exergy of species i, Ex i,ch is given by eqn (1).
where n i represents the number of moles, or molar flow rate if flow conditions are used, of species i and ex °i is the standard chemical exergy of species i on a molar basis; subscript 0 represents the reference environment conditions; x stands for a mole fraction, T is temperature and R the universal gas constant.
The standard chemical exergy of species i is given by the following equation: 39,40 where ΔG °f is the Gibb's free energy of formation of a compound i and ν j the number of atoms of the constitute element j, each with a standard chemical exergy of ε j .
The physical exergy on the other hand is given by the difference in enthalpy and entropy of the stream at its conditions relative to that at the environmental conditions. 40 i;ph ¼ ½H i ðT; PÞ À H i ðT 0 ; P 0 Þ À T 0 ½S i ðT; PÞ À S i ðT 0 ; P 0 Þ ð3Þ More detailed mathematical descriptions can be found in the ESI.†

Calculation of process heating
We assumed that the reactants for each step are introduced separately at ambient conditions.This assumption is obviously false for a highly integrated chemical plant.However, it is reasonable for the case of batch processes as might be encountered in the pharmaceutical or fine chemicals industries.For simplicity we assumed that any heat is provided to the process via an electric resistance heater and that heat capacities of all mixtures are constant throughout.Though using electricity for heating purposes is less efficient than burning liquid or gaseous fuels directly, 41 it is considered an adequate baseline for the purpose of this study.Crucially, changing the fuel source for process heating for the purposes of this model is thus facile.This leads to the following expression for exergy input due to heating (the derivation can be found in the ESI †): where Q is the heat supplied.

Separation energy
Comparing the cost of separation across different processes in monetary terms can be challenging due to the fact that very different processes can be required for different separations.
Instead the thermodynamic limit is used as a proxy to quantify the effort due to the required separation.Thus, Gibb's energy change of mixing was considered a proxy to the cost of separation of a reaction mixture.The Gibb's energy change of mixing is the difference in the Gibb's energy of a mixture compared to that of its constituents as pure components (as given by eqn ( 5) 42 ) and can thus be used as a measure of the absolute minimum energy required to separate a mixture.
where G i is the partial molar Gibb's energy of species i.For a binary system of ideal gases consisting of species a with a concentration of y a and species b with a concentration of y b eqn (5) can be expanded to the following form: 42 The Gibb's free energy of mixing of the gaseous streams was found using eqn (6) where y a is the mole fraction of gas a in the mixture and y b that of gas b, respectively, n the number of moles of gas present, T is temperature and R the ideal gas constant.For the liquid streams activity coefficients of each species in the mixture (G ˉres i ), as well as, as pure compounds (G ˉpure i ), were calculated using COSMOthermX Version C30_1401.The Gibb's free energy change of mixing for a given species i was then found using eqn (7) where x i is the mole fraction of species i in the mixture:

Exergetic efficiency
Exergetic efficiency of a process is the ratio of useful exergy output to exergy input as illustrated in Fig. 3.When defined as above, solvents and unreacted starting material will appear in the process output without actually having been "produced" by the process, thus artificially inflating the efficiency. 43Instead a more useful approach would be to only consider the produced, utilisable exergy in relation to the consumed exergy, thus excluding the transiting exergy associated to the parts of the mixture not taking part in the reaction.Thus, the transiting exergies were excluded from the efficiency calculation.We also assumed that catalysts do not deactivate and thus could be recycled allowing for their exclusion from the analysis.Further details can be found in ref. 43 and the ESI.† Based on this description, a process step schematic could be represented as shown in Fig. 4. The streams entering the control volume are the reaction mixture input stream, the heat provided by the heat exchanger, the heat required by the reactor (labelled "1" in Fig. 4) in the case of an endothermic reaction and the energy required by the separation unit (labelled "2" in Fig. 4) corresponding to the Gibb's energy of mixing.The streams exiting the control volume are the exergy stream of the product(s), the exergy stream of the unreacted reactants and solvents (which could be recycled) and the exergy stream of the waste species.Then, exergetic efficiency can be described by eqn (8).The overall efficiency of a route, consisting of j steps is the product of the efficiency of each of the individual steps as shown in eqn (9).

Mass-based indicators
We use the E-factor as defined by Andraos, but re-derive the expression to account for reaction stoichiometry, thus extending the applicability of the approach.The E-factor was initially proposed by Sheldon in 1992 44 and reviewed in 2007. 45In contrast to Sheldon's ratio of mass of waste to mass of product, Andraos derived an equivalent methodology to calculate the E-factor, along an entire synthesis route, largely based on yields, atom economy and stoichiometric factors considering all species, including water, and thus making automation and computation more facile as explicit mass balances are not required.The original approach and derivations of Andraos could not be used in the automated route evaluation.In the earlier derivation atom economy, AE j , was defined on the basis of molecular weights.This does not allow inclusion of the actual stoichiometry into the E-factor calculation and was changed in this work.Re-deriving Andraos's equations to account for reaction stoichiometry yields the following expression (details of the derivation can be found in the ESI †): where E is the E-factor, MR p the molecular weight of the product, ε is the yield with respect to the limiting reactant.The subscripts j and n relate to a reaction step number j in the synthesis route and the final step, respectively, where the sequence of steps is (1,…, j, …, n); tion yields along the reaction route from the current step to the final step, ignoring any steps carried out prior to the current step, c is the mass of catalyst, s is the mass of solvent, ω is the mass of all other materials used in work-up and purification and n mr is the number of moles of the limiting reagent in step j.SF is the stoichiometric factor, i.e. the ratio of the mass of excess reagent to stoichiometric reagent plus one, ν mr j is the stoichiometric coefficient of limiting reagent in step j and ν p is the stoichiometric coefficient of the desired product in step j.The atom economy is defined as: where SW r, j is the sum across the products of stoichiometric coefficients and molecular weights for all reagents in step j and yield is defined as: where n P is the number of moles of product produced and n mr the number of moles of the main reagent fed.The derivation assumed that the feed contains only reactant, reagents and auxiliaries, i.e. no products or byproducts.The derivation also assumes that the product of the previous reaction step is the limiting reactant in the current step and uses it to normalise quantities.Care must be taken if this is not the case.The calculation of the E-factor was automated by implementation in a Python script.The E-factor as defined here equals to the commonly used Process Mass Efficiency minus 1. 6 However, this definition is an easily automatable mass metric, whose evaluation requires the commonly reported data only.
During the network traversal, described subsequently, a list of all routes connecting limonene to benzoic acid was generated.The list of the reactions comprising the synthesis routes for further analysis were written to a file.Similarly, the reaction conditions and data for each reaction in this list were written to another file.The Python script imports these data, computes the required parameters for each individual reaction and then proceeds to combine them for the given synthesis routes in line with eqn (10).Subsequently, E total for each route is written to a file (along with further data on each route) to allow comparison.

Heuristics
In addition to calculation of the exergetic efficiency and E-factor, a number of heuristics were extracted from literature and expert knowledge, which were used to calculate a score for different criteria in the route.The performance of a synthesis route under environment, health and safety considerations is a vital factor when assessing the sustainability of a route.Solvents are of crucial importance in much of the pharmaceutical and fine chemical industry having a very large impact on the sustainability balance of a process, 4,5 but receive no explicit treatment in the methodology so far.The "CHEM21 selection guide of classical-and less classicalsolvents" 46 approximates solvents' performance under health, safety and environmental criteria through use of physical data and, where available, their hazard clauses under the Global Harmonised System and is used in this study to evaluate the solvents used.Seeing as solvents carry some of the greatest toxicological, safety and sustainability concerns as far as pharmaceutical processes are concerned 4,5 this is a good proxy, stopping short of a full LCA.Some treatment of how reliable a published reaction is, is desirable.This was achieved by using the number of publications reporting a given reaction as a proxy, following the argument that a reaction reported more frequently is better established and more reliable.This is introduced specifically for process development, rather than discovery.
Another environmental parameter introduced in the evaluation is redox economy.The term redox economy appears to first have been coined by Richter, 47 though some of the thinking can be traced back to the works of Hendrickson in 1971 and 1975, 48,49 and is being discussed in more detail by Burns. 50One of the key principles of redox economy is that unnecessary changes of the oxidation state may lead to significant environmental impacts. 7,50,51To track changes in the oxidation state the set of atoms involved in the target bondforming reactions across a route is determined.The oxidation state of each of those atoms is then calculated at each reaction step as described in ref. 9 and then the oxidation state of all such atoms at a given reaction step is summed (Ox i ) and the difference ΔOx i = Ox i − Ox target is found.Should both ΔOx i and ΔOx i−1 be positive (or both negative) and |ΔOx i | > |ΔOx i−1 | a penalty for step i will be applied according to eqn (13) as an oxidation or reduction that needs to subsequently be corrected has taken place.If, however both ΔOx i and ΔOx i−1 are of opposite signs an overshoot has taken place and the penalty will be given by eqn (14).The overall penalty for a given route will be given by eqn (15).
Penalty ¼ Again, every part of this methodology can easily be implemented in a fully automated fashion.Calculation of the oxidation state changes requires some atom mapping, however, tools for this exist.

Network traversal
A network of reactions was constructed based on the reactions contained in Reaxys.To this end all reactions starting from limonene were downloaded.All product species from these reactions were in turn individually queried to obtain all reactions starting from each of these species.This was repeated to obtain data containing three reaction steps as it was known that the desired conversion could be carried out in three steps.This was written to a file, incomplete reactions were deleted and the remaining data was then used to construct a network using "the graph-tool python library" 23 in Python 2.7.
It was decided to remove acetic acid, formaldehyde, formic acid and isoprene from the network.Due to the architecture of the network any routes via these molecules would mean that limonene would have been decomposed into one of these substances to use them as building blocks to obtain benzoic acid.This destruction of functionality and thus any synthesis routes via these molecules as main reactants was deemed undesirable leading to their exclusion.Using a k-shortest paths algorithm implemented in graph-tool all routes connecting limonene to benzoic acid were found up to a maximum path length of three synthesis steps.

Property prediction
Heat capacities.Liquid heat capacities were predicted using the Chueh-Swanson group contribution method which predicts molar liquid heat capacities at 293 K. 64 The method is reported to be accurate for most conditions 65,66 and has errors mostly ranging between 2-3%, rarely exceeding 5%. 64Gas heat capacities were predicted using DFT with a B3LYP functional and using the 6-31G(d) basic set in Gaussian09.
Gibb's free energy of formation.The Gibb's free energy of formation was calculated using the Joback group contribution method. 67Reid et al. states that the accuracy of the method is within 10 kJ mol −1 of the literature value though cautions its use for complex materials. 65nthalpy of reaction.Hess' law states that the enthalpy change associated with a reaction at standard conditions is equal to the difference in enthalpies of formation of the products, at standard conditions, and that of the reactants, at standard conditions, as shown in eqn (16). 68This was used to calculate enthalpies of reaction.
where ΔH °f denotes the standard enthalpy of formation of a given compound.Standard chemical exergies.Several values of standard chemical exergies had to be calculated.Using eqn (2) this was possible given knowledge of the Gibb's free energy of formation of a given compound which was either given in literature, or could be calculated from the Joback method.The standard chemical exergies of the atomic species constituting the compounds, ε H 2 , ε O 2 , and ε C , are those of H 2 , O 2 and C, which, according to Morris and Szargut 69 are 410.26kJ mol −1 , 236.09 kJ mol −1 and 3.97 kJ mol −1 , respectively, based on their relevant atmospheric reference species.
Gibb's free energy of mixing.The Gibb's free energy of mixing was calculated as outlined under "Separation Costs".The actual values calculated can be found in the ESI in Tables S1-13.†

Network search
The network traversal algorithm returned a total of 228 unique paths.This set contained four two-step syntheses connecting limonene to benzoic acid via 4-ethenylcyclohexene, cumol, maleic anhydride or methyl 4-methylphenyl ketone.At this stage of analysis it is not claimed that these are ideal, or even good, routes, but merely that Reaxys contains a synthesis path involving these molecules.The remaining 224 paths of the result set required three synthesis steps.In order to give the reader an idea of the path taken these paths were classified according to the product of the first step.In total there were 36 different reactions that could be carried out during the first step.The six most frequently encountered options for different products of the first step, together accounting for over 75% of possible routes, and the number of routes (of the 224 remaining) that use this step can be found in Table 1.
Further analysis of this set revealed that many papers were either unavailable online or their reporting of data was insufficient for the desired level of analysis.Thus, two routes for which the required data could be reconstructed, briefly illustrated in Scheme 1 with sources of the reaction routes given, were chosen for proof-of-concept: the first route led via p-cymene and toluene to benzoic acid, while the second route utilised p-cymene and p-toluic acid as intermediates.
A further issue that was encountered was the fact that most of the selected papers did not report balanced equations for the reactions, making reconstruction of the reaction systems very difficult when also only the yield of the main product was reported.An attempt was made at balancing the equations purely by using atom balances.It is realised that these may not be the actual equations and are not necessarily based on actual reaction mechanisms but they yield a solvable system, which for this case study is deemed sufficient.The stoichiometric coefficients were then normalised with respect to the main reactant in order to simplify later calculations to give Scheme 2.

Exergetic efficiency
The in and out flows of each species for each reaction were calculated and, using eqn (1) and (3), it was possible to calculate the exergy associated to each species entering and exiting the system for each reaction.Based on these results the exergetic efficiencies were calculated according to eqn (8).The heating duties for endothermic processes and for stream heating were converted into exergies using eqn (4).The exergetic input required to supply the free energy of mixing was taken to be equivalent to the energy of mixing.The results of the exergy calculations, as well as efficiencies, for each reaction can be found in the ESI in Table S15.† Building on this, the full results for all possible permutations of reaction sequences can be seen in Table 2.
From Table 2 it becomes apparent that the most efficient route would be the sequence of reactions 1.3, 2.2. and 3.2, in that order.In the second place lies the sequence 1.2, 2.2, 3.2 and in the third place 1.3, 2.2, 3.1.This is in contrast to ranking based on the overall yield, where the first two positions are the same but where the third-best option would be Scheme 2 A set of balanced equations used in the case study.
1.3, 2.1, 3.2 as can be seen clearly in the comparison in Table 2.This is caused by the fact that reaction 2.1 both uses a significant amount of hydrogen as carrier gas as well produces a broad range of byproducts, despite a reasonable selectivity towards toluene, which is penalised in the exergy analysis, while reaction 2.2 is carried out without solvent and produces only one byproduct, reducing the separation penalty greatly.

E-Factor
The E-factors for the given reactions were calculated using the developed Python code, based on eqn (10), and ranked in the order of increasing E-factors, i.e. in the order of decreasing environmental efficiency.To ensure the method's accuracy, all calculations were also checked manually.The results can be found in Table 3. Industrial E-factors range between 5 and 50 for the fine chemicals industry across usually 3-4 synthesis steps, while the pharmaceutical industry can reach E-factors exceeding 100 over 6+ steps 70 putting most of the computed E-factors into the range.

Heuristics
Analysing the associated records for each of the reactions stored in Reaxys some variability can be detected but, crucially, the variation in records for the alternative reactions for a given step is very low.Given the fact that a very limited set of reactions was considered carrying out very similar chemistry this was to be expected.The number of records for each reaction can be found in Table 4.
It was decided that a step having 1-4 associated records would be given a score of 3. A reaction having 5-25 associated records would be given a 2, while any reaction having in excess of 25 associated records would be given a 1.Any route involving a reaction with a score of 3 will be assigned a score of 3, increased by 1 for each additional step scoring 3. Any route involving more than two steps with a 2, and no 3s, will be scored with 2, while any route involving only reactions scoring 1 (and at most one 2)) will be assigned a 1 as overall score.
Only four reactions in all the analysed routes use solvents.Their rating was determined according to the "CHEM21 selection guide of classical-and less classical-solvents".For the purpose of MCDM the rating was converted to a numeric value as follows: "Recommended" = 1, "Problematic" = 2, "Hazardous" = 3, "Highly hazardous" = 4. Additionally, a   Table 5 The score assigned to the solvents used in a route according to ref. 46.0 = no solvent, 1 = "recommended", 2 = "problematic", 3 = "hazardous", 4 = "highly hazardous" Reaction Solvent score score of 0 was given to any reaction that did not use a solvent.If a reaction used more than one solvent the worst solvent score was used for the overall reaction.The results are shown in Table 5.Any route involving more than one step with a score greater than or equal to 3 will have its overall score increased by 1 for each additional step exceeding this threshold.

Multi-criteria decision making
The MCDM was carried out using the preference function parameterisation given in Table 6.PROMETHEE generates three rankings.Firstly, there are the partial rankings according to Phi+ (the positive flow) and Phi− (the negative flow).The sum of the two yields the complete ranking.If a given option ranks more preferably under several criteria but another more preferable on the remaining it is possible for the two to have different positions in the Phi− and Phi+ rankings (which is not apparent from the Phi score).According to the results shown in Table 7 there is a clearly preferred route option: 1.2, 2.2, 3.2.It is also possible to unambiguously determine the four worst options.The ranking of the remainder of the field is somewhat complicated by the fact that when ranking according to the Phi+ and Phi− scores the ordering of the options changes.It is however possible to isolate a field of five choices that are clearly preferable over the remainder as is shown in Table 7.The top entry remains top under both rankings.The following three options are tied for the second place and together with the fifth option occupy places 2-5 under both rankings and thus clearly outperform the non-highlighted block.The bottom four options represent the worst options; their relative position remains unchanged irrespective of the ranking method.It can therefore be concluded that the presented methodology allows differentiation of the different route options investigated and yields a clear favourite followed by a number of equally good alternatives as far as this proofof-concept study is concerned.The redox economy penalty of the routes via toluene is 1/3, while that of those via p-toluic acid is 2.
It is possible to carry out a sensitivity analysis and determine the range of values for each of the criterion weightings for which the order of a given number of the top ranked choices remains unchanged according to their complete ranking score.PROMETHEE calls this 'stability interval'.In this case stability intervals according to the top five choices were calculated.Considering the top five ranks the stability interval with respect to exergetic efficiency is 0.2254-0.3142and with respect to the E-factor is 0.1138-0.1681.For the solvent score the stability interval lies between 0 and 1 and that for the number of records between 0.2808 and 0.4109 while that of the oxidation length spans from 0 to 1.The results are thus independent of the solvent score and the oxidation length.This was to be expected in this case as all but the worst three choices have the same score and thus the interroute variation is very low rendering these two criteria potentially redundant in terms of their impact on the final result, though it must be emphasised that they do offer further insights into the underlying problems with some of the routes.This problem is highly specific to this case study.Within a typical process chemistry setting a large number of solvents would be encountered across a number of routes and thus it is expected that the differentiability would be greater in a network with more reactions and a greater number of different solvents, resulting in a greater impact of the solvent score on the final ranking.As expected the results are most sensitive to the weighting of the exergetic efficiency and E-factor, though the stability interval is deemed large enough not to be of concern.This is caused by the fact that due to the different assumptions within the two assessment criteria with regards to solvents, catalysts and separation costs they prefer different options, creating at times conflicting rankings.As a consequence, it is very important to pay close attention to the choice of weightings for the different parameters as their impact on the final outcome is pronounced and the sensitivity of the ultimate result to the relative difference in weightings can be nonnegligible.Uncertainty of the scores and its impact on rankings must be considered.Uncertainty is associated with measurement errors reported or not in the original papers as well as with the accuracy of the estimated or literature properties data.Where experimental errors have not been reported the comparison of different process route options is complicated by the uncertainty in the evaluated metrics.In the ESI † a plot of the exergetic efficiency and E-factors for each route option can be seen in Fig. S1 and S2, † showing uncertainties derived from the property prediction methods and estimates of the uncertainty on experimental data.The mean values were taken forward for MCDM, whereas the overlap of uncertainty intervals provides further information for the final decision making.
Comparing the results obtained using the MCDM methodology to the performance of individual criteria taken in isolation some differences can be observed.One commonly used, much easier to calculate, criterion to assess the performance of a route is the overall yield.Taking this metric compared to the ranking based on exergetic efficiencies the top two performing routes are identical, however, thereafter deviations between the two criteria can be seen.This is caused by the fact that the exergetic efficiency penalises impure product mixtures and the changing of temperature and pressure of the reaction mixture.Comparing performance of yield and E-factor as key metrics the deviations are less well pronounced.However, in this case too, the impact of the yield is soon overridden by differences in the amounts of waste produced.Comparing the top-scoring route using solely the E-factor or solely the exergetic efficiency, 1.3 2.2 3.2 (c.f.Tables 2 and 3), to the MCDM results one notes that due to the inclusion of further decision criteria the route 1.3 2.2 3.2 now only appears in the fifth place.Combining the metrics in the MCDM methodology thus provides a more balanced picture and, as expected, the preferred route under the MCDM approach is different to the preferred route using the more conventional individual indicator metrics, both when comparing it to yield, E-factor or exergetic efficiency taken in isolation.The very high weighting of the industrial reliability plays an important factor in the specific case evaluated.The methodology can be easily adapted to the needs of different scenarios to an extent that the use of a single metric would not be able to, making it an approach yielding not only greater insight but also being more versatile than conventional evaluation methodologies.
A necessary fact that needs to be born in mind when carrying out any assessment of the sustainability of a process is that of system boundaries 71 as any metric only focused on the process at hand can be skewed, and outsourcing seemingly encouraged, if sustainability of materials purchased is not considered. 11,70Due to the nature of the database used this, at present, is impossible and thus the system boundaries start at the factory gate, not the cradle.Seeing as the routes compared in this paper necessarily start from the same feedstock the impact of this might be slightly less pronounced but it is a factor to be born in mind when applying the methodology in other circumstances and future development of the methodology will be directed towards life cycle assessment approach.

Performance of algorithm
The automated methodology can be split into several distinct algorithms/tools, mapped onto Fig. 1.The step of data retrieval depends on how access to data is arranged.In the present study a network access to a server was used to download reaction data.Due to the very large size of the network considered, downloading the data and assembly of the network benefits from parallel computing.Running in the order of low tens of parallel processes, the time for this step ranges in the order of a few days.Once the network has been obtained all further analyses take significantly less time and depending on how broadly the network has been defined it can be used for multiple case studies.The network search takes a few minutes to a few hours on a normal desktop machine (albeit requiring a few GBs of RAM).Considering the computational intensity of the analysis carried out, it can be observed that once automated, each of the analysis steps takes in the order of seconds to carry out on a normal desktop machine for the entire set of route options.Setting up the MCDM parametrisation as well as carrying out data curation where still required.These steps add a manual overhead.In addition to the knowledge of the amounts of all substances being fed, yields of all the species and conversion of main reactants, balanced stoichiometric equations need to be available.Calculation of the chemical exergies requires knowledge of the standard chemical exergy of the species or their Gibb's free energy of formation, while calculation of the physical exergy requires temperature and pressure at which the reaction was carried out in addition to the heat capacities for all states encountered in the experiment of all species involved, along with their phase transition enthalpies, if phase transitions are observed.If the analysis is to consider heating then, in addition, knowledge of reaction heat, or alternatively heat of formation of all reaction species, is necessary.Thus, final full automation of the tools described in this paper requires close integration with thermodynamic databases and computational tools.

Data scarcity
Reviewing the calculations carried out, a list of properties required for the calculations can be drawn up.First and foremost it is necessary for the reaction stoichiometry to be reported along with at least two out of the list of conversion (of the main reactant), selectivity (for all products) and yield (for all products).Furthermore, it is required that all reactant and solvent species as well as their molar amounts are reported.
A brief analysis of the Reaxys database illustrates a crucial problem.33 526 757 reactions were downloaded from Reaxys along with some associated data for each reaction entry.This amounts to roughly 80% of all reactions contained in Reaxys 33 and should yield a reasonably reliable sample.Due to the way Reaxys reports data any multistep reactions (17 686 694 reactions) were removed from the analysis set for this section.Additionally, any reactions that were incomplete, i.e. contained either no reactants or products, were pruned from the set.2.6% of the single step reactions were incomplete.The remaining 15 414 520 reactions were analysed to determine how scarce the entries associated to these reactions were.As can be seen from Table 8 this data was incredibly scarce.On the one hand this is due to ambiguity of the reported data: 'ambient temperature' cannot be easily associated with any specific temperature value and, hence, is not translated into a database entry.On the other hand, this is due to lack of reporting of critically important values, e.g., the yield of all product species is not always clearly reported.These issues could be addressed by the use of clear and accepted data reporting standards.
The list of properties required for the presented analysis is reasonably long and in some cases quite specialised as far as the experimentalist might be concerned, and perhaps perceived as adding significant additional effort to publishing without an immediately visible pay-off.This however overlooks two important points: (i) publishing such "incomplete" papers prevents their maximum use, and (ii) much of the required data is in fact already available, just spread across different sources.
Regarding the first observation: facilitated by ever greater computational resources and growing online repositories, more and more data is being used primarily by algorithms, and not humans, which needs to be born in mind during the publishing process. 72,73Though a chemist is able to make judgement calls and interpret the data presented in a paper a computer is, in most cases, unable to do so."Big Data" is a buzzword that is penetrating ever more disciplines and even though some of the hopes placed into it may ultimately prove unfounded it does bear great potential in allowing the unlocking of insights previously impossible purely by leveraging computational resources and available data.Though the possibility of applying automated evaluation routines, such as exergetic and E-factor efficiency evaluations of a synthesis route, are potentially highly useful tools to the process engineer, this is only possible where complete datasets are available.This is a necessary conclusion borne out of the fact that teaching algorithms how to deal with "missing data" is a highly complicated operation.As a consequence, papers not making all required data available would end up being excluded from the result set as a matter of necessity, reducing the practical importance of the initially reported work and results in everyday practice.The responsibility here is necessarily born jointly by author, publisher and database provider.This discussion is being picked up in a forthcoming paper developing an extension to RInChI standard 74 to include some of the required data and make their publication and future algorithmic use more straightforward.

Conclusions
In this paper we presented the concept of automated multi-criteria evaluation of new process routes combined with the route development on the basis of data mining.We are convinced that chemical process development will soon be largely based on automated experiments and will make significant use of data sciences and algorithmic decision support tools.This work highlights the possibilities and the necessary components of the methodology, which provides multiple numeric criteria for a balanced decision on the suitability of a novel synthesis route; we also highlight current limitations.Here we have shown the first, to our knowledge, example of a combined development and evaluation of synthesis routes on the basis of datamining of existing chemical knowledge.It is a proof-ofconcept demonstration, making use of a hypothetical reaction scenario.At present the actual numerical results are of little significance, whereas the elements of the methodology and the pathway to full automation are important to stimulate further work in this emerging area of chemistry.The emphasis is made on algorithmic tools and multi-criteria decision making, which is particularly important if future chemical processes are considered from the point of view of sustainability.The developed methodology was able to rank the different route options by employing a multi-criteria decision making approach and to identify a preferred option as well as several alternatives.Due to the modularity of the method it is easy to extend the number of criteria considered, to replace some or to readjust the weighting of individual criteria, to match different scenarios or aims, making the methodology highly versatile.The method can be adapted to include results of life cycle assessment instead of individual gate-to-gate indicators.In order to allow wide implementation of the suggested development and evaluation methodology, discussions about the method of publishing reaction data currently being held in the chemoinformatics community need to be picked up by the wider chemistry and chemical engineering community and enforced by publishers to ensure that critical data is electronically available.

Fig. 1
Fig.1The proposed automated workflow based on deep data mining and multi-criteria decision making.

Fig. 3
Fig. 3 Graphical representation of exergy balance within a chemical process (adapted from ref. 43).

Fig. 4
Fig. 4 Schematic representation of a control volume drawn around a basic process step.

Table 1
An overview of 75% of the possible three-step synthesis routes connecting limonene to benzoic acid according to the product of the first synthesis step, ranked in decreasing order of occurrence.The table lists the number of routes via a given product 82heme 1 Schematic of the reaction route.Sources: 1.1,751.2,751.3,762.1,772.2,783.1,793.2, 8015.82

Table 2
Exergetic efficiencies and overall yields across all possible synthesis routes, ranked in the order of decreasing efficiency."Step"denotes which step in the synthesis route is shown in the column.∏η is the exergetic efficiency of the route, normalised to 1. ∏Y is the yield of the route normalised to 1

Table 3
Calculated route E-factors, ranked in the order of decreasing efficiency.The "step" column shows which reaction is being carried out in that step of the synthesis route

Table 4
The number of records stored in Reaxys for a given reaction.Where reactions have been obtained from a source other than Reaxys a value of 1 has been assigned to the record count and marked with *

Table 6
Parameterisation of the PROMETHEE model

Table 7
Route options ranked according to their Phi+ and Phi− score

Table 8
Percent of reactions, taken from a sample of 15 414 520 reactions downloaded from Reaxys, that do not have a value stored in the Reaxys database for the relevant property