Automated stopped-flow library synthesis for rapid optimisation and machine learning directed experimentation

For the discovery of new candidate molecules in the pharmaceutical industry, library synthesis is a critical step, in which library size, diversity, and time to synthesise are fundamental. In this work we propose stopped-flow synthesis as an intermediate alternative to traditional batch and flow chemistry approaches, suited for small molecule pharmaceutical discovery. This method exploits the advantages of both techniques enabling automated experimentation with access to high pressures and temperatures; flexibility of reaction times, with minimal use of reagents (μmol scale per reaction). In this study, we integrate a stopped-flow reactor into a high-throughput continuous platform designed for the synthesis of combinatory libraries with at-line reaction analysis. This approach allowed ∼900 reactions to be conducted in an accelerated timeframe (192 hours). The stopped flow approach used ∼10% of the reactants and solvents compared to a fully continuous approach. This methodology demonstrates a significantly improved synthesis success rate of smaller libraries by simplifying the implementation of cross-reaction optimisation strategies. The experimental datasets were used to train a feed-forward neural network (FFNN) model providing a framework to guide further experiments, which showed good model predictability and success when tested against an external set with fewer experiments. As a result, this work demonstrates that combining experimental automation with machine learning strategies can deliver optimised analyses and enhanced predictions, enabling more efficient drug discovery investigations across the design, make, test and analysis (DMTA) cycle.


S1. Continuous flow limitations
Despite the benefits, continuous flow chemistry still shows some disadvantages for medicinal chemistry such as the synthesis scale (relatively high), requiring to pump large reagent volumes to achieve steady state conditions [1], with unnecessary environmental and economic costs (e.g. when using hazardous or expensive reagents) [2]. For small volumes (microreactors), the use of dedicated and expensive equipment is necessary to achieve low flow rates required for relatively long residence times [3]. Figure S1 illustrate some of these limitations.

S2. Stopped-flow reactor operation
The reactor used was a 1000 μL coil (0.04 mm diameter) twisted around and in full contact of an aluminium cylinder block (Figure 1b, 5-8). The inlet of the reactor was connected to cross piece linked to each of the sampling loops lines, while the reactor outlet was connected to a stainless-steel back pressure regulator (BPR), equipped with a cartridge for 750 psi (Figure 1b,   8). The reactor temperature was externally controlled by a Eurotherm temperature controller, connected to a k-type thermocouple and heated up by a pair of cylindrical heating cartridges, all elements embedded into the centre of the aluminium block (Figure 1b, 5). In addition, fast reactor cooling between experiments was achieved by using a cooper pipe twisted around the external aluminium cylinder (in contact with the inner reactor coil), connecting to a cooling water supply which was automatically triggered at the end of each reaction. Finally, the reactor block was thermally insulated with a cotton jacket covering all external parts.

S3. SNOBIFT Optimisation
The first approach applied was searching for the optimum reaction conditions i.e. temperature and reaction time, using an automated self-optimisation algorithm. In this case, an iterative single-objective self-optimisation method was programmed (SNOBFIT [37]; Figure S3, a), minimising the ratio of the internal standard to the product (using their respective HPLC UV response peak signal areas), when varying temperature and reaction time. A summary of the results obtained for two amide coupling reactions is discussed ( Figure S3, b and e). For these reactions, the self-optimisation method was successful to identify the best reaction conditions, when chemical variables were kept constant ( Figure S3, c and f). The HPLC-MS data provided an in-depth visualisation of the interactions between species (products and side-products), revealing their influence in the reaction path to synthesise the target molecule. These interactions were aligned with the optimum reaction conditions identified from the selfoptimisation ( Figure S3, d and g).
In the library synthesis context, the experimental process driven by the algorithm was a relative time-consuming task, requiring a large number of experiments i.e. ~45 conditions tested for each reaction requiring a total of 10 hours (similar to other self-optimisation algorithms implemented [28]). An unrestricted large number of experiments can be particularly problematic when the search diverges e.g. when the reaction does not proceed under any circumstance. In addition, the random exploration path differs from reaction to reaction, making it difficult to compare and frustrating the identification and modelling of global underlying trends of the whole library. Finally, the algorithm implementation also required the use of an internal standard to calculate the relative increase of the target molecule, and prior knowledge of the chromatographic retention characteristics of the target product.
Consequently, we found that applying self-optimisation methods more applicable when towards the end of the DMTA cycle, at the hit-to-lead optimisation stage when libraries are small. Figure S3. Automated self-optimisation reaction sequences driven by SNOBFIT algorithm (a).
Two amide coupling reactions subjected to a self-optimisation algorithm (b and e), designed to identify the best reaction conditions (temperature and reaction time) for the targeted product molecule. (c) and (f) illustrate their respective contour plots, obtained by minimising a function value (the ratio between the internal standard to the product peak areas, calculated from the HPLC DAD 254 nm signal). In both cases, the maximum yield was obtained when the target molecule was competing with the generation of side-products under a strong temperature dependency (d and g), and with minimal effect of the reaction time (all Rt points are plotted).

S4. Function values for optimisation
For the self-optimisation ( Figure S3), the ratio between internal standard to the desired product was calculated using the HPLC signal at 254 nm. For each reaction the acquired data is presented in table S4.1 and S4.2 respectively.

S5.1 Feed-Forward Neural Net Model architecture
The feed forward neural network (FFNN) consists in 1-3 hidden layers (will be set by the hyperparameters optimization), see "def build_keras_model" in the provided code in Github (github.com/MolecularAI/HTE_Publication_Avila_et_al). 'Dropout' is applied at each layer.
The output layer consists in sigmoid activation function. The 'binary_accuracy' has been chosen as metric to access the performance as the two classes are balanced. The learning function is the 'rmsprop' (in some initial testing, adam optimizer has been used but did not lead to significantly different results).

S5.4 Cross-validation study results and best model selection
In this section we present the cross-validation study that has been made on the primary set of experiments of 836 data. The followed strategy is described in the main manuscript (see Figure   8). This study aims to select which model best classifies between the "successful" and "failed"  Firstly, the average ROC_AUC for the 'hold-on' test set is very similar than for the ones for the training sets in the case of the three 60/40% cross-validation models (see Set_1, Set_2, and Set_3 in Figure S5.1 and "Model_performance.xlsx" table). In first approximation, this demonstrates that these models do not 'overfit' the training data. Indeed, much poorer performances on the test sets often witness that the models fit the training in a way that it cannot generalize well for external or new data as the test set ones. Secondly, all the 60/40% based models led to better performance than their corresponding random models and for the 'leaveone-amine-out' models. For the latter, the models performances on the training sets are much higher than for the 'hold-on' test sets. This has important consequences. The results on the 'leave-one-amine-out' models revealing that: (i) the amine structure has a great influence on model quality as the one in the test set seems not be well predicted by the ones in the training,  Table S5.4) is in overall not well predicted by the different model features types, in fact, the average ROC AUC is lower for the models than the corresponding random models. The presence of the nitro group seems to make the 'one-out-amine' in Set_AMINE_3 unique and not well covered by the property profile of the other amines. However, it should be stressed that the training and test set size dataset slightly differ between the different 'leave one-amine-out' as all experiments could be collected  The ROC AUC gave useful information about the modelling quality but as all the models seem to perform similarly on this score, other quality measure have been used to select the best model to be applied on the temporal test set. This measures is the 'precision' (defined in the main manuscript, Figure 5). This values is strongly correlated to the ROC AUC score but maximizes it could help to reduce the number of failed experiments.
Thorough analysis of the performance results presented in "Model_performance.xlsx" showed that the performance differences between many models are subtle. Reassuringly, the model based on only the conditions as features (Model #1, see Model_performance.xlsx) presents the weakest performance compared to model implying the molecular structures in different manners. We observe that model #11 is among the model set up leading to, overall, high ROC_AUC and precision. The fact that its features set is made of different components such the reaction fingerprint, the product fingerprint and physico-chemical properties made us believe that it has better chance to better predict new data while this was not clearly showed in the cases of the different 'leave-one-amine-out' model though.

S5.5 Model analysis
The first observation that can be made is that the performance in predicting the temporal test set is significantly better than the ones from random models which show erratic behaviours.
For example, the 'precision' which reflects the capability of the model to find the successful reactions (e.g. true positive) without inflating too many failed reactions (e.g. false positive) is significantly higher for the model versus random models, and it is combined to a high 'recall' which measures how the model is able to retrieve the successful reactions. The latter is higher for the 'Random 1' model but, in this case, the accuracy and precision is poor which translate that the model is overestimating the chance to have productive reaction. However, the performance is lower than the ones observed during the cross-validation study indicating that the current modelling could not generalize enough from the training dataset to lead to highly accurate temporal test prediction. We found understandable that a model based on only 5 different amines and 6 acids would be not be able to predict any amide coupling with high accuracy. Indeed, quite different level of performances have been observed for the various 'leave'one-amine-out' dataset during the cross-validation study and can be explained by the following analysis. The classification accuracy for the different amines is displayed at Figure   S5.2. It can be observed that for 2 out of 5 amines, a significant amount of the related experiments were not well predicted, the diaryl-amine, and the ortho-pyridine amine are related to failed reaction ( Figure S5

S5.6 Coupling-agent prediction analysis
The coupling agent plays a major role in the amide coupling reaction. Similarly to the amine analysis discussed above, the classification accuracy for the different coupling agents is depicted at Figure S5.3. As for the amine case, some good trends as well as some key learning for improvement can be made. Firstly, there is no coupling agent for which the related experiment are systematically badly predicted by the model. However, failed prediction surpasses good predictions in the case of failed experiments reaction using CA2 meaning the model had the tendency to overestimate the success of the reaction when this coupling agent is employed (see peak "a" in Figure S5.3). The opposite trend is observed for CA4 where the model preferred to predict those experiments to fail and then wrongly predict 18 experiments (see peak "b" in Figure S5.3). This can be explained by the fact that the dataset is not enriched enough in successful reactions using this coupling agent, in order words, the dataset is somehow unbalanced which has direct consequence on the model learning.

S5.7 Percentage of conversion analysis
In the Figure S5.4 below, the average conversion percentage with respect to the products and two different predicted score thresholds, 'Score >0.0' (e.g. meaning no limitation using the score), 'Score>0.8' means that only the experiments having a score equal or greater than 0.