CoPriNet: Graph Neural Networks provide accurate and rapid compound price prediction for molecule prioritisation.

.


Introduction
The drug design process can be thought of as a multiobjective optimization problem in which potential drug compounds need to satisfy a wide set of properties from binding affinity and toxicity to selectivity and solubility 1 .One property that is key when developing potential drug molecules is their availability, since no matter how promising a design might be, if it is not available, it is doomed to fail.In order to estimate compound availability, several computationally calculated synthetic accessibility (SA) scores have been developed.These approaches can be roughly classified as retrosynthesis-based predictions [2][3][4][5][6] , binary classifiers [7][8][9] , and complexity-based estimations 10- 13 .Retrosynthesis-based approaches aim to identify suitable synthetic routes for a given molecule using distinct types of search algorithms over databases of building blocks and chemical transformations.
State-of-the-art methods 2,6,14,15 , which are based on deep learning, are able to integrate information from millions of reactions and building blocks, suggesting feasible synthetic routes for the majority of the benchmarked compounds in a matter of seconds to minutes 2 .However, their outputs strongly depend on the employed databases 16 and they tend to suggest multiple solutions which are difficult to rank 17 and more importantly, even the fastest are computationally demanding and therefore ill-suited for high-throughput computational pipelines 8 .Binary classifiers are machine learning algorithms trained to distinguish between compounds that are easy or difficult to make.Although the available approaches may differ in terms of learning algorithms (support vector machine, neural network, etc) and compound featurization (descriptors, fingerprints, etc.), it is the definition of the training set, consisting of compounds labelled as easy or difficult to make, that most impacts the behaviour of these methods.Some strategies for compiling training datasets include retrosynthesis 7,8 , presence in commercial catalogues 9 or virtually edited compounds 9 .Although binary classifiers tend to be much faster than retrosynthesis-based methods, they are also less accurate 9 and their performance is highly dependent on the training dataset 8 .Binary classifiers also by definition cannot distinguish between different levels of difficulty 8,9 .Complexity-based methods try to define an empirical metric under the assumption that complex molecules are more difficult to synthesize 13,18,19 .Most methods define complexity as a function of the presence of features deemed to be complex or infrequent such as chiral centres, uncommon moieties, or unusual molecular fragments.One of the most popular measures of SA [20][21][22][23][24] , SAscore 10 is a complexity-based method that uses the rarity of fragments found in PubChem 25 and a set of predefined properties such as the ring complexity or the number of stereo centres to calculate its score.Another commonly used SA score, SCScore 11 employs an indirect estimation of complexity assuming that the complexity of the reactants is never larger than the complexity of the products.Due primarily to their simplicity and speed, SAscore and SCScore have been used extensively in drug development pipelines including for compound screening [20][21][22]26 , dataset preparation 23,24 and molecule generation/optimization [27][28][29][30] . SAcore is one of the most popular metrics for biasing or discarding potentially infeasible compounds in methods for computational generation of de novo molecules 27,31-34 .However, as described above, SAscore and SCscore are simple approximations for SA and as such, present several limitations. For itance, it is well known that these scores tend to underestimate the SA of difficult compounds that can be synthesized from complex commercially available building blocks 35,36 . It ha also been shown that structurally similar compounds, which tend to have similar complexity-based scores, can require synthetic strategies of different difficulty levels 35 , leading to incorrect SA estimations.Independent of their nature, the aim of all the methods described above is to computationally filter compounds, ruling out those difficult to make or purchase.This suggests the need of an alternative metric that directly estimates the actual metric of compound availability, namely its price.This is after all what influences many of the decisions in drug discovery, particularly in the early stages when the cost of the compounds to be experimentally tested is often of central importance. Current A metrics exhibit poor correlation with prices, Fukunishi et al. 37 found that the Pearson correlation coefficient (PCC) between their SA measurement and the logarithmic sales prices of the compound, in $/mmol, was ~0.3.Fernandez et al. 38 observed only a weak correlation between prices and two complexity-based SA scores: Ring Complexity Index 39 and SAscore.This is perhaps not surprising, since SA scores were never intended to capture price information.Nevertheless, most methods for automatic compound design try to optimize their molecules against a SA metric, which leads to the suggestion of many feasible yet prohibitively expensive compounds.For the hundreds of millions of compounds in the commercial catalogues, price estimation is merely a database search question that, in real situations can be severely delayed by the quotation process.However, the real challenge is estimating price for the rest of chemical space: the advent of machine learning-based molecular generation techniques and large virtual compound collections makes this problem increasingly acute.Compound cost prediction has previously been addressed using QSAR-like methods 38 or retrosynthesisbased approaches 40 .Fernandez et al. 38 developed the QS$R approach, a classical machine learning method aimed to learn the relationship between the structure of the compounds and their prices for QSAR-like setups.As a proof-of-concept, the QS$R model was trained on ~4000 pairs of compound descriptors and prices, performing particularly well for compounds in the lower price range.Badowski et al. 40 estimated the cost of a molecule as the cost of the cheapest retrosynthetic route considering the cost of the initial reactants, reactions yield, and fixed costs.While this formulation captures the different terms involved in the final price, it relies on the assumption that the cheapest retrosynthetic route is the one that determines the final cost, which does not necessary hold, and on estimations of reaction yields and fixed costs, information that is only available for a limited number of reactions and that, in many cases, is not in the public domain.With the aim of overcoming these problems, in this manuscript we present CoPriNet, a retrosynthesis-free method to obtain price predictions using only the compound itself. Our mthod is based on a graph neural network (GNN) trained on a dataset of >6M pairs of molecules and their prices collected from the Mcule 41 catalogue (https://mcule.com/)and can be directly employed to assess novel compounds.Our approach follows that of SA binary classifiers trained on retrosynthesis predictions: given enough data, machine learning methods should identify patterns in the input molecules that are relevant for the synthetic planning (or the price) without the need to explicitly undergo retrosynthetic decomposition.Although retrosynthesisbased computations tend to be more accurate, our predictions exhibit a far stronger correlation with catalogue prices than any SA metric, with comparable running times. Cosequently, our approach can be employed as a complementary metric to fast SA estimations for high throughput assays and more importantly, for de novo molecule generation, in which the large number of required assessments prevents retrosynthetic-based approaches from being used.

Datasets
Two main sources of compounds were employed in this work.The first is the Mcule catalogue 41 , that contains more than 40 million compounds and their up-to-date prices compiled from more than a hundred vendors.In order to avoid common errors that may arise from the integration of different catalogues (misdrawn and incorrect structures), the catalogue is curated using the Mcule Advanced Curation process (MAC) that involves a rigorous molecule registration system including structural checks, and various steps of standardization, preparation and correction, ensuring that the information contained in the catalogue is highly reliable.From this catalogue we extracted the subset of in-stock compounds (~7M), that was divided into train, validation, and test partitions randomly.The price of each compound was collected on March 2021 from the Mcule database as the best price for 1 g of compound.Prices were then converted from $/g to $/mmol because, as found by Fukunishi et al. 37 , correlations with SA measurements were stronger.All statistics and Figures included in this work are derived from the compounds in the test set except when explicitly stated.The test set is also referred to as the purchasable compounds dataset (PC) throughout this manuscript as it only contains purchasable compounds.For the generalizability study, a random subset of 100K virtual compounds was also extracted from the Mcule catalogue as a separate independent test set.The second source of compounds was the ZINC database 42 from which we extracted a subset comprising only non-commercially available natural products, that we refer to as the NPNP (Non-Purchasable Natural products) dataset.We use this dataset as an approximate set of non-synthesizable compounds.We also employed two of the datasets compiled by Gao and Coley 35 .These were their dataset of molecules compiled from five different sources: MOSES 43 , ZINC 42 , ChEMBL 44 , Sheridan et al. 45 , and GBD-17 46 ; and their dataset of de novo generated molecules comprising of two subsets of molecules optimized against multiple properties including or not the SAscore.

Synthetic accessibility calculations
Four distinct SA metrics were employed in this work: SAscore 10 , SCScore 11 , the AstraZeneca RAscore 8 and SYBA 9 .All of them were executed using default parameters.Additionally, the retrosynthesis-based score ManifoldSA was computed using the Postera Manifold API v1 (https://api.postera.ai/api/v1/docs/).ManifoldSA summarizes retrosynthesis results into a number between 0 (easy) and 1 (difficult) that estimates the synthetic accessibility of a compound.For Figure 2, compounds were classified as synthesizable if their ManifoldSA < 0.5 and non-synthesizable otherwise.Manifold first performs a tree-search to compute possible retrosynthetic routes from the target molecule to purchasable starting materials, using Molecular Transformer 14,47 to predict the probability of success of each step.The ManifoldSA is then computed by considering the feasibility and robustness of multiple routes to the molecule, taking into account probability of success at each step of a route.The Manifold algorithm has been used in synthesis-driven de novo design 48 .

Retrosynthesis calculations
Retrosynthesis prediction was carried out using the Postera Manifold API, that implements the molecular transformer approach 14,47 .We employed the v1 retrosynthesis endpoint using a depth search of four and the Mcule catalogue as the source of building blocks.

Comparison with other methods
Price estimation from retrosynthesis predictions was performed using a simple heuristic as a replacement for the non-publicly available method proposed by Badowski et all 40 .This heuristic only considers the cost of the building blocks neglecting any additional cost.Thus, taking into account that the retrosynthesis results obtained for each compound tend to include several pathways, potentially involving multiple building blocks, we employed two simple strategies.The first one assumes ideal route ranking, thus overestimates the performance (ignoring non-reactant costs) by selecting the route that best matched our price records.The second strategy just reports the minimum price route and should provide an underestimation of the performance.The QS$R multilayer perceptron model was reimplemented as indicated in Fernandez et al. 38 with the exception of descriptors calculation, that were computed The node and edge vectors are projected into higher dimensionality embeddings (coloured rectangles within GNN box) using independent learnable weights for the nodes (Linear-nodes) and for the edges (Linear-edges).After that, the node and edge embeddings are processed by ten blocks of PNA layer, batch normalization and ReLU activation, updating the state of the node embeddings after each block.Then, the processed embeddings of all the nodes are combined into one single graph embedding using a Set2Set layer 53 .Finally, the graph embedding is processed by two blocks of linear layer, batch normalization and ReLU activation from which the price prediction is obtained using a linear layer with one single unit.using Rdkit 49 .The model was retrained using the same training data as used for CoPriNet.

CoPriNet Graph Neural Network
To create our price prediction GNN we represented compounds as 2D graphs with atoms corresponding to nodes and bonds to edges.Nodes are encoded using five features: atomic number, valence, formal charge, number of neighbours, and aromaticity.Edges are represented with four features: bond type, aromaticity, conjugation, and ring membership.Our GNN first embeds the node and edge features using a learnable linear transformation from dimension five and four to 75 and 50 respectively.After that, ten blocks consisting of a PNA layer 50 , batch normalization 51 and ReLU 52 activation are applied one after another.Then, an embedding for the graph is obtained applying a Set2set layer 53 .Finally, two dense layers with batch normalization and ReLU activation and one last linear layer with one single unit are applied to the graph embedding.A schematic of our GNN architecture is depicted in Figure 1.The hyperparameters were selected by cross-validation over the validation dataset, exhibiting a robust behaviour.See Supplementary Information Section 9 for more details.Our network was trained using the Adam optimizer 54 with a batch size of 512 graphs.Initial learning rate was set to 10 −5 and decreased by a factor of 0.1 when the validation loss did not improve during 25 epochs.The mean squared error was used as the loss function.

Evaluation metrics
The correlation between continuous variables was measured using the absolute value of the Pearson Correlation Coefficient (PCC, Eq. 1) and the Spearman's Rank Correlation Coefficient (SRCC, Eq. 2). and ,  ̅ is the average of variable , (  ) is the ranking of the -th observation of the variable  and n is the number of observations.Binary classification performance was evaluated using the Matthews Correlation Coefficient (MCC, Eq. 3) at the threshold that maximizes its value.

𝑃𝐶𝐶 = 𝑎𝑏𝑠 (
Eq. 3 where  is the number of true positive predictions,  is the number of true negative predictions,  is the number of false positive predictions and  is the number of false negative predictions.

Synthetic accessibility estimations present limitations
Historically, SA scores have been employed as proxies for compound accessibility assuming that synthesizability implies availability and ignoring other relevant aspects such as compound cost.In practice, SA scores have been used to classify compounds as synthesizable nonsynthesizable selecting different thresholds.We sought to test how this strategy performs on two sets of molecules, a dataset of purchasable compounds (PC) and a dataset of non-purchasable natural products (NPNP) using four different types of SA scores: SAscore 10 , SCScore 11 , SYBA 9 , and RAscore 8 (Figure 2, a-d).As a first approximation all the PC molecules should be classified as synthetically feasible, and most of them as highly accessible, whereas most of the NPNP compounds should be considered hard to synthesize.As Figure 2 a-e shows, none of the methods, including CoPriNet, which will be discussed later, perfectly separate the two compound sets.However, this is not surprising nor potentially even desired since not all NPNP are synthetically infeasible, nor are all purchasable compounds easy to make.In order to get a better estimation of the actual number of synthetically feasible NPNP compounds we computed ManifoldSA, a pure retrosynthesis-based score (see Methods for more details).Retrosynthesis-based approaches tend to be far more accurate than standard fast SA scores and as such, we can consider them as an (imperfect) ground truth when evaluating simple SA scores.For the NPNP, ManifoldSA estimates that ~24% of the compounds are synthesizable, with 4.6% of those being easily synthesizable (see Figure 3).Whereas for the PC dataset, ~6% of the compounds were regarded as infeasible despite being commercially available.While these numbers also suggest that retrosynthesis predictions are not perfect, they are more accurate than fast SA scores.So, it is interesting to consider them as ground truth (Figure 2 f-j).The results obtained in this case, although worse for all methods, are similar to the ones of the previous experiment.
The reliability of the different approaches can be influenced by the dataset used.As such we also tested their behaviour on the datasets compiled by Gao and Coley 35 that include typically used catalogues of compounds as well as de novo generated molecules for which retrosynthesis predictions were computed (see Methods).Overall, the SAscore and the RAscore better reproduced the retrosynthesis results (see Supplementary Information Section 4), but the different data subsets show quite different results.One case of especial interest is the dataset of de novo generated molecules that were optimized against several multiproperty objective functions (see Supplementary Information Figures 9-10).In this case, the RAscore score performance drops when the properties used to optimize the molecules do not account for SA.These results are in line with what would be expected for a machine learning approach, since the molecules that are obtained, although biased to replicate catalogue properties, do not necessarily represent viable instances.The results for the PC and NPNP dataset and those from the Gao and Coley datasets suggest that the SAscore, with all its imperfections, is currently the best heuristic for retrosynthesis-based SA estimation.However, and leaving aside that synthesizability may not be the most useful proxy for compound availability, there are also several examples reported where SAscore severely underperforms (for visual examples see Supplementary Information Figure 2).Moreover, retrosynthesis-based methods, despite being computationally demanding, are not perfect at identifying synthetically accessible compounds.The high degree of variability and the fact that the agreement between the different estimations depends on the dataset used, suggests that all methods are far from perfect (see Supplementary Information Figure 11).

Price and synthetic accessibility have a complex relationship
Though synthetic accessibility is an important criterion, often particularly early in the drug discovery pipeline molecules to be tested are selected based on price, effective availability and ease of synthesis.Given that, we next examined the relationship between SA metrics and price.We compared the price in the Mcule catalogue for the compounds in the PC dataset to our set of SA scores.All SA metrics had only a weak correlation with price (see Figure 4), with PCC values ranging from 0.16 to 0.35 and SRCC ranging from 0.16 to 0.41.Even a combination of all scores in the form of a linear regression model still performs poorly when trying to predict the price, with a PCC of 0.46 (see Supplementary Figure 15).These numbers agree with the value of 0.3 reported by Fukunishi et al. 37 and suggest that the synthetic difficulty of a molecule may have only a small impact on the final cost of a compound.Although this conclusion seems counterintuitive, there are many reasons why this might be the case, for example, compounds that are in high demand will benefit from economies of scale, thus lowering their price regardless of their synthetic accessibility.For the same reasons, it is not unusual to find complicated building blocks at low prices in multiple catalogues, which allows the easy synthesis of otherwise difficult compounds.Nevertheless, while cheap compounds comprise both easy and difficult compounds, in general, expensive compounds tend to be less synthetically accessible than their cheaper counterparts (see Figure 5).

CoPriNet predicts compound prices using a graph neural network
We hypothesised that a graph neural network (GNN) model should be able to detect patterns in molecules that indirectly reflect the drivers of pricing by automatically combining simple features such as the atomic number, aromaticity, or bond type across the different atoms of the molecules.Deep learning models are well suited to identify complex patterns in raw data providing enough data points are used during training, and GNNs are especially effective for molecular graphs, which are moreover fast to compute.We therefore built a model, CoPriNet, that can predict compound prices using as input molecular graphs.CoPriNet was trained as a regression model against catalogue prices and is able to produce far more accurate price predictions on our test set than any of the considered SA measurements (Figure 4 f): the PCC of 0.77 and SRCC of 0.80 are far higher than the best achieved by any of the other methods.

CoPriNet exhibits generalizability
The performance of all machine learning methods is dominated by their training set 8 , so one of the most important questions for CoPriNet is to establish how well it generalises across different compounds.The specific challenge is that we can only obtain prices for the tiny fraction of the chemical space that is contained in the Mcule catalogue, and that prices for commercial catalogues are not generally in the public domain.We first tested that predictions are consistent independently of the random train/validation split.To do so, we trained CoPriNet on three distinct train/validation partitions, and found a high consistency, with mean PCC and SRCC of 0.73 and 0.74 with a standard deviation of 0.04 for PCC and 0.07 for SRCC.Next, we tested generalisability by analysing a set of noncatalogue compounds, namely non-purchasable natural products (NPNP), that are both quite different from the training/validation set (see Supplementary Information Figure 1 for dataset comparison), and which can be reasonably assumed to be in general more expensive than purchasable compounds (PC), as they are much more difficult to synthesize. Figure 2 e shows that CoPriNet tends to predict larger prices for the NPNP compounds than for the compounds of the PC dataset, suggesting that it generalises well beyond its training set.In addition, we also studied CoPriNet performance using as test data the subset of compounds that were not present in the database snapshot used for training (March) but that were included in the next release (June).CoPriNet predictions exhibit a PCC of 0.65 (see

Figure 1 .
Figure 1.Price prediction Graph Neural Network architecture.Molecules are represented as 2D graphs consisting of nodes (blue, grey, and red circles), encoded as node vectors of dimension five (blue, grey and red rectangles), and edges (yellow and green lines), encoded as vectors of dimension four (yellow and green rectangles).The node and edge vectors are projected into higher dimensionality embeddings (coloured rectangles within GNN box) using independent learnable weights for the nodes (Linear-nodes) and for the edges (Linear-edges).After that, the node and edge embeddings are processed by ten blocks of PNA layer, batch normalization and ReLU activation, updating the state of the node embeddings after each block.Then, the processed embeddings of all the nodes are combined into one single graph embedding using a Set2Set layer53 .Finally, the graph embedding is processed by two blocks of linear layer, batch normalization and ReLU activation from which the price prediction is obtained using a linear layer with one single unit.

Figure 2 .
Figure 2. Synthetic Accessibility scores are no better approximations for compound availability than CoPriNet, suggesting that CoPriNet generalizes beyond its training set of purchasable compounds.Left: score distributions for purchasable compounds (PC, blue) and non-purchasable natural products (NPNP, orange) computed with a) SAscore 10 , b) SCScore 11 , c) SYBA 9 , d) RAscore 8 ,and e) CoPriNet.This test the premise that NPNPs are synthetically less accessible and more expensive that PC compounds.Right: score distributions for PC and NPNP compounds labelled according to their predicted synthesizability (synthesizable, ManifoldSA<0.5,green, non-synthesizable, ManifoldSA>0.5,red), computed with f) SAscore 10 , g) SCScore11 , h) SYBA9 , i) RAscore 8 , and j) CoPriNet.Note that compounds predicted as synthesizable can also have expensive prices.The Matthew's correlation coefficient for each score is displayed on top of each subplot.

Figure 3 .
Figure 3. Retrosynthesis-based ManifoldSA scores for the set of Purchasable Compounds (PC, blue) and Non-Purchasable Natural Products (NPNP, orange).PC compounds are expected to be far more synthetically accessible.

Figure 4
Figure 4 Synthetic accessibility scores correlate poorly with compound price whereas CoPriNet predictions are strongly correlated.a) Histogram of the compound prices of the CoPriNet test set; b-e) Density heatmaps for CoPriNet test set compound prices against four different SA scores: SAscore 10 , SCScore 11 , SYBA 9 and RAscore 8 ; f) Density heatmap for CoPriNet test set compound prices against CoPriNet predictions.Compound prices are displayed as natural logarithm of catalogue prices.The absolute value of the Spearman's Rank Correlation Coefficient is displayed in parenthesis (SRCC).g) Colour bar for subplots b-f displaying the percentage of the PC dataset in each bucket.

Figure 5 .
Figure 5. Expensive compounds tend to exhibit larger Synthetic Accessibility (SA), but SA metrics are unhelpful for price prediction as they correlate weakly across all price ranges.Distributions of different synthetic accessibility estimations (SAscore 10 , SCScore 11 , SYBA 9 and RAscore 8 ) for catalogue compounds of different price ranges.Last price range comprises all compounds with prices above 80$/mmol.