Deconvoluting low yield from weak potency in direct-to-biology workflows with machine learning

High throughput and rapid biological evaluation of small molecules is an essential factor in drug discovery and development. Direct-to-biology (D2B), whereby compound purification is foregone, has emerged as a viable technique in time efficient screening, specifically for PROTAC design and biological evaluation. However, one notable limitation is the prerequisite of high yielding reactions to ensure the desired compound is indeed the compound responsible for biological activity. Herein, we report a machine learning based yield-assay deconfounder capable of deconvoluting low yield from low potency to identify false negatives. We validated this approach by identifying promising SARS-CoV-2 main protease inhibitors with nanomolar activity that rivaled potency observed from the standard D2B workflow. Furthermore, we show how our framework can be utilized in a broad, in silico screen to produce compounds of similar potency as a D2B assay.


General Materials and Methods:
Unless otherwise noted, all chemicals and reagents for chemical reactions were purchased at the highest commercial quality and used without further purification.Random Forest and Gaussian Process models used were run with default scikit-learn parameters (version 1.0.2).Morgan fingerprints of the amine fragments were created using the default parameters on RDKit (version 2020.09.1).
Code and Data Availability: 1 The associated code can be found at: https://github.com/wjm41/deconvoluting_low_yield.Peptide Bond Coupling General Procedure: The amide library was made by reacting the carboxylic acid under the optimized reaction conditions (2 eq.amine; 2 eq.EDC; 2 eq.HOAt; 5 eq.DIPEA; DMSO; RT; 24h) with 300 amines (202 aromatics, 49 primary, and 49 secondary aliphatic amines).For library production, we used Echo LDV plates and an Echo 555 acoustic dispenser for liquid handling.Plate copies were made after diluting the reaction mixture with 4 μL DMSO.For yield estimation, 1 μL of the diluted library was transferred to an LC/MS-ready 384-well plate, followed by dilution with 20% acetonitrile in water to the final volume of 50 μL.The desired product was identified in 60% of wells.
General Fluorogenic Assay Procedure: 2 Compounds were seeded into assay-ready plates (Greiner 384 low volume, cat.no.784900) usingan Echo 555 acoustic dispenser, and dimethylsulfoxide (DMSO) was back-filled for a uniformconcentration in assay plates (DMSO concen-tration maximum 1%) Screening assays wereperformed in duplicate at 20mM and 50mM.Hits of greater than 50% inhibition at 50 mM were confirmed by dose response assays.Dose response assays were performed in 12pointdilutions of twofold, typically beginning at 100 mM.Highly active compounds were repeated in a similar fashion at lower concentrations beginning at 10mM or 1 mM.
Reagents for Mpro assay were dispensed into the assay plate in 10 ml volumes for a final volume of 20 mL.Final reaction concentrations were 20 mM HEPES pH = 7.3, 1.0 mM TCEP, 50 mM NaCl, 0.01% Tween-20, 10% glycerol, 5 nM M pro , and 375 nM fluorogenic peptide substrate ([5-FAM]-AVLQSGFR-[Lys(Dabcyl)]-K-amide).M pro was preincubated for 15 min at room temperaturewith compound before addition of substrate and a further 30 min incubation.Protease reaction was measured in a BMG Pherastar FS with a 480/520 excitation/emission filter set.Raw data were mapped and normalized to high (Protease with DMSO) and low (No Protease)controls using Genedata Screener software.Normalized data were then uploaded to CDD Vault (Collaborative Drug Discovery).Dose response curves were generated for IC 50 using nonlinear regression with the Levenberg-Marquardt algorithm with minimum inhibition = 0% and maximum inhibition = 100%.The assay was calibrated at different enzyme concentrations to confirm linearity and response of protease activity, as well as optimization of buffer components for most stable and reproducible assay conditions.Substrate concentration was chosen after titration to minimize saturation of signal in the plate reader while obtaining a satisfactory and robust dynamic range of typically five-to six-fold overcontrol without enzyme.As positive control, under our assay condition, nirmatrelvir has IC 50 of 2.6 nM.

Modelling, Training, and Validation Methodology:
The Gaussian Process (GP) and Random Forest (RF) models were trained using a dataset comprising 300 SMILES-inhibition readings from a high-throughput, direct-to-biology assay.The objective was to model inhibition as a regression problem, aiming to minimize the root-meansquare error between the models' predictions and the experimental data by employing an L2 loss function.Given the limited size of the dataset, we adopted a leave-one-out cross-validation approach to achieve a reliable estimation of the models' generalization error.In this method, the machine learning model is trained on all but one data point (i.e., 299 in our case) and then makes a prediction for the excluded data point.This process is iterated for each data point in the dataset, with the results presented in Figure 1.
Both models leverage Morgan fingerprint features with a radius of 2 and 2048 bits for molecular representation.To identify the optimal hyperparameters for the models, such as the GP kernel bandwidth and the number of RF estimators, we utilized 5-fold cross-validation implemented through the GridSearchCV function in scikit-learn.

Dose Response Curves for Isolated Compounds:
Associated curve is on the left.

Figure S1 :
Figure S1: Gaussian Process regression results of initial 300 amide modelling.Each dot represents one potential M pro inhibitor.Dotted diagonal line represents perfect model accuracy.Dot color corresponds to yield.

Figure S2 :
Figure S2: Random Forest regression results of initial 300 amide modelling.Each dot represents one potential M pro inhibitor.Dotted diagonal line represents perfect model accuracy.Dot color corresponds to yield.

H 2 NFigure S4 :
Figure S4: IC 50 values for the top 20 most potent compounds as determined by the "Swiss Cheese" model of the in silico screen.Coupling location is highlighted in blue.

Table S1 :
Model metrics for each model.Our "Swiss Cheese" model is the mean of the Random Forest and Gaussian Process models.

CO 2 NR 1 R 2 Ar = isoquinoline Direct-to-Biology Hits NHMe O Figure S3: IC
50 values for the top 20 direct-to-biology amides hits formed through the shown amines.Coupling location is highlighted in blue.