Calibration-free reaction yield quantification by HPLC with a machine-learning model of extinction coefficients

Reaction optimization and characterization depend on reliable measures of reaction yield, often measured by high-performance liquid chromatography (HPLC). Peak areas in HPLC chromatograms are correlated to analyte concentrations by way of calibration standards, typically pure samples of known concentration. Preparing the pure material required for calibration runs can be tedious for low-yielding reactions and technically challenging at small reaction scales. Herein, we present a method to quantify the yield of reactions by HPLC without needing to isolate the product(s) by combining a machine learning model for molar extinction coefficient estimation, and both UV-vis absorption and mass spectra. We demonstrate the method for a variety of reactions important in medicinal and process chemistry, including amide couplings, palladium catalyzed cross-couplings, nucleophilic aromatic substitutions, aminations, and heterocycle syntheses. The reactions were all performed using an automated synthesis and isolation platform. Calibration-free methods such as the presented approach are necessary for such automated platforms to be able to discover, characterize, and optimize reactions automatically.


Collecting training data and building Reaxys 38k model
The data used to construct the more general model is all available via Reaxys, a proprietary chemical database.To access these data and build a model with the same performance as the one discussed in the main text, requires a Reaxys license.The comma separated value (csv) file "reaxys_registry_numbers.csv" contains all of the Reaxys registry numbers required to query these data and the python file "extract_logE_from_xml.py" can be used to extract the chemical structures and appropriate data fields used to train the chemprop model.To query the data, navigate to the Reaxys "Query builder" tab, select the "Identification" drop down menu on the right, and add "Reaxys Registry Number" (near the bottom of the list) to the query.For the query, copy-paste the Reaxys registry numbers from the csv into the "<> Reaxys Registry Number =" search bar.
The query should consist of a list of numbers separated by semicolons (the final entry should not be followed by a semicolon, this will cause query errors).We found that Reaxys occasionally failed on queries of more than 1,000 Reaxys registry numbers, and therefore have broken the input file into 999 molecule chunks, so that you can copy-paste a single column of the csv at a time.It may take several minutes for Reaxys to search the registry numbers, especially if searching more than 1,000 at a time.
After the results are found, export the data as an XML file.To do this, click the "Export" button above the list of results, from the "Choose a format" dropdown select "XML", for the "Range" select "All results", for "Export" select "Choose specific data" then "+ Add datapoints".This will launch a new panel from which you should select the "Spectra" dropdown, then check "UV/VIS Spectroscopy", and then click "Add datapoints >".Now back at the export substances window, make sure that "Include structures" is checked and click "Export".Download the exported data into a common folder, repeating this process for all of the registry numbers in the csv.
The python script "extract_logE_from_xml.py" should now be used to extract the relevant data from the XML files into a new csv file that can be directly used to train chemprop.In addition to some common dependencies, the python script requires RDKit be installed to do some initial screening, formatting, and cleaning of the data (chemprop also requires RDKit, and this script should run cleanly in a chemprop environment).From the terminal, run: python extract_logE_from_xml.py<folder_of_xml_files> <output.csv> The generated output.csv is ready to be used to train the chemprop model.The next section details training the chemprop model, with the "test_set.csv"file corresponding to "output.csv", but with 10% of the data removed to form a validation set.From these data, the breakdown of solvents is 50.7% acetonitrile, 30.7% dichloromethane, 14.5% ethanol, and 4.1% water.

Model training details and hyperparameter optimization
All models were trained using Chemprop version 1.5.The models were trained using the MIT SuperCloud, a Linux environment with Intel Xeon Gold 6248 CPU (40 cores) and a Nvidia Tesla V100 GPU.For reproducibility, the training command line argument was: Chemprop_train --data_path train_set.csv--save_dir model_x\ --num_folds 10 --dataset_type regression --config_path\ hyperopts.json-epochs 200 --number_of_molecules 2 The train_set.csvfile consists of four columns of data, the solute and solvent (each encoded as a SMILES), the log10 of the molar extinction coefficient, and the wavelength of maximum absorption (in nm).The configuration file was generated using the built in chemprop_hyperopt command on a subset of 1,000 randomly selected molecules from the training set.Using a subset of the training data significantly decreased the time required for hyperparameter optimization.We observed that the best validation error occurred at approximately epoch 150, so 200 epochs were chosen to ensure learning completed.The optimized parameters are in the following table.

Yield calculation
The yield is directly proportional to the integral of the absorbance over the specified peak and maximum wavelength (∫   ).The proportionality constant depends on instrument parameters, such as the path length of the detector (ℓ), flow rate of the eluent ( ̇), and injection volume (  ), internal standard true peak area (  ) and measured peak area (∫   257 ), which for 4,4'-di-tertbutylbiphenyl is taken at 257 nm, the initial concentration of the reactant ( 0 ), and the molar extinction coefficient ().
For the method developed in this study, the internal standard (4,4'-di-tert-butylbiphenyl) had a standard concentration of 1.25 mM, resulting in a standard peak area of 0.21 A.U. s/µL (absorbance unit seconds per microliter injected).The flow rate was set to 0.5 mL/min, the injection volume varied depending on the concentration of the reaction, and the initial concentration varied depending on what step in each multistep synthesis was being performed.In general, for each step in the retrosynthesis the concentration of reactants is doubled, starting from 15 µmol per reaction, to account for incomplete yield in early reaction steps.Error bars are calculated by propagating the uncertainty in the predicted extinction coefficient, calculated as the variance in predictions among the ensemble of models.

Liquid handling error analysis
The uncertainty associated with pipetting via the Tecan Evo LiHa 350 µL pipette tips was measured following the procedure in Bessemans et al. 1 A specified amount of DMSO was pipetted into a single well in a 96-well plate, and then the actual amount of DMSO dispensed was measured gravimetrically.This was repeated several times at different volumes to develop a relationship between specified volume and relative pipetting error, shown in figure S1.A power-law was found to best fit the error data, showing that the error increases exponentially at smaller specified volumes, however the power-law is not a perfect fit for the performance of the Tecan Evo LiHa.

Simulated reactions details
The data shown in Figure 3, as well as the molecules those data correspond to, are provided in tabular form below. Table 2. Details of the simulated reactions shown in Figure 3.The molecules are represented as SMILES, with the measured peak area (Figure 3 x-axis), predicted area (Figure 3   The use of ensemble variance of a prediction as a means of determining the confidence in the value of that prediction is supported by the correlation between the variance (reported in figure S2 as standard error) and prediction error (reported as relative error).However, the correlation is not strong, making it difficult to quantify the confidence in the value.

Model scope and outliers
The outliers in this dataset are highlighted in figure S3. 4-(Dimethylamino)benzoic acid was severely overpredicted.Chemprop indicates that the overprediction is from the dimethylaminobenzene part of the molecule (purple), as this moiety is very common in dye molecules.Indocyanine green also caused issues with the prediction of extinction coefficient, likely because it is very strongly absorbing and has a positive charge that can be delocalized over a large, conjugated substructure.Chemprop does not calculate partial charges on atoms, instead placing all the charge on the nitrogen atom indicated in the SMILES (and in figure S3), which leads to poor predictive power as the delocalization of charge is one of the keys to the molecule's optical properties.Luckily, structures like this are not very common in drug molecules (despite indocyanine green being used as a diagnostic stain for measuring blood flow).The prediction of insulin was also quite poor, as the training data excluded molecules with a molecular weight above 800 g/mol.Macromolecules such as proteins can have a huge range of molar extinction coefficients depending on the number of light-absorbing side chains present in the structure and folding of the macromolecular structure.

Analysis of test reactions
The peak associated with camostat, shown in Figure S4, is very broad compared to typical peaks.The synthesis of enalapril was found to result in a substantial amount of overreacted product under some reaction conditions (namely using EDC as coupling reagent as opposed to HATU).
The over reacted product (shown in orange in Figure S5) has a nearly identical absorption spectrum as enalapril (in blue), causing a lower confidence in the resolution of their respective peaks by MCR.In the future, sensor fusion techniques 2 could enable use of MS data with the PDA data and curve resolution to better separate all compounds Normalized ion counts @ +399 Elution time (min) µAU @ 279 nm Figure S5.Reaction scheme (top) and MS chromatograms (bottom) from the synthesis of enalapril showing the target product (blue) and overreacted impurity (orange).The products overlap substantially, but not so much that they could not be resolved by MCR.The performance of MCR was not as robust as for other compounds because the product and impurity have nearly identical absorption spectra.

Reproducing figures 3 and 4 with provided data and HPLC_analysis.py
The discussed method was designed as an integrated part of the automated molecular discovery platform detailed in Koscher et al. 3 As part of the platform, the data generated by the HPLC were automatically analyzed in real time, that is, a reaction was analyzed as soon as the HPLC finished running that sample.To achieve this level of integration, the method as implemented within the platform makes use of a proprietary API provided by Shimadzu to control the HPLC instrument.
Since access to this API is restricted, we have provided a version of the code that performs the same analysis, but on preexisting data.This code, contained in HPLC_analysis.py, can be used to recreate figures 3 and 4 with the chemprop model built using Reaxys data.We provided the data plotted in figures 3 and 4 in Figure3_data.xlsxand Figure4_data.xlsx,respectively.Using a Elution time (min) Ion counts (millions)

Figure S1 .
Figure S1.Relative pipetting error versus the specified pipetting volume, plotted on semi-logarithmic axes, over a range of 2 to 200 µL.The dashed curve is a power-law regressed curve.

Figure S2 .
Figure S2.The error in the prediction of the log10 of the molar extinction coefficient versus the standard deviation of the prediction value, both normalized by the log10 of the molar extinction coefficient

Figure S3 .
Figure S3.The structures of 4-(dimethylamino)benzoic acid (left) and indocyanine green (right).The part of the 4-(dimethylamino)benzoic acid molecule interpreted by chemprop as being important to its predicted extinction coefficient is highlighted in purple.

Figure
Figure S4 (top) shows the MS chromatogram at m/z = +399 (top) and the absorption chromatogram

Figure S4 .
Figure S4.Chromatograms for the synthesis of camostat.MS chromatogram (top) at +399 m/z and PDA chromatogram (bottom) at 279 nm.Points indicated automatically extracted peaks.The orange area represents the area integrated of the camostat peak.

Table S1 .
Summary of hyperparameter optimization results y-axis), and accompanying uncertainties.