Discovery of SARS-CoV-2 Mpro peptide inhibitors from modelling substrate and ligand binding

The main protease (Mpro) of SARS-CoV-2 is central to viral maturation and is a promising drug target, but little is known about structural aspects of how it binds to its 11 natural cleavage sites. We used biophysical and crystallographic data and an array of biomolecular simulation techniques, including automated docking, molecular dynamics (MD) and interactive MD in virtual reality, QM/MM, and linear-scaling DFT, to investigate the molecular features underlying recognition of the natural Mpro substrates. We extensively analysed the subsite interactions of modelled 11-residue cleavage site peptides, crystallographic ligands, and docked COVID Moonshot-designed covalent inhibitors. Our modelling studies reveal remarkable consistency in the hydrogen bonding patterns of the natural Mpro substrates, particularly on the N-terminal side of the scissile bond. They highlight the critical role of interactions beyond the immediate active site in recognition and catalysis, in particular plasticity at the S2 site. Building on our initial Mpro-substrate models, we used predictive saturation variation scanning (PreSaVS) to design peptides with improved affinity. Non-denaturing mass spectrometry and other biophysical analyses confirm these new and effective ‘peptibitors’ inhibit Mpro competitively. Our combined results provide new insights and highlight opportunities for the development of Mpro inhibitors as anti-COVID-19 drugs.


S1.1 QM/MM studies on catalytic dyad protonation state
A crystal structure of the TSAVLQ↓SGFRK (↓ indicating scissile amide bond) substrate peptide-bound inactivated H41A variant of SARS-CoV BJ01 M pro (PDB entry 2q6g; 2.50 Å resolution) 1 was used as a starting point for modelling SARS-CoV-2 M pro (M pro ) in complex with the substrate peptide s01. The substituted residue Ala-41 was converted back to His-41 using the automodel routine in Modeller. 2 Crystal waters were conserved, and hydrogens were added in silico. Protonation states of titratable residues were assigned using PROPKA3; 3 the hydrogen bonding (HB) network was optimised using the protein preparation wizard in Maestro (Schrodinger). 4 Neutral and zwitterionic catalytic dyad models were prepared for the three possible protonation states of His-163 ( -, -, or both nitrogens protonated), resulting in six scenarios. The FF14SB force field 5 was used to describe the protein and substrate. A solvation shell of TIP3P water molecules 6 was created 5 Å around the protein using SOLVATE. 7 Further solvation was achieved by construction of a truncated octahedral cell of TIP3P water using LEaP (AmberTools19), 8 with a 10 Å distance from the initial solvation cell to the edge of the box. Na + and Clions were added randomly throughout the solvent with a concentration of 0.1 M NaCl.
Each system was minimised by restraining the protein and allowing relaxation of the solvent and ions. This was followed by a minimisation restraining only the backbone atoms, a further minimisation restraining only Cas, and a final minimisation allowing the full system to relax. Each minimisation included 1000 steps of steepest descent, followed by 10,000 steps of conjugate gradient minimisation. The systems were heated by increasing the temperature to 310 K over 100 ps. Langevin dynamics was used with a collision frequency of 5 ps -1 , and backbone atoms were restrained with a force constant of 5 kcal mol −1 Å −2 . SHAKE was used to restrain bonds to hydrogen atoms. 9 A timestep of 2 fs was used. NPT equilibration was carried out for a total of 9 ns, slowly releasing backbone restraints, with a Monte Carlo barostat maintaining the pressure at 1.01325 bar. Three repeat simulations of 250 ns of production MD were carried out on each of the six systems using pmemd.cuda (AMBER18). 8 The following QM/MM protocol was followed to carry out umbrella sampling simulations of the proton transfer (PT) between Cys-145 and His-41. Simulations were performed using sander.MPI (AMBER18). 8 The QM region consisted of the sidechains of Cys-145 and His-41. The reaction coordinate was defined as the difference between the S-H and N-H distances and was restrained with a force constant of 50 kcal mol -1 Å -2 . The reaction coordinate describing the PT was varied between -1.0 Å and 1.0 Å in steps of 0.1 Å, where a value of -1.0 Å denotes the neutral catalytic dyad and a value of 1.0 Å denotes the zwitterionic catalytic dyad. The reaction coordinate was followed in the forward and reverse directions, with starting snapshots selected from the MD trajectories of the neutral and zwitterionic catalytic dyad, respectively, to ascertain if hysteresis affects the free energy surface for PT. A total of 100 ps of sampling at the DFTB3/MM 10 level of theory was carried out in each reaction coordinate window, but the first 25 ps of sampling was treated as equilibration and discarded. A further backward run of 5 ps of sampling per window was carried out at the ωB97X-D/6-31G*/MM [11][12][13] level of theory for the structure with HID-163 using the interface to Gaussian 16. 14

S1.2 Comparative modelling of the SARS-CoV-2 M pro -peptide complexes
A crystal structure of the TSAVLQ↓SGFRK 11-mer peptide substrate bound to the H41A mutant of dimeric SARS-CoV M pro (PDB entry 2q6g) 1 was superimposed with one of unmodified dimeric SARS-CoV-2 M pro (PDB entry 6yb7; 1.25 Å resolution). 15 The substrate was transferred over to the chain A active site of the catalytically-competent SARS-CoV-2 M pro structure. The sequences of the 11 native cleavage sites processed by SARS-CoV-2 M pro (s01-s11) were identified by aligning the sequences of the ORF1ab polyproteins of both SARS-CoV (GenBank accession code NC_004718.3) 16 and SARS-CoV-2 isolate Wuhan-Hu-1 (accession code MN908947.3) 17 using MUSCLE. 18 For each of the 11 cleavage sites, atomic models of an 11-mer peptide matching positions P6 to P5ʹ and charged N-and Ctermini were constructed using the mutagenesis tool of the open source version of PyMOL (v. 2.3.0). 19 For every sidechain from positions P6 to P5ʹ, apart from Gly and Ala, the highest-probability backbone-dependent conformer with the least steric clash and the most chemical complementarity was selected. 20 Using CCG MOE version 2019.0104, 21 each of the resulting 11 models of the SARS-CoV-2 M pro dimer complexed with each 11-mer substrate in the A-chain active site underwent structure preparation protonation using Protonate 3D. Each model was then solvated using 0.1 M NaCl and explicit water and subjected to energy minimization using the AMBER10:EHT force field 22,23 and periodic boundary conditions, until convergence with an RMS of 0.4184 kcal mol -1 per iteration was reached.

S1.3 Explicit-solvent molecular dynamics
Pre-solvation models of the dimeric M pro -peptide complexes constructed as described above ( § S1.2) were used as starting points for MD simulations. All additives and crystallographic water molecules were removed from PDB entry 6yb7, except HOH 644 which provides bridges between His-41, His-164 and Asp-187 (see main text for details). Protonation and rotameric states of histidines and other titratable residues were assigned at pH 7.4 based on a combination of Reduce (MolProbity, Duke University), 24 H++ (Virginia Tech), 25 PROPKA3 (PDB2PQR), 3 and visual inspection, with a final M pro monomeric charge of -4. His-41 (protonated on its -nitrogen) and Cys-145 were assigned neutral.
MD simulations were performed using GROMACS (v. 2019.2) 26 employing the AMBER99SB-ILDN force field. 27 Each of the constructed complexes was solvated (TIP3P water model) 6 in a rhombic dodecahedral box (1.0 nm buffer), neutralised, and minimised using the steepest descent algorithm until the maximum force was below 1000 kJ mol -1 nm -1 . For each peptide sequence, three independent simulations were initiated by random velocities at 298.15 K. In each case, the system was equilibrated under NVT (200 ps; 1 fs timestep) and NPT (200 ps; 1 fs timestep) conditions, before being subjected to 200 ns MD simulation (2 fs timestep) at 298.15 K and 1 bar, during which protein-peptide interactions were monitored. All simulations were performed with three-dimensional periodic boundary conditions. Long-range electrostatics was calculated using the smooth particle mesh Ewald method. 28 All bond lengths involving hydrogen atoms were constrained with the LINCS algorithm. 29 Hydrogen bonds between M pro and the peptides were monitored over the course of the simulations, defined using a combined criteria on the donor-acceptor distance (dD-A ≤ 3.5 Å) and the proton-donoracceptor angle (∠(H-D-A) ≤ 30°).
Models of the designed sequences p12 and p13 complexed with SARS-CoV-2 M pro (PDB entry 6yb7) were built using a comparative modelling approach similar to that described above ( § S1.2), starting from the previously constructed model of the M pro -s02 complex. Each constructed complex was then solvated, minimised, equilibrated, and subjected to 3 ´ 200 ns MD as described above, except the retention of a backbone restraint during NPT equilibration to allow longer relaxation of the non-native peptide side chains and the M pro binding pockets.
To generate representative structures of M pro -peptide complexes for interaction analysis, frames extracted every ns from the concatenated 3 ´ 200 ns MD trajectories were fitted using the M pro backbone, before performing clustering based on the heavy-atom RMSD of the peptide, using the gromos algorithm as implemented in GROMACS (v. 2019.2). 30 A cut-off of 2.0 Å (for native substrates) or 2.5 Å (for p12 and p13, due to heavier residues in their terminal regions) was used.

S1.4 Interactive Molecular Dynamics in Virtual Reality (iMD-VR) and subsequent implicit-solvent MD
The crystal structure of apo dimeric SARS-CoV-2 M pro (PDB entry 6yb7) 15 was used as the target for substrate and peptide inhibitor docking using iMD-VR. Protonation states of histidines and other titratable residues were the same as described in main text § 2.1. The M pro , three natural substrates (s01, s02, s05), and two peptide inhibitors (p12 and p13) tested were parameterised using the LEaP programme (AMBER19) 8 employing the AMBER99SB-ILDN force field 27 and the OBC2 implicit solvent water model (igb=5). 31 M pro was minimised using OpenMM 32 prior to iMD-VR simulation.
For all iMD-VR simulations, a temperature of 300 K was used with a timestep of 0.5 fs. M pro , all substrates, and both peptide inhibitors remained fully flexible. Whilst in VR, each substrate and peptide inhibitor was docked to M pro following the guidance of 'trace atoms' representing where the s01 substrate should bind; this visual representation was taken from the positions of the s01 backbone atoms in the crystal structure of the H41A mutant of SARS-CoV M pro . 1 These 'trace atoms' were used as a rough visual guide to aid the docking, and the main focus whilst in VR was on establishing key hydrogen bond contacts between the protease and the three substrates and the protease and the two peptide inhibitors.
Once each substrate was docked in VR, a structure where the oxyanion hole interactions were successfully reformed was extracted. These structures were minimised, equilibrated, and subjected to 3 x 200 ns replicates of production MD in implicit solvent to ensure the substrates remained bound. In the case of the docked peptide inhibitors, the docking was repeated 5 times, and structures where the oxyanion hole interactions were successfully reformed were extracted, resulting in 5 docked structures per peptide inhibitor. The 5 docked structures were minimised, equilibrated, and subjected to 500 ns of production MD in implicit solvent. The process of minimisation and equilibration was the same for all substrate and peptide inhibitor structures, and is as follows: First, the structures were iteratively energy minimised at 10 K using slowly decreasing degrees of positional restraint. Restraints of 5 kcal mol -1 Å -2 , 2.5 kcal mol -1 Å -2 , and 1.25 kcal mol -1 Å -2 were applied to all backbone atoms for the first three rounds of minimisation respectively, and no restraints were applied for the final round. The system was heated by running 10 stages totaling 20 ps of MD, starting at 0 K and linearly increasing the temperature by 30 K at each stage until a temperature of 298 K was reached (each step had a backbone atom restraint of 5 kcal mol -1 Å -2 ). 8 rounds of 500 ps of NPT MD with slowly decreasing backbone restraints were run. Restraints were initially 5 kcal mol -1 Å -2 and halved after each step; once backbone restraints were below 1 kcal mol -1 Å -2 , only the restraints on Ca atoms were retained. The eighth and final stage had no restraints in the system at all.
For the docked substrate structures (s01, s02, and s05), following minimisation and equilibration, 3 x 200 ns replicates of production MD in OBC2 implicit solvent 31 was run for each docked substrate structure, with a protein backbone restraint of 5 kcal mol -1 Å -2 , resulting in 3 x 200 ns MD trajectories for each substrate. In the case of the peptide inhibitors, following minimisation and equilibration, 500 ns of production MD in OBC2 implicit solvent 31 was run for each iMD-VR docked peptide structure, with a protein backbone restraint of 5 kcal mol -1 Å -2 , resulting in 5 x 500 ns MD trajectories for each peptide inhibitor (due to 5 independent docked structures from iMD-VR). S1.5 Contact interaction mapping S1.5.1 Contact maps Snapshots from MD models as well as XChem derived crystal structures and covalent docking poses were analysed using Arpeggio. 33 The ligand-M pro complex was processed as described by Jubb et al. 33 by cleaning the PDB file using PDBtools 34 and running Arpeggio on all ligand-M pro contacts. For the MD snapshots of the substrate and designed peptides, a representative snapshot for each complex was chosen by selecting the highest populated cluster and the conformation within the cluster that has the lowest RMSD to all other snapshots in the cluster. From the docked covalent Moonshot submission compounds, the lowest energy pose of the highest populated cluster was chosen. The analysis of the XChem fragments and Moonshot designs was done using the published crystallised conformation. [35][36][37] The resulting Arpeggio contact map consists of a bit vector for each identified atom-atom contact and classifies them as "Clash", "Covalent", "VdW Clash", "VdW", "Proximal", "Hydrogen Bond", "Weak Hydrogen Bond", "Halogen Bond", "Ionic", "Metal Complex", "Aromatic", "Hydrophobic", "Carbonyl", "Polar" or "Weak Polar". 33

S1.5.2 Hydrophilicity maps
To calculate whether a given protein subsite corresponds to a hydrophilic or hydrophobic pocket, a hydrophilicity score was introduced. All identified atom-atom contacts that interact with a given residue in the substrate are classified as either hydrophobic (Hydrophobic, Aromatic, Halogen Bond) or hydrophilic (Hydrogen Bond, Weak Hydrogen Bond, Ionic, Carbonyl, Polar), excluding the "VdW" and "Weak Polar" interaction types since they were deemed too insignificant and usually redundant as individual atom-atom contacts. The sum of all hydrophobic atom-atom contacts was then subtracted from the sum of all hydrophilic atom-atom contacts to create a hydrophilicity score for each subsite.

S1.5.3 Interaction fingerprints
A bit vector was created for every analysed protein-ligand complex, denoting the absence (0) or presence (1) of an interaction of a single ligand with every protein residue that was found to interact with any of the known actives (namely the substrates or the XChem fragments). In order to compare the interaction networks of the ligands, a Tanimoto distance can be calculated between the fingerprint bit vectors using the Jaccard distance 38 and the ligands clustered by their calculated Tanimoto distance.
To investigate potential fragment elaboration pathways, the atom-atom contacts present in each fragment cluster were used as a baseline to investigate fragment growth. To identify if a designed small molecule ligand exhibits the same binding profile as one of the identified fragment clusters, a standardised cluster profile was created for each fragment cluster which records the presence of a residue level contact if it was classified by Arpeggio as one of the following major contacts: Aromatic, Hydrophobic, Halogen Bond, Polar, Hydrogen Bond, Ionic, Carbonyl. If more than 70% of all recorded residue level contacts of a particular cluster are occupied for an individual ligand, we classify the ligand as a member of that cluster.

S1.6 MM-GBSA calculations
The contribution of each residue to protein-peptide binding was evaluated quantitatively using per-residue decomposition of binding energy, 39 estimated using the molecular mechanics-generalised Born surface area (MM-GBSA) method as implemented in MMPBSA.py (v. 14.0) 40 in combination with sander (Amber18). 8 The single trajectory protocol was employed, with the M pro dimer defined as the receptor and the 11-mer peptide as the ligand. Snapshots were extracted every 5 ns from the 3 ´ 200 ns MD trajectories for each substrate (120 frames per substrate). The polar solvation term was calculated using the OBC2 model (igb=5) with mbondi2 radii at 0.15 M salt concentration. 31 Non-polar solvation terms were computed from surface area (recursive approximation from icosahedra) 39 and a surface tension of 0.005 kcal mol -1 Å -2 . 31 S1.7 BigDFT calculations Snapshots generated from MD (see main text § 2.2) were studied by Quantum Mechanical (QM) modelling, as implemented in the BigDFT suite. 41 The approach employs Daubechies wavelets to express the electronic structure of the assemblies in the framework of Kohn-Sham (KS) formalism of Density Functional Theory (DFT). 42 With such an approach, the code provides QM results for full systems of large sizes, thanks to the systematic approach offered by wavelets. The electronic structure is expressed by both the density matrix and the KS hamiltonian operator in an underlying basis set of so-called support functions, which are a set of localised functions that are adapted to the chemical environment of the system. Such functions are then expressed in Daubechies wavelets and there are only a few per atom (between 1 and 4 according to the chemical species). The code delivers excellent performance on massively parallel supercomputers and provides the user with the possibility of treating the entire system with the same QM level of theory. A single calculation on one MD-clustered snapshot at the PBE-D3 DFT level requires about 2 h of walltime on 2048 CPU cores (16 nodes) of the IRENE-Rome supercomputer at the TGCC Supercomputing centre in Saclay (Paris). We employed this computational setup, with the inclusion of frozen-core approximation enforced by norm conserving pseudopotentials, for all the DFT calculations presented here. The information to set up the full QM calculation (input file, code version) is available in the GitHub project associated with this publication.
The electronic density matrices as well as the KS hamiltonians expressed in the BigDFT basis were analysed to provide quantum observables on the various portions of the systems. Such a method of analysis has been employed previously and shown to be able to i) evaluate reliable physico-chemical observables on the systems' moieties, thereby decomposing an observable into fragment-based pseudo-observables and ii) assess the pertinence of a given partitioning, by providing an indicator of the quality of the pseudoobservables. 42 In particular, we analysed the strength of the QM interaction on each of the systems' residues, calculated as the matrix elements of the KS hamiltonian reduced on the amino acids. Such analysis provides a linear-response approximation of the energetic contribution by the corresponding residue to the enzyme-peptide interaction and enables a characterisation of the chemical bonding between portions of the system.
For the XChem crystallographic positions, a scheme to equilibrate the position was used, as follows. The pdbfixer program 43 was employed to optimise the crystallographic positions. Water molecules were removed, as well as the hydrogen lost when a ligand formed a covalent bond. Only the M pro monomer-ligand complex was considered for this preliminary dataset. The resulting positions were then optimised with the GFN-FF force field provided by the XTB program. 44 S1.8 Experimental studies on M pro activity and inhibition S1.8.1 Protein production and purification Recombinant M pro protein was produced and purified as reported. 45

S1.8.3 Substrate turnover analysis under denaturing conditions
20 µM stock of all 11 native substrate peptide sequences were prepared in the assay buffer (20 mM HEPES, pH 7.5, 50 mM NaCl). E1-ClipTip™ Bluetooth™ Electronic multichannel pipette (ThermoFisher) was used to dispense 5 µL/well (x24) of each peptide in a single row of a 384 well plate. The first column was treated with a final concentration of 1% (v/v) aqueous formic acid to obtain a 0 min time point. M pro was dispensed using Multidrop to obtain a final concentration of 0.15 µM M pro with 2 µM peptides in all wells. Each column was sequentially quenched (every minute) with 1% (v/v) aqueous formic acid. Samples were analysed by solid-phase extraction (SPE) coupled to mass spectrometry (MS) using a RapidFire Mass Spectrometer. The operating parameters in the positive ion mode were: capillary voltage (4000 V), nozzle voltage (1000 V), fragmentor voltage (365 V), drying gas temperature (280 °C), gas flow (13 L min -1 ), sheath gas temperature (350 °C) and sheath gas flow (12 L min -1 ). Samples were loaded onto a SPE C4-cartridge, which was then washed with 0.1% (v/v) aqueous formic acid to remove non-volatile buffer salts (5.5 s, 1.5 mL min -1 ) followed by elution with aqueous 85% (v/v) acetonitrile in 0.1% (v/v) formic acid (5.5 s, 1.25 mL min -1 ). The cartridge was equilibrated with 0.1% (v/v) aqueous formic acid (0.5 s, 1.25 mL min -1 ) prior to every sample injection. Data were exported in a plate list mode and processed in Excel to calculate percentage product turnover.

S1.8.4 Substrate binding and turnover analysis under non-denaturing conditions
Non-denaturing mass spectra were obtained using a Waters Synapt HDMS Q-TOF mass spectrometer coupled with an automated chipbased nano-electrospray ion source (TriVersa Nanomate, Advion). A larger concentration of M pro than the one used in the denaturing MS assays was used to provide sufficient sensitivity. 5 μM of M pro was mixed with 13-fold molar excess of a substrate (s01-s11) in 200 mM of ammonium acetate (pH 6.9) at room temperature and electrosprayed (1.77 kV spray voltage, 0.55 psi spray backing gas pressure and 4.3 mbar inlet pressure). The sample and extractor cone voltages were maintained at 180 V and 1 V, respectively; no in-source dissociation of M pro dimers was observed at these voltages. Mass spectra were recorded after 1, 3, 6, 9 and 12 min incubation. Measurements were taken in duplicate for each substrate. Data collection and analysis were carried out using Waters MassLynx software. Integrated peak areas of the substrate ions and cleavage product ions were compared at different time points: the sum of substrate and product ions intensities was set at 100% for each measurement, and the level of depletion of the substrate ions was used as a measure of the turnover efficiency.

S1.8.5 Dose response curve analysis
Methods for SPE coupled RapidFire MS-based inhibition assay were as reported. 45 In brief, in a 384 polypropylene well plate, 100 µL of 2.5 mM stocks of the designed peptides were transferred. 11 point 3 fold serial dilutions of the peptides were performed in 60 µL using E1-ClipTip™ Bluetooth™ Electronic multichannel pipette (ThermoFisher) in DMSO with 5 mix cycles of 30 µL volume for mixing. 10 µL was drawn from each well and 5 µL was transferred to two wells of a new destination 384 well polypropylene plate. 5 µL of DMSO (positive turnover control) and 5 µL of 10% (v/v) aqueous formic acid (negative turnover control) were added to 16 wells each on every destination plate. 25 µL/well of x2 stock of enzyme in assay buffer (20 mM HEPES, pH 7.5, 50 mM NaCl) was dispensed using a Multidrop Combi machine; incubation for 15 minutes was followed by dispensation of x2 stock of substrate in each well to obtain 0.15 µM M pro and 2 µM s01 concentration. The reaction was allowed to progress for 10 min (~ 50% turnover in DMSO control), then quenched with 5 µL of 10% (v/v) aqueous formic acid. The plates were centrifuged for ~15 s after addition of each reagent at 2500 rpm (Star lab) to ensure all dispensed solutions were pooled at the bottom of the plate. The plates were analysed by SPE coupled MS under the conditions specified in § S1.8.3. RapidFire integrator software was used to extract and integrate abundance peaks of the +1 charge states of the substrate (1191.68 Da) and N-terminal cleaved product (617.34 Da). Data was exported in a plate list mode and processed in Excel to calculate percentage product turnover, normalisation of percentage activity followed by deduction of percentage inhibition. Normalised percentage inhibition data were exported to GraphPad Prism 8 and non-linear regression analysis was performed to obtain IC50 values. Top and bottom constraints of 100% and 0% were applied respectively for the analysis of reported IC50 values curves. Z' of the assay was always ≥0.8.

S1.8.6 Dose response curve analysis with varying substrate concentrations
The designed peptides were dispensed using an Echo 550 acoustic liquid handling robot. Samples were prepared as described above ( § S1.8.5) with final substrate concentrations of 2 µM, 10 µM, 20 µM and 40 µM TSAVLQ/SGFRK-NH2 (s01) with 10, 10, 15 and 20 minutes of incubation with substrates, respectively. S1.8.7 Designed peptide turnover analysis under denaturing conditions 100 µM stocks of p12, p13, p15, p16, p13-WP2L, s01-LP2W and s01-QP1W were prepared. 0.15 µM of enzyme was dispensed and incubated with 2 µM peptide ( § S1.8.3); the reaction was allowed to proceed overnight at 37°C, 300 rpm in a thermomixer. Samples were analysed by SPE coupled MS. After integration using RapidFire Integrator, the data was analysed in Excel and presented using GraphPad Prism 8. S1.8.8 LCMS analysis for designed peptides LCMS experiments were performed using an Agilent Infinity Series II System attached to QTOF 6650 using an Agilent Zorbax C-18 Extend column. Solvent A: LCMS grade water with 0.1% formic acid, and solvent B: 100% acetonitrile in 0.1% (v/v) formic acid was used at 0.2 mL min -1 flow rate to elute the peptides over a gradient of 22-55% of solvent B over 8 minutes. The operating parameters for the LCMS were the same as above ( § S1.8.3). In a 96 well plate, samples consisting of 0.15 µM M pro were prepared. p12, p13, p15, p16 and s01 were transferred from source wells to destination wells with M pro using the multi injector programme and samples injected immediately after mixing. 30 min, 3 h, 6 h, 1 day and 2 days time points were obtained for peptides. Samples were covered with a polypropylene cover to limit evaporation.

S1.8.9 Designed peptide binding and turnover analysis under non-denaturing conditions
The binding of designed peptides p12, p13, p15 and p16 to M pro dimers and their effects on substrate turnover were investigated using non-denaturing mass spectrometry ( § S1.8.4). 5 μM of M pro was mixed with designed peptides at different levels of peptide excess in 200 mM of ammonium acetate (pH 6.9) at room temperature. Non-denaturing mass spectra were recorded for different proteinpeptide molar concentration ratios (up to 16-fold excess of peptide relative to the protein). At the final step, the native s01 substrate was added to the protein-peptide mixture at 4-fold excess over the protein, and its turnover recorded after 3-and 6-min incubation.

S1.9 Peptide docking
Docking of substrate and inhibitor peptides was performed using AutoDock CrankPep (ADCP) in the ADFRsuite (v. 1.0) package. 46 For redocking trials, the structure of s01-bound H41A SARS-CoV M pro (PDB 2q6g) 1 chain A was prepared in ADFRsuite as the receptor. For docking to SARS-CoV-2 M pro , the N3 inhibitor-bound (PDB 7bqy; 1.70 Å resolution) 47 or the C-terminal autocleavage site product-bound C145A (PDB 7joy; 2.00 Å resolution) 48 M pro structure was used. The dimeric M pro structure was used following processing with MolProbity, 24 correction of histidine states (Table S2.1), and conversion to pdbqt format with ADFRsuite. The most probable peptidebinding site on the receptor surface was predicted with AutoSite (v1.1). 49 While the peptide-binding site was successfully identified by AutoSite with the chain A monomer extracted from the dimeric structure in PDB 2q6g, with the monomeric asymmetric unit structure from PDB 7bqy binding site identification by AutoSite was unsuccessful. When the dimeric structure (PDB 7bqy) was inputted, however, the active site was successfully identified. Hence, subsequent docking with SARS-CoV-2 M pro structures was performed with the dimer. Affinity maps on the receptor were calculated with AutoGridFR (v. 1.2). 50 Each docking run was performed with 100 replicas and 11 million steps, starting from the extended peptide conformation, with solutions internally clustered by a native contact threshold of 0.8. The clustered solutions were then evaluated against the binding mode found in the original structure (SARS-CoV) or in the minimised comparative model (SARS-CoV-2). To allow for flexibility in the less tightly bound terminal regions, and to filter out solutions where the peptide positioning was offset by one residue, solutions were assessed on the criteria of < 2 Å deviation in at least three Ca atoms, after fitting to the M pro backbone with VMD (v. 1.9.4). 51 Out of the top 10 poses, the highest-scoring filtered pose, or if none of the poses passed the filter, the pose with the lowest Ca RMSD, was presented. S1.10 Protein-ligand docking S1.10.1 Dataset A large-scale crystal-based fragment screen against M pro has been conducted using the Diamond synchronton. 37 >500 fragments were screened leading to the discovery of 92 active fragments, 44 of which are covalently bound to Cys-145. The structures are available on Fragalysis. 52 The Poster.AI Moonshot project crowdsourced the design of M pro inhibitors based on the original fragment screen. All submissions are made available on the Moonshot project GitHub. 36 For our covalent docking workflow, the dataset as of the 12th of July, 2020 was used, which features 10001 submissions. A subset was created by selecting only submissions with a matching covalent warhead that cite one covalent fragment as their inspiration, correcting for duplicate and incorrect structures, which gave a final dataset of 540 compounds.

S1.10.2 Docking workflow
The goal of the workflow was to match each compound design to the corresponding covalent origin fragment and to include the binding pose information of the fragment into docking. The inspiration fragment was cited by the designer of the compound; a list of all designs and inspirations can be found on the Moonshot project GitHub. 36 Each design was matched with the corresponding fragment, the maximum common substructure (MCS) between them was identified, and the conformation of the design was aligned to the fragment before docking. The alignment was performed using a custom alignment script similar to the constrained alignment method in RDKit 53 to force the corresponding atom positions of the MCS into the same conformation, followed by a constrained energy minimisation, keeping the conformation of the MCS constant. Docking was performed using AutoDock4 (AD4), which considers ring conformations to be rigid when sampling ligand conformations before docking. 54 As a result, all rings present in the MCS are already aligned to the crystallographically observed binding pose. Each design was docked to the corresponding M pro crystal structure of the origin fragment, after generation of the homodimer and charge optimization using Protonate3D in MOE. 21 We used the FlexRes method in AD4 for covalent docking. 54 The covalent adduct of the COVID Moonshot design after reaction with the active site Cys-145 was selected as the flexible residue and a water molecule included as the "dummy" ligand. Docking and grid parameter files were generated for each Cys-145-inhibitor adduct individually with the rest of the corresponding co-crystallised dimeric M pro structure treated as the rigid receptor molecule. Docking with AD4 was performed using the Lamarckian Genetic Algorithm (LGA) and the following AD4 hyperparameters: population size 300; maximum number of energy evaluations 250 000; maximum number of generations 27 000; number of dockings 100.
The scoring function used by AD4 54 includes pairwise evaluation of intermolecular interactions of the ligand and the protein, intramolecular interactions between residues of the protein and covalent adduct, and an estimation of the conformational entropy lost upon binding. For the evaluation of the covalent docking procedure, only the intramolecular terms are relevant, since they correspond to the changes in the energy of the flexible residues plus covalent adduct. Since AD4 automatically clusters docking results by the total estimated free energy of binding, covalent docking results must be re-clustered using the "Final Total Internal Energy" instead (as reported in the DLG docking log file). For clustering the docked poses of the covalent adducts, the same hierarchical clustering procedure as used in the native AD4 method is employed. A new cluster is seeded with the lowest energy pose, and all remaining poses within a threshold (< 2 Å RMSD) are added to that cluster. The procedure is repeated for the next lowest energy pose, until all docked poses have been assigned to a cluster. RMSD values between poses were calculated using the Open Drug Discovery Toolkit (ODDT), 55 to account for intramolecular symmetry, such as equivalent methyls in tertiary butyl groups.
Docked poses were compared to the original origin fragment crystal structure using SuCOS. 56 SuCOS produces normalised scores to a value between 0 and 1, where 1 indicates perfect overlap and identical molecules. Both the shape and pharmacophoric feature overlaps are weighted equally in the SuCOS score. Based on work by Leung et al., 56 a SuCOS score of 0.55 between two molecules was found to be equivalent to a pose-pose RMSD of 2 Å.
Covalent docking for PF-07321332 was performed identically to that for the other covalent Moonshot designs with the exception that no pre-alignment of the ligand was performed prior to docking. Instead, a random conformation of the ligand was used to seed the docking process. The azanide nitrogen was assigned a negative charge prior to docking.     Table S2.1: SARS-CoV-2 M pro histidine protonation states adopted in this study, based on the apo crystal structure (PDB entry 6yb7). 15 Using standard nomenclature in the AMBER force fields, -protonated His residues are denoted "HID" and -protonated His residues "HIE".

HID
-NH forms a hydrogen bond with Asn-63 side chain oxygen.

HIE
-NH forms a hydrogen bond with substrate P1 Gln side chain oxygen. In the apo M pro structure (PDB 6yb7), the sulfoxide oxygen of a DMSO molecule takes the place of this oxygen.

HIE
Protonation state uncertain based on inspection of crystal structure, but preliminary MD simulations (vide infra) suggested HIE is more stable.

HIE
-NH forms a hydrogen bond with Glu-166 side chain oxygen.

HIE
-N accepts a hydrogen bond from Thr-243 backbone NH.

S2.3.1 MM-GBSA analysis
We analysed van der Waals and electrostatic contributions to protein-substrate interactions by employing the molecular mechanicsgeneralised Born surface area (MM-GBSA) method. 39,[62][63][64] The ten M pro residues contributing most to the binding energy were identified for each of the 11 substrate complexes (Figure S2.13). These residues are all proximate to the complexed substrate (Figure S2.14) and were also identified by Arpeggio as conserved contacts (main text Figure 4).
As anticipated, residues that form the most stable HBs (HBs 2, 3, 10, 11), namely Glu-166 and Thr-26, display large favourable contributions (Figure S2.14); Glu-166 sidechain has a larger total contribution due to its interaction with P1 Gln. Other HB-forming residues (backbones of Cys-145, Thr-190, and Thr-24; and sidechains of His-163 and Gln-189) are also identified as hotspots. For the remaining consistently interacting residues, contributions to binding are dominated by van der Waals interactions (Figure S2.15), exemplified by Met-165 and His-41, both of which engage in non-polar contacts with the hydrophobic P2 residue. Optimising interactions with these hotspot residues could help guide the design of optimal M pro inhibitors.
By applying a similar per-residue decomposition of MM-GBSA binding energy at each substrate position (Figure S2.16), contributions from the P2 sidechain are significant (Figure S2.17). The S2 site appears to tolerate Phe (s02) well, in addition to the more common Leu that is found in nine of the eleven substrates, while Val (s03) is less favourable. The possibility of filling the S2 pocket with a larger, aromatic moiety is of interest in designing inhibitors.

S2.4.1 Comparison of energetic panorama between DFT and force field
Below we compare the energetic panorama that is provided by DFT and the force field (FF) and show that the results correlate (supporting a reasonable MD configurational sampling). Enzyme-substrate interactions were analysed by post-processing QM results, which enable insight into the possible interaction patterns (or "interaction signatures") for each of the substrates. Note that the ability to treat the whole system with a consistent level of theory enables analysis of both short-range and long-range interactions on an equal footing.
Using this approach, energies were computed for the substrate (S), the protein dimer (P) and the total assembly (A) from which an interaction energy was calculated as follows: Eint = EA -(ES + EP); solvent molecules were omitted for easy comparisons. While absolute estimates of physical quantities were not our focus (due to the omission of entropy and solvent contributions), this approach provides an unbiased first-principles approach, complementary to that provided by classical FFs, to identify the main features involved in substrate-protein binding. Figure S2.18: Comparison of the three-point interaction energies (kcal mol -1 ) coming from full DFT calculation of the clustered MD snapshots. DFT interaction energies (x-axis) are represented together with FF three-point interaction energies (y-axis). We regroup the data on a per-peptide basis, for peptides s01 (blue), s02 (red), s05 (green), and p13 (orange). Statistic distributions are averaged taking into account the weights of the clusters in the trajectories.
The results (Figure S2.18) show a good correlation between the different approaches. The energetic panorama offered by the MD of the assemblies is likely similar to what would have been found by employing a first-principle approach, supporting a reasonable conformational sampling by MD. This also indicates that charge-polarisation, which would be captured (at least partially) by DFT, does not play a major role for the peptide-enzyme interaction. This is related to the fact that the most charged peptides are ones which exhibit more attractive interactions. Within this assumption the QM interaction energy can be efficiently approximated by the electrostatic peptide-enzyme interaction. Indeed, we show (Figure S2.19, left) that there is a remarkable correlation between the threepoint interaction and such an approximated term (which is calculated by a multipole expansion of the DFT charge density result). To validate this, we performed the same analysis by employing the polarizable FF Polaris(MD), 65 the results of which show almost perfect correlation between the two FFs (Figure S2.19, right). Note that the contribution to the interaction energy that is provided by van der Waals terms (the semi-empirical D3 dispersion term) is approximately constant (around 100 kcal mol -1 ), and therefore the trends for the various interactions are provided by considering only the DFT-PBE contributions.

S2.4.2 Contact map derived from electronic structure
A quantity of particular interest in determining the interaction network is the system's Hamiltonian, which gives a measurable indication of the strength of the chemical bonding among the electron clouds of each of the fragments. The partial traces of this operator, once projected on the system's fragments, provide an indication of the fragments which participate in bonding during the trajectories. Such "contact interaction energy", Econt, can be interpreted as short-range sharing of electrons between two fragment residues. For each system, we have plotted this term for the ten most stabilising interactions (Figure S2.20). Figure S2.20: Violin plot distributions of contact interaction energies (kcal mol -1 ) between substrate/designed peptides (s01, s02, s05, p13) and selected M pro residues (which rank in the top ten in terms of stabilising interactions with the peptide in at least one of the systems), displayed (left) on the peptide or (right) on the enzyme residues.

S2.4.3 Long-range electrostatic interaction patterns
The QM-FF comparison above (Figure S2.19, left) suggests that peptide binding trends can be estimated, at first approximation, by only considering the long-range electrostatic interactions. As we have employed a full QM calculation on the entire system it is interesting to show which are the emerging patterns of these interactions for the different peptides. This is helpful in defining other "interaction signatures" which are based on long-range patterns. Below (Figure S2.21) we show how such interactions behave during the dynamics. For clarity, we only represent residues whose magnitude of the (MD-averaged) interaction is larger than 7 kcal mol -1 for at least one of the systems. Some residues belonging to the other M pro monomer (chain B) also appear to be relevant, due to their relative geographical proximity with the substrates. Obviously, the main overall pattern is dictated by the total charge of the bound peptide, which disfavours the neutral s05 in comparison with the other positively charged peptides (s01, s02, p13). The strong binding role of Glu-166 is clearly visible. Figure S2.21: Distribution of long-range electrostatic interaction energies (kcal mol -1 ) between substrate/designed peptides (s01, s02, s05, p13) and selected M pro residues (which show a >7 kcal mol -1 interaction with the peptide in at least one of the systems), displayed as (left) a heatmap and (right) a box plot.   (Figure S2.22). Depletion of s04 was also observed, but its likely hydrophilic products were not observed potentially due to weak retention by the SPE C4 cartridge. Conditions: 0.15 µM M pro , 2 µM substrate peptide in 20 mM HEPES, pH 7.5, 50 mM NaCl at 37°C, 300 rpm.    66 This approach involves determining the IC50 ratio (R) at two substrate concentrations (S1 and S2). The value of R is then used to assign the compound a (likely) competitive (R1), non-competitive (R2), uncompetitive (R3) or mixed (R4) inhibition mode. Using S1 = 2 µM and S2 = 40 µM (highlighted with a rectangle), the calculated R value for competitive inhibition (R1) is 3.38. For the designed peptides (p12, p13, p15, p16), the results suggest competitive inhibition with respective R values of (3.61), (6.02), (4.56) and (4.85).     RMSD of p12 and p13 heavy atoms (i.e. not including hydrogens) across 5 ´ 500 ns MD simulations of iMD-VR docked structures, compared with the starting structure (i.e. the docked structure). The RMSD of p12 stays around 2.5-5.5 Å across all simulations, indicating that a common, stable bound structure has been found. By contrast, the RMSD of p13 is higher, with values ranging 4.5-8.5 Å. Figure S3.8: RMSF of p12 and p13 backbone atoms across 5 ´ 500 ns MD simulations of iMD-VR docked structures. In simulations of p12, the Pʹ side of the peptide (residues -YSQFY) is more flexible than the P side (residues KYTFWQ-). In simulations of p13, the P' and P sides of the peptide seem equally flexible, with residues near the catalytic dyad (residues -WQN-) having a low RMSF.      .14: Hydrophilicity map of the eleven 11-mer substrate peptides (s01-s11) and the designed peptides (p12-p16). Hydrophilicity scores were calculated as a sum of all hydrophilic contacts subtracted by the sum of all hydrophobic contacts of each residue in the peptide, as identified by Arpeggio. More positive scores correlate to hydrophilic sites and negative scores to hydrophobic pockets. Note the yellow at the P2 Trp for p12, p13, and p14 show that the P2 residue can become more deeply buried within S2 than the analogous P2 residues in the natural substrates, forming more than double the number of hydrophobic contacts in the cases of p12 and p13.     Table S3.5: Assessment of selected binding poses from ADCP docking of each substrate (s01-s11) and designed (p12, p13, p15, p16) sequence in the crystal structure of SARS-CoV-2 M pro , originally in complex with the N3 inhibitor (PDB 7bqy). 47 For each peptide sequence, the 10 highest ranked solutions were evaluated by comparison to the M propeptide complex resulting from comparative modelling followed by MM minimisation. For every sequence, the highest ranked solution that passed the filter of < 2 Å deviation in at least three Ca atoms, or if none of the solutions passed the filter, the solution with the lowest Ca RMSD, was presented below.  Table S3.5; number in parentheses = pose rank) from ADCP docking of each native substrate (s01-s11) sequence in the crystal structure of SARS-CoV-2 M pro , originally in complex with the N3 inhibitor (PDB 7bqy). The catalytic dyad His-41 and Cys-145 are in magenta. Docked structures with the P4 and P2 residues positioned correctly in their respective S4 and S2 pockets were consistently obtained. However, greater variation was observed in the positioning of residues from P1 onwards, likely due to the poorer definition of the Sʹ subsites.  Table S3.5; number in parentheses = pose rank) from ADCP docking of each designed (p12, p13, p15, p16) sequence in the crystal structure of SARS-CoV-2 M pro , originally in complex with the N3 inhibitor (PDB 7bqy). The catalytic dyad His-41 and Cys-145 are in magenta. Table S3.6: Assessment of selected binding poses from ADCP docking of each designed (p12, p13, p15, p16) sequence in the crystal structure of C145A SARS-CoV-2 M pro , originally in complex with the s02 cleaved product (PDB 7joy). 48 For each peptide sequence, the 10 highest ranked solutions were evaluated by comparison to the M propeptide complex resulting from comparative modelling followed by MM minimisation. The highest ranked solution that passed the filter of < 2 Å deviation in at least three Ca atoms, or if none of the solutions passed the filter, the pose with the lowest Ca RMSD, was presented below.  Table S3.6; number in parentheses = pose rank) from ADCP docking of each designed (p12, p13, p15, p16) sequence in the crystal structure of C145A SARS-CoV-2 M pro , originally in complex with the s02 product (PDB 7joy). The catalytic dyad His-41 and Ala-145 (in place of Cys-145 that is present in the catalytically competent protein) are in magenta. Figure S4.1: Views from crystal structures of all XChem fragments and their binding site on the M pro dimer. 37 All binding sites are on chain A (white). As a representative structure the fragment x0830 co-crystal structure was used. There are 66 fragments that bind into the active site (green fragments) and 25 allosteric fragments (fragment 1101 binds in two different allosteric sites).

S4.1.1 Fragment clustering by interaction fingerprints
A fingerprint bit-vector was constructed for every active-site binding fragment, with each bit denoting the presence or absence of an interaction with every M pro protein residue found to interact with either a substrate or fragment. A distance matrix was created by calculating the Tanimoto distance 38 between interaction fingerprints. Fragments were then clustered using the interaction Tanimoto similarity index. A similarity score of 1 corresponds to perfect overlap of all fragment-residue level contacts, and 0 denotes no overlap. Note that the Tanimoto index is computed between Arpeggio-derived contact fingerprints rather than ligand structure extendedconnectivity fingerprints (ECFPs). For this analysis, a tighter (0.7) and broader (0.5) threshold were chosen. The former was found to be better at distinguishing distinct binding modes, while the broader threshold was more useful in grouping binders around major significant interactions (see Table S4.1 and Figures S4.2-3 for further details).    Among the clusters using the 0.5 threshold clustering method (Figure S4.5), cluster 5 stands out as it the only major cluster with fragments that bind deeply into the S1 pocket, having one of the main conserved contacts identified for the substrate peptides. This cluster shows a distinct binding motif primarily driven by (i) hydrogen bonding between a carbonyl oxygen on the fragment and the Glu-166 backbone NH-group, and (ii) a strong polar interaction between His-163 and the fragment. Notably, the position of the hydrogen on the imidazole sidechain of His-163 appears to depend on the HB fragment partner.

Tanimoto threshold Number of clusters
Based on the presence or absence of either a HB donor or acceptor on the fragment, the protonation state of His-163 can be inferred. This suggests that for x0107, x0434, x0540, x0678 and x0967, the His-163 -nitrogen is protonated, forming a HB to the pyridine nitrogen (x0107, x0434, x0540 and x0678) or phenol oxygen (x0967). For x1093, the -nitrogen is protonated, leaving the -nitrogen free to form a HB with the indole -NH of x1093, reversing the HB polarity compared to the other fragments in the cluster. Nonetheless, the same binding geometry is observed in both cases and the clustering algorithm correctly assigns the molecules into the same cluster.

S4.1.2 Descriptor based on contact and long-range interactions
To further understand how the XChem fragments interact with M pro , we performed linear scaling DFT calculations, as done for the substrate/designed peptides. Both short-range and long-range interaction terms were analysed and the computed coupling strengths between fragments and enzyme residues were compared to those obtained for the natural substrates, with the aim to identify potentially interesting inhibitor candidates. We have made these results publicly available in the following links: Contact Interaction: https://maayanlab.cloud/clustergrammer/viz/603f6484d0867e01721c4820/contact_interactions_subs_XChem.tsv Electrostatic Interaction: https://maayanlab.cloud/clustergrammer/viz/603f624cd0867e01721c47e5/electrostatic_interactions_subs_XChem.tsv A hierarchical visualisation of contact and long-range interactions enables identification of families of different inhibitors (Figure S4.6 below; left and right panels respectively). Within this clustering approach, the fragments/inhibitors belonging to the same family as the native substrates (s01, s02, s05) are highlighted in red and the cluster 5 compounds identified with Arpeggio in bold red (main text § 4.1), by considering the short-range DFT interaction as a descriptor. Remarkably, all cluster 5 compounds belong to the same family as the native substrates. Although this analysis is at a preliminary stage, we believe that this is potentially a powerful direction of investigation as it enables agnostic comparison between compounds of different size and nature based on first-principle considerations.
Using long-range electrostatic interactions as a descriptor, cluster 5 compounds split into two main subfamilies. One of the compounds (x0540) belongs to the same subgroup as the substrate peptides, while the remaining compounds are grouped into another subfamily with compound x0874. Compounds that show similar long-range interaction patterns but different contact interaction patterns as substrates s01/s02 are highlighted in purple.
This analysis can be interpreted in two ways: on one hand, it enables a criterion to single out some XChem compounds which provide interaction patterns similar to those exhibited by native substrates. We can thus generalize the cluster 5 group of compounds into a larger family, which also includes the compounds highlighted in red (Figure S4.6). On the other hand, when applying such a criterion to long-range interactions, other compounds (in purple) with interaction patterns similar to those shown by the native substrates can be identified. It is of interest to further investigate such interaction pattern categorisation with the aim of identifying potent inhibitors with novel modes of action. Figure S4.6: Dendrograms showing the hierarchical clustering of XChem fragment inhibitors, substrate peptides (s01, s02, s05), and designed peptide p13, considering descriptors defined on (left) short-range contact and (right) long-range DFT electrostatic interactions. The average correlation distance between interaction patterns is indicated on the x-axis. A family of compounds (red) which contains the peptides (s01, s02, s05, p13) as well as cluster 5 binders (bold) is identified using short-range interactions (left). Fragments that belong to the same family as s01 and s02 based on long-range (right) but not short-range interactions are highlighted in purple.

S4.2 Interaction analysis of COVID Moonshot compounds
As of the 11 th January, 2021, the COVID Moonshot M pro inhibitor project had reported fluorescence assay data for 798 inhibitors and SPE MS assay data for 784 inhibitors; 35,36 X-ray crystal structures for 245 inhibitors in complex with M pro were reported. To test our hypothesis that the interactions identified in the substrate and fragment analyses are important for inhibitor design, the Moonshot structures were processed with Arpeggio and ligand-M pro interaction fingerprints were generated and compared to the most important interactions identified in fragment cluster 5. A Moonshot compound was classified as a cluster 5 binder if it shared at least 70% of the atom contacts found in cluster 5 binders. This analysis resulted in 101 assayed cluster 5 binders (main text Figure 16). Since the sensitivity of the activity assays are capped at 99 μM, weak binders could not be accurately quantified and were labelled with IC50 values of 99 μM. 36 In the following analysis the fluorescence assay results for 798 inhibitors were used.
When comparing the average IC50 of cluster 5 binders with the rest of the dataset, cluster 5 binders bind slightly more tightly with an average IC50 of 42.2 μM (95% confidence interval: 34.55-49.87 μM), while the average IC50 of all compounds is 54.0 μM (95% confidence interval: 50.95-57.07 μM). However, given the bimodal distribution of the data (Figure S4.7), a better metric is the hit rate of good compounds and the number of 'inactive' compounds in the dataset: 2 of the 10 best Moonshot compounds are cluster 5 binders, as are 10 of the top 10% (81 compounds) of Moonshot compounds. Out of the 101 cluster 5 binders, only 15 were identified as weak binders, while out of all 798 assayed compounds 263 (33%) were weak or non-binders. As a result, the observed IC50 difference between cluster 5 binders and other inhibitors might be higher than currently calculated. Note, only 245 of the 798 assayed compounds had been crystallographically analysed so it is unknown how many of the unsolved compounds are also cluster 5 binders.
Thus, by analysing fragment and substrate interactions with M pro , we were able to extract and cluster structural interaction fingerprints and found in cluster 5 a promising non-covalent binding mode in the M pro active site. Analysis of the large set of assayed Moonshot compounds reveals that cluster 5 binders are potent inhibitors with a higher hit rate of actives than the rest of the dataset. Cluster 5 binders represent highly promising starting points for further optimisation. The kernel density estimation (KDE) and histogram of the distribution of the COVID Moonshot dataset for the fluorescence and RapidFire dataset are shown in Figure S4.7. 35

S4.3 Covalent docking of COVID Moonshot compounds
We selected the 540 covalently reacting compounds of the 10,001 designed Moonshot compounds and docked them using the knowledge-based pre-alignment docking method using AutoDock4 ( § S1.10). 54 These compounds were designed by the Moonshot consortium, usually using one or more of the 44 covalent fragments as a core structure. 36 Only compounds with a matching covalent warhead to the inspiration fragment that also cite a single covalent fragment as their inspiration were selected to form the dataset of 540 compounds. To take advantage of the many diverse induced-fit conformations of M pro , each designed compound was docked into the M pro structure of the corresponding covalent "inspiration fragment". 36 The lowest energy docked pose in the highest populated cluster of each docking run was used to identify contacts using Arpeggio and generate the interaction Tanimoto distance matrix using a Tanimoto similarity threshold of 0.5 or 0.7 between poses of each cluster as described above for the XChem fragments. The broader clustering threshold of 0.5 leads to a total of 5 clusters, with the first cluster containing 477 of the 540 poses (88%) and no single-pose clusters; while the tighter threshold of 0.7 results in 46 clusters with 12 single molecule clusters. As expected, the diversity of binding modes of these compounds is much lower than in the original XChem fragment set, due to the limited number of fragments (44) and the reduced structural diversity of the designs, all being covalent S1/S1ʹ binders.
We subsequently analysed the ability of the procedure to recapitulate the binding pose of the parent fragment when docking the fragment-based Moonshot designs. We compared the shape and pharmacophoric (SuCOS) overlap of the lowest energy pose of the highest populated cluster for each Moonshot compound with the inspiration covalent XChem fragment referenced by the designers (Figure S4.8). A SuCOS score of 0.5 and higher was sufficient to consider the binding poses of the crystallographic fragment and docked design as conserved. 56 Due to creative freedom in the design process, some of the designed compounds do not overlap significantly with the inspiration fragments and in some extreme cases only have the covalent warhead in common. When controlling for the smallest maximum common substructure (MCS) that encompasses at least the covalent warhead and one additional atom in the compound, 379 docked designs remain, from which 132 (34.8%) recovered the binding mode of the inspiration fragment. Given the high similarity between the fragments and the docked designed compounds, it is likely that these binding modes are more representative of the actual binding mode of the ligand. Furthermore, by selecting docked compounds that did not have significant modifications to the original fragment (less than 10 atoms difference to the MCS between fragment and docked compounds), the number of compounds regaining the binding pose was 87 (54.7%) of the 159 compounds. Interestingly, the distribution of SuCOS over the data subsets resembles a bimodal distribution in all cases, with one peak around a SuCOS of 0.25 and a second peak between 0.7-0.8 depending on the subset (Figure S4.8). This suggests that the AD4 docking process performs well generally in identifying binding poses similar to the original fragment (especially in cases where few modifications were made), but fails completely in some cases, resulting in almost no overlap between the pose and the fragment and an extremely low SuCOS (<0.3). This observation is in line with the expected changes in binding mode of a molecule when subjected to large changes in structure and does not necessarily correspond to incorrect docked poses. As a result, when filtering for compounds where only minor changes were made to the inspiration fragment and the docking pose generated regained the same binding pose as the fragment, we increased the probability of gaining relevant poses. However, thorough validation of this hypothesis is yet to be done.
At the point of analysis, only 6 of the 540 docked covalent compounds have been analysed crystallographically. 52 These structures were used as a limited benchmark for the docking method; an overlay of the crystallographic conformation, the lowest energy pose of the highest populated cluster from docking, and the corresponding crystallographic structure of the inspiration fragment is shown ( Figure  S4.9). x10899 (Figure S4.10) was excluded from further analysis since it binds via a crystal contact to a third symmetry-related M pro molecule, rather than the biologically relevant dimeric state. 60 The binding modes of two compounds, x3077 and x10306 (Figure S4.9a and S4.9e respectively), were reproduced perfectly. The method places the aromatic sidechain of x3324 (Figure S4.9b) correctly into the S2 pocket of M pro but varies on placement of the linker when compared to the crystal structure. However, in this case, the original fragment that was cited as inspiration has no overlap with the designed compound. As a result, the induced fit shape of the active site of the M pro structure used for docking complements a completely different molecule and the alignment before docking pointless. For x3325 and x10172 (Figure S4.9c and S4.9d respectively), the selected lowest-energy pose of the highest populated cluster did not match the binding pose of the crystal structure.