Identification of SARS-CoV-2 Mpro inhibitors through deep reinforcement learning for de novo drug design and computational chemistry approaches

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused a global pandemic of coronavirus disease (COVID-19) since its emergence in December 2019. As of January 2024, there has been over 774 million reported cases and 7 million deaths worldwide. While vaccination efforts have been successful in reducing the severity of the disease and decreasing the transmission rate, the development of effective therapeutics against SARS-CoV-2 remains a critical need. The main protease (Mpro) of SARS-CoV-2 is an essential enzyme required for viral replication and has been identified as a promising target for drug development. In this study, we report the identification of novel Mpro inhibitors, using a combination of deep reinforcement learning for de novo drug design with 3D pharmacophore/shape-based alignment and privileged fragment match count scoring components followed by hit expansions and molecular docking approaches. Our experimentally validated results show that 3 novel series exhibit potent inhibitory activity against SARS-CoV-2 Mpro, with IC50 values ranging from 1.3 μM to 2.3 μM and a high degree of selectivity. These findings represent promising starting points for the development of new antiviral therapies against COVID-19.


Introduction
Identified in China in December 2019, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) quickly escalated into a global pandemic as the infectious agent of COVID-19, spreading rapidly across the world, resulting in more than 774 million reported cases and causing 7 million deaths. 1 In parallel, vaccines and monoclonal antibody treatments became accessible within a year to combat the severe coronavirus disease 2019 (COVID-19).However, the urgent need persisted for therapeutic solutions for individuals who were unvaccinated, immunocompromised, or experiencing diminished vaccine immunity. 2One promising approach is the development of small molecule inhibitors targeting the main protease of the virus, also known as Mpro.This enzyme plays a critical role in the replication of the virus, making it an attractive target for drug discovery. 3,4irmatrelvir-ritonavir (Paxlovid) and ensitrelvir are Mpro inhibitors which were approved for the treatment of COVID-19 in 2021 and 2022, respectively.6][7] Moreover, the growing emergence of drug resistance underscores the necessity for the continued development of Mpro inhibitors, despite the already significant progress. 8,9n this context, the Covid Moonshot effort is a collaborative and open-science initiative 10 aimed at discovering drugs to target the SARS-CoV-2 main protease.The campaign utilized crowdsourcing, high-throughput structural biology, machine learning (ML), and molecular simulations to identify new chemical series with potent nanomolar activity.Through this effort, a comprehensive understanding of the structural plasticity of the main protease was obtained, along with extensive structure-activity relationships for multiple chemotypes and a large dataset of biochemical activities.Notably, the initiative achieved a significant milestone by openly sharing all compound designs, crystallographic data, assay data, and synthesized molecules, creating a large knowledgebase for future anti-coronavirus drug discovery that is accessible and free from intellectual property restrictions.The Covid Moonshot efforts have been highly beneficial to our research.They discovered several potent chemical series and provided valuable insights into the structure of the SARS-CoV-2 main protease.The open sharing of data has created a valuable resource for our anti-coronavirus drug discovery and helped accelerating our progress.
In addition, in recent years, ML methods have gained popularity in the field of drug discovery due to their ability to accelerate the identification of novel drug candidates. 11,12One such approach is deep reinforcement learning for de novo drug design, which involves a trained neural network, scoring components and a reward function.The trained neural network or deep generative model (DGM) generates novel chemical structures, the generated molecules are ranked by scoring components and a reward function combines all the scores.The generative model iteratively adapts its policy to maximize the reward and therefore generates, over time, molecules with desirable properties, such as high affinity, good pharmacokinetics, and low toxicity.
Researchers have applied a similar approach to the identification of epidermal growth factor (EGFR) inhibitors, using a generative model pre-trained with ChEMBL 13 molecules and an EGFR predictive ML model.The generated structures were experimentally validated and led to the identification of several new bioactive molecules. 14Another similar approach employing reinforcement learning has been applied retrospectively to demonstrate that a DGM could generate active molecules where these molecules were not included in the training dataset. 15Finally, de novo design approaches, without reinforcement learning, were applied to discover new SARS-COV-2 Mpro inhibitors for which two approaches were experimentally validated 16,17 while two others were not. 18,19n this study, we focused on the design of non-covalent inhibitors and have used, to this end, REINVENT 2.0, 20 an AI tool for de novo drug design developed by Thomas Blaschke et al.The generative engine of the tool is directed by a reinforcement learning (RL) module.We customized REINVENT by two additional components which we developed ourselves: a 3D pharmacophore/shape-alignment scoring component and a substructure match count scoring component.After the generation process, we employed several filtering steps to prune the number of hits.These primary hits underwent a hit expansion followed 3D structure-based selection through molecular docking.
As a result of these efforts, four new Mpro inhibitors were identified.One of them underwent optimization using a 3Dstructure based approach, while another one was optimized using a classical structure activity relation (SAR) approach.The affinities of the compounds were measured by a Mpro FRET IC 50 assay, with values ranging from 1.3 μM to 2.3 μM.However, one of the four compounds exhibited close similarity to known Mpro inhibitors and was therefore excluded from further consideration.
Based on these findings, the three remaining compounds were selected as starting points for additional refinement in the Idorsia hit-to-lead pipeline.
While our approach could probably be used for the discovery of covalent inhibitors, we decided to solely concentrate our effort in the identification of non-covalent inhibitors.

Results and discussion
The details of our selection workflow can be found in the "Experimental section".However, the different steps can be identified in Fig. 1 to which we will frequently refer in the remaining manuscript.A hit expansion and structure-based approaches follow this selection workflow.

Scoring components for the reinvent RL
A 3D pharmacophore and shape alignment scoring component from single active conformer queries was adapted as a REINVENT scoring component.This component was derived from PheSA 21 (pharmacophore-enhanced shape alignment), a tool developed at Idorsia.The reference Mpro inhibitor dataset was used to query the PDB files (obtained from the https://fragalysis.diamond.ac.uk/ website) and to extract the active conformers of the exact same ligand of the 225 reference inhibitors.69 active ligands were identified and extracted from PDB files.PheSA descriptors were generated for these query active conformers.The 69 inhibitors were clustered, and non-covalent representatives of each cluster were selected.In total, 23 query compounds were selected.The 23 PheSA queries (single active conformers) were used to screen the test set made of 69 active and 6000 compounds considered inactive as randomly picked from the enamine HTS collection.Molecules with a PheSA score > = 0.63 were considered predicted "active".

RSC Medicinal Chemistry Research Article
The SMC component is a privileged fragment substructure match scoring component.A matched molecular pair analysis was performed on the reference Mpro inhibitor dataset.Fragments that allowed a 2-fold gain in potency were selected.Fragments containing covalent warheads were discarded.The selected fragments were then combined with the noncovalent fragment hits discovered in the frame of the Covid Moonshot crystallographic screen. 22Only fragments with Mw < = 350 g mol −1 were retained and duplicates were removed.In total a collection of 265 privileged fragments were obtained.
The following analysis was performed with the Knime analytical platform 23 combined with Python: The COVID Moonshot FRET activity dataset was split into two groups: • 354 active molecules < = 10 μM (FRET IC 50 ).
A substructure match was performed using these compounds against a set of selected privileged fragments.
The Python script counts how many fragments would match a COVID Moonshot compound.
Again, the reference compounds were split into two groups.
• 125 matching counts > 15 fragments.The results of this validation are depicted in Fig. S1 of the ESI † and demonstrate the beneficial impact on inhibitory activity for molecules displaying a high SMC as we can observe an enrichment of "active" molecules for compounds with a matching count >15 fragments.All datasets are described in the "Experimental section" and are electronically available as part of the ESI.† The Python script was written to allow generated molecules to be evaluated against a list of privileged fragment substructures.The script counts how many privileged fragments match a molecule.Then a membership trapezoidal scoring function is applied as depicted in Fig. S2 of the ESI.† The script can score molecules based on their substructure match count against privileged fragments.The script was implemented as a scoring component in the DGM RL allowing the generation of molecules containing many privileged fragments.

AI generated molecules
The quantitative estimate of the drug-likeness QED 24 scoring component was used to help the DGM in generating "druglike" molecules.The Jaccard distance component was employed to assist the model in generating molecules dissimilar to already known Mpro inhibitors.Two additional self-developed scoring components, the 3D pharmacophore/ shape alignment (PheSA) component and the SMC scoring component, were applied to help the DGM to generate Mpro bioactive molecules.
Two DGM scenarios were considered for generating molecules, for the first scenario, "exploration", the pre-trained DGM was used as such to explore the chemical space for the generation of new Mpro inhibitors.For the second scenario, "exploitation", the DGM was retrained with a set of known Mpro inhibitors collected from the Open Science consortium COVID Moonshot and the ChEMBL database to exploit the known Mpro chemical space.
The two scenarios were then used in an RL fashion employing the scoring components described above.Both systems were trained for 500 epochs for the exploitation mode and 1000 epochs for the exploration mode until the scores of each individual scores plateaued out.
During the exploitation approach, the originally trained prior with ChEMBL molecules was retrained for 40 epochs with the DGM-TL dataset containing 338 known Mpro inhibitors.The purpose of this "Transfer Learning" approach is to bias the original DGM with compounds from the Mpro chemical space with the expectation that the generated molecules will be similar to the ones it was trained for.
Once the prior was retrained, the system was used in a RL context for 500 epochs where the PheSA, Jaccard distance, SMC, and QED scoring components were utilized to score the generated molecules and assist the system in understanding the desired properties.The best RL agent was obtained with the following parameters: • Epochs: 500.
The individual scores of the exploitation mode were monitored with Tensorboard as depicted in Fig. S7 of the ESI.† The graphs show a score increase across all scoring components.However, a plateau is reached for PheSA after 350 epochs and for QED, while unstable, after 150 epochs.
During the exploration approach, we utilized the original pre-trained prior based on ChEMBL molecules without any retraining.This model was directly applied in a reinforcement learning setting, incorporating scoring components like PheSA, Jaccard distance, SMC, and QED.We ran this setup for 1000 epochs to help the system grasp the targeted molecular properties.The aim of this second approach was to ascertain if the system could produce the desired molecules with reduced bias.The best RL agent was obtained with the following parameters: • Epochs: 1000.and 400 epochs for SMC.For the Jaccard distance, a steep increase was observed then a decline and again an increase.For QED, the score improved until 400 epochs and then deteriorated.
By comparing the 2 approaches, focused and exploration, it is noticeable that the focused approach allowed an overall better and quicker learning process.It is also interesting to see that for PheSA and Jaccard distance a similar score can be obtained for both approaches.For QED, it was difficult to improve or even maintain the original scores while using the exploration approach.Finally, the transfer learning significantly helped the system to generate structure containing privileged fragments as we can see by comparing both SMC score figures.
The above trained agents were used to generate about 400 000 molecules.This design step is depicted in the top left of Fig. 1.

AI generated molecule selection workflow
A machine learning model for bioactivity predictions was developed and validated with the Knime analytical platform for the selection of the AI generated molecules.The ML dataset was labelled as follows: • Label 1 (active) < = 10 000 nM.
Validation method: a random molecule (arbitrarily assigned as compound 1) was picked, and its closest analogue (highest Tanimoto/Morgan FP) is identified.The identified closest analogue is assigned to compound 2 and, again, its closest analogue (excluding its predecessor) is identified as the following compound.The same logic is applied until all molecules got a compound number.Since there is no time stamp on this dataset, a virtual chronology was simulated by sorting the compounds by similarity.Molecules were sorted from 1 to 739 and a 10-fold linear sampling was applied.Random Forest and XGBoost were evaluated with either all possible RDKit 25 fingerprints (FPs) or calculated physico-chemical properties.DeepChem 26 graph convolutional neural network (GCNN) was also evaluated with 25, 50, 75 and 100 epochs.
• RF-Global_Torsion.Then the following split was applied 80/10/10 for training/ validation/testing of a model ensemble (see the list above).
ML performance on the validation set: • ROC AUC: 0.881.
The above results suggest that a ML (RF-FPs, XGBoost-FPs and GCNN) approach can predict compounds being similar to compounds in the training set (ESI, † Fig. S3) but is not able to make useful prediction for dissimilar compounds (ESI, † Fig. S4).In these conditions, a ML classification applied to RL or even post-processed generated molecules would be suboptimal.
In order to identify better ML models, the extended 3-dimensional fingerprint 27 (E3FP default parameters) was evaluated on the previously assembled ML dataset.A similarity sorting split 80/10/10 was performed with an ensemble of models (XGBoost and RF).
ML performance on the validation set: • ROC AUC: 0.898.
The model ensemble XGBoost-E3FP and RF-E3FP shows better predictive performance (ESI, † Fig. S5 and S6) compared to the previous ML model ensemble (RF-FPs, XGBoost-FPs and GCNN).While the performance, still weak, would not speak in favor of using this model in RL, it has been considered that a post-filtering of DGM generated molecules could be useful with a class probability threshold >0.359 as it would help to exclude compounds with a higher probability of being inactive.
The generated molecules were filtered with the following filter cascade: • Initial sampling: 409 600 molecules.
The SMARTS filters were employed to remove molecules containing reactive functions or to remove molecules that are in general not desired by medicinal chemists.These SMARTS strings were assembled in a previous work. 28n analysis was realized right after the bioactivity ML model prediction filter where two smaller datasets of 1007 molecules for the exploration mode and 729 for the exploitation mode were obtained.The two datasets (1736 molecules) were compared to each other as follows: a Tanimoto nearest neighbor analysis was performed on both datasets; exploitation and exploration.Each molecule of each dataset was compared with each molecule of a Mpro reference compound dataset (DGM-TL).Each comparison is based on the Tanimoto similarity coefficient (1-Jaccard distance) and allows the identification of the most similar molecule in the DGM-TL.The similarity values were then binned with a range of 0.05 and a distribution bar chart was plotted (Fig. 2) with the Python library seaborn. 29This analysis gives a solid understanding of how similar or dissimilar a dataset compared to a reference dataset looks like.
Fig. 2 shows that the exploration dataset exhibits a greater abundance of dissimilar compounds when compared to the DGM-TL compounds, highlighting the capacity of the exploration mode to generate more novel compounds.This discrepancy can be attributed to the absence of transfer learning in the exploration approach, which encourages greater diversity in the generated molecules.Conversely, the exploitation mode involves retraining the generative model on the Mpro chemical space, resulting in a reduced potential for novelty.In this mode, the generative model predominantly produces molecules that align with its retraining, limiting the exploration of new chemical space.
Then, a Tanimoto self-nearest neighbor analysis was performed on both datasets, exploitation and exploration respectively.Each molecule of each dataset was compared against each other molecules of the same dataset to evaluate the self-diversity of the dataset.Each comparison is based on the Tanimoto similarity coefficient (1-Jaccard distance) and allows identification of the most similar molecule within the same training set.The similarity values were then binned with a range of 0.05 and a distribution bar chart was plotted with the Python library seaborn.This analysis allows assessment of the self-diversity of a dataset.
Fig. 3 illustrates a distribution shift between the two datasets, revealing an interesting finding.The exploitation datasets demonstrate a reasonable level of self-diversity, while surprisingly, the exploration mode appears to have achieved a lower level of self-diversity.One potential explanation for this unexpected result is that the exploration generative model tends to generate compounds that are dissimilar to the Mpro DGM-TL dataset but exhibit self-similarity.However, further investigation is required to confirm this hypothesis and gain a better understanding of the underlying factors at play.
According to the selection process of Fig. 1, all steps until the bottom right have been described and discussed.

Docking with SeeSAR and GOLD
Three crystal structures corresponding to three non-covalent chemical series were selected for the molecular docking (SeeSAR and GOLD): • MPro-×11294 (quinolone).
• MPro-P0009 (Ugi-non-covalent).These three crystal structures were selected for the diversity of their ligands covering the main non-covalent series identified by the Covid Moonshot and also to help overcome the high plasticity of Mpro. 30he docking validation dataset was docked into MPro-×12692, MPro-P0009 and MPro-×11294.As observed from the crystal structure, pharmacophore constraints (acceptor spheres) were added to guide the docking for key H-bond interactions with the conserved H163 and E166 residues (MPro-×12692 and MPro-P0009) or with the H163 and either N142 and/or G143 residues (MPro-×11294).
After docking, a visual inspection allowed us to filter out poses for which one or two of the key interactions were missing.
The compounds with acceptable poses were predicted to be active, and the other compounds (either non-acceptable binding poses or when SeeSAR could not generate poses) were predicted as negative.

RSC Medicinal Chemistry Research Article
The evaluation of the predictive power led to the following results: • Precision: 0.64 (for 22 predicted active, 14 were actually active).
• ROC AUC: 0.62.As we can see from the results, a docking approach does not represent an ideal classifier nevertheless, it allows, among other filters (PheSA, ML, SMC), enrichment of the final dataset with possible new inhibitors.
Filtering: the combined 1736 compounds were docked with SeeSAR following the above description (validation).Approximately 188 compounds were selected after visual inspection and according to their binding poses.Then, a hard cut-off was applied on the SeeSAR estimated affinity and compounds below 3500 μM were retained.Finally, 118 were selected after docking.
These 118 molecules were docked again into Mpro with the software Gold, 31 this time not to exclude any further molecules but in order to get an additional score which was used in a final consensus scoring.Aizynthfinder 32 was then employed to assess the synthesizability of these same 118 compounds.60 compounds were considered synthesizable.Out of these 60 molecules, 7 compounds found to be commercially available (in the Enamine REAL space 33 ) were directly selected and ordered.Finally, a consensus scoring, budget considerations and synthesizability constraints led to a selection of 10 additional compounds.In total 17 compounds were ordered and 16 could be synthesized, delivered, and biologically assessed.
Below is the selection workflow from the docking to the final selection as depicted in Fig. 1: • Docking: 118 molecules.
• 17 compounds selected: 7 commercially available (Enamine REAL space) compounds and 10 additional compounds with a high consensus score.

Biological results of the AI generated molecules
The 16 AI generated, synthesized, and delivered molecules were measured (ESI, † Table S1) in Mpro FRET (1 μM DTT), cathepsin L FRET, MCA quenching and for some cases in SPR (Biacore) assays.As depicted in Fig. 4 and 6 hits belonging to 3 clusters were identified.Cluster 1 (piperazine) showed activities between 3.3 and 63.5 μM.For cluster 2 (cyclized urea) the Mpro FRET IC 50 could not be determined but a K d of 185 and 368 μM could be measured by SPR demonstrating a weak binding affinity.Finally, the single compound in cluster 3 (N-benzoimidazol-1-yl-acetamide) showed an IC 50 of 22.6 μM.Compound 21, depicted in the ESI † Table S1 and displaying a Mpro FRET IC 50 of 66.9 μM, was not further considered due to a late delivery from our provider.All compounds (the 6 hits described above, compound 21 and the inactive compounds) and their biological data are shared in Table S1 of the ESI.† While the 6 hits showed good to low affinity, they all, except for compounds 4 and 5 (not conclusive at this point), demonstrated a good selectivity profile as indicated by the cathepsin L IC 50 values.All hits turned up not being autofluorescent as demonstrated by the MCA quenching assay results.

Hit expansion and docking with glide
As mentioned above, these 6 hits were categorized in 3 clusters which we defined as 3 hit series: a piperazine, a cyclized urea and an N-benzoimidazol-1-yl-acetamide series.The 2D similarity was evaluated between the 6 hits and known Mpro inhibitors from ChEMBL and Covid Moonshot.Overall, a limited novelty was observed as compared to known Mpro inhibitors, therefore, to bring additional novelty, a hit expansion was performed for the 3 hit series.
Analogues, based on 2D similarity, were identified in either the Chemspace 34 stock screening compounds or in the Idorsia corporate collection.The analogues were docked with H-bond constraints using grids generated from publicly available crystal structures.
The hit expansion (2D similarity analogue search) was performed for the 3 hit series (piperazine, cyclized urea and N-benzoimidazol-1-yl-acetamide).The hit expansion was done by searching the Idorsia corporate collection for the piperazine (cluster 1) and the N-benzoimidazol-1-yl-acetamide compound (cluster 3).For the cyclized ureas (cluster 2) a hit expansion was performed in the Chemspace in stock screening compounds as no analogues could be identified from the corporate collection.
For cluster 1, the 1400 analogues were docked with Schrodinger Glide. 35g. 4 Primary hits with biological results of molecules originated from AI generative models, prioritized with a selection workflow, synthesized and delivered.
The fragalysis complex Mpro-×11812 (co-crystalized ligand) was selected for its similarity with the cluster 1 hits and used to generate a docking grid.
The analogues were docked with grid-based constraints (at least two constraints): • H-bond with H163.
• The docking poses were inspected, the best poses were starred, and 115 compounds were selected.
The positive control molecules were present in the top docking score list and provided some in silico evidence on the validity of the approach.110 compounds were selected from the Idorsia corporate collection and submitted to the biological measurements.
The fragalysis complex Mpro-×11513 (co-crystalized ligand) was used for its similarity with the cluster 2 hits.
The docking (SP followed by XP) was performed on the analogues with grid-based constraints: • H-bond with H163.
• H-bond with E166.The docking poses were inspected, and the best poses were starred, and finally 43 compounds selected, and a quote was requested at Chemspace.38 compounds were delivered and submitted to biological measurements.
For cluster 3, 128 compounds were docked with Schrodinger Glide.
The fragalysis complex Mpro-×11612 (co-crystalized ligand) was used for its relative similarity with the cluster 3 hit.
The analogues were docked (SP followed by XP) with gridbased constraints: • H-bond with H163.
• H-bond with E166.The docking poses were inspected, and the best poses were starred, and finally 25 compounds were selected, and ordered from the Idorsia store for a biological assessment.
After visual inspection of the docking poses, the best compounds were selected, ordered, and measured in a FRET assay.

Biological results of the compound selected after the hit expansion and glide docking
In total, 173 compounds were considered for a biological assessment and then measured (ESI, † Table S2) in Mpro FRET (1 μM DTT), cathepsin L FRET, Mpro MCA quenching and for some cases in SPR (Biacore) assays.This hit expansion led to the identification of 4 interesting and diverse inhibitors (compounds 7, 8, 9 and 10) as depicted in Fig. 5. Compounds 8 and 10 originated from the hit expansion of cluster 1, compound 7 originated from the hit expansion of cluster 2 and compound 9 originated from the hit expansion of cluster 3. Additional Mpro inhibitors were identified with the approach but cannot be disclosed due to corporate restrictions.
The X-ray structures of SARS-CoV-2 Mpro in complex with compounds 7 and 8 (PDB code 8r11 and 8r12 respectively) were solved and utilized to further optimise compound 8 into compound 11.The binding mode of compound 7 is depicted in Fig. 6.The following observation can be made with this figure, the m-methylpyridine sits in the S1 pocket where the pyridine nitrogen makes a key H-bond interaction with H163.The amide carbonyl makes a key interaction with the E166 backbone-NH and a water-mediated interaction with the E166 backbone-carbonyl.The m-chlorophenyl sits in the hydrophobic S2 pocket making face-to-edge aryl interaction with H41.
The binding mode of compound 8 is depicted in Fig. 7.The following observation can be made with this figure: the m-chloropyridine sits in the S1 pocket where the pyridine nitrogen makes a key H-bond interaction with H163.The amide carbonyl makes a key interaction with G143.The

RSC Medicinal Chemistry Research Article
m-chlorophenyl sits in the hydrophobic S2 pocket making face-to-edge aryl interaction with H41.
As depicted in Fig. 8, the 3D structure observations allowed us to understand the importance of a chlorine atom in the S2 pocket.Indeed, for example, compound 7 (magenta) has an Mpro FRET IC 50 of 5.6 μM or the COVID Moonshot compound EDJ-MED-92e193ae-1 (yellow) has an Mpro FRET IC 50 of 0.230 μM for which available crystal structures were aligned with the crystal structure of compound 8 (cyan).This picture shows the possibility of replacing the cyano moiety of compound 8 with a chlorine in the ortho position.Furthermore, the nitrile bore by compound 8 should engage in polar interactions while the S2 pocket is known as a hydrophobic pocket. 36Therefore, the substitution by a chlorine atom should be beneficial as this latter should form hydrophobic interactions, hence the potency should be improved.This SBDD suggestion was then synthesized and led to compound 11 with an Mpro FRET IC 50 of 1.3 μM (28fold improvement).
The crystal structure of compound 11 (PDB code: 8r14) underlines the relevance of the chlorine atom and is depicted in Fig. 9.The following observation can be made with this figure: the m-chloropyridine sits in the S1 pocket where the pyridine nitrogen makes a key H-bond interaction with H163.The amide carbonyl makes a key interaction with G143.The m-cyanophenyl sits in the hydrophobic S2 pocket making face-to-edge aryl interaction with H41.
The tetrahydroisoquinoline molecule (compound 10) was improved by a standard SAR approach which consisted of exploring a substitution (methyl) in position 5 of the pyridine as this position was substituted for all other hits (compounds 7, compound 8 and compound 9) of Fig. 5, and additionally, the SAR consisted of the synthesis of a chemical library of several bi-substituted tetra-hydroisoquinolines.This first round of optimization allowed a 28-fold improvement for the diazepane series to 1.3 μM and a 27-fold improvement for the tetrahydroisoquinoline series to 2.3 μM (compound 12).
The crystal structure of compound 12 (PDB code: 8r16) is shown in Fig. 10.The following observation can be made with this figure: the m-methylpyridine sits in the S1 pocket where the pyridine nitrogen makes a key H-bond interaction with H163.The amide carbonyl makes a key interaction with E166 as well as water mediated interaction with G143.The two chlorine atoms of the tetrahydroisoquinoline sit in the hydrophobic S2 pocket.
Eventually, the AI generated hits followed-up by hit expansion, docking and a first round of optimization for compound 8 and 10 resulted in the identification of 3 novel Mpro inhibitors, a diazepane (compound 11), a pyrrolidine (compound 7) and a tetrahydroisoquinoline (compound 12) with IC 50 s ranging from 1.3 to 2.3 μM.Compound 9, while more potent than its queries (compound 4 and 5), was   considered too similar to the compounds identified in the frame of the COVID Moonshot.
Our work showcases the potential of combining deep reinforcement learning for de novo drug design with additional computational chemistry techniques, as we successfully identified three novel Mpro inhibitors that display high potential for further development.We believe that our stepwise approach is key for slowly but surely identifying new inhibitors.
There are, however, several limitations and challenges that need to be addressed to improve the efficiency and effectiveness of RL-based de novo molecule design.
One of the main limitations is the similarity of the generated molecules with reference Mpro molecules.Since the retrained agent of our exploitation approach learns from existing molecules, it tends to generate molecules that resemble the training set.Additionally, both for the exploitation or exploration, the SMC scoring component certainly influenced the training of the agent towards the known Mpro chemical space.As a result, the generated molecules may not be sufficiently diverse or novel, leading to limited success in identifying new lead compounds.
To overcome this issue, training the generative model without the SMC scoring component in full exploration mode might constitute an interesting perspective or considering a docking scoring component instead of the SMC scoring components might also have the potential to lead to additional novelty.
Another significant challenge is the synthetic accessibility of the generated molecules.The models may generate molecules that are structurally complex or contain building blocks that are difficult to synthesize, making them impractical for further experimental evaluation.Despite the aid of Aizynthfinder in eliminating challenging or infeasible molecules, we failed to fully consider the true availability and cost of the building blocks for the interesting molecules we considered.Indeed, Aizynthfinder allows one to make its own collection of building blocks.This stock collection is then used by the tool when it recursively breaks down the molecule into available precursors.In our situation, we used the building block collection of aggregators (regrouping many vendors) which made the approach not fully effective.Indeed, the molecules were successfully considered synthesizable by Aizynthfinder but as the building blocks were available from a plethora of vendors it made it very difficult for one Contract Research Organization (CRO) to make a reasonable offer.Lesson learned: make an Aizynthfinder stock from a preferred vendor/CRO and not from an aggregator.Consequently, some of the compounds that were considered turned out to be excessively costly.

Conclusion
In summary, we have used a de novo method for the generation of molecules with the goal of identifying new SARS-COV-2 Mpro inhibitors.The method uses a reinforcement learning approach that guides the generative model in creating desired molecules.It is based on the open-source code REINVENT 2.0, an AI tool for de novo drug design.The presented method is designed to generate molecules which contain key features of known Mpro inhibitors such as 3D pharmacophore/shape and privileged fragments.Additional computational chemistry techniques, used as post filters, such as molecule sanitization, QED, SMC, Jaccard distance, PheSA, ML affinity, Aizynthfinder, molecular diversity, molecular docking, and final consensus scoring, were employed to further enrich the generated molecules toward higher likelihood of being active against SARS-COV-2 Mpro.
We believe that the identification of SARS-CoV-2 Mpro inhibitors through reinforcement learning de novo design combined with 3D pharmacophore modeling, shape-based alignment, hit expansion, and molecular docking approaches represents a promising avenue for the identification of new starting points that could lead to new therapies for COVID-19.We are convinced that combining modern ML techniques together with state-of-the-art computational chemistry techniques has the potential to accelerate the identification of effective treatments for this severe disease.
The success of our workflow speaks for itself: six AI generated primary hits demonstrated weak to decent affinity.These molecules could be easily and quickly improved by hit expansion and showed novel Mpro chemotypes.
In our case study, we show that analogues picked out in the frame of the hit expansion and further prioritized by molecular docking led to the identification of three novel inhibitors.Two of them underwent a first round of optimization in the lab.
We then additionally report the identification of three novel SARS-COV-2 inhibitors displaying inhibitory activity ranging from 1.3 μM to 2.3 μM fulfilling our initial goal.
• ML dataset: ChEMBL and COVID Moonshot molecules with IC 50 data (FRET and RapidFire assay data) were pooled together.For molecules with several measurements, a standard deviation was calculated together with the difference between the max and min value.Measurements showing a difference greater than 2 times standard deviation were discarded.The ML dataset was made up of 739 datapoints.
• Privileged fragments: a list of privileged fragments was assembled as described in the following SMC scoring component validation section.

Docking with GOLD
To get the input molecules ready for docking, their threedimensional (3D) structure needs to be determined.This step was accomplished using the OEChem package. 37Similarly, the proteins were prepared with the same package.This process involved removing any atoms not part of the protein, reconstructing missing side chains and loops, and adding hydrogen atoms to the protein structure.The ligands, which are smaller molecules that bind to proteins, were retrieved from Protein Data Bank (PDB) files using the Biopython package.These ligands were then processed using the rdkit module, a process that involved removing any salts and neutralizing charges in the ligands.
The idea behind the following workflow is that compounds with similar 3D structures would perturb the binding pocket in the same way.To identify a reference compound that closely resembles the ligands, the PheSA program was used.PheSA was employed to calculate the similarity between all the ligands to evaluate and the active conformations of reference compounds.
For the docking process, hydrogen atoms were also added to the input molecules.The docking itself was conducted using the GOLD docking program, which allows for automatic non-covalent docking.This procedure was integrated into a script for efficiency.The docking utilized both the prepared PDB files of the ligands most similar to the reference compounds and the prepared input molecules.

Consensus scoring (CS)
It has been reported that CS, which combines multiple scoring functions in binding affinity estimation, leads to higher hit rates in virtual library screening studies. 38It has also been demonstrated that consensus scoring outperforms, for statistical reason, any single scoring. 39 consensus scoring was developed taking PheSA, SeeSAR & Gold docking, and SMC scores into account.
• The SMC and PheSA scores were already normalised from 0 to 1 so the scores were kept unchanged.
• The CS was calculated with the following equation: The CS score was used to make the final selection of compounds to be purchased.

Fig. 2
Fig. 2 Nearest neighbour similarity of the exploration and exploitation datasets against the DGM-TL dataset.Each subset (exploration and exploitation) was normalized independently.

Fig. 3
Fig.3In orange, self-nearest neighbor similarity of the exploration dataset against itself.In blue, self-nearest neighbor similarity of the exploitation dataset against itself.Each subset (exploration and exploitation) was normalized independently.

Fig. 5
Fig. 5 Compounds 7, 8, 9 and 10 with biological results were identified through hit expansion of the primary hits and molecular docking.Compounds 11 and 12 underwent a first round of SAR or SBDD of optimization.