ChemSpaX: exploration of chemical space by automated functionalization of molecular scaffold

Exploration of the local chemical space of molecular scaffolds by post-functionalization (PF) is a promising route to discover novel molecules with desired structure and function. PF with rationally chosen substituents based on known electronic and steric properties is a commonly used experimental and computational strategy in screening, design and optimization of catalytic scaffolds. Automated generation of reasonably accurate geometric representations of post-functionalized molecular scaffolds is highly desirable for data-driven applications. However, automated PF of transition metal (TM) complexes remains challenging. In this work a Python-based workflow, ChemSpaX, that is aimed at automating the PF of a given molecular scaffold with special emphasis on TM complexes, is introduced. In three representative applications of ChemSpaX by comparing with DFT and DFT-B calculations, we show that the generated structures have a reasonable quality for use in computational screening applications. Furthermore, we show that ChemSpaX generated geometries can be used in machine learning applications to accurately predict DFT computed HOMO–LUMO gaps for transition metal complexes. ChemSpaX is open-source and aims to bolster and democratize the efforts of the scientific community towards data-driven chemical discovery.

GFN2-xTB(THF): optimization and hessian calculation in THF using the GBSA solvation model using Grimme's xTB (6.3.3) package. BP86(GAS): DFT calculations were performed at various levels of theory. Geometry optimizations were performed in the gas phase using BP86 XC functional and def2-SVP basis set. Free energy corrections were obtained via hessian calculations within the harmonic approximation at the same level of theory. This method is denoted as BP86(GAS).
PBE1PBE(thf) (or PBE0(THF)) and BP86(THF): To include the solvent effects the electronic energies were further refined via single point energy calculations using SMD solvation method with THF as solvent. Two XC functionals namely BP86 and PBE1PBE were used with a triple zeta quality basis set (def2-TZVP). The free energy corrections obtained via hessian calculations at BP86(gas) level of theory were added to the electronic energies obtained during the singlepoint energy calculations. These methods are denoted as PBE0(THF) and BP86(THF).
We have calculated pearson correlation coefficient (R 2 ) related to a linear fitting (̂= + ) for all case where two methods/quantities ( , ) were compared in a scatter plot. We computed the root mean squared error (RMSE) related to the linear fit as : The XYZ files of structures and datasets used for this publication are attached.

S2. Calculation of the centroid vector and the rotation matrix
The centroid vector starts at the central atom of the substituent group and points towards the centroid of the shape (triangle in the case of a tetrahedral substituent) formed by the atoms at the edges. In ChemSpaX an automated calculation of this centroid vector is done per substituent group. This can be visualized using a hypothetical tetrahedral substituent ( Figure 1). The asymmetrical substituent is transformed into a hypothetical symmetrical tetrahedral substituent, where the centroid (yellow) is used to calculate the centroid vector ( Figure 1, top). This centroid vector is then used to rotate and translate the whole substituent group.
The correct rotation matrix is determined by two unit vectors, 1) the bond that will be functionalized (a unit vector a between bonded_atom and atom_to_be_functionalized) and 2) the substituent's centroid Where [ ] is the skew-symmatric cross-product matrix of v.
The rotation matrix (R) is applied to the whole substituent group, followed by a translation of the whole group to the specified bonding distance from the skeleton. The complete workflow is summarized in the bottom part of figure 1.

S3. Observed issues with FF optimization
When functionalizing a complex recursively, it was observed that using Openbabel's universal force field optimization (UFF) for geometry optimization would often misalign some of the hydrogen atoms. as illustrated in figure 2. In the same scenario, but using Openbabel's GAFF only for geometry optimization, bonds between carbon and a halogen would have an incorrect angle. Which is shown in figure 3, with the incorrect angle on the right side and the correct angle on the left side of the figure. Using a combination of GAFF and UFF for geometry optimization resulted in a highly increased probability of an error-free geometry.

S4.
Comparison of calculated energies of reaction using xTB or DFT for pincer complexes RuPNP     The distribution of ΔΔ for the reactive adsorption of HBr for all the ligands and functionalization sites (R1 and R2) is shown in Figure 7. For the CNC ligand complexes variation of the functional group at the R2 position with fixed R1 (Figure 7a) led to similar µ(ΔΔ ) < 10 kcal mol −1 for all groups except for R1 = CF3. Since functionalization site R2 is primarily a ligand backbone site, it seems that the effective relaxation of the ligand during DFT optimization is similar for all the CNC complexes. Variation of the functionalization group at R1 results in a wider spread in the ΔΔ values for CF3, H and cy groups, while i-Pr substituents show most narrow distribution with µ(ΔΔ ) = 9.01 kcal mol −1

S8
. All PNN ligand complexes showed high and widespread µ(ΔΔ ) values. Most notable is R1 = H which shows a rather large spread indicating large geometric relaxations when different ligands are introduced on the backbone sites. PCP ligand complexes are interesting for their almost linearly increasing µ(ΔΔ ) as the size of the functional group on the R1 site is increased while the spread in the energies is minimal (Figure 7e). Furthermore, the variation at site R2 shows an almost constant µ(ΔΔ ) with similar spread ( Figure 7f). This observation indicates the source of spread is related to a larger geometric relaxation near the metal site during DFT based geometry optimization. Site R2 being farther away from the metal center seems to have minimal and approximately constant impact on the geometries, and is rather well optimized by the force-field. We also compared the individual difference in energy between FF and DFT optimized geometries for M(X)-L(H) and M-L complexes. In general, we found that the energy of FF optimized M(Br)-L(H) complexes were closer to their DFT counterparts when compared to corresponding M-L complexes. This observation reveals that the M-L complexes have a larger contribution to the ΔΔ . This is consistent with the distribution of hRMSDs which are generally higher for M-L (pristine) complexes (see Figure 6 in the main text).
It must be noted here that the hRMSDs and ΔΔ can be reduced by introducing intermediate optimization with a higher level of theory as discussed in the main text.

S6.
Distribution of Gibbs free energy for RuPNP

Mn-pincers
For the Mn-pincers, the correlation between HOMO-LUMO gaps computed using GFN2-xTB(THF) and BP86(THF) was worse ( 2 = 0.3). The correlation for various analyzed adducts on the metal site and for the various backbones are shown below.

S11.
HOMO-LUMO gap prediction via OLS for functionalized Co porphyrins The correlation between DFT and GFN2-xTB calculated HOMO-LUMO gaps for 280 selected structures was computed. Three features (1. number of atoms in the structure, 2. hRMSD (ChemSpaX generated FF versus GFN2-xTB structures) and 3. GFN2-xTB calculated HOMO-LUMO gap) were used to apply linear regression via OLS fitting and predict the DFT calculated HOMO-LUMO gap. These features were chosen to select relevant and easily computable features from xTB calculations for HTS applications. 75% of the dataset was applied to learn the DFT calculated HOMO-LUMO gap, 25% of the dataset was used for testing the model. It was observed when using the hRMSD as the only feature (training: R 2 = 0.23, test: R 2 = 0.19) that this feature can be useful in more extensive machine learning methods. Its importance for electronic property prediction at DFT level of theory is currently not utilized by the simplicity of OLS.