Combining machine learning and quantum chemical calculations for high-throughput virtual screening of thermally activated delayed fluorescence molecular materials: the impact of selection strategy and structural mutations

In view of the theoretical importance and huge application potential of Thermally Activated Delayed Fluorescence (TADF) materials, it is of great significance to conduct High-Throughput Virtual Screening (HTVS) on compound libraries to find TADF candidate molecules. This research focuses on the computational design of pure organic TADF molecules. By combining machine learning and quantum chemical calculations, using cheminformatics tools, and introducing the concept of selection and mutation from evolutionary theory, we have designed a computational program for HTVS of TADF molecular materials, especially the impact of selection strategy and structural mutations on the results of HTVS was explored. An initial compound library (size = 103) constructed by enumeration of typical donors and acceptors was used to evolve by successively applying selection and 10 different structural mutations. And a group fingerprint similarity (ΔMSPR) index was proposed to account for the similarity between two compound libraries with comparable sizes. Based on the computed data, we have found that the mix of selection and mutations into the evolution map does have great impact on the HTVS results: (a) except the fast mutation Sub2, all the rest of the mutations can effectively concentrate ‘good’ molecules in a compound library, and hence give large material abundance (typically >0.8) for high mutation generations (ng ≥ 6). (b) The mean energy gap can exhibit a fast convergent trend toward very low values, hence the studied mutations (except Sub2) can cooperate very well with the studied DA substrates to generate optimal molecules, and the group fingerprint similarity can retain high enough values for large ng, which can be associated with the apparent convergence in molecular skeletons as ng increases. (c) The distribution of skeleton frequencies for a specific mutation is generally uneven with one dominant skeleton. The overall numbers of common and generic cores for all mutations are 11 and 7 as ng = 9. Hence, in a sense, the ‘optimal’ skeletons seem unique and useful in realizing low energy gaps. With these observations and the development of related HTVS software, we expect to provide insight and tools to the research community of HTVS of molecular (TADF) materials.


Generation of initial compound library G0
The initial compound library G0 is constructed by combining donors with acceptors at preset connection sites (symbol * is used to denote the connection site).The structures of 30 donors is shown in Figure S1, and that of 43 acceptors is shown in Figure S2.After the enumeration of fragments, a random pick routine is used to further restrict the size to 1000.

is constructed by combining donors with a
ceptors at preset connection sites (symbol * is used to denote the connection site).The structures of 30 donors is shown in Figure S1, and that of 43 acceptors is shown in Figure S2.After the enumeration of fragments, a random pick routine is used to further restrict the size to 1000.


Effect of varied ground state geometry optimization methods

The difference in ground state geometries calculated by B3LYP/6-31G(d) and PM6-D3 levels of theory is measured by their root-mean-square deviations (RMSDs).The distribution of frequency of RMSDs is given in Figure S3.For a total of 72 molecules, only 25% of them have RMSD values larger t

Effect of varied ground state geometry optimization methods
The difference in ground state geometries calculated by B3LYP/6-31G(d) and PM6-D3 levels of theory is measured by their root-mean-square deviations (RMSDs).The distribution of frequency of RMSDs is given in Figure S3.For a total of 72 molecules, only 25% of them have RMSD values larger than 0.69, hence, from a point of view of probablity, the replacement of B3LYP/6-31G(d) by PM6-D3 method seems acceptable.The effect of varied ground state geometry optimization methods on the HTVS results have been briefly tested.The evolution of material abundance (ω M A ), average number of aromatic aCH bonds (n aCH ) and number of accumulated optimal molecules (n acc_opt_mols ) with increase of mutation generation (n g ) for Sub3 using B3LYP/6-31G(d) level of theory as ground state geometry optimization method have been listed in Table S1.As compared with the semi-empirical method, the investment on a relatively high accuracy computational one seems to have the effect to speed up the convergence of ω M A .Hence is beneficial for harvesting more optimal molecules at relatively low n g .

n 0.69, hence, from a point of view of probablity, the repla
ement of B3LYP/6-31G(d) by PM6-D3 method seems acceptable.The effect of varied ground state geometry optimization methods on the HTVS results have been briefly tested.The evolution of material abundance (ω M A ), average number of aromatic aCH bonds (n aCH ) and number of accumulated optimal molecules (n acc_opt_mols ) with increase of mutation generation (n g ) for Sub3 using B3LYP/6-31G(d) level of theory as ground state geometry optimization method have been listed in Table S1.As compared with the semi-empirical method, the investment on a relatively high accuracy computational one seems to have the effect to speed up the convergence of ω M A .Hence is beneficial for harvesting more optimal molecules at relatively low n g .

Table S1: The evolution of material abundance (ω M A ), average number of aromatic aCH bonds (n aCH ) and number of accumulated optimal molecules (n acc_opt_mols ) with increase of mutation generation (n g ) for Sub3 using B3LYP/6-31G(d) level of theory as ground state geometry optimization method n g ω M A n aCH n acc_opt_mols 0 0.016 18.3 16 1 0.212 16.9 220 2 0.892 12.5 988 3 0.969 11.7 1808 4 0.960 11.8 2637 5 0.962 9.7 3482 6 1.000 8.0 4306


S5

The evolution of skeleton (generic core) with mutation generation for Sub3

Figure S4: The evolution of skeleton (generic core) with mutation generation for Sub3 (the numbers below structures denote the corresponding frequencies).[ 3,10,30,100 ], 'criterion': ["mse", "mae"], 'max_depth': [ 2, 5, 10, 50 ], 'max_features': ["auto", "sqrt", "log2"] Th Table S1: The evolution of material abundance (ω M A ), average number of aromatic aCH bonds (n aCH ) and number of accumulated optimal molecules (n acc_opt_mols ) with increase of mutation generation (n g ) for Sub3 using B3LYP/6-31G(d) level of theory as ground state geometry optimization method n g ω M A n aCH n acc_opt_mols 0 0.016 18.3 16 1 0.212 16.9 220 2 0.892 12.5 988 3 0.969 11.7 1808 4 0.960 11.8 2637 5 0.962 9.7 3482 6 1.000 8.0 4306

S5
The evolution of skeleton (generic core) with mutation generation for Sub3 Figure S4: The evolution of skeleton (generic core) with mutation generation for Sub3 (the numbers below structures denote the corresponding frequencies).[ 3,10,30,100 ], 'criterion': ["mse", "mae"], 'max_depth': [ 2, 5, 10, 50 ], 'max_features': ["auto", "sqrt", "log2"] The ECFP fingerprints as the X featurization vector, and the computed energy gaps (∆E ST ) as the Y object vector.The X;Y is fed to the RFR model, by applying the Grid-SearchCV with above grid parameters, the cross_val_score method and 5-fold cross validation, the best ML model (best_reg) is screened out from the grid search hyper-parameter space.The best_reg is retrained with the training data, and is further evaluated by a 5-fold cross validation using the same scoring method (neg_mean_squared_error).The newly learned ML model will be used for subsequent prediciting property of unseen molecules in the original compound library (n g keeps unchanged).

ECF
fingerprints as the X featurization vector, and the computed energy gaps (∆E ST ) as the Y object vector.The X;Y is fed to the RFR model, by applying the Grid-SearchCV with above grid parameters, the cross_val_score method and 5-fold cross validation, the best ML model (best_reg) is screened out from the grid search hyper-parameter space.The best_reg is retrained with the training data, and is further evaluated by a 5-fold cross validation using the same scoring method (neg_mean_squared_error).The newly learned ML model will be used for subsequent prediciting property of unseen molecules in the original compound library (n g keeps unchanged).


Details for training and evaluation of machine learning model

To have a taste on the relative accuracy of the ML models, the related data for Sub3 has been given in Table S2.Obviously, the mean and standard deviation are small enough, hence t

Details for training and evaluation of machine learning model
To have a taste on the relative accuracy of the ML models, the related data for Sub3 has been given in Table S2.Obviously, the mean and standard deviation are small enough, hence the ML model could be safely used to predict the energy gaps of unseen molecules with considerable confidence.

ML model could be safely used to predict the energy gaps of u
seen molecules with considerable confidence.


Optimal molecules sorted by SAS

Figure S5: The structures of 9 optimal molecules with lowest SAS for all mutations except Sub2 (each row corresponds to a mutation, row 1 → Sub1)

The (energy gap) optimal molecules for all mutations have been sorted by Synthetic Accessibility Scores from low to high, so as to give recommendation for TADF materials candidate.The structures of 9 molecules

Optimal molecules sorted by SAS
Figure S5: The structures of 9 optimal molecules with lowest SAS for all mutations except Sub2 (each row corresponds to a mutation, row 1 → Sub1) The (energy gap) optimal molecules for all mutations have been sorted by Synthetic Accessibility Scores from low to high, so as to give recommendation for TADF materials candidate.The structures of 9 molecules with lowest SAS for all mutations except Sub2 have been depicted in Figure S5.Notably, duplication between different rows exists due to the origin setup of the computational road map.

west SAS for all mutations excep
Sub2 have been depicted in Figure S5.Notably, duplication between different rows exists due to the origin setup of the computational road map.

Figure S1 :
S1
Figure S1: The donors (D) used as fragments for construction of donor-acceptor (DA) molecules (* is used to denote the connection site)


Figure S3 :
S3
Figure S3: The distribution of frequency of RMSDs for Sub3.




016 0.023 0.021 0.0

Figure S1 :
Figure S1: The donors (D) used as fragments for construction of donor-acceptor (DA) molecules (* is used to denote the connection site)

Figure S3 :
Figure S3: The distribution of frequency of RMSDs for Sub3.
016 0.023 0.021 0.026 0.073 0.032 0.021 The featurization of structures of training molecules is carried out by utilizing the ECFP fingerprint (size = 2048) computing tool of the DeepChem package.By introducing the related tools of the open source machine learning Scikit-Learn package, a Random Forest Regressor (RandomForestRegressor, RFR) from the ensemble module is adopted as the ML model, Grid Search Cross Validation (GridSearchCV) and relevant score function and method (cross_val_score and neg_mean_squared_error) as tools for model selection, simple imputer (SimpleImputer) as data imputer, and a min max scaler (MinMaxScaler) for data scaling.The following grid parameters have been used for the Grid Search Cross Validation step: 'bootstrap': [ True, False ], 'n_estimators':

.015
0.032 0.012 0.045 0.025 0.0129 0.
Maximum Similarity Pa ring RuleThe calculation of group fingerprint similarity (∆ M SP R ) is based on the following algorithm, which we name it the Maximum Similarity Pairing Rule (MSPR).Suppose we have two