Open Access Article
Zong-Rong Yea,
I.-Shou Huangab,
Yu-Te Chanac,
Zhong-Ji Lia,
Chen-Cheng Liaoa,
Hao-Rong Tsaia,
Meng-Chi Hsieha,
Chun-Chih Changad and
Ming-Kang Tsai
*a
aDepartment of Chemistry, National Taiwan Normal University, Taipei, 11677, Taiwan. E-mail: mktsai@ntnu.edu.tw
bDepartment of Chemistry, The University of Chicago, Chicago, IL 60637, USA
cTheoretical Chemistry and Catalysis Research Center, Technical University Munich, Lichtenbergstr. 4, 85747 Garching, Germany
dDepartment of Chemical and Materials Engineering, Chinese Culture University, Taipei, 11114, Taiwan
First published on 23rd June 2020
Organic fluorescent molecules play critical roles in fluorescence inspection, biological probes, and labeling indicators. More than ten thousand organic fluorescent molecules were imported in this study, followed by a machine learning based approach for extracting the intrinsic structural characteristics that were found to correlate with the fluorescence emission. A systematic informatics procedure was introduced, starting from descriptor cleaning, descriptor space reduction, and statistical-meaningful regression to build a broad and valid model for estimating the fluorescence emission wavelength. The least absolute shrinkage and selection operator (Lasso) regression coupling with the random forest model was finally reported as the numerical predictor as well as being fulfilled with the statistical criteria. Such an informatics model appeared to bring comparable predictive ability, being complementary to the conventional time-dependent density functional theory method in emission wavelength prediction, however, with a fractional computational expense.
All these fluorescent molecule applications and properties are closely related to the interactions of different chemical bonding structures at different electronic states, leading to the conventional schematic representation of Jablonski diagram.4 For a long history, chemists have been searching for the fluorescent core structures ranging from biological proteins and peptides, small organic molecules, other synthetic oligomers or polymers.5 One of the organic fluorescent core structures is coumarin isolated from tonka beans in 1820.6 In 2012, Chen et al. studied the coumarin related chromophores, and discovered two isomers with the inversed tetracyclic pyrazolo[3,4-b]pyridine structures with the distinctive fluorescence emission wavelengths.7 In addition to coumarin, various organic core structures have been introduced for the optical applications, e.g. xanthene, cyanine, squaraine, naphthalene, oxadiazole, anthracene, pyrene, oxazine, acridine, arylmethine, and tetrapyrrole.3 In order to build a new fluorescent molecule, chemists are prone to explore the available core structure and modify its chemical bonding environment to advance the corresponding photophysical and photochemical properties.
From the computational perspective, chemists also developed new approaches to describe chemical properties based upon the success in medicinal chemistry, i.e. quantitative structure–activity relationship (QSAR) and quantitative structure–property relationship (QSPR). QSAR and QSPR are designed for predicting complex physical, chemical, biological properties of molecules from the experimental or calculated fundamental characteristics.4 Such an approach originated from the early toxicology study on the primary aliphatic alcohols and the water solubility in 1863.8
As the successes in predicting many physical/chemical properties of compounds, molecular descriptors have also been developed to characterize and classify structural patterns. Molecular descriptors are the information of molecular physicochemical properties such as constitutional, structural, lipophilicity, electronic, geometrical, hydrophobic, solubility, quantum chemical, and topological descriptors. Another type of descriptor is fingerprints. Fingerprints are the binary type (Yes/No) descriptor indicating the presence of certain functional groups within the molecules.9 With the modern computing capability and capacity, chemists are able to assess the chemical space using large-scale molecular descriptors and fingerprints. To analyze the complexity of chemical databases and build chemical-intuitive mathematical models, machine learning plays a key role in these investigations and has been applied in QSAR study since early 80s. The use of machine-learning method was positioned to create the logical and numerical rules of samples as well as the relevant background knowledge.10 King et al. described the neural networks (NN) and inductive login programming (ILP) models in QSARs and compared the multiple linear regression (MLR) with these machine learning methods on drug design problems.11 The authors observed poorer statistical-characteristics for the NN model and higher interpretive ability with the ILP model. Wang et al. reported an extreme learning machine neural network model to predict the electronic energies of 4,4-difluoro-4-bora-3a,4a-diaza-s-indacene (BODIPY) dyes.12 Li et al. successfully established the QSAR between overall power conversion efficiency and quantum mechanical descriptors using a cascaded support vector machine (SVM) model for 400 organic dye sensitizers.13 Recently, 109 fluorescent proteins were analyzed using neural networks, decision trees (DT), random forests (RF), and SVM where the RF algorithm relatively outperformed than others.14
The latest advance using deep neural network for the development of data-driven continuous representation has demonstrated the state-of-the-art advancement in describing molecular structures and predicted properties for the drug discovery application.15 Noh et al. introduced the inverse design pipeline based upon the invertible image-based featurization to design new functional inorganic solid-state materials.16 In addition to predict the structural functionality, Häse et al. extracted the fundamental knowledge of excited electron transfer properties of light-harvesting systems using artificial neural network to facilitate the development of excitonic devices.33 Despite the mathematical forms (or said data structures) for the optimal representation of molecules and materials have been actively explored by several pioneering studies,17–22 molecular physiochemical phenomena is commonly interpreted in terms of stoichiometry, local valence chemical bonding, and the presence of functional groups. Therefore, a prediction model assessing the valence bonding patterns of the ground and excited electronic states on behalf of a large pool of organic fluorescent molecules is attempted in this study. A fast-and-accessible predictor could substantially boost the high throughput screening for the design of organic fluorescent molecules, followed by the refining of quantum mechanics (QM) characterization before entering the synthetic process. We, therefore, conducted the present study using a systematic-and-statistical approach to build a machine-learning QSAR model for the emission wavelength prediction.
460 experimentally-synthesized fluorescent organic molecules from Reaxys database23 with the corresponding emission wavelength between 200–900 nm, and the molecular weight distribution was between 30–3203 g mol−1 as shown in Fig. 1. More details of dataset construction is provided in ESI.† The solvatochromism effect due to the use of different solvents in the experiments was not preferentially filtered for the purpose of maintaining dataset diversity. Subsequently, we inputted the simplified molecular input line entry system (SMILES) files of these molecules to the descriptor generator.
![]() | ||
Fig. 1 (a) Emission wavelength and (b) molecular weight distribution of the 11 460 fluorescent molecules. | ||
141. The complete list of the seventy categories of the generated descriptors from PaDEL is summarized in Table S1.†
411 to 6208, and most of the removed cases were fingerprint descriptors. We subsequently conducted the multiple linear regressions (MLR) with the 6208 descriptors, and denoted the results as VTS-MLR model.
Additionally, we built two more MLR based models for the comparison purpose. The first one used 4300 out of 6208 descriptors denoted as VTSsel-MLR model, where the dominant 4300 descriptors (|coefficient| > 5) determined by VTS-MLR model were taken into account. The second one contained only the 5158 fingerprint descriptors out of the VTS descriptor ensemble (the rest of the 1050 features are 2D descriptors), being denoted as VTSfp-MLR model.
The predicted results using VTS series MLR models are shown in Fig. 2 in comparison with the experiments. The R2 values of these MLR models are summarized in Table 1 with all cases giving R2 > 0.86. In order to examine the overfitting problem, we divided the descriptor dimension into 10 equivalent partitions and conducted cross validation to calculate the mean of 10-fold-R2 (denoted as Q2) as shown in Table 1 (see Table S2† for the details of R2 for each partition). The Q2 results indicate that all three MLR models are found to be short of predictability and are not generalized. We believed that the failure of Q2 results is due to the diverse characteristic of our collected dataset.
![]() | ||
| Fig. 2 The 2D histograms of the prediction vs. experiment comparison using the VTS series MLR models. | ||
460 fluorescence molecules. The predicted inertia and Silhouette scores of k = 2–100 group partitions are shown in Fig. 3 (see ESI† for the details of both scoring functions). The Silhouette score of PCA-transformed VTS ensemble labeled as PCA-VTS (Fig. 3b) is found to be substantially different from other counterparts while no apparent elbow points could be identified form three inertia score curves. The highest Silhouette score at k = 15 suggests that the whole 11
460 molecules maybe reasonably categorized by 15 groups using PCA-VTS descriptor ensemble. In Fig. S1,† we also conducted the non-PCA transformed cases using VTS ensemble, and the corresponding results suggested PCA transformation was not trivial for classifying these complicated descriptor ensembles.
![]() | ||
| Fig. 3 The calculated inertial (a, c and e) and Silhouette (b, d and f) scores in respect to k groups in K-means clustering analysis using the PCA-transformed descriptors of the VTS series ensembles. | ||
With introducing K-means clustering, we intended to identify the general, however subtle, structural characteristics to categorize the collected fluorescent molecules. The visualization of the 15 sub-groups of 11
460 molecules using the PCA-transformed VTS, VTSsel, and VTSfp ensembles are projected onto the ternary plots in Fig. 4 (see Fig. S2† for the corresponding 3-dimentional plots) where the data distribution of the VTS ensemble (Fig. 4a) appeared to be the relatively distinguishable but not the optimal case.
Golbraikh and Tropshas30 suggested a combination of statistical criteria for demonstrating a statistically meaningful regression results as shown in Table 2. The ideal values for these statistical criteria are also summarized in Table 2 with k0 denoting the linearity of predicted over experiment through the origin, and (R2 − R02)/R2 representing the predictive ability of the regression if the corresponding value <0.1. Both Lasso-LR and Lasso-RF models gave reasonable predictability with MAE at 40 and 24 nm, respectively, and the transferability of these models also appeared to be significantly better than the VTS series models (see Q2 values in Tables 2 and S2†). Despite the Lasso-RF model gave better predictability than the LR counterpart, the LR model still provided qualitative results in addition to its interpretability (Fig. 6).
| Criterion | R2a | Rtest2 (MAE)a | Q2 | b | c | c |
|---|---|---|---|---|---|---|
a Only 80% of 11 460 samples were selected (randomly) as the training set for the Lasso-LR and Lasso-RF models. The rest of 20% samples were used as the testing set with MAE (in nm) shown in the parentheses.b The value of k0 denotes the slope of the predicted over experimental data through the origin (intercept equal to zero), and is the inverse k0. The detailed information is summarized in ESI.c R02 denotes the correlation coefficient of k0, and denotes the case of . See ESI for more details. |
||||||
| Ideal values | >0.6 | >0.6 | >0.5 | 0.85 ≤ k ≤ 1.15 (0.85 ≤ k′ ≤ 1.15) | Close to R2 (close to R2) | <0.1 (<0.1) |
| Lasso-LR | 0.6632 | 0.5685 (44) | 0.5800 | 0.9868 (0.9999) | 0.6627 (0.4984) | 0.0009 (0.2486) |
| Lasso-RF | 0.9227 | 0.7004 (36) | 0.6205 | 0.9919 (1.0025) | 0.8565 (0.7933) | 0.0717 (0.1402) |
![]() | ||
| Fig. 6 The 2D histograms of the regression results of Lasso-LR and Lasso-RF models. The legend shows the linear equation fitting to the predicted values. | ||
460 samples, 5 compounds per 100 nm interval between 300–700 nm for carrying out the emission wavelength predictions using time-dependent density functional theory calculations, one of the common methods in predicting emission wavelength using QM approach (see Table S4† for the corresponding SMILES). We employed wB97XD functional31 and 6-31+G(d) basis set under the implicit solvation model (PCM of ethanol) with Gassian16 package32 for the S1 state optimization. The emission wavelength was further calibrated by the vertical S0 to S1 excitation energy computed at wB97XD/6-311+G(d,p) level using the prior minimum structure of S1 state. In Fig. 7 and Table S4,† the Lasso-RF model appears to give the reasonable R2 value (R2 = 0.655, MEA = 48 nm) in comparison against the selected DFT predictions (R2 = 0.778, MAE = 60 nm) with significantly less computational expense. The Lasso-RF approach is consequently recommended for the large scale and high-throughput screenings on the emission wavelength of the organic fluorophores being complementary to the TDDFT calculations.
Footnote |
| † Electronic supplementary information (ESI) available: All of the descriptor categories from PaDEL, the selected compounds for DFT vs. Lasso-RF comparison, the schematic representation of Lasso-RF model. See DOI: 10.1039/d0ra05014h |
| This journal is © The Royal Society of Chemistry 2020 |