Jiangcheng Xua,
Wenbo Yua,
Nan Zhoub,
Jiamin Zhongb,
Jintian Lyuc,
Zhihao Sud,
Yu Chen*a and
Kui Du
*b
aHangzhou Vocational & Technical College, Hangzhou 310014, P. R. China. E-mail: 2003010002@hzvtc.edu.cn
bChemistry and Chemical Engineering, Shaoxing University, Shaoxing 312000, P. R. China. E-mail: dkui@usx.edu.cn
cL.E.K. Consulting, 75 State Street 19th Floor Boston, MA 02109, USA
dCollege of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, P. R. China
First published on 27th June 2025
Fluorescent drug molecules play a pivotal role in biomedical research and precision medicine. Their intrinsic fluorescence enables real-time tracking of drug distribution, target engagement, and metabolic pathways, while avoiding interference from external labeling. However, traditional fluorescent drug discovery relies heavily on trial-and-error approaches, which are inefficient and resource-intensive. To address this, we developed DyeLeS (Dye-Likeness Scoring), a web platform designed to rapidly evaluate molecular fluorescence potential and predict key photophysical properties such as Stokes shift and quantum yield. DyeLeS utilizes a curated dataset of fluorescent and non-fluorescent compounds, applies a Naive Bayes-inspired algorithm for fluorescence classification (AUC = 0.995), and employs a LightGBM model for quantitative prediction of fluorescence properties, achieving an R2 of 0.88 in absorption wavelength (λabs) prediction. Leveraging DyeLeS, this study constructed FluoBioDB, the first publicly available library of fluorescent bioactive compounds, encompassing 32865 structurally diverse molecules, including kinase inhibitors and GPCR modulators. Case analyses indicate that FluoBioDB compounds typically possess polycyclic conjugated frameworks, donor–acceptor (D–A) structures, and rigid planar cores, endowing them with strong potential for applications in bioimaging, targeted therapy, and theranostics. This work presents a robust computational framework and a valuable molecular resource to facilitate the rapid discovery and optimization of fluorescent drug candidates. All source codes and datasets are available at https://github.com/MolAstra/DyeLeS, and the web server can be accessed at https://dyeles.molastra.com.
The development of fluorescent drug molecules typically requires a complex, multi-step process, including lead identification, fluorescence and bioactivity characterization, structural optimization, mechanistic studies, pharmacokinetic evaluation, and safety assessment.8–10 Each stage demands extensive experimental effort and relies heavily on manual trial-and-error approaches, resulting in high resource consumption and limited efficiency. Streamlining this workflow is crucial for accelerating fluorescent drug discovery.11
With the advancement of artificial intelligence technologies and the expansion of pharmaceutical big data, the speed of drug development has significantly increased.12–14 For example, the development of machine learning (ML) models or algorithm for predicting drug-like properties has enabled the rapid construction of drug-like compound libraries,15–17 thereby streamlining the drug discovery process. Similarly, models for assessing natural product-likeness have facilitated the identification of compounds with natural origins and the creation of natural product libraries,18,19 improving development efficiency. Although databases such as ChEMBL,20 MedChemExpress,21 and ZINC22 offer extensive collections of bioactive molecules to support drug discovery, to the best of our knowledge, there have been no reports focusing on methods for the rapid assessment of molecular fluorescence properties or on the establishment of fluorescent bioactive molecule libraries.
To enable the rapid identification of fluorescent properties in drug molecules and to construct a dedicated library of fluorescent drug candidates, we developed DyeLeS, a web-based application for Dye-Likeness Scoring. DyeLeS allows for the efficient evaluation of molecular fluorescence potential and predicts key properties such as Stokes shift and fluorescence quantum yield. Furthermore, by applying DyeLeS to the virtual screening of known bioactive molecules, we constructed the first publicly available fluorescent drug molecule library, comprising 32865 compounds annotated with detailed fluorescence characteristics, including maximum absorption wavelengths, quenching wavelengths, and fluorescence lifetimes. This study leverages ML to accelerate the development of fluorescent drugs, utilizing the inherent labeling effect of fluorescent molecules to advance drug pharmacokinetics research, target validation strategies, and the integration of therapeutic and diagnostic (theranostic) applications.
The fluorescent molecule database was built from two primary sources: (a) approximately 3200 fluorescent compounds manually curated based on extensive experimental experience in the field of fluorescent molecules;23–25 (b) an additional ∼25000 fluorescent compounds collected from the Dye database26 developed by the Song research group. After deduplication, RDKit standardization, and quality control, the final library contained 26
255 unique fluorescent molecules represented by SMILES. Among them, 6703 were annotated with properties such as Stokes shift, quantum yield (Φfl), absorption (λabs), and emission wavelengths (λem).
The non-fluorescent molecule database was primarily sourced from the Collection of Open Natural Products27 (COCONUT) database. Based on our group's expertise in fluorescence research,23–25 we performed fluorescence screening and manually excluded molecules with potential fluorescent properties. This resulted in a non-fluorescent dataset containing 38991 compounds with no observable fluorescence activity.
Comparative analyses of molecular weight (MW, Fig. 1a), atom-based octanol–water partition coefficient (ALOGP, Fig. 1b), number of rotatable bonds (ROTB, Fig. 1c), structural alerts (ALERTS, Fig. 1d), hydrogen bond acceptors (HBA), and hydrogen bond donors (HBD) (Fig. S1a and b, ESI†) between the fluorescent and non-fluorescent compound libraries revealed that the two datasets share comparable key physicochemical properties, such as molecular weight, thereby minimizing the risk of data-type bias. Moreover, both libraries exhibit approximately normal distributions with uniform data coverage, reducing the likelihood of bias arising from the overrepresentation of structurally similar fluorescent molecules.
To further explore the data characteristics of the two databases, we employed the TMAP28 dimensionality reduction tool for analysis. Two key observations can be drawn from Fig. 2a: First, on a global scale, there is a clear distinction between fluorescent molecules (blue) and non-fluorescent molecules (orange), as evidenced by the separate clustering of the two colors. A possible explanation is that, while the molecules in both datasets share similar physical properties, the fluorescent molecule dataset contains a higher proportion of aromatic rings (Fig. S2, ESI†), which differentiates it from the non-fluorescent dataset. This distinct clustering in chemical space suggests that machine learning algorithms may be able to effectively distinguish between the two classes of molecules. Second, at a more detailed level, we observe several orange points (non-fluorescent molecules) scattered among the blue cluster (fluorescent molecules), particularly in the left-central region of the map, as well as in the upper right and left areas. This indicates that some fluorescent and non-fluorescent molecules are very close to each other in chemical space. A representative example is the pair of adjacent blue and orange dots located in the lower right of the plot. Upon examination, these correspond to carbazole, a fluorescent molecule (Fig. 2b), and N-methylcarbazole, a non-fluorescent molecule (Fig. 2c).
In carbazole, the nitrogen atom is sp2-hybridized, and its lone pair can participate in π-conjugation within the aromatic system, facilitating fluorescence. However, when the hydrogen on the nitrogen is replaced by a methyl group, the nitrogen's lone pair can no longer delocalize, thereby disrupting the electronic conjugation across the molecule. The only structural difference between the two molecules is the presence of a single methylene group, but this substitution increases the molecular flexibility and enhances non-radiative decay pathways in the excited state, leading to a significant reduction in fluorescence.29 In the TMAP dimensionality reduction plot, it is also easy to identify cases where molecules with highly similar structures exhibit markedly different fluorescence behaviors (Fig. S3, ESI†).
Based on structural analyses of molecules from the fluorescent and non-fluorescent molecule databases, we found that fluorescent compounds typically possess key structural features such as aromatic ring systems, donor–acceptor architectures, and specific functional group substitutions. Therefore, in designing DyeLeS for fluorescence property prediction, we first developed DyeLeS-DyeS to extract characteristic fluorescent substructures from the molecular datasets, enabling a coarse screening of fluorescence potential. Subsequently, we implemented DyeLeS-DyeP, a regression-based machine learning model, which considers additional structural factors—such as molecular rigidity and the extent of π-conjugation—to perform refined and quantitative predictions of fluorescence properties.
Next, inspired by the Ertl18 and Sorokina19 research group, we adopted a log-likelihood ratio scoring algorithm (statistical fragments) combined with the Morgan Fingerprint approach, providing a highly efficient, interpretable, and scalable solution ideal for large-scale fluorescent molecule screening and early-stage library construction. We refer to this approach as DyeLeS-DyeS. This strategy has two key advantages: (1) it does not require modeling complex dependencies among features, instead focusing solely on the relationship between fragment presence and fluorescence properties; (2) it is resistant to overfitting in high-dimensional spaces because each molecular fragment is modeled independently.
After processing the positive and negative sample datasets, a log-likelihood ratio (F-score) is calculated by DyeLeS-DyeS for each fingerprint bit to identify features more common in dye molecules. This score is based on fragment frequencies in both datasets, with Laplace smoothing applied. A molecule's fluorescence score is then computed by summing the contributions of its fragments and normalizing by molecular size. DyeLeS-DyeS assigns a fluorescence-likeness score from −5 to +5, where higher scores indicate a greater likelihood of fluorescence.
To evaluate the performance of the DyeLeS-DyeS model, we conducted both classification and scoring tasks across multiple molecular datasets (Fig. 3). In the binary classification task distinguishing fluorescent molecules from non-fluorescent ones, the model achieved an area under the ROC curve (AUC) of 0.995 (Fig. 3a), indicating excellent predictive accuracy. To further assess the model's generalizability, we tested its classification ability on three additional datasets where fluorescence properties were uncertain: ZINC, NPAtlas, and ChEMBL. These were compared against a dye molecule database composed of known fluorescent compounds. The resulting AUC values were 0.911, 0.995, and 0.917, respectively (Fig. 3b), demonstrating the robustness and transferability of the model across chemically diverse datasets. Notably, SHapley Additive exPlanations (SHAP) analysis (Table S1, ESI;† additional cases available at https://github.com/MolAstra/DyeLeS) revealed that aromatic systems, donor–acceptor motifs, and functional group substituents (e.g., hydroxyl, amino, and carbonyl groups) consistently ranked as top-contributing features. This finding aligns with established fluorescence mechanisms, validating both the model's predictive accuracy and chemical interpretability.
In addition to classification, we also applied the DyeLeS-DyeS model to fluorescence scoring tasks across datasets of varying sizes (Fig. 3c). The scoring distributions reflected the expected fluorescence tendencies of each dataset. Specifically, the Dyes database, consisting of fluorescent molecules, showed score distributions primarily in the range of 0 to 5. The COCONUT database, generally regarded as containing non-fluorescent natural products, had scores concentrated between −5 and 0. Meanwhile, the ZINC database, which contains molecules with ambiguous or unknown fluorescence behavior, exhibited a broader score distribution from approximately −4 to 4 (Fig. 3d). These results confirm the effectiveness of the pseudo-Bayesian scoring strategy implemented in the model, which enables both quantitative ranking and qualitative screening of molecular fluorescence potential.
The relatively lower R2 value for Φfl indicates that fluorescence quantum yield remains a more difficult target to model. This is consistent with literature reports, as Φfl is affected by a variety of subtle and non-structural factors, including molecular rigidity, intramolecular motions, solvent polarity, and the presence of non-radiative decay pathways. Although the model's predictive accuracy for Φfl is limited, it still offers valuable first-pass screening capability, enabling prioritization of candidates for experimental validation.
Together, these results highlight the complementary strengths of DyeLeS-DyeP and DyeLeS-DyeS. While DyeLeS-DyeS enables rapid classification of fluorescent versus non-fluorescent candidates, DyeLeS-DyeP provides fine-grained and property-specific predictions, enhancing the overall utility of the DyeLeS platform for fluorescent molecule design and screening.
As illustrated in Fig. 5, the construction of FluBioDB involved three key steps: (1) fluorescence-Likeness Scoring: We first evaluated compounds from ChEMBL and NPAtlas using the DyeLeS-DyeS, which assigns a fluorescence-likeness score ranging from −5 to +5 based on fragment-level enrichment. As shown in Fig. 5a, only ChEMBL compounds with scores above 0.5 were selected. All NPAtlas compounds fell below this threshold and were excluded from the final dataset. (2) Fluorescence Property Annotation: The filtered ChEMBL compounds were then analyzed using DyeLeS-DyeP, which predicts four key photophysical parameters: absorption wavelength (λabs), emission wavelength (λem), Stokes shift, and fluorescence quantum yield (Φfl). Notably, DyeLeS-DyeP has been implemented as a web application, allowing users to conveniently obtain fluorescence-related predictions based on molecular structure (Fig. 5b). (3) Database compilation: all bioactive compounds in ChEMBL were first encoded as Morgan fingerprints (radius = 2, 2048 bits) using the RDKit cheminformatics toolkit. The selected subset of compounds—those scoring above 0.5 in DyeLeS-DyeS—were then assembled into the final dataset. This resulted in a total of 32865 fluorescent bioactive molecules, each annotated with predicted fluorescence properties from DyeLeS-DyeP (Fig. 5c). The resulting FluBioDB serves as an open-access, structurally diverse resource for accelerating fluorescent drug discovery, enabling virtual screening, experimental validation, and downstream theranostic application development.
We selected nine representative compounds from the FluBioDB with high DyeLeS fluorescence-likeness scores for detailed analysis. These compounds exhibit promising potential for applications in bioimaging probes, drug delivery tracking, and related fields.
For instance, the compound shown in Fig. 6a is a potential kinase inhibitor or signaling pathway modulator. Its structure includes a pyrimidine ring, enone moiety, and urea group, forming an extended conjugated system. Molecules with similar scaffolds are often designed as fluorescent probes.36 The compound in Fig. 6b features a [1,2,4]triazolo[1,5-a]pyridine fluorophore—a conjugated core commonly found in fluorescent probes,37 structurally reminiscent of quinoline. It also includes a thiazole ring, chlorophenyl aromatic system, and a flexible linker combining a piperidine ring and fluorocyclopropylamine. The presence of a difluoromethoxy group, a strong electron-withdrawing unit, contributes to a donor–acceptor (D–A) configuration, promoting charge transfer (CT) fluorescence. The predicted fluorescence quantum yield is 0.37. The molecule shown in Fig. 6c features a structure commonly associated with kinase inhibitors or GPCR (G Protein-Coupled Receptor) modulators, characterized by a fused dibenzofuran, benzoyl, and methoxyphenyl moiety forming an extended π-conjugated system that facilitates efficient electronic transitions. Its rigid and planar architecture indicates strong potential for development as a fluorescent drug candidate.38,39 The compound in Fig. 6d is a multi-heterocyclic covalent-binding candidate with rich aromatic conjugation, significant intramolecular charge transfer (ICT) capability, and high structural rigidity. According to DyeLeS predictions, it exhibits a Stokes shift of 83.41 nm and a fluorescence quantum yield of 0.35.
![]() | ||
Fig. 6 Case studies of fluorescent compounds from FluBioDB. Most FluBioDB compounds feature polycyclic aromatic or heterocyclic cores, extended π-conjugation, and donor–acceptor (D–A). |
Structural analysis of additional compounds in Fig. 6e–i further supports these findings. Most FMDB hits possess characteristic polycyclic aromatic or heterocyclic cores, extended π-conjugation, and donor–acceptor (D–A) architectures that facilitate intramolecular charge transfer (ICT) and enhance Stokes shift. These features validate the effectiveness of DyeLeS in enriching for fluorescence-active molecules and demonstrate that FMDB can significantly accelerate the discovery of fluorescent drug candidates and bioimaging probes.
This work fills a critical gap in fluorescent drug discovery by providing a scalable computational framework and a curated molecular resource. DyeLeS and FluBioDB offer practical tools to accelerate the design of fluorescent probes and theranostic agents, facilitating the translation of fluorescent drugs toward real-world biomedical applications.
We have deployed a fully-featured standalone version of the tool on GitHub (https://github.com/MolAstra/DyeLeS), complete with comprehensive installation guidelines and usage documentation. The web interface offers a streamlined subset of functionalities designed for browser-based accessibility, while the GitHub repository provides the complete analytical toolkit, including extensively annotated Jupyter notebooks with step-by-step code explanations. Both platforms will receive continuous maintenance and updates.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5ra03164h |
This journal is © The Royal Society of Chemistry 2025 |