Yaoyao
Tang
ab,
Quan
Xu
*ab,
Peide
Zhu
b,
Rongye
Zhu
b and
Juncheng
Wang
*c
aCollege of Artificial Intelligence, China University of Petroleum-Beijing, Beijing, 102249, China. E-mail: xuquan@cup.edu.cn
bState Key Laboraty of Heavy Oil Processing, China University of Petroleum-Beijing, Beijing, 102249, China
cDepartment of Stomatology, First Medical Center, Chinese PLA General Hospital, Beijing, 100042, China. E-mail: wbhwjc527@126.com
First published on 1st November 2023
As a novel type of nanomaterial, carbon dots (CDs) are widely used in biology owing to their optical property, biocompatibility, and intrinsic theranostic properties. Taking advantage of these features, the CDs serve as color agents, fluorescence probes, and anti-cancer drugs. Machine learning (ML) has progressed dramatically, especially for widespread use in the biological field. In this review, we introduce the ML workflow and the leading models in the process and then demonstrate the application of CDs in bioimaging, biosensing, and cancer treatment. Next, we generalize the use in the development of CDs’ combination of ML in complementary aspects. Finally, we briefly summarize the challenges and expectations for the future. This review provides new thoughts and guidance for CDs on the application and integration of machine learning.
Carbon dots (CDs) define the zero-dimension carbon materials on account of their less-than-10 nm particle size,16 bringing multiple satisfactory properties,17–22 such as excellent optical properties, chemical stability, solubility, and biocompatibility. More attention has been paid to CDs, and significant progress has been made in the preparation, synthesis,23–28 and applications.29–31 Particularly, CDs act as a powerful tool in the biological field, and have a widespread application in imaging, sensing, and cancer treatment. It furnishes vigorous fluorescence intensity, good chemical stability, and a long fluorescence lifetime, which can serve as fluorescent agents working in the bioimaging field.31 Moreover, due to the large surface area and wealth of chemical groups on the surface, CDs specifically bind to unique substances, enhancing or diminishing the fluorescence intensity depending on the concentration of the target molecule. This optical characteristic is developed so that the fluorescence probe senses the content of pH,32 ions,33–35 and biomolecules.36–38 CDs possess good hydrophilicity owing to their amino, hydroxyl, and carboxyl groups, and can quickly be loaded with drugs. Thus, they have strong potential as superior therapeutic drug delivery agents.39–41
The integration of ML into the biological application of CDs has great capacity, mainly including two major categories: ML guides CDs synthesis; and ML assists in data processing and analysis. The ML algorithm can be widely used in various optimization problems by sampling, and training for limited samples to optimize the synthesis parameters and improve the material properties. For example, Tang et al.42 used ML to guide the synthesis of CDs to enhance the optical property and established a progressive, adaptive model to minimize the number of practical trials. ML, as a data-driven model, quickly and efficiently ascertains the characteristics of the data to improve the accuracy of the analysis results. This paper reviews the CDs’ application in the biological field, and the following section introduces ML in the corresponding combination usage, as displayed in Fig. 1. Finally, the brief conclusion and expectations about ML are shown at the end of every section.
The birth of AI came from the Dartmouth conference in 1956, and its development phase has gone through three major waves, as shown in Fig. 2. The first breakout stage represents the logical reasoning, whose sign is the making of the simple music Wabot1 robotics. Due to the low computational capacity, the complex and significant problems could not be dealt with, so the development of AI was stalled. The neural network and backpropagation (BP) algorithm significantly facilitated the second breakout stage of AI, which is the expert system based on prior knowledge accumulation. However, the lack of practical application of AI led to a quick decay in it use. Owing the explosive increase in data, powerful computing ability, and constant refining algorithms, AI experienced a burst in growth in the past decade. In particular, Alpha GO defeated the world Go champion Lee Sodel in 2016.44 Nowadays, finance, medical treatment, automotive drive, and other industries45,46 have mature applications and established a relatively complete database.
Fig. 2 History of artificial intelligence.43–46 |
With the recent rapid advances in data availability, arithmetic power, and new algorithms, ML has emerged as one of the key methods for realizing AI, building models by using algorithms to reveal inner connections, which can make better decisions without human intervention. Most industries possess large amounts of data and have recognized the value of ML technology. Banks and other businesses in the financial industry use ML technology to identify important insights in data and prevent fraud, which helps to identify profit opportunities or avoid unknown risks. ML assesses patient health in real-time, and can also help medical professionals analyze data to identify diagnoses and treatments. Websites recommend items you might like based on analyzing previous purchase history by ML. Moreover, we enjoy many conveniences brought by ML, such as text and speech recognition software, web search engines, personalized recommendations for movies, and prediction of delivery times, etc., which is playing an increasingly important role in people's daily lives. Compared with the previous rule, ML can handle massive amounts of data easily to extract patterns and regularities. Furthermore, ML utilizes the information to make predictions and decisions. The process is automated, efficient, and reliable, leading to less time consumption and more rewards.
Applying ML can be divided into four processes: data collecting, feature engineering, model selecting, and model application, as displayed in Fig. 3a. This section reviews the workflows and some common ML models.
Fig. 3 (a) The workflow of ML. Schematic diagram of the linear regression (b), K-nearest neighbor (c), decision tree (d), and support vector machine (e). |
Feature selection screens useful features, and deletes redundant (and even irrelevant) features from a large number of features.51 If all features are input into the ML model, the problem of dimensional disaster often occurs, and even the model's accuracy can be reduced. Therefore, accomplishing feature selection to exclude invalid features as training data for the model is essential.52 Feature extraction is the process of reducing the dimensionality of the data, retaining significant information, and generating new features.53 The large pristine feature matrix increases computational cost and time, so reducing the dimension is important. Principle component analysis (PCA) and linear discriminant analysis (LDA) show excellent ability and potential to deal with high-dimension data.52 Feature construction processes the original data, combines subsistent features, and generates new features.53 The new introduced features need to be verified to improve the prediction accuracy, rather than adding a useless feature to increase the complexity of the algorithm operation. This required researchers to spend a lot of time on the sample data and think about the nature of the problem, the structure of the data, and how best to use them in the prediction model. For example, the body mass index (BMI) represents the body's fatness and measures whether it is healthy. It is calculated by the mass and height, and constructed as the new feature. The initial data, mass, and height also can denote the body index, but the BMI more directly shows healthy conditions and can help with disease prevention.
Model | Linear regression | Polynomial regression | Logistic regression | K-Nearest neighbor | Decision tree | Support vector machine | Random forest | Adaptive boosting | Gradient boosting decision tree | XGBoost | LightGBM |
---|---|---|---|---|---|---|---|---|---|---|---|
Regression problem | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | |
Classification problem | √ | √ | √ | √ | √ | √ | √ | √ | √ | ||
Output | Continuous | Continuous | Discrete | Continuous and discrete | Continuous and discrete | Continuous and discrete | Continuous and discrete | Continuous and discrete | Continuous and discrete | Continuous and discrete | Continuous and discrete |
Basic principle | Linear function | Polynomial function | Active function | Distance measure | Binary tree | Decision surface | Decision tree | Decision tree | Decision tree | Decision tree | Decision tree |
Advantage | Easy to understand and avoid overfitting | Easy to implement and understand; fit a nonlinear relationship between variables | Easy to implement and understand; suited for binary classification | Simple technique; suited for multi-model classes | Handle a variety of data; good generalization ability | Not overfitting; appropriate kernel function; handle with high-dimension data | Fast, scalable, robust to noise, does not overfit, easy to interpret and visualize with no parameters to manage | No overfitting and no need to filter features; high accuracy | Handle a variety of data; robust to outliers | Higher precision compared with GBDT; training in parallel; automatic processing of missing value features | Adopt histogram algorithm reduces time complexity; less memory consumption compared to XGBoot |
Disadvantage | Cannot deal with a non-linear relationship | Easily overfitted data; outlier sensitivity | Difficult to deal with data imbalance; more sensitive to multicollinearity data | outlier sensitivity; intensive computation with huge data | difficult to handle high dimensional data; data fragmentation problem | Cannot solve multiclassification problems | Slow for real-time prediction | Time-consuming for training; data imbalance leads to loss of classification accuracy | Difficult to train data in parallel; computational complexity | Excessive space complexity | More sensitive to noise |
Linear regression (LR) predicts the target variable by fitting a linear function to the data set, which is the basis for the regression problem and the most highly used model in the industry.54 Its function is as follows:
y = ωT × x |
Polynomial regression develops from linear regression and has more input variables and a higher power exponent,55 which can fit an arbitrary data set without considering the computational effort, overfitting, etc. Logistic regressions belong to the linear classifier. Employing a logistic function,56 the data features are mapped to a probability value in the interval from 0 to 1, compared with 0.5 to the classification.
Bagging63 belongs to classical parallel methods whose weak learners can independently train and predict. Bagging uses bootstrap sampling, which should put back the randomly selected data into the data set per round, obtaining the unique dataset for each weak learner. The random forest (RF) is based on the DT utilizing the bagging algorithm to form the strong learner.64 RF has many advantages over DT, such as decreasing the variance, eliminating the overfit characteristic, evaluating the feature importance, etc.
Boosting65 uses weak learners to iterate the learning of the data's intrinsic rules, which is a dynamic process that adjusts the weights for the data and enhances the performance of models. While all the learners are trained, the performance also specifies the weight of every learner. Adaptive Boosting (AdaBoost),66 Gradient Boosting Decision Tree (GBDT), XGBoost, and LightGBM all utilize the boosting method.
The AdaBoost is adaptive to the data, whose weight changes by the weak learners.66 While samples misclassified by the prior learner are strengthened, the whole weighted sample is applied again to train the next learner. GBDT67 calculates the negative gradient to recognize the error and modify the model. XGBoost68 perfects the column subsampling, parallel calculating, and automatic handling of missing values based on GBDT. LightGBM69 handles the huge volume of data in industrial circles, which consumes less memory, possesses faster training speed and accuracy, and supports distributed computing compared with XGBoost.
The data division is related to the model evaluation and selection, and is randomly divided into training and test sets according to a certain proportion. Only part of the data is used to train the model, which would affect selected models. Cross-validation is proposed to optimize model validation technology. The most widely used one is K-Fold cross-validation. It divides the data into disjoint and equal numbers of K parts, selects 1 part as the validation set, and the remaining K-1 parts as the training set. After conducting K separate model training and validation, the validation error of this model takes the average of the K validation results. Leave one out cross validation (LOOCV) is a particular instance of K-fold cross validation, which selects one data as the validation set per round. The number of build models is controlled by the size of the data set and the computational cost is tremendous. In practice, cross-validation usually combines with the Grid Search, which tests the performance on the same validation set by permuting different hyperparameters and selects the setting corresponding to the best-performing model.
Suitable evaluation metrics are needed to compare the effectiveness of different models and select the most appropriate model. According to the different types of problem-solving, the evaluation metrics are also different, which are mainly categorized into two types: classification and regression.
The common evaluation metrics for classification problems mainly include accuracy, precision, recall, F1 score, ROC curve, and AUC score. The evaluation indexes of regression problems mainly include MAE and MSE.
In a binary classification problem, samples can be classified according to the combination of their true and predicted categories into four types: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). In general, these are displayed by the confusion matrix.
Accuracy is the percentage of the total sample with correct predictions.
Precision evaluates the accuracy of the positive examples predicted by the model.
Recall represents the probability of being predicted positively in a positive sample.
Precision and recall are contradictory metrics, so the F1 score combines them to find the balance.
The ROC curve is a graphical tool used to represent the performance of a classification model. It shows the relationship between the true positive rate (TPR) and false positive rate (FPR) of the classifier under different thresholds.
The AUC score is the area under the ROC curve, and is used to measure classifier performance. The value of AUC ranges from 0 to 1, which is positively associated with the classifier performance.
For regression problems, the common evaluation metrics are MAE and MSE.
The formula for the mean absolute error (MAE) is
ML accelerates the discovery of new recipes, whose material and result can be considered parameters. As we can see in the subsequent section, many models are applied to the synthesis of CDs. The CDs’ recipe and experimental condition are variable, leading to volatile results. So, the CDs' related property is considered the target, such as the quantum yield (QY), emission center, Stokes shift, etc. The effect of fitting the data is excellent for obtaining high-quality CDs. The insight of ML is novel for synthesizing CDs, which provides new ideas for similar materials, and accelerates the fast development in material science.
ML also provides novel analysis methods to enhance interpretability, especially for image data. In general, the 2D image data are set as the input of ML, which can effectively learn the corresponding features. The fluorescence intensity of CDs is related to concentration, which should be measured through a specific instrument. However, ML can directly utilize the fluorescent image to calculate the concentration, which is more convenient than ever. Computer vision is one of the fastest growing and most widely used technologies in the artificial intelligence segment, while ML, as the power tool, will promote development.
Fig. 4 (a)–(c) Confocal fluorescence MCF-7 cells imaging in green, yellow, and red under the 450 nm laser light.77 Reproduced from ref. 47 with permission from John Wiley and Sons, copyright 2015. (d) The zeta potential spectra for CDs, CDs + PEI, and CDs@PEI.79 (e) λ = 450 nm excitation and (f) λ = 550 nm excitation fluorescence spectra of CDs, CDs + PEI, and CDs@PEI aqueous solutions.79 Reproduced from ref. 49 with permission from Elsevier, copyright 2022. (g) Confocal images of the mixture of Escherichia coli and Staphylococcus aureus cells after incubation with 50, 100, 200, 300, and 500 μg mL−1 T-SCDs.82 (h) Average intensity statistics of fluorescence corresponding to confocal images processed by the image.82 Reproduced from ref. 52 with permission from American Chemical Society, copyright 2021. (i) Fluorescence imaging was performed in vivo at 490 nm excitation and 560 nm emission at 5, 15, 30, 60, and 120 minutes before and after injection of N-CQDs.83 Reproduced from ref. 53 with permission from Elsevier, copyright 2023. |
Doping heteroatoms also improve the fluorescence properties of CDs. For example, Liu et al.80 proposed the synthesis of nitrogen (N) and sulfur (S)-doped CDs by hydrothermal method. The biocompatibility and excellent fluorescence emission indicated that the NS-CDs can be used for fluorescence imaging. Bouzas-Ramos et al.81 successfully synthesized CDs doped with two elements (nitrogen and lanthanide), Gd and Yb, using a one-pot microwave-assisted hydrothermal method. Especially, the doped-CDs with a QY of 66 ± 7% had intense fluorescence emission, low cytotoxicity, magnetic resonance (MR), and computed tomography (CT) contrast properties for successful application in in vitro fluorescence, MR, and CT cell imaging.
Similarly, S. aureus exhibited a notable increase in fluorescence intensity with increased incubation time (Fig. 4h), while Gram-negative E. coli did not react to the extended incubation period. T-SCDs were remarkable for the rapid enrichment of Gram-positive bacteria and had good performance in the real-time tracking of Gram-positive bacteria. Bharat Bhushan et al.84 synthesized a simple, inexpensive, and environmentally friendly method to prepare CDs by hydrothermal treatment of commercial casein. The CDs can be specifically labeled for Gram-negative bacteria and were able to enter fibroblasts, remaining effectively labeled in the subsequent 3–4 generations of cells.
Malignant tumors have become a group of common diseases that endanger human health.85 There have been attempts to use CDs to trace tumor cells, label mesenchymal stem cells, pursue intracellular nucleic acids, and target organelles.40 The assembly of CDs bio-probe enables highly sensitive and specific detection, providing a new way for the early diagnosis of malignant tumors. Liu et al.86 injected 100 μL of prepared CDs solution into nude mice for systemic circulation through the tail vein. After 0.5 hours of injection, the strong fluorescence signal observed in the brain suggested that this may have great potential for therapeutic diagnosis of certain brain diseases through real-time monitoring. Zhang et al.83 utilized a molecular fusion strategy to generate a novel range of nitrogen-doped carbon quantum dots (N-CDs). Fig. 4i shows the fluorescence image after the N-CDs were taken into the body, demonstrating the N-CDs were effectively delivered to the tumor site in a mere 5 minutes and able to produce a noteworthy fluorescent signal owing to their ultra-small size (1.5 nm). The fluorescence intensity remained strong after long-term consumption, indicating the superior permeability and retention of N-CQDs, enabling the real-time visualization of the tumor with optimal fluorescence imaging.
Fig. 5 (a) The maximum fluorescence intensity histogram shows the 400 CDs synthesized under different reaction conditions. Each block represents the reaction times of CDs for 3 (i), 7 (ii), 14 (iii), and 24 (iv) h. The column and row were controlled by MPBQ and various solvents, respectively. The colors of different columns in each square lattice show different volumes of VEDA.88 (b) Imaging of whole-cell fluorescence by CDs-3. (i) The diagram shows the internalization of CDs-3 into the cell. (ii) Confocal images from left to right showing the bright field, blue channel, green channel, and merging of cells. Reproduced from ref. 58 with permission from American Chemical Society, copyright 2022. |
Bioimaging is currently developing many exciting resources to promote the adoption of supervised machine-learning models. Image noise is inevitable for fluorescence microscopy which can be from the detectors, acquisition speed, imaging resolution, scattering, etc. So, low noise is essential for image analysis. ML-driven analysis can optimize and simplify microscopic image processing, improve signal-to-noise ratios, deconvolve images, and dramatically increase resolution.89 Romain et al.90 explored self-supervised ML methods to denoise images, which is more convenient than supervised ML in the data quality demands. Wang et al.91 combined the transfer learning and U-Net for application in image denoising. The transfer learning possesses prior knowledge about the noise model, which helps to improve the performance of denoising.
Weigert and co-workers developed content-aware image restoration (CARE) networks to restore missing training data.92 Chai et al.93 applied the U-Net to the 3D image to inpaint and denoise, achieving dynamic volumetric imaging and improving the signal-to-noise ratio (SNR).
Image segmentation refers to the separation of objects in a picture from the background, related to the set of pixels. The technology of image segmentation can automatically separate the different tissues in microscopy images, attracting much attention. Schmidt et al.94 utilized the star-convex polygons to localize cell nuclei to enhance the detected performance. Based on the U-Net, the model can segment the overlap nuclei. Researchers have also applied this to 3D images, where the star-convex polyhedral represents the cell nuclei. The method can accurately facilitate the detection and segmentation of cell nuclei in the 3D image.95 Segmenting the membrane is more difficult than nuclei, owing to the various morphologies and different sizes. Khameneh et al.96 combined the superpixel-based tissue classifier (SVM) and modified U-Net to segment cell membranes with 0.94 segmentation and 0.87 classification accuracy. Eschweiler et al.97 modified the 3D U-Net for 3D confocal microscopy images. It showed an advantage in segmentation quality for deep issues without tedious parameter adjustment. The restriction for algorithmic generalization capabilities is the lack of the amount of dataset to train the whole-cell type segmentation model. Stringer et al.98 introduced a generalized model named Cellpose, which is trained on over 70000 segmented objects. It can precisely segment cells in various image types, and does not require model retraining or parameter adjustments. The model also supports the 3D image without the labeled 3D data. Cellpose is packed as a software for free usage to support the community contribution online.99,100 Next, Eschweiler and co-operator101 expanded the Cellpose method to improve the classification accuracy in 3D images. The expanded Cellpose could simplify the preparation of training data and instance reconstruction. In addition, Cutler et al.102 proposed another generalized algorithm named Omnipose, which reaches high accuracy of segmentation in cell and expands the applied field to non-bacterial subjects, varied imaging modalities, and three-dimensional objects, demonstrating excellent performance in image segmentation. Organelle segmentation is required in the biological field, while Heinrich prepared work with the matter, which is impeded by the time-consuming manual annotation.103 The open-source web repository, ‘OpenOrganelle’, is created to share data, accelerating the development of segmentation in the biological domain.
Model | Result | Ref. |
---|---|---|
XGBoost | Produce the highest quality red CDs to increase the cost and have the potential to bioimaging | 87 |
XGBoost | Maximum fluorescence intensity and emission center's CDs generated and applied in bioimaging | 88 |
Self-supervised | Denoise images without the data quality demands | 90 |
U-Net | Denoise images | 91 |
CARE | Restore the missing training image | 92 |
U-Net | Inpaint and denoise the 3D image; achieve dynamic volumetric imaging | 93 |
U-Net | Use star-convex polygons to separate the overlap nuclei | 94 |
U-Net | Utilize star-convex polyhedral to segment the nuclei in the 3D image | 95 |
SVM + U-Net | Segment cell membrane with 0.94 segmentation and 0.87 classification accuracy | 96 |
3D U-Net | Segment the cell in confocal microscopy image and deep tissue | 97 |
Cellpose | Segment the cell in various types for 2D and 3D image | 98 |
Extensive cellpose | Improve the classification accuracy in 3D image | 101 |
Ominpose | Segment the cell and expand into any 3D objects | 102 |
Fig. 6 (a) Schematic of P-CDs sensing the pH.104 Reproduced from ref. 59 with permission from the Royal Society of Chemistry, copyright 2019. (b) The schema of synthesizing RCDs and its “off–on” property for the H+/OH−.106 Reproduced from ref. 61 with permission from Elsevier, copyright 2019. (c) Schematic diagram of a mechanism for the pH-based CDs response on the inner filter effect of Aniline Blue on the emission of Rhodamine B.107 Reproduced from ref. 62 with permission from American Chemical Society, copyright 2021. (d) The schematic diagram of Ca, N, and S-CDs with pH-responsive properties applied into pH detection, in vivo pH image, and Nucleus-targeting pH image.108 Reproduced from ref. 63 with permission from Elsevier, copyright 2023. |
Fig. 7 (a) The color change and analysis of CDs with Fe3+ added gradually.109 (b) The curve between the ratio of yellow/green and the concentration of Fe3+. Reproduced from ref. 64 with permission from American Chemical Society, copyright 2022. The absorption (c) and fluorescence spectrum (d) of CDs upon adding the Cu2+ and Hg2+.111 Reproduced from ref. 66 with permission from Springer Nature, copyright 2021. (e) The map of FL spectra for the CDs under different concentrations of Cu2+ (0–150 μmol L−1). (f) The relative changes of the fluorescence intensity (F/F0) of the CDs with different concentrations of Cu2+ (inset: the linear relationship between F/F0 and the concentration of Cu2+).112 Reproduced from ref. 67 with permission from the Royal Society of Chemistry, copyright 2022. (g) The schematic diagram of CDs detecting Cl− based on the “on–off–on” mechanism.113 Reproduced from ref. 68 with permission from Frontiers, copyright 2021. |
Fig. 8 (a) The schematic diagram of CDs sensing glucose and lactate in saliva samples.114 Reproduced from ref. 69 with permission from Elsevier, copyright 2021. (b) The illustration of the preparation of rGO-PBA and use in detecting glucose.116 Reproduced from ref. 71 with permission from Elsevier, copyright 2022. (c) The schematic image of CA-CDs detecting cysteine through oxidation.117 Reproduced from ref. 72 with permission from Elsevier, copyright 2022. |
There was a linear relationship between the concentration of the analytes and the intensity of fluorescence. The detection limitations of glucose and lactate were 2.60 × 10−6 and 8.14 × 10−7 mol L−1, respectively. Yang et al.115 fabricated the green hydrophilic CDs to detect glutamic acid and aspartic acid, whose limitations were 1.69 and 1.24 μM, respectively.
To further decrease the cost of the enzyme, some studies utilized the “on–off” mechanism to direct the glucose. Zhou et al.116 used the phenylboronic acid functionalized reduced graphene oxide (rGO-PBA) and the polyhydroxy-modified CDs to detect the glucose. As shown in Fig. 11(b), the modified CDs were attached to the surface of rGO-PBA due to the affinity between the phenylboronic acid and hydroxyl groups. The quenching of fluorescence was related to the photoinduced electron translating from the CDs to graphene oxide. When the glucose was added to the solution, the CDs were shed from the graphene surface by stronger binding, leading to the fluorescence recovery. The additive had a linear relationship with the recovered fluorescence with the limitation of 0.01 M. Furthermore, Lin et al.117 designed the carbon-based enzymes to detect the cysteine (Cys) through the oxidase-mimicking property. CA-CDs were prepared by the one-step solvothermal method using citric acid (CA) as the single carbon source, which had a higher affinity and better catalytic ability with the Cys than the natural enzyme. Fig. 11(c) shows that the Cys was oxidized to form the cysteine and H2O2 by CA-CDs. Subsequently, the Cys can facilitate the decomposition of H2O2 to produce the radical group of OH˙. Then, the solution containing the terephthalic acid (TA) captured the radicals to make the TA-OH, leading the strong fluorescence. Therefore, the combination of CA-CDs and TA accurately detected the Cys with a limitation of 0.036 μM through the intensity of fluorescence. More importantly, this research not only provides a new approach to detecting the concentration of analytes via single simple raw material, but also establishes the foundation for the proposal and synthesis of carbon-based enzymes.
Fig. 9 (a) The process of establishing the ML model and application.118 Reproduced from ref. 73 with permission from American Chemical Society, copyright 2020. (b) The input and output features for the XGBoost model.88 Reproduced from ref. 58 with permission from American Chemical Society, copyright 2022. (c) The CDs-1 can be quenched by Fe3+, and has high selectivity and low limit of detection with Fe3+. (d) The schematic diagram of utilizing the XGBoost model to enhance the QY.119 Reproduced from ref. 74 with permission from the Royal Society of Chemistry, copyright 2022. |
Compared with traditional sensing methods, the novel approach based on machine learning (ML) can directly and quickly detect to save time. Pandit et al.120 combined the ML and fluorescent array to detect eight proteins based on the different optical patterns. Seven ML algorithms were performed on the data, and the recognition accuracy was greater than that for the traditional linear discriminant analysis (LDA), realizing 100% prediction efficiency. Shauloff et al.121 used interdigitated electrodes (IDEs) coated with CDs to record the capacitance difference to distinguish and sense the bacteria through the CDs matching distinctive gas.
ML was used to study the capacitive response data for sensing different vapors. It exhibited excellent performance in distinguishing the single and mixed groups, which can further detect the kinds of bacteria. The investigation facilitated the real-time detection of bacteria and the application of ML. Xu et al.122 constructed a fluorescence sensor array containing the CDs and lanthanide complex (EDTA-Tb) to sense multiple heavy metal ions through the ML. Different ions combined with the sensor array caused different multi-dimension data (Fig. 10a), while the SX-model can distinguish each other and predict the concentration. The SX model was precise and its accuracy was up to 95.6% in the experimental sample. Liu et al.123 developed a smartphone-based nanoprobe sensing platform using machine learning to detect glutathione (GSH) and azodicarbonamide (ADA). The yolov3 was utilized, belonging to deep learning, to handle the image and establish the model that can relate the concentration of analytes with fluorescence ratio, as shown in Fig. 10b. The novel method was integrated into the WeChat APP, which was convenient for smartphone-based handheld devices. The performance of detection had excellent sensitivity in the concentration range of 0.1–200 and 0.5–160 μM with the limitation of 0.07 and 0.09 μM for GHS and ADA, respectively. Lu et al.124 designed a tri-color fluorescent optical device based on the ML algorithm and smartphone to detect tetracycline antibiotics (TC), as shown in Fig. 10c. The trichromatic sensor consists of blue CDs (BCDs), and the dual-emission red CDs were sensitive to TC. gases, Upon increasing the concentration of TC gradually, the sensor color changed from red to cyan. The YOLO v3 algorithm assisted in the photographing of the solution, which can implement the detection of TCs through the visual method. The technique was integrated into the smartphone, providing real-time and on-time antibiotic detection. Xu and co-workers125 utilized the carbon quantum dots-based fluorescence sensor array to sense the tetracyclines. Tetracyclines mainly contain four kinds: including tetracycline (TC), oxytetracycline (OTC), doxycycline (DOX), and metacycline (MTC). The SVM is applied to handle and analyze the sensor array dataset, which can distinguish the four kinds of tetracyclines. Moreover, the model is extensible, which can sense the binary mixture. In river and milk samples, the model also shows an excellent ability to detect. Wang et al.126 applied the ML to dual-emission fluorescence/colorimetric sensor array to detect the nine antibiotics. The input parameter contains the difference in fluorescent intensity and maximum emission wavelengths, while the antibiotics category and corresponding concentration is the output parameter. All of the processing is optimized by the tree-based pipeline optimization technique (TPOT). The ML can detect nine antibiotics at 0.5–50 μM with 95% accuracy. For the unknown sample, the model also can distinguish the different kinds and quantify the concentration. Jafar et al.127 utilized SVM to sense the concentration of nitrate, while the polynomial kernel showed the highest accuracy. The results showed that the polynomial kernel with the parameter at γ = 0.20 was best, with the MSE of 0.0016 and R2 of 0.93. The ML is integrated into smartphone applications, which is convenient for online detecting. The lifetime of biosensors also increased through applying the SVM, which is usable for up to at least 10 days. Gonzalez-Navarro et al.128 constructed the model between the amperometric response of glucose with dependent variables under different conditions, such as temperature, benzoquinone, and pH. Four kinds of ML regression models are compared, while the radial basis function-based SVM (SVM-R) is an excellent model with an R2 of 0.999. Due to the sensitivity of the sensor response being strongly related to these dependent variables, their interactions should be optimized to maximize the output signal, for which a genetic algorithm and simulated annealing are used, resulting in good generalization error. Rong et al.129 used the SVM to analyze the sense data to detect the small proteins. SVM with four kernels (polynomial, sigmoidal, linear, and radial basis function) were compared to the optimized model, while the radial base function kernel shows the best performance. The model possesses an accuracy of 98% on the bind interactions between the proteins and DNA. For the general test, the relative code shows greater ability than the equivalent circuit analysis. The model can be integrated on the smartphone for quick detection.
Fig. 10 (a) Schematic of the multi-emission array sensor that served to detect heavy metal ions based on the SX model.122 Reproduced from ref. 77 with permission from Elsevier, copyright 2022. (b) The diagram of GSH and ADA determination utilizing Yolo v3 and Wechat.123 Reproduced from ref. 78 with permission from Elsevier, copyright 2022. (c) The illustration of the prepared tri-color CDs and application in TCs detection based on deep learning and smartphone.124 Reproduced from ref. 79 with permission from Elsevier, copyright 2023. |
Model | Result | Ref. |
---|---|---|
XGBoost | Improve the QY to 39.3% and apply CDs to sense Fe3+ | 118 |
XGBoost | Max PL intensity and emission center and applied CDs to sense Fe3+ | 88 |
XGBoost | Enhanced the QY 200% higher than the pristine and applied CDs to sense H2O2 | 119 |
LDA | Detect eight proteins with 100% accuracy | 120 |
PCA | Detect the kinds of bacteria | 121 |
SX-model | Sense seven heavy metal ions with 95.6% accuracy in the experimental sample | 122 |
SX-model | Sense nine antibiotics at 0.5–50 μM with 95% accuracy | 126 |
SVM | Sense tetracyclines and the binary mixture | 125 |
YOLO V3 | Detect GSH and ADA in the range of 0.1–200 and 0.5–160 μM with the limitation of 0.07 and 0.09 μM, respectively | 123 |
YOLO V3 | Detect tetracycline antibiotics | 124 |
SVM | Sense the concentration of nitrate with R2 of 0.93 | 127 |
SVM | Sense the concentration of glucose with R2 of 0.9999 | 128 |
SVM | Detect small proteins with higher accuracy than the equivalent circuit analysis | 129 |
Fig. 11 (a) Schematic diagram of the FA-Cβ-MSCDs getting through the cell and releasing ETO.130 Reproduced from ref. 82 with permission from Elsevier, copyright 2018. (b) The principle of CDs combined with DOX to form CDs-DOX.131 (c) The cell uptake of CDs-DOX and application.131 Reproduced from ref. 83 with permission from Elsevier, copyright 2019. (d) The illustration of the MSNs-CDs@DOX preparation and cancer cell targeted-delivery.133 Reproduced from ref. 85 with permission from Springer Nature, copyright 2019. |
Permatasari et al.137 synthesized the blue-yellow emission CDs using urea and citric acid as precursors through the microwave-assisted hydrothermal treatment, as displayed in Fig. 12a. The study showed that pyrrolic-N-rich is key for CDs to enhance the near NIR absorption, whose photothermal efficiency was up to 54.2% (Fig. 12b). Kim et al.138 attained sulfur-doped CDs (S-CDs), accessing the Camellia japonica flowers as principle materials acting as the cancer therapeutic agents. The S-CDs had high NIR absorption, whose photothermal conversion efficiency is up to 55.4% under the 808 nm irradiation. The tumor size in mice with S-CD gradually became smaller and formed a black scar under the NIR laser irradiation.
Fig. 12 (a) Scheme illustrating the CDs with strong NIR absorption.137 (b) The absorption spectrum of CDs with rich Pyrrolic-N.137 Reproduced from ref. 89 with permission from American Chemical Society, copyright 2018. (c) Illustration of the working mechanism for the two-photon excitation S-Se-CDs.139 Reproduced from ref. 91 with permission from Springer Nature, copyright 2017. (d) Illustration of CD@MSN/ICG targeting the tumor and killing cells based on the PTT.141 Reproduced from ref. 93 with permission from the Royal Society of Chemistry, copyright 2019. |
To further enhance the performance of CDs, researchers have also started to dope various metals and their derivatives. Lan et al.139 prepared the S, Se co-doped CDs (S-Se-CDs) through hydrothermal treatment, while polythiophene and diphenyl diselenide acted as precursors in the alkaline solution. The co-doped CDs had two NIR emissions at the peak of 731 nm and 820 nm, and the efficiency of photothermal conversion was 58.2%, as shown in Fig. 12c. Nanoparticles acting as carriers or agents combined with the CDs also can enhance the photothermal property and improve anti-cancer effectiveness. Qian and collaborators140 leveraged the hydrogen bond to assemble the CDs into the framework of mesoporous silica nanoparticles (MSNs), getting the mixture CD@MSNs. The CD@MSNs can degrade in the cell and aggregate the dispersed CDs to enhance the photothermal efficacy, realizing the killing of cancer cells and achieving inhibition of tumor metastasis. Benny Ryplida et al.141 loaded the CDs into MSNs with pH-responsive Indocyanine Green (ICG) through electrostatic interactions. The CD@MSN/ICG released the CDs under the acid pH, which absorbed the excitation and converted it into heat to kill the cancer cells, as displayed in Fig. 12d. Peng and co-works142 utilized the one-step microwave-assisted carbonation method, depositing CDs on Prussian blue nanoparticles (CDs/PBNP). The PBNPs had a photothermal property and the CDs exhibited strong green emission, so the CDs/PBNP can act as an imaging agent and PTT agent.
CDs are used as the PS and image agents in the PDT field, owing to their non-toxicity and excellent optical properties. Zhao and co-workers doped N and S into CDs, attaining N, S-CDs with red emission.147 These CDs accumulated in the tumor lysosome and mitochondria, and produced singlet oxygen (1O2) to induce cell death under light irradiation. Wang et al.148 synthesized the Cu-doped CDs (Cu-CDs) via the pyrolysis method, which can be used in the imaging-guided PDT for tumor treatment, as shown in Fig. 13a. It showed the high QY of 1O2 (36%) and possessed photoinduced toxicity, which can induce cell apoptosis. To enhance the treatment performance for cancer, the CDs serving as PS targeting the nucleic acid were developed. Xu et al.149 composed the Se and N co-doped CDs (Se/N-CDs), which bonded with RNA and were transported near the nucleus to generate the ROS shown in Fig. 13b. Pang et al.150 designed the novel CDs targeting the nucleolus and producing ROS, which improved the performance under low concentrations of CDs.
Fig. 13 (a) The illustration of Cu-CDs synthesis and imaging-guided PDT application.148 Reproduced from ref. 100 with permission from American Chemical Society, copyright 2019. (b) Schematic of light-induced entry into the nucleus.149 Reproduced from ref. 101 with permission from Elsevier, copyright 2020. (d) The illustration of PDT for CDs/Ce6-HA conjugate.152 Reproduced from ref. 102 with permission from Elsevier, copyright 2015. |
CDs act as the carriers, enhancing solubility and biocompatibility, resulting in improving the effect of photodynamics. Yang et al. loaded the insoluble chlorin e6 (Ce6, PS) into CDs to attain the CDs/Ce6, which had good solubility, biocompatibility, and high fluorescence resonance energy transfer (FRET) efficiency.151 The system enhanced the photodynamic process whose 1O2 QY increased by 200% compared with pristine Ce6. Beack et al.152 formed CDs/Ce6-HA conjugate by the dehydration condensation of the amino group of diaminohexane-modified hyaluronate (HA-DHA) and the carboxyl group of CDs/Ce6. The conjugate could target the tumor cell and generate more 1O2 than Ce6 and CDs/Ce6 related to the enhanced solubility (Fig. 13c).
Xie et al.153 utilized the fast correlation-based Filter (FCBF) algorithm to find five important biomarkers to detect early lung cancer in the plasma data. Ail and co-works used an artificial intelligence diagnostic platform to recognize cancer malignancy, invasiveness, and grading on the blue light cystoscopy image.154 The ML models showed a high classification on sensitivity and specificity, up 95.77% and 87.44%, respectively.
This advanced technology exhibited the potential for cancer diagnosis and improved cancer detection rates, and was a beneficial treatment decision for doctors. ML helps discover the relationship between the drug and tumor issue, which is beneficial to the drug design to shorten the drug development cycle. Grisoni et al.155 utilized long short-term memory (LSTM) cells to design the anticancer peptides (ACP), reaching the selective killing of the cancer cell. Half of the ACPs can kill cancer cells without influencing the human erythrocytes whose sensitivity was over three times that of the MCF7 cell. Furthermore, the ensemble model is proposed to design effective ACPs.156 Four counter-propagation artificial neural networks (CPANN) were combined into one ensembled model, where one half belonged to the lung cancer model, and the remaining half belonged to the breast cancer model, as shown in Fig. 14a. The ensembled model de novo designed 14 peptides for testing the MCF7 cell and A549 cell (lung cell). The results showed the six peptides had anticancer ability, five of which were effective against both types of cancer cells.
Fig. 14 (a) Schematic representation of the ensemble prediction model.156 Reproduced from ref. 108 with permission from Springer Nature, copyright 2019. (b) The illustration of the model for identifying the relationship between gene expression and drug sensitivity.157 Reproduced from ref. 109 with permission from Springer Nature, copyright 2018. (c) Diagram of the correlation analysis and differential expression analysis on patient-derived xenograft (PDX) tumors.158 Reproduced from ref. 110 with permission from Springer Nature, copyright 2020. |
Nowadays, some ML models are applied to drug response prediction on patients due to individual variability. Lee et al.157 investigated the relationship between gene expression and the drug's sensitivity to acute myeloid leukemia (AML), as shown in Fig. 14b. They collected genome-wide gene expression profiles and chemotherapy drug sensitivity, and created novel methods for studying the prior anticancer knowledge to identify robust gene expression biomarkers. Kim and partners conducted the correlation analysis between the gene expression and dose–response on the patient-derived xenografts (PDXs) models (Fig. 14c), and built the drug response model via random forest.158 The model predicted the biomarkers for different types of cancers, such as breast cancer, pancreatic cancer, colorectal cancer, or non-small cell lung cancer, for application to drug therapy. Kong et al.159 utilized ridge regression to identify the biomarkers whose data came from organoid culture models. The drug responses were accurately predicted through the identified biomarkers, while the colorectal cancer patient was cured with 5-fluorouracil and the bladder cancer patient was treated with cisplatin, whose numbers were 114 and 77, respectively. The consistency between the experimental and control groups verifies the predicted result. This work combined gene data and ML to efficiently accept drug responses in cancer patients. To explain the drug mechanism in the ML model, Deng et al.160 constructed the deep artificial network (DNN) by introducing a layer of path nodes as a hidden layer that is more interpretable. The performance of DNN was evaluated on the different independent drug sensitivity data sets, and the results showed the DNN model had obvious advantages compared with the extra eight standard regression models. More importantly, they found the activity of disease-related nodes decreased, while the drug input forward propagation, revealing the suppression of the disease path by treating cancer with drugs.
Using a single anticancer drug was uncommon because of how easily and quickly it is for tumor cells to develop drug resistance, so multiple drugs were adopted in the realistic therapeutic schedule. The combination of multiple drugs can produce three influences: additive, antagonistic, and synergistic. The drug synergy is most optimized in these conditions, which can greatly enhance anticancer effect. Predicting drug synergy is important for the application of multiple drugs. Liu and co-workers developed a TranSynergy model based on the knowledge-enable and self-attention transformer.161 In, addition, the Shapley Additive Gene Set Enrichment Analysis (SA-GSEA) method was proposed and applied to deconvolute genes, which improved the interpretability of the model. The combination of TranSynergy and SA-GSEA provided new insight into the discovery of anticancer therapy. Ail et al.162 utilized the silico biological network to recognize the synergistic anticancer drug pair. Based on the drug-perturbed transcriptome profiles and biological network analysis, they proposed six relative network biology features and trained the model on the public drug synergy dataset. The model was capable of discerning whether there was synergy between two drugs, and explaining the situation in terms of the molecular network that passed through them. Regan-Fendt et al.163 designed a new computational approach from gene expression and network, mining to disease analysis and drug combination prediction. The new method considered connectivity mapping and network centrality for transcriptomics. The effect of prediction was tested on the public gene expression data and mutation data, and the drug combination results were confirmed by the high throughput experimental.
Model | Result | Ref. |
---|---|---|
FCBF | Detect early lung cancer in the plasma data | 153 |
CNNs | Recognized the tumor with 95.77% sensitivity and 87.44% specificity | 154 |
LSTM | Design the anticancer peptides and reach the selective killing of the cancer cell. | 155 |
CPANN | De novo designed 14 peptides for the MCF7 cell and A549 cell | 156 |
RF | Find the relationship between gene expression and dose–response on the PDXs; predict the biomarkers for different types | 158 |
Ridge regression | Predict the different drug responses identified biomarkers while the colorectal cancer patient was cured | 159 |
DNN | Reveal the suppression of the disease path by treating cancer with drugs | 160 |
TranSynergy | Predict the drug synergy and combine with SA-GSEA to discover anticancer therapy | 161 |
Silico biological network | Recognize the synergistic anticancer drug pair. | 162 |
SynGeNet | Analyze disease and predict the drug synergy | 163 |
However, two significant problems may impede the development of ML in the biological field. At first, the data from prepared CDs takes a lot of work to gather. Establishing a database is so urgent for the laboratory, which can effectively solve the missing conditions. The database forms unified management and control of data, and dramatically improves data integrity and security. The balance of data is a crucial issue, which directly influences the model performance. Research tends to be biased toward obtaining successful results and previously failed experiments cannot be recorded. So, we should pay more attention to the proportion of data. In addition, current research focuses on changes in the experimental conditions and keeping the recipe the same for synthesizing CDs, which is limited in terms of tuning the material performance. In the future, different precursors and various solvents can be used as experimental conditions to predict the relevant properties by ML. This process requires familiarity with the chemical structure and physical properties of the raw materials, utilizing a set of descriptors or features to represent the material in the dataset.
The automation of acquisition pipelines and the new microscopy technologies have broken the limits of temporal and spatial resolution, dramatically increasing the amount of bioimage data. The major challenge is to interpret these image data sets in a quantitative, automatic, and efficient way. Computer vision and image analysis are applied to microscopic images to extract biological information and generate databases, which expedites the development of bioimaging. For supervised machine learning algorithms, the image itself needs to be labeled, which requires manual operation, and is time-consuming and laborious. In return, the labeled data determine the performance of the model, so special attention should be paid to data selection and labeling.
Biosensors inevitably have some irregular signal noise, which leads to poor stability of biosensors and limits their commercialization. Detecting the corresponding analyte concentration is essential, which is based on analyzing the sensing data. ML can provide a new method of data analysis, perform a series of noise processing on the data, and automatically predict the type or concentration of analytes according to the decision system. In addition, biosensors are easily affected by the sample environment and operating conditions, which greatly disturb the results. ML can detect anomalies and exclude outliers by rule. At present, the combination of biosensors and machine learning for health monitoring is a worthy challenge. Biosensors can continuously detect the corresponding indicators and provide sequential data, while ML analyzes data and evaluates biological health using new algorithms such as RNN, LSTM, etc.
ML has been extensively studied in tumor therapy to diagnose tumors and predict a patient's condition, obtaining excellent success. However, some limitations and obstacles must be overcome before it can be widely used in the clinic. As the amount of CT and MR imaging continues to grow, the management of medical data is a major barrier. Collating the imaging data needs of trained professionals in terms of labeling, annotation, segmentation, quality control, or application, makes the process expensive in both time and cost. It is important to develop automated imaging process software, and the use of data is also restricted due to permission and privacy concerns, so the wide clinical application of ML is hard. ML is a black box model, which cannot be explained to a certain extent, receiving certain limitations for their application. The current development of data visualization can help understand the principle of machine learning to a certain extent, accelerating the integration between ML and cancer therapy.
This journal is © The Royal Society of Chemistry 2023 |