Ali Nabi
Duman
*a and
Almaz S.
Jalilov
*b
aDepartment of Mathematics and Statistics, University of Houston-Downtown, Houston, USA. E-mail: dumana@uhd.edu
bDepartment of Chemistry and Interdisciplinary Research Center for Advanced Materials, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia. E-mail: jalilov@kfupm.edu.sa
First published on 29th July 2024
One of the hottest topics in nanoparticles research right now is carbon dots (CDs). In order to be used in applications like medical imaging and diagnostics, pharmaceutics, optoelectronics, and photocatalysis, CDs must be synthesized with carefully controlled properties. This is often a tedious task due to the fact that nanoparticle syntheses frequently involve multiple chemicals and are carried out under complex experimental conditions. The emerging data-driven methods from artificial intelligence (AI) and machine learning (ML) provide promising tools to go beyond the time-consuming and laborious trial-and-error approach. In this review, we focus on the recent uses of ML accelerating exploration of the CD chemical space. Future applications of these methods address the current limitations in CD synthesis expanding the potential uses of these intriguing nanoparticles.
Almaz Jalilov obtained his PhD in Chemistry from the University of Wisconsin. After postdoctoral positions in Northwestern University and Rice University, he joined King Fahd University of Petroleum and Minerals, where he is now an Associate Professor of Chemistry, and his research interest includes physical organic chemistry and materials chemistry with an emphasis on the mechanism-based design of carbon nanoparticles for sustainable energy and catalysis. |
Ali Nabi Duman received a PhD from the University of British Columbia in 2010. He has been an Assistant Professor of Mathematics at King Fahd University of Petroleum and Minerals from 2015 to 2023. In 2023, he joined the University of Houston-Downtown, where he is now an Assistant Professor of Data Science. His research focuses on topological data analysis with applications in neuroimaging, microscopy and genomics. |
CDs can be synthesized by utilizing bottom-up or top-down methods.16,23 In the top-down methods, large carbon materials are cut into small carbon structures smaller than 10 nm. The demanding physical procedures to break down the carbon materials (e.g., graphite, graphene oxide, carbon nanotubes, activated carbon, soot) involve laser ablation, arc discharge and nanolithography under unfavorable conditions such as strong oxidants, concentrated acids, and high temperatures.24–33 The more adaptable and accessible bottom-up methods usually include ultrasound synthesis, chemical oxidation, room temperature method, and hydrothermal and solvothermal processing of relatively small molecular precursors.34 Although these methods may include high temperatures/pressures, long reaction times, or toxic solvents, the use of microwaves in solvothermal synthesis partially solves these issues by reducing the reaction time and the amount of solvents.35–40 The room temperature method is another advantageous technique because it does not require complicated machinery or harsh synthesis conditions, making it environmentally friendly and sustainable.41–43 Hence, the simple setup, low cost and accessibility to a wide variety of precursors make bottom-up methods more favorable over top-down methods.
CDs can be divided into four main classes according to their carbon core structure, surface functionalities, and performance features: (i) graphene quantum dots (GQDs), (ii) carbon nanodots (CNDs), (iii) carbon quantum dots (CQDs), and (iv) carbonized polymer dots (CPDs).44,45 GQDs are usually synthesized by using a top-down approach, while preparation of CNDs and CQDs is mainly done by using bottom-up methods.46–48 A variety of models (e.g. polycyclic aromatic hydrocarbons, molecular fluorophores, or sp2/sp3 hybrid spherical structures) are used to explain the different structures of CDs.49–52
Due to the necessary high temperatures during bottom-up synthesis of CDs, multiple reaction pathways occur while forming a considerable amount of by-products. Along with the irregular mass transfer, low reproducibility is also common as reported in earlier studies.35,53 One solution to optimize the target properties is to scan large experimental synthesis conditions including the reaction temperature, the mass of precursor, ramp rate, and reaction time. However, the high complexity of the extracted data, repetitive experimental procedures, and the lack of predictability make this scan very time-consuming to achieve ideal results. For example, it is still unclear how CQDs emit their fluorescence because it is a very complicated process. It is customary to analyze the pH-dependent photoluminescence (PL) spectra of CQDs at a fixed excitation and ignore all other potential excitations; however, this method only allows for the extraction of a portion of the available data.39,54–60 On the other hand, the complexity of data analysis methods can rise along with the number of PL measurements. Similarly, current CDs reported in the literature were frequently prepared optimally by controlling one reaction parameter and fixing the other reaction factors, while not considering the complex relationship between reaction parameters during CD synthesis. Therefore, there is a need to employ methods that accelerate the screening of the necessary parameters in order to create CDs with enhanced features and applications.
Quantum mechanics methods such as density functional theory (DFT) provide a reliable computational solution to search a reasonably designed parameter space.61,62 These semi-empirical approaches can be used to explore the electronic structure and chemical reactivity of CDs.63–67 Density-functional-based tight binding (DFTB) is another semi-parametric method which approximates DFT in a tight binding framework.68–71 DFTB requires fewer empirical parameters and is computationally more efficient than DFT. The mechanism of graphene formation and single-walled carbon nanotube nucleation are examples studied using DFTB.72,73 However, these semi-empirical methods are computationally too costly for a large search space. The alternative approaches to reduce the entire search space include optimization and gradient based algorithms. The accuracy and computational performance of these methods depend on the initially determined parameters; hence, they might return different results from the different initial values and potentially end up in local minima.
Data-driven approaches based on machine learning (ML) algorithms provide an alternative to the abovementioned computational methods for the description of the structure and properties of CDs. As a branch of artificial intelligence, ML employs statistical and probabilistic methods to learn from a given dataset by optimizing performance measures for particular tasks.74,75 Certain methods have the ability to detect the relationship (correlations/inference) between input variables and the target variable. Instead of screening the entire parameter space, ML methods learn the hidden patterns using a limited amount of data. These trained algorithms are later generalized to predict the target variables from previously unseen input variables. As a result of increasing amount of experimental data and accessible computational power, ML has successful applications in a variety of fields including image/speech recognition, cancer research, chemical synthesis, and protein structure prediction.76–84
In materials science, ML has attracted a lot of interest in applications such as materials discovery, materials structure/property prediction, performance optimization, and acceleration of the protocols for nanoparticle synthesis.85–95 Using ML, the reaction parameters and their effects on the nanoparticle synthesis can be revealed objectively,96,97 and the synthesis process can be made more efficient by choosing appropriate evaluation criteria including shape, size, polydispersity, and surface chemistry.98 ML accelerates not only the experimental protocols but also the search of new semiconductor, metal, carbon-based, and polymeric nanoparticles with superior features requiring low computational cost.8 The large amount of data needed for ML algorithms can be obtained using computational or experimental methods. Numerous databases such as the Materials Project, Automatic Flow for Materials Discovery, Open Quantum Materials Database, Novel Materials Discovery make it possible to access data of a lot of materials in addition to using computer simulations to generate it.
In particular, the use of ML in the field of CD has generated a lot of interest in research in recent years. The proper adjustment of a variety of variables, including precursors, temperature, and reaction time, is necessary for the successful preparation of CDs. It is simple to use these elements as input parameters in ML, which is trained with the available experimental data and generates accurate new predictions. Therefore, the addition of ML can aid in the relationship between precursors and desired properties, which may result in the formation of a design principle for further study and significantly shorten the synthesis cycle and lower the cost of CDs.
Many outstanding reviews on the applications of computational and ML methods to the nanoparticle synthesis have been published.98–106 Although some of them focus on CD synthesis, they partially cover the development of ML methods along with the experimental techniques.99 The more general reviews include CDs as a subcategory of nanoparticles,98,100 quantum dots,103,104 and graphene-based101,105 or polymer-based106 materials. Theoretical methods such as quantum mechanics and/or molecular mechanics approaches applied to CDs are also available in the literature.102 To the best of our knowledge, a thorough review of ML applications specifically for CD synthesis is lacking. In this review, we outline the primary ML algorithms in the context of CD research, discuss recent studies on ML applications for CD synthesis, and enumerate potential future directions for this rapidly expanding field of study (Table 1).
ML models | Input | Output | Samples | Ref. |
---|---|---|---|---|
MLR, poly. reg. | Microstructural features | Thermoelectric performance | 322 | 107 |
Lin. reg., poly. reg. | Synthesis process parameters | UV-visible and PL spectra | 44 | 108 |
MLR, KNN | Synthesis process parameters | PLQY, PL peak position | 227 | 109 |
MLR | PL characteristics | Temperature sensing accuracy | 121 | 110 |
Random forest | Reaction parameters | Emission wavelength, Stokes shift, PLQY | 480 | 111 |
Log. reg., KNN, SVM | CD fluorescence sensor array | Protein classification | 48 | 112 |
CNN | Synthesis process parameters | Spectral properties and FL colors | 170 | 113 |
ANN | Synthesis process parameters | Color classification, emission wavelength | 407 | 114 |
ANN | CDs fluorescence variation maps | Amino acid classification | 90 | 115 |
Multilayer perceptron | Synthesis process parameters | PLQY | 30 | 116 |
CNN | Emission/PL decay data of CDs | Ethanol content prediction | 597 | 117 |
XGBoost | Synthesis process parameters | PLQY | 391 | 82 |
XGBoost | Synthesis process parameters | PLQY | 467 | 118 |
PCA, XGBoost | Synthesis process parameters | FL intensity, emission centers | 400 | 119 |
XGBoost | Reaction parameters of CD catalysts | Failure/success of oxidation of C–H bonds | 652 | 120 |
GBDT | Biochar preparation parameters | Fluorescence quantum yield | 480 | 121 |
Random forest | Precursor combinations | PL wavelength intensity | 202 | 122 |
Random forest, GA | Synthesis process parameters | Corrosion inhibition efficiency | 102 | 123 |
PCA, MCR, NMF | Wavelengths, pH | Unsupervised clustering | 401 | 124 |
LDA, SVM | CD fluorescence sensor array | Tetracycline classification | 92 | 125 |
y = xTβ. |
The most common objective function to determine the coefficients β is the residual sum of squares:
Applying multiple linear regression (MLR), Armida et al. explored the relationship between the size, dimensionality, concentration, doping and other microstructural features of carbon dots and their thermoelectric performance.107 The conversion efficiency of a thermoelectric material is quantified by the thermoelectric figure of merit, ZT = S2σT/κ, where S is the Seebeck coefficient, σ is the electrical conductivity, κ is the thermal conductivity, and T is the absolute temperature. MLR is performed for each ZT, σ, T and κ using 10 input variables characterizing size, dimensionality, concentration, doping and other features. The results revealed a strong negative relationship between functionalization and S, as well as a strong positive relationship between the type of carbon nanostructures and σ. Polynomial regression highlighted significant impacts of six input parameters on the Seebeck coefficient, electric conductivity σ and thermal conductivity κ, while no combination of parameters significantly affected thermoelectricity ZT.
Zhang et al. utilized linear and polynomial regression models to investigate the core synthesis process parameters of B,N-GQDs (synthesis temperature, H2O2 additional volume, and synthesis time).108 The models are trained using the optical properties of B,N-GQDs derived from UV-visible and PL spectra (i.e. 675/500 peak intensity ratio and PLQY). While the authors employed other complex models such as bagging regression, random forest regression, LASSO regression, and ridge regression, the highest R2-score is obtained using the polynomial model of degree 7 (R2 = 0.9860). Polynomial and linear regression models pointed out that high H2O2 additional volume, low synthesis temperature, and appropriate synthesis time in the selected process conditions contribute to achieving a high 675/500 peak intensity ratio (see Fig. 1).
Fig. 1 Machine learning-assisted evaluation of B,N-GQDs. (A) Schematic of machine learning-assisted evaluation of the optical properties of B,N-GQDs. Optical properties of B,N-GQDs in varied synthesis conditions and the corresponding predicted value sets with (B) linear regression, (C) polynomials 1–30, (D) polynomial regression 7, (E) bagging regression, and (F) random forest regression. The R2 scores of linear regression, polynomial regression 7, bagging regression, and random forest regression models are 0.6751, 0.9860, 0.9473, and 0.9469, respectively. Reprinted with permission from ref. 108. Copyright 2022 American Chemical Society. |
Tuchin et al. analysed a dataset on the synthesis parameters and optical characteristics of carbon dots focusing on their optical behavior within the red and near-infrared wavelengths.109 A predictive model using multiple linear regression has been developed to forecast the spectral attributes of these carbon dots. The validity of this model was confirmed by comparing its predictions with the actual optical properties observed in carbon dots synthesized in three distinct laboratories.
Doring et al. applied a multiple linear regression model that combines steady-state and time-resolved luminescence data from carbon dots to enhance temperature sensing accuracy to 0.54 K.110 This research illustrates the significant advancements in temperature sensing using optical probes through multidimensional machine learning techniques.
Several machine learning algorithms are dedicated to classification tasks. Logistic regression considered under the umbrella of a generalized linear model is specifically designed for predicting probabilities associated with discrete (categorical) variables. The probability p(x) of a sample belonging to a particular category is expressed as
ANNs have found applications in various carbon dot studies. Wang et al. have used a convolutional neural network (CNN) model tailored for predicting the optical characteristics of carbon dots such as spectral properties and fluorescence (FL) colors under ultraviolet (UV) irradiation.113 The model is trained with CD synthesis features (precursor, mass, temperature, solvent, and reaction time) from 170 prototypical studies. The output layer is a feature vector that indicates spectral properties and FL color under UV irradiation. Subsequently, CDs with distinct emission properties were synthesized, and their experimental data were compared with the predicted outcomes from the trained model. These synthesized CDs were employed in cell imaging, demonstrating good performance. These findings suggest that the implementation of CNNs can assist researchers in achieving effective CD design without the need for extensive manual processes. Within the same study, alternative classification models, such as support vector machines, K-nearest neighbors, random forests, decision trees, and extreme gradient boosting, demonstrated inferior performance compared to CNNs. These outcomes underscore the significant potential of CNNs in guiding the synthesis of CDs.
Senanayake et al. conducted a parallel study, employing ANNs, to characterize the influence of synthesis parameters on and make predictions for the emission color and wavelength of CDs.114 The machine analysis indicated that the selection of the reaction method, purification method, and solvent is more closely correlated with CD emission characteristics compared to factors like reaction temperature or time, which are often adjusted in experimental settings. A total of 407 data examples were gathered from the literature, with 379 of them constituting the training database. The remaining 28 data examples were reserved as an external test set to validate the model. The color prediction from the classification model, which does not include reaction temperature and time as features, attained a training accuracy value of 0.94. The accuracy of emission prediction is enhanced from MAE = 38.4 to 25.8 when a combination of both classification and regression methods is employed. To overcome the limitations associated with a small dataset in an ANN model, the authors used an ANN k-ensemble model which outperformed XGBoost, K-nearest neighbor (KNN), and support vector machine (SVM). The hybrid models employed a two-step approach: initially, a classification model was utilized to predict the color, and subsequently this predicted color (combined with the actual color during training) served as an input to predict the emission wavelength using a regression machine learning model. The tools developed in this study, particularly the hybrid models, are expected to be valuable in predicting the emission of novel carbon dots (CDs). This approach allows for the selection of promising reaction examples from the model, streamlining the synthesis of CDs with specific colors and significantly reducing the effort required in the optimization process.
In another classification problem, Tuccito et al. employed CD fluorescence as a nanochemosensor to detect different amino acids.115 The modification of CD surfaces can alter fluorescence properties, including emission intensity and excitation and emission wavelengths. In this study, carboxyl groups on nanoparticle surfaces were activated and subsequently reacted with various amino acids. The nanochemosensors demonstrated the ability to distinguish between amino acids within a mixture, showcasing their potential in complex amino acid analyses. ANNs were trained with fluorescence variation maps of activated CDs to predict if the amino acid is alanine (ALA) or not alanine. The resulting model had 0.8 sensitivity and 0.91 specificity. These discoveries will contribute to the advancement of cost-effective nanochemosensors for investigating specific diseases that are presently diagnosed through basic amino acid detection methods.
In a regression task, Pudza et al. applied multilayer perceptron (MLP) to predict the photoluminescent quantum yield (PLQY) of fluorescent carbon dots synthesized from tapioca powder.116 The training data (n = 30) were collected from the experiments. MLP trained with temperature, time, dosage and the solvent ratio predicted the PLQY with high accuracy. The optimization and prediction processes have yielded sustainable, efficient, and reliable fluorescent carbon dots. This approach not only saves energy within a manageable timeframe but also reduces the required dosage while maintaining an optimal quality output.
Doring et al. applied CNNs and deep neural networks on the emission/PL decay data of CDs to improve ethanol content determination in ethanol/water mixtures (n = 578) as well as in alcohol-containing beverages (n = 19).117 The models are trained by PL excitation/emission maps, PL decay spectra, and extracted features (i.e. PL intensities, PL peak positions, and PL lifetimes) to predict the ethanol content. The utilization of time-resolved spectral information (PL decays and lifetimes) as the input for CNNs enables more accurate prediction of ethanol content compared to steady-state emission data. Using entire optical spectra, namely PL decays and PL excitation/emission maps, advanced deep learning models demonstrated their applicability in the analysis of beverages. In contrast to CNN models with only a few predictor variables, which struggled due to autofluorescence of the beverages, advanced deep learning models enabled better predictions of ethanol content. Although CDs serve as excellent candidates for showcasing deep learning in optical sensing, the methods outlined in this study hold promise for enhancing chemical sensing across a range of luminescent materials (see Fig. 2).
Fig. 2 Multi-channel deep learning model: (a) structure of the PL decay channel. The input layer takes 1024 intensity integers as the input. After normalization, data are passed through a dense layer (64 neurons), a dropout layer (dropout = 0.01), and a second dense layer (16 neurons). (b) Structure of the PL map channel. The input layer takes a 16 × 217 × 1 matrix as the input. It is passed through a series of convolution, maximum pooling, and dropout layers before it is flattened and fed through another dropout layer and dense layer (32 neurons). (c) Example of a multi-channel model with 9 inputs. The respective input data are passed through either a PL decay channel or a PL map channel. These channels are concatenated before being passed through a dense layer (32 neurons) and a dropout layer (dropout = 0.3) to predict the ethanol concentration as the target variable. Reprinted with permission from ref. 117. Copyright 2022 American Chemical Society. |
Extreme gradient boosting (XGBoost), a powerful ensemble learning algorithm, has emerged as a dominant force in the realm of machine learning, demonstrating remarkable success across various domains. Developed as an extension of traditional gradient boosting techniques, XGBoost has garnered widespread popularity due to its efficiency, scalability, and superior predictive performance. At its core, XGBoost operates by sequentially training a series of weak learners, typically decision trees, and iteratively refining their predictive capabilities. Unlike traditional gradient boosting, XGBoost incorporates a regularization term and employs a second-order Taylor expansion to optimize the objective function, enhancing its ability to capture complex patterns within the data.
One of the defining features of XGBoost is its versatility, making it applicable to both regression and classification tasks. The algorithm excels in handling large datasets and high-dimensional feature spaces, showcasing robustness in the face of noisy or missing data. Moreover, XGBoost provides a comprehensive set of hyperparameters that can be fine-tuned to accommodate diverse modeling scenarios, fostering adaptability to different applications. The success of the algorithm is further underscored by its ability to balance bias and variance, mitigating overfitting and ensuring generalizability across unseen data. As a result, XGBoost has become a method of choice in various fields, ranging from finance and healthcare to image processing and natural language processing, showcasing its broad utility and effectiveness in extracting meaningful patterns from complex datasets.
XGBoost has demonstrated considerable efficacy in numerous CD studies. Han et al. reported a machine learning-assisted approach for synthesizing highly fluorescent CDs using a hydrothermal route.82 XGBoost outperformed multilayer perceptron, support vector machine, and Gaussian process regressor in predicting the QY using five input variables: the volume of ethylenediamine, the mass of precursor, reaction temperature, ramp rate and reaction time. The data were collected from 391 experiments with different combinations of growth parameters, and respective QYs ranged from 0 to 1. XGBoost unveiled a noteworthy correlation between outstanding optical properties and the mass of the precursor and the volume of the alkaline catalyst. This observation aligns well with experimental findings. The methodology introduced in this study serves as a foundational step toward the advancement of artificial intelligence techniques for the analysis and optimization of material preparation methods (see Fig. 3).
Fig. 3 Application of ML for guided synthesis of CDs. (a) Design framework for the guided synthesis of CDs with a large QY based on ML and hydrothermal experiments. (b) The heat map of the Pearson correlation coefficient matrix among the selected features of hydrothermally grown CDs. (c) Feature importance retrieved from XGBoost-R that learns from the full data set. The most important features are EDA and M. (d) Predictions from the trained model, which is represented by the matrix formed by the two most important features. Reprinted with permission from ref. 82. Copyright 2020 American Chemical Society. |
Tang et al. developed a regression model to improve the PLQY of carbon quantum dots (CQDs) grown through hydrothermal methods.118 Six hydrothermal parameters were identified as input features: the pH value (pH), reaction temperature (T), reaction time (t), the mass of precursor A (M), ramp rate (Rr), and solution volume (V). A total of 467 experimental records were used with different growth parameters and respective PLQYs ranged from 0 to 1. In order to best infer the PLQY from the features, several regression algorithms are evaluated with nested cross validation, including XGBoost regressor, support vector machine regressor, and Gaussian process regressor. XGBoost demonstrates superior performance, surpassing the other algorithms by a significant margin, as indicated by its R2 value of 0.8402. The most critical factor influencing the PLQY is shown to be the pH value, with reaction temperature and reaction time following closely in significance. The trained XGBoost model is then employed to predict the PLQY for a vast array of 1555840 potential synthesis conditions generated from various combinations. Eleven synthesis conditions are recommended by the model attributed to their highest predicted PLQY. Subsequent experiments conducted in the laboratory yielded a remarkably high photoluminescence quantum yield (PLQY) of 55.5%. This achievement is particularly noteworthy given the ultra-low heteroatom doping precursor ratio employed, making it one of the highest reported PLQY values under such conditions. The findings support the promising potential of ML in optimizing and expediting the material synthesis process. This endorsement suggests that ML has the capability to facilitate the development of advanced inorganic materials, contributing to practical applications through reduced processing time and enhanced material properties.
Hong et al. utilized the XGBoost model for predicting the maximum fluorescence (FL) intensity and emission centers of CDs synthesized under room temperature conditions using p-benzoquinone (PBQ) and ethylenediamine (EDA) as starting materials.119 They successfully synthesized a variety of CDs with tailored optical properties. These CDs were effectively employed for applications such as detecting Fe3+, facilitating sustained drug release, enabling whole-cell imaging, and contributing to the preparation of poly(vinyl alcohol) (PVA) films. The input dataset comprises four hundred types of CDs prepared under different reaction conditions, encompassing the mass of p-benzoquinone (VEDA), volume of ethylenediamine (VEDA), reaction duration, and solvent types. For output, the predicted target variables are the FL intensity and the location of emission centers. Principal component analysis (PCA) was employed to create new variables characterized by relative independence. Subsequently, PC1 and PC2 were utilized as novel input features for the training of the model. XGBoost showed superior performance compared to K-nearest neighbor, decision trees, random forest, support vector machine and convolutional neural networks. Leveraging the significant features and parameters (i.e. VEDA and MPBQ) extracted from the XGBoost model, the authors successfully fabricated a series of novel carbon dots (CDs) with customizable fluorescence (FL) intensity and emission center properties. This study demonstrates that the XGBoost algorithm, as a machine learning approach, is effective in identifying crucial factors in CD synthesis. It provides chemists with a rapid and reliable means to access optimal reaction parameters for synthesizing desired CDs (see Fig. 4).
Fig. 4 Schematic illustration of machine learning guiding the synthesis of CDs. (a) Synthetic process of CDs. (b) Prediction of CD optical properties using machine learning models. Reprinted with permission from ref. 119. Copyright 2022 American Chemical Society. |
Using ML, Wang et al. successfully predicted and synthesized metal-free CD homogeneous catalysts for the oxidation of C–H bonds.120 The dataset for cyclohexane oxidation was compiled from literature sources and laboratory notebooks, comprising a total of 652 entries. This dataset consists of 113 positive samples (17.3%) and 539 negative samples (82.7%). The boundary between success and failure in this context is characterized by achieving a 10% conversion of cyclohexane and a 70% selectivity towards the production of adipic acid (AA). The input features are selected as O (content of oxygen), Mw (weight-average molecular weight of the nonmetal catalyst), G (O2 or not), p (homogeneous catalysis or heterogeneous catalysis), T (catalytic temperature), P (pressure), and t (reaction time). Out of the four classical models considered (multilayer perceptron, naive Bayes, SVM, and XGBoost), the XGBoost model was chosen due to its high performance. The analysis of feature importance derived from the XGBoost model indicates that the molecular weight (Mw) takes precedence over other features. The order of importance follows Mw, followed by O, T, P, and t. Subsequently, the established XGBoost model is employed to apply the unexplored conditions, predicting the probability of success or failure. All predictions align with the actual outcomes of “success” in the conducted true experiments, affirming the accuracy of the model. This study distinctly illustrates a novel approach to C–H bond activation, employing metal-free CDs as quasi-homogeneous catalysts.
Chen et al. explored the relationship between biochar preparation parameters and the fluorescence quantum yield of CDs in biochar, employing six machine learning models including decision trees (DT), random-forest (RF), gradient-boosting decision-trees (GBDT), extra-trees (ET), K-nearest-neighbor (KNN) regression, and XGBoost, where the dataset consisted of 480 samples.121 The input parameters for the biochar production experiment were determined, encompassing the type of farm waste, as well as characteristics such as cellulose, hemicellulose, lignin, ash, moisture, nitrogen (N), carbon (C), and carbon-to-nitrogen ratio (C/N) contents of the samples. Additionally, parameters related to the pyrolysis process, including pyrolysis temperature (T) and residence time (t), were considered. The GBDT model had the best performance among the other models, as GBDT exhibit resilience to missing values and outliers, are less susceptible to the impact of extreme values, and demonstrate effectiveness in handling high-dimensional sparse data. It was identified that four features, namely, pyrolysis temperature, residence time, nitrogen (N) content, and carbon-to-nitrogen (C/N) ratio, had the most significant impact on enhancing the accuracy of QY predictions. The methodology introduced in this study can serve as a foundation for the advancement of new techniques leveraging artificial intelligence for the analysis and prediction of CDs generated in the process of biochar production.
Chen et al. explored the relationship between reaction parameters and the photoluminescence characteristics of CDs, achieving controllable synthesis of multi-color CDs with the aid of ML.111 Five input parameters are used, including varied precursor types and quantities such as p-phenylenediamine with urea, p-phenylenediamine with citric acid, and diverse solvent types (anhydrous ethanol, water, and N,N-dimethylformamide), along with reaction time and temperature. 270 experiments with different parameter combinations are conducted to feed the ML algorithms. The 3D fluorescence spectra (maximum emission wavelength, Stokes shift) and fluorescence quantum yield were used as the output variables. The RF model demonstrated superior predictive performance compared to other models, including extreme gradient boosting (XGBoost), light gradient boosting machine (LGBM), ridge regression (ridge), least absolute shrinkage and selection operator (LASSO), and support vector regression (SVR), specifically in predicting the maximum emission wavelength, the fluorescence quantum yield and the Stokes shift of multicolor CDs. The authors also implemented a computer algorithm for ranking importance, utilizing a method to calculate the significance of features. The outcomes revealed that the solvent was the primary factor influencing the maximum emission wavelength of multicolor CDs. The key determinant influencing the fluorescence quantum yield was identified to be the precursor ratio and the precursor type was the main influencing factor of the Stokes shift.
Xing et al. employed RF to facilitate the synthesis of CDs with predictable photoluminescence (PL).122 In contrast to treating the precursors as constants, the variables in this context involve randomly chosen 202 combinations of precursors, specifically three-precursor combinations of 24 precursors. The wavelengths of the peaks with the strongest intensity and the longest wavelength under excitation wavelengths of 365 and 532 nm were used as output parameters. The other reaction parameters were fixed to 200 °C and 10 h. The RF model demonstrated the highest performance among the six models including KNN, AdaBoost, bagging, DT, RF and SVM. It is shown that, utilizing prediction data that encompass the entire precursor combination space, the screening of CDs with specific PL wavelength features can be conducted much more effectively than through random trials.
He et al. established an RF regression model for corrosion inhibitors based on hydrothermally synthesized CDs to predict the inhibition efficiency.123 This model unveils the relationship between different synthesis parameters and the inhibition efficiency of the CDs. The dataset was created by combining 102 data points on CD synthesis and inhibition efficiency, drawing from reported studies and the authors’ own experimental findings. Typical input parameters such as CD concentration in HCl, precursor type and quantity, solvent type and volume, and reaction time and temperature were selected. The inhibition efficiencies of CDs, calculated through potentiodynamic polarization (PDP), served as the output variable in the analysis. Utilizing the feature importance derived from the RF model, critical factors in the synthesis of CD-based corrosion inhibitors were identified. The concentration of CDs in HCl emerges as the most influential factor affecting the inhibitory behaviors of the synthesized CDs, followed by N atomic content and reaction time. Additionally, the synthesis route is intelligently optimized using the genetic algorithm (GA), which is an optimization technique inspired by natural selection and genetics, utilizing a population-based approach with genetic operators to iteratively evolve solutions for a given problem. Successful controlled preparation of CD-based corrosion inhibitors was achieved. By identifying and filtering out unsatisfactory synthesis conditions, this approach significantly enhances the synthetic efficiency of CD-based corrosion inhibitors (see Fig. 5).
Fig. 5 Application of ML for controlled synthesis of CD-based corrosion inhibitors: (a) establishment of the dataset; (b) modelling for inhibition efficiency prediction; and (c) synthetic optimization of CDs. Reprinted with permission from ref. 123. Copyright 2023 Elsevier. |
Xu et al. used linear discriminant analysis (LDA)130 and support vector machine (SVM)131 to analyze multidimensional data of a CD-based sensor array fabricated for the detection and differentiation of four tetracyclines (TC), including tetracycline (TC), oxytetracycline (OTC), doxycycline (DOX), and metacycline (MTC).125 A training data set comprising a matrix of 2 CDs, 4 TCs, and 5 replicates was created through the utilization of I/I0 values. The reliability of the established fluorescence sensor array was confirmed by studying 52 unknown samples. At a concentration of 1.0 μM, four different TCs can be effectively clustered by SVM and LDA. Furthermore, the sensor array demonstrates the capability to effectively differentiate between individual TCs as well as binary mixtures of TCs and DOXs. The utilization of SVM presents an innovative option for array sensing systems in handling diverse data sets. The research illustrates the potential of the fluorescence sensor array in environmental monitoring and quantifying antibiotics (see Fig. 6).
Fig. 6 Two-dimensional LDA score plot of the fluorescence sensor array for the discrimination of the four TCs at different concentrations: (a) 1.0 μM; (b) 10 μM; (c) 25 μM; (d) 50 μM; (e) 100 μM; and (f) 150 μM (QR-CDs, 13.3 μg mL−1; CPC-CDs, 60 μg mL−1). Reprinted with permission from ref. 125. Copyright 2020 Elsevier. |
While artificial neural networks and gradient boosting algorithms have shown superior performance in several studies, research indicates that the optimal machine learning model can vary, even under identical input and target feature conditions. Hence, future research is needed to understand the performance, either theoretically or numerically, of various ML applications for CDs. Although achieving the true optimal experimental parameters remains a challenge in the field, there is optimism that ML will play a promising role in addressing this problem in the future. A promising avenue for enhancement involves establishing a more comprehensive model that incorporates both synthesis process-related and chemistry-related features.
The median of the sample size in the studies covered in this review is 357. For the gradient boosting algorithms, this number is 467. To enhance the accuracy and applicability of the ML approach, future endeavors should focus on collecting high-quality data for refining and updating the currently employed models. This continual improvement is crucial for advancing the development of more efficient CD synthesis strategies.
This journal is © The Royal Society of Chemistry 2024 |