Yuzhi
Xu†
ad,
Jiankai
Ge†
*b and
Cheng-Wei
Ju
*c
aDepartment of Chemistry, New York University, New York, New York 10003, USA
bChemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA. E-mail: jiankai2@illinois.edu
cPritzker School of Molecular Engineering, The University of Chicago, Chicago, Illinois 60637, USA. E-mail: cwju@uchicago.edu
dNYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
First published on 24th April 2023
With the development of industrialization, energy has been a critical topic for scientists and engineers over centuries. However, due to the complexity of energy chemistry in various areas, such as materials design and fabrication of devices, it is hard to obtain rules beyond empirical ones. To address this issue, machine learning has been introduced to refine the experimental and simulation data and to form more quantitative relationships. In this review, we introduce several typical scenarios of applying machine learning to energy chemistry, including organic photovoltaics (OPVs), perovskites, catalytic reactions and batteries. In each section, we discuss the most recent and state-of-art progress in descriptors and algorithms, and how these tools assist and benefit the design of materials and devices. Additionally, we provide a perspective on the future direction of research in this field, highlighting the potential of machine learning to accelerate the development of sustainable energy sources. Overall, this review article aims to provide an understanding of the current state of machine learning in energy chemistry and its potential to contribute to the development of clean and sustainable energy sources.
Fig. 1 Change in the number and percentage of publications with clean energy and containing energy topics in the last decade (source: Web of Science). |
Photovoltaics (PV) is one of the clean energy technologies that utilize solar energy and has gained more and more attention. More particularly, solar energy is a source of energy that can be considered as physically infinite. Therefore, adopting solar energy is regarded as the most promising solution to address the energy crisis.10 PV can be divided into three categories: silicon-based solar cells, organic solar cells (OSCs), and the increasingly popular perovskite solar cells (PSCs).
Currently, over 90 percent of the global PV market is dominated by crystalline silicon solar cells.11 Silicon-based solar cells are mature and commercially available for large-scale manufacturing. However, in most buildings and agricultural production sites, the opaqueness and weight of crystalline silicon prevent it from being used as a cover.12 OSCs and PSCs are considered as potential alternatives to silicon-based solar cells, as they have the potential to be light weight, flexible, and produced at a lower cost. Nevertheless, compared to silicon-based solar cells, OSCs and PSCs are still in the development stage.13 The high power conversion efficiency (PCE), long-term stability and less efficiency loss while scaling up are the main challenges for OSCs and PSCs.14
Computational design of materials has become an essential part of PV design.15–17 By using computational methods, researchers can simulate and predict the performance of different materials and devices without costly and time-consuming experimental trials. Such methodology can greatly accelerate the materials design and device optimization process. In the past few decades, computational chemistry methods have mainly employed first principles method.18,19 One of the main advantages of first principles methods is that they can accurately and non-empirically predict the electronic and optical properties of materials, including the energy levels of electrons and holes, band gaps, and the absorption spectrum. In recent years, machine learning, a statistics-based technology, has become an important part of computation-aided materials design.20,21 Through machine learning, researchers can not only bypass complex formulas to explore relationships between different values but also generate novel compounds and materials. Many parameters in materials or devices for PV cannot be simply derived from theoretical equations or calculated using the first-principles method, while the usage of machine learning greatly facilitates research in these areas.
In addition to PV, the catalytic reaction is also an efficient tool in sustainable and energy chemistry. A lot of catalysts have been developed to accelerate different types of chemical reactions, like the oxygen reduction/evolution reaction (ORR/OER), hydrogen evolution reaction (HER), CO2 reduction reaction (CO2 RR), etc.22–24 However, the catalytic ability is limited by multiple factors such as synthesis conditions, morphology, measurement methods, etc.25–27 It is hard to develop theories or models to describe catalytic systems. Machine learning, with the benefits of multi-factor fitting and identifying trends, has great potential in catalytic fields. Similarly, with the help of machine learning, researchers may be able to improve the understanding of the structure–function relationship and predict the catalytic ability, thereby guiding the synthesis of catalysts.
To date, there are already many websites, books, and reviews on machine learning in chemistry or energy that describe machine learning algorithms and the related research process.28–30 To gain a better understanding of basic machine learning concepts such as what an algorithm is, what a dataset is, and how to manipulate and use machine learning, refering to these excellent works may help in comprehending the machine learning process.31–33 This review provides a perspective on input types, task types and state-of-the-art (SOTA) performances when using machine learning in energy chemistry. We expect that this review could help chemists and materials scientists to gain more insight into how machine learning empower the development of energy chemistry.
Here, we will introduce how machine learning could optimize and accelerate the development of energy chemistry, especially in designing of materials. More specifically, this review focuses on the recent advancements, applications, and future prospects of machine learning in the fields of PV, catalysis and batteries. These are essential areas of energy chemistry, where machine learning techniques have been applied to improve the prediction, design, and optimization of material properties. In Section 2, we provide a brief introduction to organic photovoltaics (OPV) and delve into the structural and electronic descriptors. We further discuss the development of de novo design of OPV materials. In Section 3, we summarize multiple types of features and prediction tasks for perovskites. After that, we discuss how machine learning helps in perovskite discovery through experiments and auto-synthesis. In Section 4, developing atomistic potentials and the prediction for heterogeneous catalytic reactions are presented. As a practical application, battery design and management are reviewed with typical examples in Section 5. Finally, a perspective on developing novel machine learning-based methodologies aimed at solving chemistry problems and their applications in energy chemistry is proposed. With the assistance of machine learning, scientists can benefit from developing materials designs on different scales, from predicting and optimizing properties, fitting atomistic force fields, to managing systems.
There are multiple types of descriptors used in the process of converting raw data into features for OPVs, and these types can be divided into two categories: physical property representation and chemical structure representation. In physical property representation, researchers usually directly use electronic structure parameters or measurement values as the features. When constructing features using physical property data from different papers, it is important to ensure similar measurement conditions to avoid model bias or deviation. In chemical structure representation, various methods based on different structure description dimensions can be employed to describe the structure. It is important to note that these methods are often accompanied by a loss of information when converting a chemical structure to a feature. Therefore, the method of converting chemical structures into features may vary depending on the specific situation.
It is important to choose the suitable number and types of descriptors as features because adding more features in the model does not guarantee better prediction results. The inappropriate descriptors can add redundant information to the model and too much features may lead to model overfitting. Besides, it may also increase the dimensionality of the data, making the model training more expensive and difficult. Hence, feature engineering is usually required after manually selecting the descriptors to optimize the performance of the model.43 In actual practice, chemical structure descriptions and electronic descriptions typically complement one another.44
Fig. 2 (a) Specific representation coding and differences between SELFIES and SMILES, with SELFIES having a more complex structure than SMILES.50 Reproduced from ref. 50 under the terms of the Creative Commons Attribution 4.0 license from IOP Publishing. (b) Simple representation of molecular fingerprints.54 Reprinted (adapted) with permission from ref. 54. Copyright 2014 Elsevier. (c) The multidimensional fragmentation descriptors strategy in a linear alternative polymer; A and B represent different fragments in the polymer and A–B stands for a monomer in the polymer. The features from different input dimensions work together and boost the prediction accuracy of the linear alternative polymers.51 Reprinted (adapted) with permission from ref. 51. Copyright 2021 American Chemical Society. |
In the chemical structure representation of OPVs, the whole polymer cannot be described directly using the machine learning algorithm for its complex components. As an approximation method, researchers usually adopt individual monomers to represent the polymers. Therefore, the process of describing small molecule and polymer structures is similar. In many cases of small molecule or protein design, 3D coordinates or 4D descriptors are useful as they provide detailed information about the molecular structure and conformation.55 However, the OPV structures are often more complex, and the use of these high dimensional descriptors may not be as effective. Besides, researchers are more concerned with OPV functionality than with subtle changes in the absolute position. Thus, 3D or 4D descriptors are rarely employed in the OPV machine learning work. In actual practice, structural descriptors are mainly classified into 2D and 3D categories. 2D descriptors can be mainly divided into SMILES (including different kinds of SMILES, such as canonical SMILES and SMILES with atomic mapping), InChI, Tensors and others. 3D descriptors can be divided into voxels, Coulomb matrix, tensor field networks and potential energy surface method. Using the concept of physical descriptors, Elton et al. introduced general descriptor production rules in two and three dimensions.56 Recently, the majority of OPV machine learning studies have used molecular fingerprints as the primary structure representation. This is because molecular fingerprints not only contain the original structure of the molecule but also capture some information from the surrounding atomic environment at a 2D level. This type of environmental information is generally in the range of 4–6 atoms such as extended-connectivity fingerprints (ECFP4 and ECFP6).57 Although molecular fingerprints lose long-range structural information, it is believed that with a large enough dataset, the environmental information supplemented by sub-segments can still capture this information. This is because OPV molecules have relatively fixed building blocks of the backbone structure. Note that very few OPV works simply use structures as the only descriptor. Generally, multiple descriptors including PV parameters and chemical structure information are featured together.
In many cases using fingerprints as descriptors, random forest (RF) and boosting decision tree (BDT) are the most suitable models in OPVs.58,59 Sun et al. manually constructed a database of more than 1700 experimentally tested real donor materials including both polymers and small molecules (with a median PCE of 3.48%).60 They compared the performance of image representation, ASCII strings, and seven molecular fingerprints in the binary classification of “high” or “low” PCE (two thresholds are 3.00% and 10.00%) with the algorithms of the back propagation neural network (BPNN), deep neural network (DNN), RF and support vector machines (SVMs). Besides, they also researched the influence of different lengths of the fingerprints. They found that using RF with daylight fingerprints61 an average prediction accuracy of 86.67% could be achieved and explored whether the fingerprints whose length is longer than 1000 bits including sufficient chemistry information are suitable candidates for building descriptors in PCE prediction models. Furthermore, Wu et al. established a small database of 565 donor/acceptors (D/A) combinations and used a fragment fingerprint method to build a PCE prediction model.62 In their work, the authors employed five different model methods including linear regression (LR), boosted regression trees (BRT), multinomial logistic regression (MLR), RF and artificial neural network (ANN) to perform ternary classification of PCE (power conversion efficiency) using thresholds of 7% and 11%. Among these methods, RF achieved the highest performance with 65.2% accuracy in the ternary classification of high-level PCE (>11%). They discovered six novel D/A combinations after virtual screening. After that, they compared the PCE values of the new materials with model prediction values and found that RF prediction values are closest to experimental values.
As mentioned before, small molecules are treated as monomers using fingerprints in OPVs without considering the molecular weight and binding situation. In some cases, if the structure is complex, it can lead to large deviations in material predictions. Nagasawa et al. used MACCS fingerprints and bandgaps, the HOMO, and weight-averaged molecular weight as the input with ANN and RF methods to build a PCE prediction model (Fig. 3).63 However, the predicted PCE of selected OPV molecules in the Harvard Clean Energy Project dataset was about 5.0%–5.8% while the experimental device values were 0.47 ± 0.04%. They concluded that the deviations between the machine learning and experiment values are due to the direction of combining and low molecular weight. Additionally, as an extension of the molecular structure, molecular graphs converted from SMILES can also be the structural descriptors in building PCE machine learning models. Eibeck et al. used the graph neural network (GNN) model and attention fingerprint model found by Xiong et al.64 in the PCE prediction and reported achieving the Pearson correlation coefficients of 0.68 and 0.57.65 In the field such as retrosynthesis66 and chemical reaction prediction,67 the molecular graph performed well, and with the development of the GNN and machine learning in chemistry, this method will contribute more to OPVs.
Fig. 4 (a) Using the values of the HOMO and LUMO as features in the ternary OPV VOC prediction model. (b) The RF model and (c) SVM model.69 Reproduced from ref. 69 under the terms of the Creative Commons Attribution License from Wiley. |
Combining electronic descriptors with visualizing decision tree models, researchers can gain physical insights and experimental guidance from the trained models.51,70 To visualize a decision tree, the model is represented graphically as a tree-like structure. In the tree-like structure, there are four main types of nodes: root nodes, leaf nodes, internal nodes, and branch nodes. To be more specific, input data are located at the root nodes and the predictions are at the leave nodes. Each branch of the tree represents a decision made on the attributes, and the internal node shows the test on the attributes. This tree is constructed by repeatedly splitting the data based on the attributes that provide the most information gain. This process will continue until the stopping criteria are met. Decision tree visualization can help to understand the structure and decisions made using the model and can provide direct insights into the relationships between the input variables and the output predictions. For example, Lee used a ternary OPV machine learning model to verify the correlation between the electronic properties (HOMO and LUMO from donors, acceptors, and third components, respectively) of various materials and their PCE69 (Fig. 5). He employed the feature ranking mechanism of the RF algorithm to rank the contribution of each feature to the VOC and concluded that the donor's LUMO and HOMO had a primary impact on the VOC. Furthermore, he established an RF model for binary classification, with a threshold of 9%. After analyzing the logical flowchart of the classification of the two groups, it was found that the LUMO and HOMO of the donor, as well as the HOMO of the acceptor, were the key values that contributed to the classification model. This result is consistent with previous work in the field, which also identified these features as important for the classification of the two groups.
Fig. 5 An overview of the STONED. (a) shows how STONED obtains the chemical subspaces. Reordering the SMILES string within the same molecule can generate different orders of SELFIES. Random mutation in SELFIES can yield vast different molecules with similar structures. (b) and (c) demonstrate how to form the path generated by two reference molecules and how to find the median molecule along the path, respectively.71 Reproduced from ref. 71 with permission from the Royal Society of Chemistry. |
Nigam et al. proposed a molecular generative neural network model based on the SELFIES description method for inverse molecular design, named superfast traversal, optimization, novelty, exploration and discovery (STONED) and the workflow is shown in Fig. 5. This model enables structure enumeration in chemical space and the discovery of transformation trajectories between any two molecules.71 The researchers utilized the ability of SELFIES to generate multiple structures by adding, deleting and changing random characters while maintaining a rational chemical structure. This allowed them to form local chemical subspaces. They also defined the chemical space path as a finite step change between two molecules to achieve a transformation from one end to the other. They tried to find the median molecule, which is similar to several reference molecules, in the path of the two reference molecules. In the application of the discovery of new OPV molecules, they took three properties (high LUMO energy, high dipole moment and high HOMO–LUMO energy gap) as end molecules and tried to find median molecules among them.
For the de novo design of a molecule, it is important to select the benchmarking task for the design. Currently, most of the insight into which threshold in machine learning to use derives from the researcher's own judgment and experience. A public benchmark is helpful in comparing the performance of different models. Nigam et al. reported a series of benchmarks named TARTARUS for designing molecules including OPV molecules.78 They demonstrated the utility of the TARTARUS benchmarks by evaluating several mature algorithms such as VAEs, long short-term memory hill climbing (LSTM-HC) models, REINVENT, JANUS, and a graph-based genetic algorithm (GB-GA). After the algorithm evaluation, they proposed six benchmarks based on the properties of the HOMO–LUMO gap, LUMO energy, molecular dipole moment and PCE, and put forward more detailed function combinations of these properties. According to these benchmarks, they designed a small organic donor and an acceptor molecule to be used in bulk heterojunction devices with [6,6]-phenyl-C61-butyric acid methyl ester (PCBM) and poly[N-90-heptadecanyl-2,7-carbazolealt-5,5-(40,70-di-2-thienyl-20,10,30-benzothiadiazole), respectively. Although these benchmarks should not be viewed as the final performance judgments of any method used (design issues should be case by case), they can still provide preliminary insights. They also claimed that there is currently no champion algorithm capable of performing tasks on all benchmarks.
To date, de novo molecular design using deep learning still has much room for improvement. More advanced algorithms, broader and more comprehensive datasets, and more sophisticated guidance for design models are worthy of further consideration by researchers. Also, automated synthesis of molecules is currently a hot topic in high-throughput screening.79,80 In the near future, we believe that a series of on-demand designs for automated design and synthesis should take chemistry to the next level.
There are other common descriptors besides the structural features that can be used to represent perovskites, such as atomic, element level properties and macroscopic measurement properties. After considering the A-sites and B-sites, a huge number of microscopic descriptors can be extracted from the atom, such as atom mass, radius, electron affinity, Pauling electronegativity, etc. Similarly, macroscopic properties such as density and volume, and space groups can also be included in the feature input. Li et al. built a perovskite bandgap energy prediction model, which uses five structure relative factors (such as the tolerance factor) and an initial atomic feature set with 77 atomic physical, chemical and spatial properties.102 Isayev et al. proposed a concept of property-labeled material fragments (PLMFs), which combined the geometry structure with atom/element properties, including the multiplication and ratio of general element/atomic values, measured values and derived properties.102 Generally, it is possible to use different combinations of descriptors to create new descriptors using expert knowledge. Additionally, a data-driven method such as sure independence screening and sparsifying operator (SISSO) can also be used in exploring new descriptors. To be more specific, SISSO is a math model that is based on the LASSO approach. With the input of physical quantities, it could perform linear combination (unary or binary operators), which can select the best descriptor from a large space of parsed expressions (potential features).103,104 Many examples show that this method can produce numerous novel descriptors.105 Usually, using SISSO is accompanied by feature engineering (such as generating several potential descriptors and selecting the most suitable one). For example, Xie et al. used the SISSO with atomic radius, valence, electronegativity, permittivity, and nine operators to yield over 182 million descriptors (equations).106 Finally, after cross-validation, they selected the best features and successfully adopted them in octahedral tilting prediction with 81.7% accuracy. Notably, it is not a fact that the prediction result is better with more combinations of material properties. It is a problem that requires a tailored approach to feature selection. Xu et al. showcased that for predicting the properties of ferroelectric perovskites, the traditional machine learning workflow can perform better than the SISSO based method in specific surface area, bandgap, Curie temperature prediction.107 Therefore, in the choice of material electronic descriptors, multiple explorations can be used to find the optimal solution.
With the development of deep learning, a number of descriptors have been integrated to fit the DL input, which brings a new development in this area.108,109 Chen et al. proposed a novel neural network materials graph network (MEGNet) and represented a series of crystal perovskite structures as graphic structures.110 To be more specific, as shown in Fig. 6, they followed the GNN normal representation and defined the V, E, and U as atomic (node/vertex), bond (edge), and global state attributes, respectively. The original graph structure information is first updated from the original bond states to the new bond states. Subsequently, the atomic states are updated based on the previous state and bond states. Finally, a new graph structure is generated after using the previous global states. The previous steps are repeated in a cycles until the final result is achieved. 11 of the 13 properties predicted in MAE were under the generally accepted thresholds of chemical accuracy and better than the previous work using the QM9 database. This method also extends to predict the synthesizability and bandgap of perovskites.111,112
Fig. 6 Overview of a MEGNet module, this figure shows how a graph represents a molecule in the MEGNet. The node attributes fall into three categories: bond, atom, and state. This figure also shows the updating steps in the MEGNet.110 Reprinted (adapted) with permission from ref. 110. Copyright 2019 American Chemical Society. |
Besides, automatic unsupervised learning methods can extract hidden information from the original input in the field of perovskites, leading to the formation of non-manual defined descriptors.113 From an encoding–decoding model such as variational autoencoders (VAE), original features are embedded into a series of compressed latent vectors, which can capture more in-depth features. Using the features extracted only from its chemical formula in the VAE model, Ihalage et al. defined the mean vector generated from VAE (μ) as the perovskite material fingerprint (Fig. 7).114 Furthermore, they verified this new fingerprint with the k-nearest neighbor method and found that in the fingerprint space, similar materials are located close to each other. 5-Nearest neighbors (5-NNs) can determine the correct experimental crystal system of the parent composition with a success rate of 71.8%. Furthermore, non-manual defined descriptors can be used to de novo design the perovskite. Based on the generative adversarial network (GAN) and transformer models in machine learning, Dan et al. and Wei et al. proposed material design models named MatGAN and crystal transformer.115,116 Wei et al. compared these two models and found that the transformer-based model is more suitable for exploration in known chemical spaces due to its ability to capture element interrelationships, whereas the GAN-based model is more appropriate for discovering new molecules in uncharted chemical spaces.
Fig. 7 An overview of using VAE to make perovskite material fingerprints. (a and b) In the part of descriptor creation section, they explored the periodic table for elements that may fully or partially fill octahedral and interstitial positions and specified the following conditions: (1) in the generation, the average oxidation of site A should not be larger than that of site B; and (2) the average ionic radius of the A site should be more than or equal to the average ionic radius of the B site. (c and d) In the model training section, they trained the VAE model on over 2000 unlabeled experimental data sets. Calculation of Euclidean distances in fingerprint space between experimental components and potential perovskites.114 Reproduced from ref. 114 under the terms of the Creative Commons CC BY license. |
Predicting the basic physical information of perovskites can be of great help in the exploration of new materials, the mapping of experimental parameters and the understanding of structure–function relationships. Many models have been developed to predict the physical properties of perovskites such as bandgaps,109,119 oxide ionic conductivity,120 thermodynamic stability,121,122 dielectric breakdown strength,123,124 lattice parameters,125 crystal structures.126 For example, Zhang et al. established a model to predict lattice constants based on cubic perovskites.127,128 Besides, Li et al. predicted formation energy, thermodynamic stability, crystal volume and oxygen vacancy formation energy using a variety of machine learning models.129 Saidi et al. constructed a convolutional neural (CNN) model for deriving relevant physical properties (e.g., lattice constants, octahedral tilt angles, etc.) from the given perovskite material.130 Compared to the first-principle methods such as density functional theory (DFT), these kinds of machine learning models can be used for low-cost large-scale screening of physical properties in perovskite materials.
A significant concern in perovskite research is the exploration of identifying potential perovskite structure types. This includes determining which elements can form perovskites and understanding the different structural and compositional variations that are possible within the perovskite structure.131,132 Many models are successfully established in different kinds of perovskites.133–135 Taking the electrical and geometrical factors into account, machine learning models established by Li et al. were used to predict the formation of perovskite structures and showcased 96.55% and 91.83% accuracy in the single and double perovskite databases.136 Combining first-principles calculations and machine learning, Talapatra et al. proposed to use energy above full = 50 meV as a threshold criterion for database stability and non-stability for perovskite screening.137 Based on 68 elements from the periodic table, they built a virtual database of 437828 stable perovskite structures. Based on SHAP analysis, Zhang et al. proposed that the formation of hybrid organic–inorganic perovskite (HOIP) structures is more likely to occur when the A site radius falls within the range of 1.95–3.25 Å and the B site radius falls within the range of 0.60–1.20 Å138 Besides, machine learning can also be adopted in identifying the perovskite structure in experimental characterization. Massuyeau et al. built RF/CNN models capable of identifying XRD peaks using XRD diffraction patterns as the training set, which can directly distinguish between perovskite and non-perovskite materials.139 All of these works provide good paradigms for identifying perovskites.
Searching for the stability and high PCE lead-free halide perovskites140 is also an important downstream task of perovskite machine learning application. Using the property density distribution function (PDDF), Stanley et al. constructed features and applied them to predict the bandgap, formation energy, and convex hull distance of lead-free halide perovskites.141 Besides, machine learning can also be used in the design of new types of lead-free halide perovskites. Lu et al. reported a HOIP prediction model trained from 212 reported bandgap values.142 Using a combination of DFT optimization and machine learning prediction, they determined the range of tolerance factors, octahedral factors, metal electronegativity, and polarizability of potentially promising HOIP organic molecules and selected 3 thermal and environmental stable lead-free HOIPs with appropriate bandgaps from 5158 candidates. In addition, there have been research efforts that combine machine learning and DFT,143 for discovering lead-free hybrid perovskite,144 two-dimensional lead-free perovskite145 and others.118,146 These works provide a solid foundation for discovering more efficient and stable lead-free halide perovskites.
One of the major challenges that remains to be addressed in perovskite applications is the stability of devices. The stability of the perovskite devices falls short of mainstream silicon devices. Odabas et al. analyzed the hysteresis and reproducibility of perovskite solar cells and proposed materials and alternatives for perovskite deposition with low hysteresis and high reproducibility.147 In materials, in addition to thermodynamic stability, another more important aspect to consider is mechanical stability (or mechanical strength). Jaafreh et al. investigated the mechanical strength of perovskite-based materials using the AdaBoost algorithm with the volume and shear quantities of the elastic modulus and its scaling criterion (satisfying G/B smaller than 0.57 for ductility at room temperature (RT)). Based on the model, they identified about 770 perovskites with mechanical strength.148 Howard et al. proposed a reap-rest-recovery (3R) cycle machine learning framework to avoid permanent failure of perovskites due to exposure to water vapor and oxidation.149 Due to the complexity of factors such as device stability, more effective models with interpretability still need to be developed for evaluation to help find a suitable device.150
Compared with using virtual datasets, using real-world datasets is considered a more appropriate approach for predicting the properties of perovskite materials. However, one of the obstacles of this method is the time-consuming and labor-intensive process of manually collecting and cleaning large datasets from thousands of perovskite-relevant articles. Thanks to the development of natural language processing (NLP), much of the chemical text and information extraction toolkits are proposed such as ChemDataExtractor, OSCAR4, ChemicalTagger and others.151–154 Thus, it is possible to do text mining and build relatively large real-world datasets for perovskite prediction.155,156 Beard et al.157 adopted the ChemDataExtractor to build two datasets from 25720 articles regarding dye-sensitized solar cells (DSCs) and perovskite solar cells (PSCs).157 Furthermore, using an automatic collection dataset can directly train the machine learning model. Kim et al. proposed a linguistic model-based approach for linking the scientific literature to material synthesis insights and successfully performed perovskite synthesizability screening (prediction of two precursors).158 Although there are few examples in the area of text-mining, it is likely that in the future the text-mining method will play an important role in building perovskite datasets as the amount of scientific literature data continues to grow.
Fig. 8 High throughput experiments conducted by Sun et al. Precursor solutions were prepared and a high throughput experimental cycle was designed. Three experiments and characterization were carried out to examine the structural and optical properties according to thin film deposition, X-ray diffraction and UV-visible spectroscopy.161 Reprinted (adapted) with permission from ref. 161. Copyright 2019 Elsevier. |
To be more specific, the accelerated experiments assisted by machine learning can be divided into two parts, to explore the experimental condition and realize verification.163 In the aspect of materials or experimental reagent selection, there are a lot of studies that have been reported. Yu et al. used machine learning to study the reactivity trends of different types of amines and suggested five property recommendations of amines for post-treatment of MAPbI3.164 By developing the capping layer, Hartono et al. used RF regression and SHAP values to find the features having the largest contribution to stability, and found that the most important properties for prolonging the onset of degradation were a low number of hydrogen bond donors and a small topological polar surface area.165 Furthermore, based on their model, they proposed and experimentally validated phenyltriethylammonium iodide (PTEAI) as the best capping layer material. They found that the stability lifetime of MAPbI3 was 4 ± 2 over bare MAPbI3 and 1.3 ± 0.3 over octylammonium bromide (OABr), which is SOTA at that time. They also gave a corresponding explanation based on XPS and FTIR results that the capping layer on top stabilizes MAPbI3 by changing the surface structure and chemistry, which match the previous experiment regulations.166,167 Besides, by machine learning and experimental verification, Cai et al. confirmed the ratio of Sn: Pb in MASnxPb1-xI3 holding an Sn–Pb alloy within the perovskite crystal.168
In exploring the device stability condition part, Hu et al. investigated the factors affecting the stability of perovskite solar cell devices through a combination of experiments and machine learning.169 Five factors affecting the efficiency and stability (grain size, defect density, bandgap, fluorescence lifetime and surface roughness) were selected using machine learning models and proposed that roughness and crystal size have a strong influence on long-term stability. Subsequently, based on a self-built PCE model, they designed different conditions to vary the surface roughness to achieve the best stability of perovskite devices at 25% humidity and 25 degrees Celsius.
Machine learning-assisted high throughput experiments for automated synthesis have an important role in replacing manual synthesis and a large-scale exploration of perovskites.170–172 More specifically, human-based operations are replaced with fully automated robotic working and the process is iterative between automated experiments and machine learning-based experiment planning. This method can speed up the experiment considerably compared to human labor. For example, Li et al. reported a high-throughput robotic perovskite synthesis system that takes 20-fold less time than manual synthesis.173 Bayesian optimization is the most commonly used algorithm, which performed well in low-dimensional parameter space.174,175 As shown in Fig. 9, MacLeod et al. developed an 8-step thin film modular robotic platform called ‘Adad’, which automatically synthesizes, processes, and characterizes thin-film samples. Using ChemOS in the previous work, a Bayesian optimization algorithm was applied to design the next sample for the experiment after the characterization.176,177 Besides, Higgins et al. used a pipetting robot to build a perovskite combinatorial library and used Gaussian regression (a form of Bayesian optimization) to analyze the physical properties of the constituent series.178 However, compared to the integrated pervasive API interface of machine learning, machine learning-assisted high-throughput experiments still require the development of a pervasive experimental system and software in the future.
Fig. 9 8-Step auto-platform combining synthesis, characterization, software machine learning calculation into a self-driving workflow to make thin film samples, this work reported by Macleod et al.176 Reproduced from ref. 176 under a Creative Commons Attribution NonCommercial License 4.0 (CC BY-NC). |
By using machine learning methods, machine learning potentials are obtained by fitting energy and force from DFT calculations. A classical method was proposed by Behler and Parrinello in 2007.194 Similar to empirical potentials, the total energy E could be expressed as a sum of atomic contributions Ei, an approach that is typically also used in empirical potentials.
(1) |
Another important method to calculate PESs was proposed by Bartok et al.,196 in which a kernal-based descriptor was applied. In this research, they started with forming a local atomic density related to neighbor atoms, and converted the PES to an interpolation of the atomic energy in the truncated bispectrum space. By using Gaussian process (GP) regression, they realized a good approximation to the atomic energy function. Then with different sparse configurations, they proposed a final expression called the Gaussian approximation potential (GAP) model. This model also performs well in bulk crystals at high temperature. As a typical method, it has been applied to develop smooth overlap of atomic positions (SOAPs).197 This proves that some widely used descriptors could be concluded using a general approach, where they applied a finite set of basis functions to expand the atomic neighborhood density function. To make a best estimate of atomic energy function, it assumes a Gaussian basis function as below:
(2) |
Thompson et al.198 proposed a new interatomic potential for solids and liquids, which is called spectral neighbor analysis potential (SNAP). Different from the GAP model proposed by Bartok,196 researchers proposed the bispectrum as its descriptor and assumed a linear relationship between atom energy and bispectrum components. In SNAP, the coefficients are determined by the weighted least-squares linear regression, which allows the model to fit a full set of quantum mechanics calculations. Also, the symmetry properties are applied to reduce the computational cost.
By utilizing graph convolutional neural networks (GCNNs), Schutt et al.199 developed a deep learning architecture SchNet to model atomic systems. By using continuous-filter convolutional layers, SchNet is able to predict the potential-energy surfaces and energy-conserving force fields of small molecules, which could be utilized in MD simulations. Also, GCNN has been applied to overcome the limitations of traditional methods, which do not consider the spatial information. Gasteiger et al.200 constructed a directional message passing neural network (DimeNet) that embeds the messages passed between atoms by considering directional information.
A deep learning method is promising in promoting the efficiency of calculating many-body potential energy. Zhang et al.201 designed a DeePMD-kit to build potential energy and force fields by using deep learning methods. The model used a function containing coordinates and elemental types as descriptors. By training the data from the AIMD to DeePMD model, the MD stimulations can accurately replicate the results of the original AIMD data. The DeePMD-kit is written with Python/C++ and interfaced with TensorFlow, which improves the training efficiency and user-friendly.
Also, a lot of software packages have been used to build deep learning potential energy surfaces. The promotion and application of machine learning models greatly benefits the development of atomistic potentials, as we listed several typical packages in below: 1. LASP:202 learning-based atomistic simulation package (LASP) is a software platform that merges the stochastic surface walking (SSW) method and global neural network (G-NN) potential for exploring and evaluating the PES. LASP provides various simulation techniques for PES data building, exchange, and G-NN potential generation within a single platform. 2. AMP:203 the atomistic machine-learning package (AMP) is a software package for building and using machine learning models for atomistic simulations. It is designed to handle large-scale simulations and includes features for parallelization and incorporating diverse training data. 3. SchNetPack:204,205 SchNetPack is an open-source software package for building neural network potentials for molecular and material simulations. It includes a variety of NN architectures, as well as tools for generating and analyzing training data. 4. MLIP:206–208 the machine learning interatomic potentials (MLIP) software package is a Python library for building interatomic potentials using machine learning models. It includes a variety of machine learning algorithms, as well as tools for generating and analyzing training data.
In conclusion, machine learning has been widely applied to find the PESs and atomic forces. With the development of machine learning algorithms, different descriptors and regression methods have been applied, leading to great progress. Compared with ab initio calculations, adopting machine learning could largely reduce the computational cost and maintain acceptable accuracy. It is promising to apply machine learning methods in MD/MC calculations.
Adsorption decides the interaction intensity between active intermediates and the catalyst, and overpotential determines the behavior when assembling catalysts in batteries. Lian et al.211 investigated single-atom catalysts (SACs) for lithium–sulfur (Li-S) batteries. At first, researchers classified the adsorption process by the presence or absence of S–S bond breaking. Then, crystal graph convolutional neural network (CGCNN) was applied to complete the classification and regression.212 By obtaining 812 adsorption configurations on 203 SAC catalysts, researchers categorized them into 4 categories and excluded unstable catalytic configurations. After training the machine learning model, its prediction has a mean absolute error of 0.14 eV (Fig. 10a). As shown in Fig. 10b, for different elements, researchers calculated two elementary steps and plotted their free energy as ΔG1 and ΔG2. These two elementary steps were identified as potential limiting steps depending on the LiS* adsorption energy, which shows a volcano plot as shown in Fig. 10c. And by calculating the overpotential for different metal sites and supporter composites (Fig. 10d), researchers concluded that higher overpotential would lead to a limited catalytic activity. Based on the volcano plot and overpotential results, it could be applied to optimize the synthesis of SACs and predict catalytic activities.
Fig. 10 Taken from Fig. 3 and 4 in Lian et al.211 (a) DFT calculated and machine learning predicted adsorption energy of Li-polysulfides, (b) predicted adsorption energy of LiS*, (c) volcano plots for catalysts with an overpotential lower than 0.1 V, and (d) heat map of the predicted overpotential of different SACs. Reproduced with permission.211 Reprinted (adapted) with permission from ref. 211. Copyright 2021 American Chemical Society. |
The oxygen reduction reaction (ORR) and oxygen evolution reaction (OER) are key reactions for fuel cell and metal–air batteries.213–216 The ORR/OER are limited by 4 electron transfer and sluggish kinetics, so it is important to design efficient electrocatalysts for these two reactions. In recent years, SACs have been widely applied in the ORR/OER, and achieved high activities.217–219 And this has led to increased interest in investigating the key factors for these reactions, which could assist in understanding the mechanism. Ying et al.220 found a volcano-shaped relationship between the catalytic activity and ΔGO, and applied machine learning model based on the RF algorithm. With consideration of the scaling relationship and the feature importance, it determined the outer electron number and oxide formation enthalpy as the two most important factors. And the machine learning model could give an accurate prediction of ΔGO efficiently.
Furthermore, SACs still have some limitations, like compatible low stability and simple adsorption configurations. So researchers further introduced dual-metal-site catalysts (DMSCs),221,222 which could both enhance the activity and optimize surface adsorption. The increase in metal sites also increases the difficulty of investigating specific factors that contribute to reaction activity. So machine learning methods have been applied here to reveal the order of important factors,223 which benefits the optimization of catalysis design.224,225
Zhu et al.226 conducted DFT calculations to calculate the adsorption free energy and screen high ORR activity DMSCs as the flowchart shown in Fig. 11a. By training models based on the GBR algorithm,227 it showed a low RMSE of 0.036 eV. And mean impact value (MIV) has been applied here as an indicator to assess the importance of features. With this tool, researchers proposed 7 features that were mostly related to the catalytic activity of DMSCs, and determined that the electron affinity of metal atoms is the most important feature for the activity of DMSCs. This result provides a valuable insight for synthesizing DMSCs.
Fig. 11 Taken from Fig. 4 by Zhu et al.226 (a) Schemetic plot of screening high efficient DMSCs from DFT calculation and trained with the machine learning model, (b) training results of ΔGOH, (c) feature importance based on the MIV. Reproduced with permission.226 Reprinted (adapted) with permission from ref. 226. Copyright 2019 American Chemical Society. |
A lot of effort has been made in finding new chemistry in battery electrodes, and electrolytes. However, each material has different electrochemical properties, and it is hard to optimize directly. With the development of machine learning, it becomes a powerful tool in dealing with complex factors and provides relationships between structure and their functions. Machine learning assists the design of batteries and boosts the discovery of energy storage materials.
Besides the voltage of electrode materials, the Li-ion conductivity is also a significant factor that decides the performance of batteries. For different electrode materials, the conductivity can differ in tens of magnitude,247,248 so how to develop high Li-ion conductivity materials is of great importance. Sendek et al.249 discovered a lot of crystalline solid materials through density functional theory simulations guided by machine learning-based methods. In this research, researchers compared the machine learning guided method with random search of material space, and received at least a 44 times improvement in the log-average of room temperature Li ion conductivity. It is also evaluated from the F1 score, which is 3.5 times better than completely random guesswork and much better than human brains. The screening result shows that most of the high conductivity materials are found by applying the machine learning guided search, which proves its superiority over the traditional guess and test method.
With the development of the Materials Project database, there are a lot of material data that could be utilized for training a machine learning model. However, the quality of models is greatly limited by the quality and quantity of data, and for each material, not all properties are well prepared. So developing an unsupervised learning method is valuable, as it could avoid labeling data and requires less data points. Zhang et al.250 proposed an unsupervised learning model to find materials for solid-state Li-ion conductors. As is shown in Fig. 12a, researchers applied an agglomerative hierarchical clustering method to train a mXRD dataset, and it shows similar characteristics with the real mXRD pattern (Fig. 12b and d), which means a good quality of classification. Then this model was used to find solid-state Li-ion conductors (SSLCs) with high Li-ionic conductivities and group them accordingly. (Fig. 12c). To confirm the superiority of the unsupervised learning, they conducted AIMD simulations, and the result shows that the model discovered 16 new fast Li-conductors with conductivities of 10−4 to 10−1 S cm−1.
Fig. 12 Plots remade from Fig. 2 and 3 by Zhang et al.250 (a) the tree diagram of the agglomerative hierarchical clustering method, (b) the dendrogram to the conductivity reveals grouping of known solid-state Li-ion conductors, (c) violin plots of σRT data grouped in the grouping, (d) mXRD of materials, (e) crystal structures (left) and (right) Li sites (green sphere) determined by local anion (yellow/red sphere) configurations, and (f) σRTvs. activation energy, ion conducting properties of newly predicted shown as filled symbols. Reprinted (adapted) with permission from ref. 250. Copyright 2019 Springer Nature. |
Previous works mainly focused on finding new compounds as electrode materials. Except for screening and designing electrode materials using components, optimizing heterogeneous electrode microstructures is also a powerful tool in designing batteries.251 Starting from microstructures could unveil the relationship between structures and functions clearly. With the help of machine learning, complex structures could be designed. The reconstruction routine consists of two major strategies, statistical sampling and optimization. A common method is to sample descriptors of different microstructures, and follow up with minimizing the difference between reconstructed structure and real structure. Based on this method, different modeling methods could be applied to build 3D electrode models, like physics-inspired Monte Carlo method and hierarchical reconstruction.252,253
Machine learning is a strong tool for large scale screening materials and their properties. Jalem et al.260 proposed a NN method for screening potential solid state electrolyte (SSE) materials. In the research, researchers utilized NN and searched in the LiMXO4 group. The screening was mainly focused on two properties, the Li diffusion barrier and the cohesive energy. These two properties are important for Li-ion conductivity and bonding information. Researchers revealed the relationship between diffusion barrier, the cohesive energy and their structure descriptors in the materials space. Compared with traditional partial least squares, the application of multi-output node architecture could increase the accuracy of prediction.
Also, to realize the finding of new chemistry, the structure–function relationship needs to be investigated carefully. Kireeva et al.261 applied the support vector regression to investigate the composition–structure–Li ionic conductivity relationships. It could be utilized to define parameters that lead to high Li-ion conductivity, and search in a large material space, which could provide potential materials as SSE.
As shown in Fig. 13a, the model predicts the conductivity well with the experimental result. The accurate result could provide a significant insight into the co-doping effect, which is not completely issued by DFT calculation. Generally, the doping of different cations shows the same trend without outlier-by-prediction. Fig. 13b provides a model with an additional descriptor pool. It reveals the impact of different parameters on the property space.
Fig. 13 (a) The prediction accuracy of Li conductivity by machine learning models and (b) the conductivity results categorized using different synthesis methods, and adapt t-stochastic triplet embedding, where experimental results are served as an extra descriptor pool. Reproduced from ref. 261 with permission from the PCCP Owner Societies. |
Besides, machine learning could also be applied to predict mechanical properties like the growth of dendrites. Dendrite formation is a serious problem that affects the safety of batteries.262,263 Using solid electrolytes is a promising method to deduce dendrites, which could suppress the formation of dendrites greatly. Ahmad et al.264 calculated properties of mechanically isotropic and anisotropic interfaces as the criteria of dendrite initiation. Then a GCNN was trained on the shear and bulk moduli, and gradient boosting regressor and kernel ridge regression were used to train the elastic constants. With these machine learning methods, 20 mechanically anisotropic interfaces could be predicted between Li metal and four solid electrolytes as the candidate materials.
SSEs are also promising for addressing the flammability concerns, which requires to form a high quality SSE layer. To evaluate the quality of SSE films, both conductivity and uniformity are considered. Chen et al.265 proposed a high-quality SSE film synthesis method guided by machine learning. In this research, researchers adopted three algorithms (principal component analysis, K-means clustering, and support vector machine) to analyze the relationship between fabricating parameters and film quality. Principal component analysis has been used to determine the manufacturing conditions and converts it to a low dimensional subspace. Then K-means algorithm is applied to classify different films and defines its performance. Finally, a support vector machine unveils the effect of fabricating parameters on the quality of films. When assembling the whole cell, the SSE film shows a good stability, proving this method to be useful. The machine learning-assisted method successfully optimized the production of SSE films.
In conclusion, machine learning boosts the design of electrolytes and screening materials with superior physical properties. Compared with traditional first-principle methods, machine learning methods are able to consider more factors that determine the behavior of electrolytes and directly guide fabrication process in batteries.
To estimate the state of charge and health, it is significant to build efficient models to describe battery management systems. A main battery model that applies in battery systems is equivalent circuit models (ECMs),270,271 which simplifies complex systems to circuits and fits models. For further advancement, physics-based models (PBMs) are being developed for battery systems,272 which can take into account multi-dimensional information, like real time scale analysis, and battery dynamic parameters.273,274 These models are always limited by their complexity and require a large computational source to solve them. With the development of machine learning, a lot of methods and algorithms have been applied to simulate battery systems, including multiple regressions, NN, and Bayesian.275–277
An early-prediction strategy has been an important method for predicting the state of health (SoH) and remaining useful life (RUL) of batteries, as it could shorten the time of experiments and improve the efficiency of optimization. As shown in Fig. 14, Attia et al.,278 researchers proposed a closed-loop optimization (CLO) system, combining an early-prediction model and a Bayesian optimization algorithm to accelerate the time of identifying charge protocols. This strategy sampled the first 100 cycles and utilized it as input for a linear model via elastic net regression to find charging protocols.279 Compared with full cycle experiments, the early-prediction model accelerated more than 30 folds. Then, by applying a Bayesian optimization algorithm to early-prediction data, it could provide an optimized result for next-round charging protocols.280,281 With these two strategies, the article made a successful approximation to the average life cycle and uncertainty of protocols. Also, utilizing early-prediction results reduced the total optimization cost, which is beneficial for the wide application of the CLO system. Furthermore, the early prediction strategy could be extended and integrated with Monte Carlo simulation to predict the battery remaining useful life.282 Tong et al.282 proposed a deep learning algorithm, named adaptive dropout long short-term memory (ADLSTM). By obtaining early cycle capacity as training data, researchers trained the model, and used long term cycles as testing data. With a trained model, MC is applied to figure out the uncertainty of battery data, and enhanced the robustness of the model. This method showed the lowest errors compared with other algorithms.
Fig. 14 Schematic plot of the closed-loop optimization (CLO) system applied to predict the cycle life. Researchers adapted the first 100 cycles data as the first feeding data and applied Bayesian optimization to determine parameters. This method provides insights to designing parameters of batteries. Reprinted (adapted) with permission from ref. 278. Copyright 2019 Springer Nature. |
As a classical energy storage system, Li-ion batteries have been widely applied in daily life because of their high energy density.283–285 However, the degradation of Li-ion batteries, due to their complex and non-linear deactivation, has caused a lot of issues for recycling.286,287 To decide the state of health and remaining useful life of batteries, traditional methods mainly rely on multiscale simulations.288 But conventional simulation tools cannot perform well on a wide length scale and long time scale. So it is more accessible to combine different characterization and machine learning methods to generate a large amount of data and build an efficient statistical model.
For investigating battery systems, electrochemical impedance spectroscopy (EIS) is a classical method to measure the relationship between input and output, like capacity and resistance.289 However, it is hard to predict battery properties using EIS since the result of EIS contains both real and imaginary part, and still there are debates on if an electrical model could describe a complex battery system.290–292 Zhang et al.293 built a battery forecasting system with a GP model. By feeding over 20000 EIS results of commercial Li-ion batteries, the GP model could predict degradation and remaining useful life successfully. With one of the largest dataset, researchers could estimate the capacity and RUL of batteries by applying only one impedance test in different dimensions, like different temperatures and at different stages of life. Another article also applied EIS to measure the state of charge (SoC) and obtained a high accuracy model.294 By using a sensitivity analysis of data, researchers extracted most reliable features to predict the SoC. These methods help improving the prediction of battery conditions, and also benefit sampling methods of EIS.
In conclusion, to make predictions for state of charge and health, researchers design a lot of machine learning algorithms and models, combining different characterization methods. These models provide valuable insights for designing and optimizing battery systems and accelerate the prediction, which are helpful for high throughput screening. Furthermore, by combining more physics insights with current machine learning algorithms, researchers can create models that can better explain the results.
More recently, drug design for small molecules has kept pace with the development of applications in the field of SOTA machine learning technology, such as different transformer-based and GNN-based methods. But OPV machine learning models, which are also based on organic small molecule representation, are relatively limited and many of the current works still rely on using molecular fingerprinting with traditional machine learning. Compared to deep learning, traditional supervised learning requires less data and is more robust. This advantage can be extended and more adapted to the drawbacks of a small number of OPV datasets and low standardization of experimental data collected from the literature, which leads to better performance. However, using supervised learning cannot model complex relationships. With the development of OPVs and OPV data growth, the deep learning method could lead to a deeper understanding of underlying relationships in OPVs.
Transferring molecule-based models from one application to another is often simple since many molecule-based models have a high degree of applicability. For example, Some models based on small molecules can be used not only in drug discovery, but also in other areas such as materials science. Flam-Shepherd et al. showcased that their fragment-based 3D molecules model295 can be used in the design of both drug molecules and the materials of organic light-emitting devices (OLED). Besides, with relatively minor modifications, some molecule-based generalizable models such as SMILES transformer reported by Shion Honda et al.296 and SSVAE deep generative model reported by Kang et al.297 could potentially be used in other molecule-based prediction models in energy chemistry.
Generally, some tools can help us to improve the interpretability of our models. Using visualization tools based on the NLP model or NN can visualize the weights of certain layers in a deep learning model. In some famous machine learning packages such as TensorFlow and Keras, it is easy to realize. This can make the model more acceptable; Rives et al. proposed a ESM-1b model which showed promising results in protein structural and functional prediction.298 They used the tsne technique to visualize their trained weight in their ESM-1b model and illustrated that the ESM-1b model can learn the physical and chemical information from the protein sequence. Besides the visualization tool, one useful tool is model-agnostic methods299 including the local interpretable model-agnostic explanations (LIME), SHAP, recursive feature elimination (RFE). The key attribute of model-agnostic methods is their independence from specific model structures, enabling their application across various model types. This flexibility allows for broad adoption in different model configurations.
In addition, experimental validation is an important part of ensuring the overall acceptance of the work by the researchers. With the development of chemistry machine learning, different researchers will come up with different solutions for the same task. Machine learning models with experimental validation are more acceptable to the people who want to apply them.
In many fields of machine learning, such as linguistics and biochemistry, many benchmarks have been established. Benchmark is one of the most important aspects of comparing models, which helps to clearly compare the performance of different models on different datasets. For molecules, several molecule benchmarks were set by Wu et al.300 and Nigam et al.78 However, more benchmarks are still needed to develop because many novel models in machine learning in chemistry are focusing on the descriptor and different works claim that their strategy is excellent but the training and test environments are different. Therefore, it is hard to compare different strategies. Although there are many public datasets for energy chemistry as mentioned above, we hope to have more public datasets in the future. Most of the current public datasets are collected in different methods, the average accuracy of validation with different datasets can be considered as an evaluation benchmark.
In summary, establishing more benchmarks helps to compare the quality of the corresponding algorithms or descriptors, which is more conducive to the development of chemical machine learning especially in energy chemistry.
Some LLMs, such as ChatGPT, are designed to help computers understand human language and generate natural language responses, making them valuable tools for various natural language processing (NLP) tasks. Such a chatbot could provide an accessible interaction way for researchers to leverage machine learning models, even if they are not familiar with programming or high-performance computing systems. Unluckily, a material-specific chatbot has not yet been developed. Nevertheless, the emergence of material-specific chatbots like ChatGPT that can interface with various downstream tasks and allow users to use pre-trained models for machine learning research is expected to lower the barrier to entry for machine learning research.
There is still much work to be done in this area. One challenge in building a LLM is the need for large amounts of high-quality data to train the model. Another challenge is to develop more specialized algorithms and architectures to handle the complex nature of materials science data. Despite these challenges, the development of LLMs has opened up new possibilities in the field of chemistry, and we can expect to see more exciting applications in the future.
Footnote |
† These authors contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2023 |