Open Access Article
Yu
Jin
a,
Hang-Biao
Lv
a,
Shisheng
Zheng
*a and
Jian-Feng
Li
*ab
aCollege of Energy, State Key Laboratory of Physical Chemistry of Solid Surfaces, iChEM, College of Chemistry and Chemical Engineering, Institute of Artificial Intelligence, Xiamen University, Xiamen 361005, Fujian, China. E-mail: zhengss@xmu.edu.cn; Li@xmu.edu.cn
bInnovation Laboratory for Sciences and Technologies of Energy Materials of Fujian Province (IKKEM), Xiamen 361000, Fujian, China
First published on 29th January 2026
Machine learning is revolutionizing the field of heterogeneous catalysis, transitioning from a supporting tool to a central force in materials discovery and mechanistic understanding. At the heart of this transformation lies feature engineering, which bridges the catalyst structure with predictive modeling capabilities. In this review, we provide a systematic overview of the evolution of feature engineering in heterogeneous catalysis. This progression spans hand-crafted descriptors, symbolic regression methods, graph-based features that capture intricate chemical and geometric relationships, topological data features encoding multiscale structural invariants, and most recently, multimodal representations that integrate textual data and structure into unified feature spaces. Despite these advancements, several challenges remain in feature engineering, including the underdevelopment of multimodal representations, limited model interpretability, and the absence of cross-scale structural descriptors. Emerging strategies aimed at addressing these issues are discussed in detail. We hope that this review will inspire further innovation in feature engineering methodologies tailored to the continued advancement of heterogeneous catalysis.
In ML applications for heterogeneous catalysis, model performance, whether predicting catalytic activity, adsorption energies, or material stability, critically depends on the features used to represent the system.15–17 Features are numerical encodings of key physicochemical properties that bridge the catalyst structure with predictive models.18–22 This involves selecting, transforming, and constructing descriptors that capture essential determinants of catalytic behavior, such as electronic structure, atomic geometry, and composition. Well-designed features enhance predictive accuracy, generalization, and computational efficiency by focusing on the most informative variables.23 For example, Vinchurkar et al. found that “effective coordination number” and the catalyst's “catalyst electronegativity” were the most important features in their model, and through symbolic regression deduced that the adsorption energy is approximately proportional to the square of the catalyst electronegativity.24 Consequently, feature engineering is not merely a preprocessing step but a decisive factor shaping the reliability and interpretability of ML-driven catalysis research.
This review traces the evolution of feature engineering as a central driver of ML in heterogeneous catalysis with a primary focus on how to encode catalyst structures. Early studies relied on handcrafted features such as electronegativity, atomic radius, and other fundamental properties, forming the basis for high-throughput screening using density functional theory (DFT). The subsequent introduction of symbolic regression methods, exemplified by the sure independence screening and sparsifying operator (SISSO), enabled automated discovery of low-dimensional, interpretable descriptors from vast candidate spaces. With the emergence of graph neural networks (GNNs), structural representation advanced toward end-to-end learning of atomic configurations and local chemical environments. More recently, topological data analysis (TDA) has provided mathematical tools such as persistent homology to quantify multiscale geometric and topological invariants, while multimodal fusion strategies have begun integrating structural, compositional, and textual information into unified representations. This review discusses the application of these feature paradigms to catalytic activity prediction and stability analysis. Finally, we outline emerging opportunities in developing more advanced multimodal feature representations, enhancing model interpretability, and establishing cross-scale feature engineering frameworks, all aimed at accelerating the rational design of high-performance catalytic systems.
A key application of ML in electrocatalysis is the rapid prediction of performance metrics, including adsorption energies, activity, and selectivity, enabling high-throughput screening of candidate materials.26–30 For example, Rosen et al. showed that the crystal graph convolutional neural network (CGCNN) can directly predict fundamental properties, such as band gaps, from crystal structures.31 By incorporating descriptors such as intermediate adsorption energies, active-site coordination, and reaction conditions (e.g., potential and pH), AI models can accurately predict product selectivity in complex reaction networks.32 For the oxygen evolution reaction (OER), symbolic regression approaches like SISSO have been used to extract concise physical descriptors from large feature spaces, facilitating precise activity trend predictions.33 In studies of g-CN-supported double-atom catalysts (DACs), Bian et al. employed feed-forward neural networks trained on DFT data to predict limiting potentials for various bimetallic combinations, efficiently identifying superior catalysts for the catalytic CO2RR to CO and HCOOH.28 More machine-learning-driven catalyst discovery studies can be found in previous review articles.16,34–36
Another key application of ML lies in machine-learning interatomic potentials (MLIPs), which enable atomistic simulations of catalysts with near-DFT accuracy at unprecedented computational efficiency.37,38 MLIPs achieve this by learning high-dimensional potential energy surfaces directly from reference DFT data using flexible, data-driven representations such as neural networks, graph-based models, and other regression frameworks, thereby overcoming the functional limitations of classical empirical force fields and extending simulations to larger system sizes and longer timescales. To ensure the reliability and transferability of such potentials across vast configurational spaces, uncertainty-aware active learning frameworks have been developed as a systematic strategy to iteratively identify poorly sampled regions and enrich training datasets.39,40 In the electrocatalysis domain, a growing body of research has shown that MLP-driven molecular dynamics can significantly reduce computational cost compared with conventional DFT simulations.16,41,42 For instance, Lian et al. used high-accuracy machine-learning potential-driven molecular dynamics simulations to investigate oxide-derived copper electrocatalysts and showed that subsurface oxygen diffusion occurs over spatiotemporal scales extending from seconds to hours, a regime that is experimentally relevant but effectively unreachable by conventional DFT due to prohibitive computational cost.43
ML has emerged as an essential tool in heterogeneous catalysis, offering powerful capabilities for predicting material properties, adsorption energies, catalytic activity, and stability. It is fundamentally reshaping the paradigms of catalyst design and discovery, driving the field toward a more rational and data-driven future. Central to these advances is the construction of effective system representations and feature sets, with feature engineering serving as a critical link between underlying physicochemical mechanisms and ML models. The continued development of feature engineering is thus pivotal to the effectiveness and depth of AI applications in heterogeneous catalysis.
Among numerous handcrafted features, the d-band center theory represents a paradigmatic example of successfully establishing a quantitative correlation between the electronic structure and catalytic activity.45,46 By describing the relative position of metallic d-band centers with respect to the Fermi level, this theory provides a mechanistic understanding of how adsorption strengths of reaction intermediates are regulated on transition metal surfaces. Originating from Newns et al.'s quantum model47 and systematically elaborated by Nørskov et al.,48 the d-band center theory has undergone continuous refinement. It has been employed to rationalize adsorption and reactivity variations across different metal surfaces,49 guide the design of alloy catalysts,50 and elucidate the influence of strain or surface modification on catalytic performance,51 ultimately giving rise to the well-known “volcano plot” for predicting activity trends.52 Beyond the d-band center itself, additional descriptors such as d-bandwidth, filling factor, and coupling matrix elements, introduced by Nilsson53 and Ruban et al.,50,54 have been incorporated to enhance theoretical precision. Experimentally, Stamenkovic et al. demonstrated that the formation of a Pt-skin structure on Pt3Ni(111) lowers the d-band center of surface Pt by approximately 0.34 eV, resulting in a tenfold enhancement of the oxygen reduction reaction (ORR) activity relative to Pt(111). This work provides direct evidence of how electronic structure modulation can govern catalytic behavior, highlighting the predictive power and practical relevance of d-band-based descriptors in electrocatalysis.
As research in electrocatalysis has advanced, handcrafted features have gradually evolved from single-parameter descriptors to multi-parameter combinations forming systematic descriptor frameworks. A representative example is provided by Trand et al., who integrated four categories of physicochemical descriptors: the atomic number of the element (Z), the Pauling electronegativity (χ), the coordination number of the element with the adsorbate (CN), and the median adsorption energy between the adsorbate and the pure element (ΔE), as illustrated in Fig. 2a. This approach constructs a 32-dimensional feature vector, offering a comprehensive digital representation of the local chemical environment at adsorption sites. By establishing effective mapping from simple elemental properties to complex catalytic performance, it enables accurate predictions of adsorption energies for the CO2RR and the hydrogen evolution reaction (HER).44 Building on this framework, hand-crafted features have been widely applied in machine-learning-assisted high-throughput screening of electrocatalysts.58,59 For example, Back et al. further expanded the feature system (Fig. 2b) by incorporating additional atomic properties, including periodic table position, electronegativity, atomic volume, valence electron count, first ionization energy, electron affinity, and atomic radius. Moreover, by introducing a Voronoi polyhedron-based neighborhood solid angle descriptor, this enhanced feature set achieved an average absolute error of only 0.15 eV in predicting *CO and *H adsorption energies, thereby facilitating high-throughput screening across the vast catalyst design space.57 In the study of single-atom and diatomic catalysts, handcrafted features have proven to be highly effective in elucidating structure–activity relationships. By carefully selecting descriptors such as d/p electron count, oxide formation enthalpy, and electronegativity, researchers successfully predicted the selectivity of single-atom catalysts (SACs) for H2O2 generation.57 Similarly, active sites for the CO2RR on dealloyed gold surfaces were identified,60 and the critical role of interatomic spacing in governing HER activity within g-CN systems was established.61 Subsequently, using hand-crafted geometric and electronic descriptors combined with random-forest models, a materials genome containing 279 bi-atom catalysts was constructed, from which 9 HER-active, 3 OER-active, and 5 ORR-active catalysts were high-throughput screened, with AuCo/g-CN identified as a rare trifunctional HER/OER/ORR catalyst.62 Xu et al. using physically interpretable hand-crafted descriptors combined with an XGBoost model, screened 196 S/N-coordinated SACs and uncovered 17 promising NRR catalysts, among which Mo@S3N1 and W@S3N1 exhibited the best performance.63
![]() | ||
| Fig. 2 (a) Fingerprint of the coordination site. Adsorption sites are reduced to numerical representations, or fingerprints, and these fingerprints are used as model features by TPOT55 to predict ΔECO. Reprinted with permission.44 Copyright 2018, Springer Nature. (b) Nine basic atomic properties are presented by one-hot encoding56 to prepare the atomic feature vectors. Reprinted with permission.57 Copyright 2019, American Chemical Society. | ||
To ensure robustness, many hand-crafted features are physics-informed and mathematically designed. Notably, the smooth overlap of atomic positions (SOAP) descriptor (Fig. 3) represents local atomic environments by expanding a smooth atomic neighbor density on the basis of radial functions and spherical harmonics, yielding a continuous, high-dimensional representation that is invariant to translation, rotation, and permutation of identical atoms.65 Among structural descriptors tested in kernel ridge regression models for hydrogen adsorption energies on MoS2 and Cu–Au nanoclusters, Jäger et al. found that SOAP achieved the lowest mean absolute error in modeling HER activity.66 In the screening of catalysts for CO2 hydrogenation at complex metal–oxide interfaces, Nielsen et al. found that combining SOAP descriptors with the WWL-GPR model can efficiently predict catalyst adsorption energies.67 When combined with sparse Gaussian process regression (SGPR), SOAP can be used to develop data-efficient machine-learned potentials with built-in uncertainty quantification, enabling active-learning-driven refinement of training datasets during molecular dynamics simulations.68,69 To improve scalability over large configurational and chemical spaces, sparse Bayesian committee machine (BCM) schemes partition the descriptor space into multiple local SGPR experts and integrate their predictions within a Bayesian framework, preserving uncertainty estimates while reducing computational cost;70 such BCM-based potentials have been explored as a route toward constructing transferable potentials spanning wide materials spaces, enabling high-throughput molecular dynamics simulations of multicomponent and multiphase systems.70
![]() | ||
| Fig. 3 (a) Illustration of the construction principle of the smooth overlap of atomic positions (SOAP) descriptor. First, the neighbor atomic density ρ around a central atom is expanded in a local basis composed of radial basis functions and spherical harmonics Yml. Then, the expansion coefficients Cnlm are summed over the squared modulus in the m direction to obtain the power spectrum vector p, thereby ensuring the rotational invariance of the descriptor; (b) elucidation of this construction from a mathematical structural perspective. Reprinted with permission.64 Copyright 2021, American Chemical Society. | ||
Manual feature design in heterogeneous catalysis faces intrinsic limitations. Expert-crafted descriptors often struggle to explore high-dimensional feature spaces, capture nonlinear multi-factor interactions, or generalize to novel materials such as high-entropy alloys, complex oxides, and metal–organic frameworks. These challenges have motivated interpretable automated feature engineering methods, such as SISSO, which efficiently identify optimal feature combinations from large pools of primary descriptors. By revealing subtle interactions inaccessible to human intuition, such approaches enhance descriptor discovery and enable rational design of complex catalytic systems.
The fundamental framework of SISSO is illustrated in Fig. 4a. Its workflow consists of two sequential steps: first, “sure independence screening” rapidly reduces the candidate feature space; second, a “sparsifying operator” precisely identifies optimal low-dimensional descriptors.72 By integrating compressive sensing with symbolic regression, SISSO enables the automated extraction of descriptors from complex expressions involving numerous primary features. The resulting descriptors are presented as analytical formulas, providing a transparent link between data-driven modeling and the underlying physical mechanisms.72,73
![]() | ||
| Fig. 4 (a) The method SISSO combines unified subspaces having the largest correlation with residual errors (or P) generated by sure independence screening (SIS) with a sparsifying operator (SO) to further extract the best descriptor. Reproduced with permission.72 Copyright 2018, American Physical Society. (b) Distribution of the collected experimental OER activity data on perovskite catalysts in publications up to the year 2020 (the time of starting this project). (c) Idea of sign-constrained MTL. The ith coefficients βit in all t have the same sign. (d) Comparison of the activity data between the identified 2D descriptor (dB, nB) and the experiments. The colors denote the source of the data sets. Reproduced with permission.74 Copyright 2023, American Chemical Society. | ||
With methodological advancements, SISSO has demonstrated substantial applicability in the study of catalytic materials. For example, in predicting the relative stability of octahedral binary compounds, SISSO derived analytical formulas linking energy stability to complex feature spaces, thereby establishing clear mappings between material properties.73 Building on this foundation, the method has been extensively applied to catalytic systems.33 Wang et al. introduced the sign-constrained multi-task learning (SCMT-SISSO) framework (Fig. 4b and c), which addresses discrepancies in experimental data by enforcing sign consistency of descriptor coefficients across multiple sources. Using 182 data points from 13 independent studies, they identified an effective two-dimensional descriptor, (dB, nB), where dB corresponds to the number of d electrons in the B-site metal and nB denotes its oxidation state (Fig. 4d). This descriptor enabled the screening of 36
660 perovskite materials, successfully predicting several high-performance OER catalysts whose activity was subsequently validated experimentally.74 In addition, Fung et al. employed compressed sensing to extract key descriptors for the HER reaction in SACs.75 Similarly, in studies on selective alkene oxidation, Foppa et al. applied SISSO to reveal intrinsic correlations between key features and catalytic performance, based on 12 vanadium/manganese catalysts and 55 physicochemical parameters.76
To address challenges in practical applications, researchers have continued to optimize and extend the SISSO methodology to identify universal descriptors across diverse catalytic reactions. For example, Gong et al. proposed the physically meaningful feature engineering and selection (PFESS) framework, inspired by SISSO, and developed the ARSC descriptor with explicit physical interpretation, expressed analytically as Φ = (1 + kα) × ϕxy [62]. As illustrated in Fig. 5, the descriptor construction follows a systematic four-step process: (i) establishing primary atomic properties (A) based on the d-band shape of homonuclear sites; (ii) selecting optimal parameters (R) by incorporating reactant effects; (iii) introducing heteronuclear intermetallic synergistic effects (S) via the PFESS framework; and (iv) integrating coordination environment influences (C) to form the final ARSC descriptor. This approach demonstrates how complex descriptors can be progressively built from fundamental physical properties. Importantly, ARSC successfully unified independent experimental data from 17 types of diatomic sites across 28 publications, with activity data exhibiting high consistency on the ARSC volcano plot, thereby validating the descriptor's universality and reliability.77
![]() | ||
| Fig. 5 General workflow of our work. Firstly, a primitive descriptor (φxx) for atomic property effects through d-band shape analysis. Secondly, screening principle (φopt) of potential desirable heteronuclear DACs based on reactant effects. Thirdly, ML-based descriptors (φxy) for synergistic effects through physically meaningful feature engineering based on φxx and feature selection/sparsification algorithms. Fourthly, the final universal descriptor model (Φ) with quantification of coordination effects and corresponding experimental verifications. Reproduced with permission.77 Copyright 2024, Springer Nature. | ||
SISSO has also shown exceptional capability in handling complex catalytic systems. Nair et al. developed an SISSO-guided active learning workflow, in which a closed-loop “predict-validate-update” mechanism enabled efficient screening of stable catalysts under acidic conditions.33 To further enhance algorithmic practicality, the RF-SISSO model was introduced, achieving a 265-fold improvement in regression efficiency compared to the original SISSO model with only 45 samples.78 Concurrently, manual feature engineering is increasingly integrated with interpretable ML; for instance, physically informed descriptors such as the “topological undercoordination number” have successfully revealed structural sensitivities in metal catalysts.79
These methodological refinements have substantially broadened the applicability of SISSO in machine-learning-assisted high-throughput screening of SACs across key electrocatalytic reactions. For example, high-throughput first-principles calculations combined with SISSO have been used to screen 192 transition-metal atoms anchored on 1T-TMD substrates for the CO2RR, where SISSO-derived descriptors linking intrinsic features to limiting potentials guided the identification of promising catalysts such as Fe@CoS2, Pt@TiTe2, and Co@CoS2 with low overpotentials and selective pathways to fuels like formic acid and methane.80 In studies focusing on the HER, SISSO has been integrated into machine-learning workflows to derive interpretable descriptors correlating adsorption energetics and electronic structure with catalytic activity, enabling rapid screening and prediction of high-performance SACs.81 Similarly, in ORR screening, SISSO-generated features such as combinations of d-electron count and Bader charge have been shown to play a critical role in predicting overpotentials and activity trends of MXene-supported single atoms.82
The SISSO method offers a robust and interpretable approach for small-sample datasets, automatically extracting optimal descriptors from large feature spaces while generating concise analytical expressions instead of black-box models. Its performance depends on the coverage of primary features, and large feature pools require pre-screening due to computational demands. Although its ability to capture highly nonlinear relationships is limited, SISSO serves as a powerful bridge between data-driven modeling and physical mechanism understanding, providing clear, actionable insights for rational design of heterogeneous catalysts.
The development of chemical graph features originated in computational chemistry and chemoinformatics, where molecular systems were first abstracted as graphs with atoms as nodes and covalent bonds as edges. A notable milestone in this field was the establishment of the GDB-9 dataset by Ramakrishnan et al., which systematically provided quantum chemical structural and property data for over 130
000 molecules. This dataset revealed the scaling behavior of chemical space with molecular size and the distribution of isomeric properties, validating the feasibility of predicting molecular properties through graph-based representations.84 With subsequent advances, the concept of chemical graphs has been extended from molecular systems to crystalline and solid-state materials, supporting broad applications in materials science and heterogeneous catalysis. Concurrently, the advent of deep learning and GNNs has transformed chemical graph features from static descriptors into end-to-end, learnable representations. This evolution has culminated in the emergence of chemical graph neural networks, establishing a new paradigm for feature engineering in heterogeneous catalytic systems.
A breakthrough in chemical graph features for materials science and heterogeneous catalysis was achieved with the introduction of the CGCNN. In this framework, crystal structures are represented as graphs (Fig. 6a), where nodes correspond to atoms in the unit cell and edges denote chemical bonds or interatomic interactions. Graph convolutional layers iteratively update atomic features by aggregating information from neighboring atoms and bonds, while pooling layers integrate these local features into global crystal descriptors for property prediction. This end-to-end learning approach effectively eliminates the dependence on manually designed descriptors. Applied to ∼47
000 crystal structures from the Materials Project, CGCNN achieved a mean absolute error of 0.039 eV per atom in formation energy prediction, surpassing conventional ML models.85 Subsequently, Chen et al. developed the multi-task crystal graph convolutional neural network (MT-CGCNN), enabling efficient and accurate simultaneous prediction of multiple material properties. This multi-task framework is particularly advantageous in limited-data scenarios and high-throughput material screening.86
![]() | ||
| Fig. 6 (a) Construction of the reaction network for urea electrosynthesis on a nitrogen-doped carbon catalyst. Reproduced with permission.56 Copyright 2018, American Physical Society. (b) A scheme of the overall reaction network, the typical elementary steps, and the example of adsorption configuration enumeration. Green dots represent reaction intermediates in the classical mechanism, while orange dots represent those in the MvK mechanism. The average DFT calculation costs (percentage of reaction intermediates calculated by DFT) for the prediction of the (c) classical mechanism and (d) MvK mechanism with and without the GSP algorithm. Reproduced with permission.95 Copyright 2025, Chinese Chemical Society. | ||
Researchers have leveraged the high efficiency of CGCNN in predicting energetic properties to enable high-throughput screening and prediction of catalysts.25,87–90 Kim et al. introduced the Surface Graph Convolutional Neural Network (SGCNN), tailored to predict the binding energies of key adsorbates (*H, *N2, *N2H, *NH, and *N2) relevant to the NRR. Using only low-dimensional inputs such as elemental properties and atomic connectivity, SGCNN achieved a mean absolute error of 0.23 eV on a dataset of 3040 DFT-calculated surfaces.91 For HER catalysts, Zheng et al. employed an improved CGCNN model (ASB-GCNN) that partitions crystal geometry into active, surface, and bulk layers to screen 600 MA2Z4-based materials, identifying five promising SACs, including V1/HfSn2N4(S) with a near-ideal ΔGH* of 0.06 eV, thereby demonstrating efficient structure–activity mapping.92 For OER catalysts, Back et al. combined DFT and CGCNN to identify low-index IrO2 surfaces with lower overpotentials than the rutile(110) benchmark, highlighting GNN-assisted screening of active facets.93 In the CO2RR, Gu et al. used labeled site representations within a GNN framework to predict CO adsorption energies with an MAE of 0.116 eV, enabling rapid evaluation of diverse PdxTi1−xHy surfaces.94 Collectively, these developments highlight how graph-based neural networks have redefined chemical feature engineering, providing scalable and physically grounded representations for heterogeneous catalysis.
Leveraging chemical graph features, their application has been extended to automated construction and exploration of complex reaction networks. Zheng et al. combined graph theory with active learning to model the electro-synthesis of urea, representing reactants as molecular graphs (atoms as nodes, bonds as edges; Fig. 6b). Graph editing operations enabled automated simulation of elementary steps, constructing a reaction network with hundreds of intermediates. The graph stability prediction (GSP) algorithm reduced DFT computational cost by ∼40% while maintaining accurate pathway identification (Fig. 6c and d).95
As model scale and catalytic system complexity increase, traditional GNNs face challenges in accurately capturing multiple adsorbates and diverse bond interactions. To address this, Bang et al. proposed the bond-type embedded crystal graph convolutional neural network (BE-CGCNN), which explicitly distinguishes and embeds four bond types: covalent, metallic, chemisorption, and nonbonded interactions, allowing more precise representation of nanoparticle surface chemistry (Fig. 7a and b). This approach discards distance-dependent features in favor of one-hot encoded bond types, enhancing robustness for unrelaxed structures and achieving a MAE of 0.07 eV in Pt *OH adsorption predictions (Fig. 7c).96 For limited-data systems, Xu et al. developed a simplified crystal graph neural network with adaptive feature encoding (S-CGCNN), maintaining high predictive accuracy under small-sample conditions.97 Chemical graph features are increasingly recognized as universal descriptors linking atomic structures to macroscopic properties, providing a foundation for high-throughput screening, multi-task learning, and active site mechanistic analysis.98,99
![]() | ||
| Fig. 7 (a) A schematic representation of the graph convolution neural network model to predict the adsorption energy; (b) representation of bond embedding. Each bond is embedded into a bond vector by one-hot encoding of the bond type; (c) comparison of BE-CGCNN and CGCNN predictions for adsorption energy differences on the OH adsorbate dataset. Reproduced with permission.96 Copyright 2023, Springer Nature. | ||
SchNet, introduced by Schütt et al., pioneered the use of continuous-filter convolution to model quantum interactions in molecular and crystalline systems, enabling direct learning of potential energy surfaces from atomic 3D coordinates. Unlike traditional graph-based chemical descriptors, SchNet represents atoms as nodes and spatial vectors between atoms (distance and direction) as edge features, encoded continuously via radial basis functions. As illustrated in Fig. 8a, a learnable filter network processes this geometric information through multiple interaction blocks, progressively updating atomic features to output rotationally invariant total energies and rotationally covariant atomic forces.100 Building on this approach, Chen et al. developed the MEGNet (Materials Graph Network) framework, which unifies the treatment of molecules and crystals within a graph neural network by incorporating multi-level updates for atoms, bonds, and global state variables (e.g., temperature, pressure) (Fig. 8b). Trained on ∼69
000 crystals from the Materials Project, MEGNet outperformed prior models such as CGCNN in predicting crystal formation energies and bulk moduli.101
![]() | ||
| Fig. 8 (a) The discrete filter (left) is not able to capture the subtle positional changes of the atoms, resulting in discontinuous energy predictions E (bottom left). The continuous filter captures these changes and yields smooth energy predictions (bottom right). Reproduced with permission.100 Copyright 2017, the Authors. (b) A MEGNet module starts with atomic attributes V = {vi}vi=1:N, and E = {(ek,rk,sk)}ek=1:N global state attributes. Through sequential updates of bonds, atoms, and the global state, information flows among all three, yielding a new graph representation. Reproduced with permission.101 Copyright 2019, American Chemical Society. | ||
The DimeNet model (Fig. 9a) overcomes the limitations of distance-only geometric representations by introducing directional message passing. Using a two-dimensional basis of spherical Bessel and spherical harmonic functions, it explicitly encodes both distances and bond angles, enabling precise modeling of local directional interactions.102 This approach is particularly effective for analyzing conformational evolution in multi-step reactions such as the OER and CO2RR. Building on this, Li et al. proposed LEPool-DimeNet++, which incorporates local environment pooling to improve adsorption energy predictions. The model achieved mean absolute errors (MAEs) of 0.096 eV and 0.073 eV for *CO and *H adsorption energies, respectively, outperforming previous state-of-the-art models.104 Further advancing geometric graph representations, Choudhary et al. developed ALIGNN (Atomistic Line Graph Neural Network) (Fig. 9b). ALIGNN performs message passing simultaneously on atomic graphs (nodes = atoms, edges = bonds) and line graphs (nodes = bonds, edges = bond angles), collaboratively updating atomic, bond, and bond-angle features. This allows accurate capture of local geometric configurations at surface active sites and shows exceptional performance in adsorption energy prediction, conformational stability analysis, and modeling reaction intermediates.103 Geometric graph neural networks also extend to complex organic molecules. GAME-Net, developed by Pablo-García et al. (Fig. 10), predicts adsorption energies of organic molecules on metal surfaces with an MAE of 0.18 eV, achieving approximately six orders of magnitude faster computation than traditional DFT, thereby enabling high-throughput screening of heterogeneous catalysts.26
![]() | ||
| Fig. 9 (a) The DimeNet architecture represents distances (dji) using spherical Bessel functions and distances (dkj) with angles (akj,ji) using a 2D spherical Fourier-Bessel basis. An embedding block generates initial message embeddings (mji), which are updated through multiple interaction blocks via directional message passing using neighboring messages (mkj), 2D representations (aSBFkj,ji), and distance representations (eRBFji). Each block outputs transformed embeddings using (eRBFji) and sums them per atom, and the outputs of all layers are finally summed to yield the prediction. Reproduced with permission.102 Copyright 2022, the Authors. (b) An undirected crystal graph representation and the corresponding line graph construction of a SiO4 polyhedron. For clarity, only Si–O bonds are shown. The ALIGNN convolution layer alternates messages passing between the bond graph (left) and the line graph (bond adjacency graph, right). Reproduced with permission.103 Copyright 2021, Springer Nature. | ||
![]() | ||
| Fig. 10 Schematic illustration of the GAME-Net workflow. Starting from the DFT FG-dataset of small adsorbates (3315 samples), adsorption systems are converted into graph representations to train the proposed GNN architecture. GAME-Net is then applied to predict adsorption energies of larger molecules (C < 23) on metal surfaces in the BM-dataset, eliminating the need for costly DFT calculations. Reproduced with permission.26 Copyright 2023, Springer Nature. | ||
Geometric graph features naturally encode three-dimensional structural information. This makes them well-suited for complex catalytic systems, including alloys, surfaces, and interfaces. Nevertheless, these features face several challenges. They are sensitive to unrelaxed structures and highly dependent on precise atomic positions, often requiring large training datasets for complex systems. Their high computational complexity also places greater demands on algorithms and hardware.
![]() | ||
| Fig. 11 (a) Schematic illustration of the persistent homology methodology for point clouds. Each point cloud is transformed into a geometric object through filtration, during which topological data features (e.g., holes) appear and disappear. Their birth and death values are recorded in a persistence diagram (x = birth, y = death, persistence = y − x), capturing the topological evolution of the data. The persistence diagram enables distinguishing point clouds of different shapes and clustering those with similar topology, as demonstrated by the representative classification plot. Reproduced with permission.108 Copyright 2021, Elsevier. (b) The flowchart illustrates the development of the topological descriptor for hMOF5035530. In the pore geometry barcode, the horizontal axis represents the filter radius, while the vertical axis indicates the number of barcodes. Reproduced with permission.117 Copyright 2024, MDPI. | ||
In heterogeneous catalysis, adsorption energies at active sites critically determine catalyst activity, selectivity, and stability. These energies strongly depend on local atomic configurations, coordination environments, and electronic structures, making the establishment of universal structure–performance relationships challenging. Topological data features provide a machine-learning-compatible representation that captures complex three-dimensional structures. For example, in metal–organic frameworks (MOFs), pore topology governs gas adsorption behavior. Yang et al. applied TDA to convert MOF crystal structures into descriptors quantifying pore connectivity, ring structures, and cavity distribution (Fig. 11b). When combined with an extreme gradient boosting (XGBoost) model, this approach substantially outperformed traditional geometric descriptors in predicting C1–C3 alkane adsorption performance.117
TDA can capture structural information that is difficult to encode using conventional graph-based representations. For example, in metal–nitrogen–carbon SACs, local curvature plays a critical role in modulating the geometric environment. However, such curvature does not directly alter bonding connectivity and therefore remains challenging to represent using standard graph encodings. Liang et al. developed the persistent homology-enhanced crystal graph convolutional neural network (PH-CGCNN) (Fig. 12), which embeds curvature-induced microstructural variations by persistent homology into graph neural network features. The barcodes generated by persistent homology can effectively distinguish structural variations induced by different curvature conditions.
![]() | ||
| Fig. 12 Architecture of the PH-CGCNN model, combining atomic graph representations from standard CGCNN with persistent homology-derived curvature features to predict adsorption energies. Reproduced with permission.119 Copyright 2025, Elsevier. | ||
Beyond interpretation and prediction, topological approaches have been extended to reverse design of active sites. Wang et al. developed a topology-based variational autoencoder framework (PGH-VAEs) that represents catalytic sites using persistent GLMY homotopic features (Fig. 13a–c). This method quantifies the relationship between three-dimensional structural sensitivity and adsorption properties, enabling interpretable design of high-entropy alloy active sites. Applied to the IrPdPRhRu system, it revealed synergistic regulation of *OH adsorption energies by coordination and ligand effects. Latent-space analysis identified Pt–Pd bridging sites combined with distal Ru atoms as optimal configurations, exhibiting higher OH adsorption energies than Pt(111). Incorporating second-neighbor Ru further modulated the d-band center, optimizing the ORR pathway and enhancing catalytic efficiency and poisoning resistance.120
![]() | ||
| Fig. 13 (a) Schematic of feature construction, where coordination features are obtained via the PGH method and ligand features are represented using elemental properties. (b) Dataset construction using DFT and semi-supervised learning: a GBR model is first trained on DFT-calculated adsorption energies and then used to predict energies for additional simulated active sites, creating an expanded pseudo-labeled dataset for model training. (c) Framework of PGH-VAEs, showing modules for encoding, latent space visualization, sampling, and decoding to generate potential active sites. Reproduced with permission.120 Copyright 2025, Chinese Academy of Sciences. | ||
Zheng et al. proposed PH-SA (Fig. 14a–d) to efficiently explore active phase configurations using TDA. The method decomposes structures into atomic aggregates, identifies potential adsorption sites via persistent homotopy, and generates configurations through combinatorial enumeration. Machine learning force fields optimize these structures, and Pourbaix diagrams track phase evolution under external conditions. PH-SA samples surface, subsurface, and bulk sites for both slabs and clusters, overcoming limitations of intuition-based approaches. In Pd hydrogenation and Pt cluster oxidation, it accurately predicted structural rearrangements and reactivity, providing an efficient framework for discovering active phases and elucidating catalytic mechanisms.121
![]() | ||
| Fig. 14 (a) The PH-SA decomposes material structures into small atomic aggregates, using persistent homology to identify potential interaction sites within each unit. Combining the sites from all aggregates yields the potential active sites for species across the entire structure. (b) Identified sites are used in combinatorial enumeration to generate a set of structures. (c) A machine learning force field (MLFF) is trained via transfer learning to improve computational efficiency. (d) The Pourbaix diagram under specific conditions is constructed to aid catalytic mechanism analysis. Reproduced with permission.121 Copyright 2025, Springer Nature. | ||
Topological algebra methods provide multiscale representations for designing heterogeneous catalysts and predicting performance. Challenges include integrating element-specific information and combining topological descriptors with electronic structure features. Techniques such as attention mechanisms, multimodal learning, and dynamic descriptors can enhance expressiveness and applicability. With further development, they hold promise for elucidating catalytic mechanisms, guiding material design, and accelerating catalyst discovery.
Textual representations can complement traditional descriptors, supporting multimodal approaches for catalyst design. Catalyst generative pretrained transformer (CatGPT) generates chemically valid string representations of inorganic catalysts from text inputs. When fine-tuned on specialized datasets, such as binary alloys for the two-electron ORR, it can propose candidate structures tailored to specific catalytic applications.123 As a specific implementation of multimodal fusion, structure-text alignment aims to connect textual material descriptions with numerical structural features. Ock et al. developed a graph-assisted pretraining framework that integrates atomic structures with textual descriptions for predicting adsorption energies. As shown in Fig. 15a, the framework operates through two main stages: self-supervised graph-text alignment pretraining followed by supervised fine-tuning for energy prediction. This approach aligns graph embeddings from an equivariant graph neural network (EquiformerV2) with text embeddings from a Transformer language model (CatBERTa) in a shared latent space (Fig. 15b). The framework also incorporates a large language model (CrystalLLM) to generate structural descriptions from simplified chemical inputs, enabling reasonable energy predictions without complete atomic coordinates (Fig. 15c). This method offers a potential pathway for utilizing textual information from the literature to support catalyst screening.21
![]() | ||
| Fig. 15 (a) The training involves two steps: graph-assisted pretraining followed by energy prediction fine-tuning. (b) The CatBERTa model is used as the text encoder. (c) The graph encoder, with the final-layer graph embeddings reshaped and max-pooled into a 1D format. The architecture is reproduced from the original EquiformerV2124 publication. Reproduced with permission.21 Copyright 2025, Springer Nature. | ||
In multimodal catalyst screening, spectroscopic descriptors complement conventional structural representations and enable integrated AI-driven design. To improve catalyst screening efficiency, Yang et al. developed a cross-modal encoder–decoder framework that integrates spectroscopic and structural descriptors for comprehensive chemical representation. Using pretraining strategies based on property regression and masked-mode prediction, the model enables bidirectional translation between molecular geometries and vibrational spectra. In CO/NO adsorption on Ag/Au surfaces, the framework exploits the complementarity of infrared and Raman signals to accurately predict adsorption properties and internal coordinates (RMSE ≈ 0.01 Å). This method addresses the information insufficiency of single-modality representations and supports multi-objective prediction and data recovery in complex catalyst design.125 Subsequently, Zhao et al. extended multimodal fusion to organic molecular structure elucidation by proposing a framework based on one-dimensional convolutional neural networks (1D-CNNs). By integrating infrared, Raman, and nuclear magnetic resonance spectra, the method effectively leverages the complementary strengths of vibrational and magnetic resonance information, enabling automated identification and quantification of functional groups.126 Although several multimodal alignment approaches integrating spectra, structures, and text have emerged in recent years, such as TranSpec and SpecGNN, which enable bidirectional translation between vibrational spectra and SMILES representations, their application in catalysis remains relatively unexplored.127
Compared to traditional single-modality descriptors, multimodal feature representations offer a more comprehensive capture of the complex structure–property relationships in catalytic systems. As a result, they have emerged as a critical direction in structural feature engineering, showing great potential in enhancing both the generalization ability and physical interpretability of ML models.
In summary, this review systematically summarizes advances in feature engineering for heterogeneous catalysis, from empirical descriptors to data-driven methods. Early work relied on intuitive descriptors like the atomic radius and electronegativity, linking electronic structure to activity but struggling with high-dimensional complexity. Symbolic regression methods, such as SISSO, automate the discovery of low-dimensional, interpretable descriptors, improving prediction and mechanistic insight. GNNs model catalysts as atomistic graphs, capturing local environments and multi-site interactions, enhancing adsorption and activity predictions. TDA, e.g., persistent homology, provides multi-scale insights into structural connectivity. Integrating multimodal data from computational, experimental, and literature sources enables mechanism-informed “super-descriptors,” bridging prediction and understanding. Overall, feature engineering advances both predictive accuracy and mechanistic insight, supporting rational, data-driven design of next-generation heterogeneous catalysis.
| This journal is © the Owner Societies 2026 |