Open Access Article
Elena Patyukova*a,
Junwen Yin
*a,
Susmita Basakb,
Samuel Pinilla
bc,
Alin M. Elenaa and
Gilberto Teobaldi
b
aScientific Computing Department, Daresbury Laboratory, STFC, UKRI, UK. E-mail: patyukova@gmail.com; junwen.yin@stfc.ac.uk
bScientific Computing Department, Rutherford Appleton Laboratory, STFC, UKRI, UK
cDiamond Light Source, Harwell Science and Innovation Park, Didcot OX11 0DE, UK
First published on 15th June 2026
Performing density functional theory (DFT) calculations requires a careful choice of computational parameters to ensure convergence and obtain meaningful results. This represents a particularly important problem for high-throughput and agentic workflows, where due to computational cost, any additional convergence studies are preferably avoided. So, there is a need for tools and models which are able to predict DFT parameters from basic input information, such as a structure. In this work, we develop a machine learning approach to predict the appropriate k-point sampling in DFT calculations and generate the input files for Quantum Espresso self-consistent field calculations. To achieve this, we first generated a training dataset comprising over 20
000 materials, each with an energy convergence threshold of 1 meV per atom. Several ML models were evaluated for their ability to predict k-point distance, and uncertainty estimation was incorporated to guarantee that, for at least 85–95% of compounds, the predicted k-distance lies within the convergence region. The best-performing models are made publicly available through an open-access web application.
Choosing the right parameters is crucial for achieving accurate results while optimizing computational resources, as improper settings can lead to convergence issues or unnecessarily long computation times. However, choosing the optimal strategy for parameter choice is a non-trivial task, as, in general, the given level of accuracy in a property can be achieved with multiple sets of parameter combinations; the effects of different parameters are often interdependent and not well understood.
One of the most common approaches to parameter selection in high-throughput calculations is to fix all relevant parameters at certain values for all compounds, usually at sufficiently large k-point densities and energy/density cutoffs. The situation is further complicated by the fact that the rationale behind the chosen fixed parameter values is often not reported. In this situation, there is little control or understanding of the associated errors, and calculations often turn out to be partially over-converged and partially under-converged.
In recent years, there has been great interest and progress in automating high-throughput workflows both through rigid-logic automation systems such as AiiDA,1 Atomate2,2 Pyiron,3 and AFLOW4 and new emerging agentic LLM-driven frameworks.5–7 Fixed workflow systems provide robust workflows, data management tools, HPC-interfaces, input/output processing and analysis, and help standardise DFT simulations, result reproducibility, and reuse. Emerging agentic frameworks use more flexible approaches and aim at performing the full cycle of calculations and data analysis in an automated manner. Potentially, these frameworks can give non-experts and robots wide access to theoretical tools. However, the choice of calculation parameters for those workflows remains a problem in both cases.
In this work, we aim to improve on the conventional baseline of fixed parameters for single-point SCF energy calculations conducted using Quantum ESPRESSO8 by developing a tool using machine learning to predict optimal calculation parameters. By sharing this tool, we aim to help to improve the quality of high-throughput datasets and make calculations greener by reducing unnecessary calculation time.
We begin by reviewing prior work in the field. Based on this analysis and our own convergence studies, we identify an optimal strategy for data generation and subsequently construct the dataset. We then benchmark different models to determine which achieves the lowest prediction errors for k-point density. A key consideration in this task is uncertainty estimation, since the cost of prediction errors is asymmetric: underestimating the required k-point density can lead to inaccurate results, while overestimating it only increases computational expense. To address this, we predict the 0.85, 0.9, and 0.95 quantiles rather than the median (0.5 quantile), ensuring that at least 85%, 90%, and 95% of our predictions are not underestimated.
Finally, we encapsulate our findings in a practical web application that automatically generates input files for single-point SCF calculations in Quantum ESPRESSO for any given structure.
1. A large-scale convergence dataset of 20
178 compounds, generated with consistent SSSP-1.3 Efficiency pseudopotentials and cold smearing, providing reliable reference k-meshes for SCF single-point energy calculation, with a convergence criterion of 1 meV per atom.
2. A systematic evaluation of machine learning models (Random Forest, Gradient Boosting, CrabNet, CGCNN, and ALIGNN) and feature types (composition, structure, SOAP, lattice descriptors, and metallicity embeddings) for predicting optimal k-point densities.
3. A compact and interpretable rule-set derived via surrogate decision-tree modeling, offering intuitive insight into the key physical and chemical factors influencing k-point spacing.
4. An uncertainty estimation framework using conformalised quantile regression (CQR), ensuring that predictions satisfy probabilistic guarantees and avoid underestimating required k-point density.
5. A publicly available web application that automatically generates fully specified Quantum ESPRESSO SCF input files using the trained models.
6. Quantitative demonstration of computational savings, showing that ML-driven predictions reduce unnecessary k-mesh density and thereby lower computation time relative to fixed-density baselines.
000 materials, using the Vienna ab initio Simulation Package (VASP). The authors used projector-augmented wave (PAW) pseudopotentials and Gaussian smearing (with a fixed value of 0.01 eV) for all calculations. The authors tried to find correlations between k-points, cut-off parameters, and other material properties such as density, number of electrons, the slope of bands, number of band-crossings, the maximum plane-wave cutoff used for pseudopotential generation, crystal system, the number of unique species in the material, and exchange–correlation functional. They discovered that: (1) convergence is nearly independent of the exchange–correlation functional; (2) the converged plane wave cutoff is significantly correlated with the maximum cutoff used in pseudopotential generation (ENMAX) (Pearson's correlation ≃ 0.64); (3) strong correlation exists between the slope of a band crossing at the Fermi surface for metals and the required k-point density (Pearson's correlation ≃ 0.66); (4) certain preferences exist for using higher k-points (often in cubic and hexagonal) vs. higher plane-wave cut-off energies (often in triclinic) for different crystal systems; (5) the presence of some elements, for example, transition metals required on average high k-point densities; the presence of electronegative elements (O and F) required high cut-offs. The results imply that the derived convergence data would be expected to depend strongly on the pseudopotential used; therefore one cannot use them for calculations with Quantum ESPRESSO and Quantum ESPRESSO's set of pseudopotentials.
Janssen et al.10 conducted a systematic study aimed at understanding the optimal way to choose energy cutoffs and k-point density for DFT calculations with VASP (so smearing was fixed to default VASP values). They considered energy–volume dependence and derived quantities (the bulk modulus in particular as it is more sensitive to fitting/convergence errors) as the target properties to analyse convergence in several elemental metal compounds. They decomposed the total error into a systematic part, obtained by comparing results at finite cutoffs and k-point meshes against highly converged references, and a statistical part, identified as the residual oscillatory noise arising from discrete changes in the number of plane waves or k-points. This separation enabled the construction of error phase diagrams, distinguishing regimes where systematic or statistical errors dominate. Their analysis showed that systematic errors (e.g., from insufficient cutoffs) dominate for most elements, while statistical errors can prevail in transition metals such as Cu, Pd, and Ag. They concluded that in the statistical regime, it is most efficient to refine the cheaper parameter (typically the k-points), whereas in the systematic regime, both the cutoff energy and k-point density must be increased simultaneously.
The Quantum ESPRESSO community paid a lot of attention to the systematic verification of pseudopotentials and pseudopotential-based codes. The verified codes included those using different numerical basis sets, such as plane waves, Gaussian functions combined with plane waves, Daubechies wavelets, and atomic orbitals.11–23 They designed a testing protocol (SSSP – Standard Solid-State Pseudopotential protocol) to benchmark pseudopotentials in Quantum ESPRESSO across multiple tasks (equations of state, convergence of phonon frequencies, cohesive energies, pressures, and band structures with respect to plane-wave cutoffs). They tested available pseudopotentials on elemental crystals (with rare-earth nitrides and SiF4 being the exceptions). These investigations resulted in the list of recommended pseudopotentials (including ultrasoft, norm-conserving, and PAW) for each element in the periodic table and a recommended energy/density cut-off pair for these pseudopotentials. The authors also pointed out that the required cut-off depends strongly on the property of interest. They also noted the independence of convergence patterns for local/semi-local functionals (whereas for hybrid/meta-GGA, no dedicated pseudopotentials existed).
Nascimento et al.24 extended this line of work by analysing the interplay between k-point sampling and electronic smearing in Quantum ESPRESSO calculations. Building on the SSSP pseudopotential library—with its recommended plane-wave and charge-density cutoffs—they carried out a systematic study on a dataset of 269 experimentally known compounds (including oxides, metals, and insulators, both 3D and 2D) selected from the Materials Cloud databases, with unit cells containing up to 14 atoms. Their goal was to determine how different choices of k-point density and smearing affect both numerical accuracy and the fraction of calculations that fail to converge. The guiding principle was that sampling (statistical) errors from insufficient k-points should be suppressed in favour of controlled systematic errors introduced by finite smearing, since the latter are predictable and tunable. Using Marzari–Vanderbilt cold smearing, they identified three practical protocols—fast, balanced, and stringent—that offer trade-offs between efficiency and precision. The balanced protocol (smearing temperature σ = 0.02 Ry, k-point spacing λ = 0.15 Å−1) was shown to work well for most systems, while the stringent protocol (σ = 0.0125 Ry, λ = 0.1 Å−1) was necessary for challenging cases such as lanthanides and actinides. In this way, they provided ready-to-use parameter sets that minimize errors and computational cost while ensuring robust convergence across large, chemically diverse datasets.
In the past year, there has been rapid progress in LLM-based multi-agent systems designed for automated computational chemistry workflows, such as El Agente,5 DREAMS (Density Functional Theory Based Research Engine for Agentic Materials Simulation),6 and VASPilot.7 These agents typically employ multiple child agents to handle automated convergence schemes, either by leveraging existing recommendations from the literature and packages such as pymatgen or ASE or by generating a series of calculations with systematically varied parameters for convergence testing. They offer clear advantages in managing complex tasks—such as input file generation, job submission, error handling, and post-processing—without human intervention. However, a persistent challenge remains in determining the optimal set of calculation parameters that ensures accurate results while minimizing computational cost.
In summary, previous studies have progressively clarified the roles of cutoffs, k-point density, and smearing in determining the accuracy and efficiency of plane-wave DFT calculations. Nevertheless, the optimal choice of these parameters for high-throughput studies remains an open problem. In this work, we continue our work in this direction.
(1) The valence electrons are represented using the Standard Solid-State Pseudopotential (SSSP) library (version 1.3, PBEsol, efficiency set), with plane-wave cutoff energies determined following the efficiency-level recommendations.
(2) The Marzari–Vanderbilt cold-smearing scheme, with a smearing width of 0.01 Ry, is applied in all calculations. The choice of the cold smearing is justified by its ability to yield a free energy that is quite insensitive to variations in the smearing temperature. The smearing temperature is set to 0.01 Ry, which ensures smearing convergence for most materials according to Nascimento's report.24 This also implies that even for insulators, smearing is applied to the electronic states around the Fermi level, rather than using fixed occupations.
(3) The 20
187 structures were selected randomly from the MC3D PBEsol-v1 dataset, and no additional structural relaxation was performed.25
(4) Non-spin-polarized calculations were performed for all materials, and no magnetic configurations were taken into consideration.
(5) The k-mesh convergence is determined as follows: when the energy difference among three consecutive k-meshes becomes smaller than 1 meV per atom (an illustrative example is provided in Section 3 of the SI), the first point of these three is identified as the converged k-mesh. In the present workflow, the convergence criterion was defined using total-energy differences only, which means that atomic forces were not included in the dataset. This procedure does not guarantee that one has the optimal set of parameters, that all pockets of the Fermi surface are resolved, or that the total energy of compounds with a band gap below 0.14 eV is resolved correctly. However, it is an easy way to generate the data.
(6) We employ the definition of the k-point distance, as implemented in the AiiDA–quantumESPRESSO package,1,26 to generate various k-point meshes. The k-distance represents the maximum spacing (in Å−1) between adjacent k-points in reciprocal space, such that the number of k-points along each reciprocal lattice vector bi is given by ⌈|bi|/kdist⌉. Starting from a k-distance of 1.0 Å−1, we systematically scan all possible k-meshes by varying the k-distance in steps of 0.005 Å−1 (an illustrative example is provided in Section 3 of the SI). Throughout this work, we adopt Γ-centred meshes as a practical and consistent workflow choice, mainly to maintain compatibility with common follow-up calculations such as electronic-structure and phonon workflows.
The generated data were processed to form the dataset. This included reducing all systems to primitive cells and then removing all structural duplicates. To do this, we used pymatgen.27 The property that we predict (k-points distance) is intensive and should not depend on the size of the number of formula units or centering of the cell. Pre-processing ensures that the model predictions are consistent with this physical constraint. The number of k-points that enter the Quantum ESPRESSO input file is easily generated for any unit cell/supercell from the k-point distance.
178 unique crystal structures, encompassing all non-radioactive and some radioactive elements (see element occurrence frequencies in the SI, SI Fig. S1). The dataset exhibits a broad distribution in the number of atoms per primitive unit cell (SI Fig. S2) and includes compounds containing up to seven distinct elements (SI Fig. S3). Among the 230 crystallographic space groups, 200 are represented. Since the shape of the first Brillouin zone is determined by the Bravais lattice type, it is noteworthy that all 14 Bravais lattices are represented in the dataset (Fig. 1), ensuring comprehensive coverage of symmetry types relevant for k-point sampling.
Fig. 2 shows the distribution of compounds with respect to the target property (converged k-point distance) used for training the machine-learning models. The vertical dashed line indicates the recommended reference value of 0.06 Å−1 reported by Bosoni et al.,11 which is consistent with our convergence dataset: in most cases, the explicitly determined converged k-point spacing is larger than 0.06 Å−1, indicating that this literature value represents a conservative lower bound for reliable reference calculations. The overall distribution exhibits two distinct peaks, corresponding to metallic systems (lower maximum k-point distance) and insulators (higher distance). Fig. 3 presents analogous distributions for subsets of compounds constrained to individual Bravais lattices. There is a clear trend that more symmetric lattices, on average, require dense k-point meshes. So, we can conclude that both metallicity and lattice symmetry significantly influence the optimal k-point density.
![]() | ||
| Fig. 2 Distribution of compounds with respect to the converged k-point distance in the dataset. The vertical dashed line represents the max k-point distance of 0.06 Å−1), recommended for the reference calculations reported by Bosoni et al.11 | ||
![]() | ||
| Fig. 3 Separate distributions with respect to the maximum distance between k-points for compounds with a fixed Bravais lattice. | ||
To illustrate the efficiency of using an optimized k-mesh relative to commonly used databases, we compare it with several widely adopted materials datasets. For example, the MC3D PBEsol-v1 dataset25 employs a default k-point distance of 0.15 Å−1, which is substantially denser (i.e., over-converged) for approximately 93.4% of the materials in our generated database. The Materials Project28 uses a default k-point density of 1000/(number of atoms in the unit cell), which likewise results in a denser-than-necessary k-sampling for about 70% of our materials. Assuming identical settings for pseudopotentials, smearing, and symmetry reduction (as in MC3D PBEsol-v1), an optimized, convergence-verified k-mesh would enable significant reductions in symmetrized k-point counts—and therefore in overall computational energy consumption, as shown in Fig. 4.
To illustrate how our optimized k-meshes compare with commonly used database defaults, we benchmark them against several widely adopted materials datasets. For example, the MC3D PBEsol-v1 dataset25 employs a default k-point distance of 0.15 Å−1, which yields denser k-sampling than our convergence-verified setting for approximately 93.4% of the materials in our generated database. The Materials Project28 uses a default k-point density of 1000/(number of atoms in the unit cell), which likewise gives denser k-sampling than our converged setting for about 70% of the materials considered here. To better quantify the extent of this difference, Fig. 4 shows the distribution of the ratio between the default database k-point number and the converged k-point number. This indicates that the default settings are often more conservative than required by our convergence protocol and, in many cases, correspond to a substantially larger k-point-related workload. However, we note that the actual runtime impact depends on additional factors beyond the k-point number alone. In practice, the realized savings depend on how the electronic-structure code distributes work across parallelisation layers, including k-point pools, band groups, as well as on the communication characteristics of the underlying HPC architecture. Nevertheless, the reduction in required k-point sampling reported here remains a useful code-level proxy for potential efficiency gains across high-throughput workflows.
To evaluate the robustness of our non-spin-polarized calculations, we selected 268 materials evenly sampled across different categories, including metallic and semiconducting systems, the 14 Bravais lattice types, and four ranges of atomic counts (0–10, 10–20, 20–40, and >40 atoms per cell) for spin-polarized tests, for which the initial magnetic moments were set according to the MC3D protocols.25 As presented in Fig. 5, the results show that 82% of these materials converge to the same k-mesh, while the remaining 18% exhibit magnetic behaviours in the spin-polarized calculations. As we are unable to capture the magnetic behaviours in non-spin-polarized calculations, it is understandable that the converged k-mesh is different. We also notice that among all samples, 94% have denser k-meshes in the non-spin-polarized calculations than in the spin-polarized calculations, while 6% have slightly lower densities, indicating that the non-spin-polarized calculations generally provide well-converged results. We also plan to extend the dataset with spin-polarized calculations and to develop a separate machine-learning workflow to identify magnetic materials and adapt the k-point recommendation accordingly.
![]() | ||
| Fig. 5 Correlation between converged k-distances in non-spin-polarized and spin-polarized calculations. | ||
As presented in Fig. 6, the SSSP Efficiency library (version 1.3) was employed in our training data generation, as it is recommended for high-throughput calculations with affordable computational cost.29 We acknowledge that, in practical applications, users may prefer the SSSP Precision library to achieve results closer to all-electron calculations. This naturally raises the question of whether our dataset, trained with the Efficiency library, remains valid for predicting k-mesh settings when using the Precision library. To examine this, we selected 215 materials and performed non-spin-polarized calculations using the Precision pseudopotentials. The results show that in 90% of the cases, the Precision library yields a k-mesh that is either denser than or identical (78%) to that obtained with the Efficiency library. In the remaining 10%, 8% of the total materials show energy differences smaller than 2 meV/atom compared to the converged results, confirming the robustness of the Efficiency-trained dataset. More generally, transferability to other pseudopotential libraries or exchange–correlation treatments, including hybrid functionals, should not be assumed without further validation. To support this, we provide the full training and data-processing pipeline in our public repository, enabling users to retrain the models for alternative settings.
![]() | ||
| Fig. 6 Correlation between converged k-distances in SSSP-efficiency and SSSP-precision calculations. | ||
In the discussions above, we identified the main limitation of our current non-spin-polarized database—its inability to correctly describe magnetic materials. However, determining whether a material is magnetic or non-magnetic is often non-trivial. To provide an initial assessment of this issue, we used the MC3D25,30 database as a reference to identify approximately 2800 potentially magnetic materials in our dataset and performed spin-polarized calculations on them with finite initial magnetic moments. The initial magnetic moments were assigned based on the presence of d or f electrons: materials containing such elements were initialized with their maximum possible magnetic moments (i.e., 5 or 7 µB for d or f shells, respectively), while all others were assigned a small default value of 0.1 µB.25 We found that only about 43% of the non-spin-polarized calculations yield the same converged k-mesh as their corresponding spin-polarized counterparts for these magnetic materials, indicating that for magnetic materials, spin polarization can significantly affect the convergence behaviour and, consequently, the accuracy of the derived properties.
Random Forest Regression31 (RF) is a machine learning method that uses an ensemble of multiple (i.e. 100) decision trees, each trained on a bootstrap sample of the data with a random subset of features. The final prediction is obtained by averaging the predictions of individual trees, which reduces variance and overfitting. We use the Random Forest Regressor implemented in the scikit-learn32 library.
Gradient Boosting Regression33 (GBR) is another ensemble-based method, which, instead of averaging the predictions of multiple weak learners, builds them sequentially, where each subsequent tree attempts to correct the errors of the previous one. We use the Gradient Boosting Regressor implemented in scikit-learn.
The Random Forest Regressor and Gradient Boosting Regressor require features in a tabular format representing each compound as a vector derived from the compound's compositions or/and structure. We use the following sets of features:
(1) C = composition features. These are essentially the Magpie34 style features computed with the matminer library.35 We include ElementProperty, Stoichiometry, and ValenceOrbital. ElementProperty features are composed of statistics [“minimum”, “maximum”, “range”, “mean”, “avg_dev”, and “mode”] of the properties of atoms in the structure. The examples of properties are number, Mendeleev number, atomic weight, electronegativity, etc. Stoichiometric features depend only on the fractions of elements; these include the number of elements in the compound and several Lp norms of the fractions. ValenceOrbital features represent the average number and/or fraction of valence electrons in specified orbitals.
(2) S = structural features. For the first part of these features, we use GlobalSymmetryFeatures and DensityFeatures from matminer. These include space group number, crystal system, centrosymmetry, number of symmetry operations, density, volume per atom, and packing fraction. For the second part of these feature vectors, we use modified Smooth Overlap of Atomic Positions (SOAP) fingerprints.36 The modification consists of: (1) substituting all atoms with an ‘X’ atom type; (2) calculating SOAP fingerprints for the structure with anonymised atoms with the DScribe library;37 (3) averaging the resulting SOAP descriptors over all atoms. This modification is required to ensure that the features have the same size for all compounds.
(3) L = lattice features. Lattice features include a, b, c, α, β, and γ parameters for the unit cell, the inverse unit cell, and the Bravais lattice.
(4) JarvisCFID features were originally suggested for the prediction of k-point length by Choudhary and Tavazza9 and are available in a matminer library.35 These are complex feature vectors with a length of 1557, which include information about composition, unit cell, core charges, angular distribution function, radial distribution function, dihedral angle distribution function, and nearest neighbours.
(5) M = metallicity features. It is expected that the required k-point density strongly depends on whether a compound is metallic or insulating. In metals, a denser k-point mesh is required to accurately resolve the Fermi surface, whereas in semiconductors and insulators, the Brillouin zone integration converges more rapidly. The reason is that metallic systems have a discontinuity in electronic occupation at the Fermi level, making the total energy highly sensitive to how finely the electronic states are sampled. If the mesh is too coarse, errors in the total energy can reach tens of meV per atom, leading to unreliable stability comparisons and inaccuracies in derived properties such as formation energies or phonon spectra. Since metallicity is not known a priori for arbitrary compounds, we train a dedicated CGCNN model on Materials Project data to classify materials as metals or non-metals. By performing inference with a trained model for a new compound and extracting learned embeddings, we get numerical vectors which encode information about metallicity for each particular compound, and can be used as additional features in k-point prediction models. So, M features are those internal embeddings from the pre-trained CGCNN model. (The ‘is_metal’ dataset was downloaded from MP in July 2025 and contained ∼180
000 unique structures. Performance of the trained CGCNN model on the test set is as follows: accuracy = 0.84, F1_score = 0.83, Matthews Correlation Coefficient = 0.69. The model checkpoint can be found in the code repository.)
CrabNet38 is a composition-based deep learning model that adapts the Transformer architecture,39 originally developed for natural language processing, to materials science applications. In CrabNet, the elements in a chemical formula serve as tokens, analogous to words in a sentence. Each element is associated with a learned embedding vector, while stoichiometric fractions are incorporated through a modified positional-encoding scheme that enables the model to represent not only which elements are present but also their relative proportions. Passing these token-fraction embeddings through stacked self-attention layers allows CrabNet to capture complex inter-element interactions, modelling both pairwise and higher-order chemical relationships.
As input features, we use mat2vec embeddings40 together with extended cgcnn descriptors, cgcnn + edc. Mat2vec embeddings were developed by Tshitoyan et al.40 via a skip-gram variation of the Word2vec method trained on 3.3 million scientific abstracts and were originally used in the CrabNet model. CGCNN features were suggested in the CGCNN paper41 and include elemental attributes such as atomic number, atomic weight, and basic electronic properties. The extended cgcnn feature set (cgcnn + edc) augments the original cgcnn elemental descriptors with information from the SSSP 1.3 pseudopotential library (efficiency, PBEsol), including recommended energy and density cutoffs and the pseudopotential type (norm-conserving, ultrasoft, etc.). These quantities are element-specific because they depend on the corresponding pseudopotential. Their inclusion is motivated not by a fundamental dependence of k-point convergence on cutoff values, but by the practical observation that, in finite-accuracy high-throughput calculations, cutoff-related quantities can provide useful complementary information.9 In a style similar to the cgcnn features, we encode the numbers for cut-off values as one-hot feature vectors of a size suitable for the range of features. Finally, to reduce redundancy in cgcnn + edc descriptors, we perform PCA on them and remove dimensions with negligible information content.
The next model we try is the CGCNN,41 or the Crystal Graph Convolutional Neural Network, which is a graph neural network designed to represent crystalline materials as atom graphs. Here, we consider two ways to build an atom graph: (1) a radius graph, which assumes that two atoms are connected by an edge if the distance between them (or their periodic images) is smaller than a cutoff radius. Such a graph usually is a multigraph; (2) a Voronoi tessellation graph, which assumes that two atoms are connected by an edge if they share the face of the Voronoi polyhedra. Once the atom graph is constructed, the CGCNN employs a gated graph convolution operation for message passing, allowing atom embeddings to be updated based on their local chemical environments. By stacking multiple convolutional layers, the model captures both short- and longer-range structural information, making it effective for predicting a wide variety of material properties directly from crystal structures.
The ALIGNN42 conceptually follows the CGCNN model and extends this framework by introducing a line graph in addition to the atom graph, thereby explicitly encoding angular information between neighbouring bonds. This enriched representation enables the model to distinguish between the graphs that would be indistinguishable using only information about the atoms' identities and distances between them.
As input features for structure-based GNN models, we employ cgcnn features, extended cgcnn features (cgcnn + edc), and mat2vec features. In the study43 it was shown that, due to the limited power of the considered GNNs in representation, the GNN models fail to learn some properties of materials, which are commonly used as physical descriptors. The authors performed a thorough analysis of the issue and advocated for injecting the descriptors that are difficult for GNN models to learn after graph convolution layers. These include mainly features related to periodicity and symmetry, such as unit cell parameters, space group, crystal system, etc. As these properties are important for the prediction of the k-point distance, we use the modification suggested in (ref. 43) for our GNN models. The resulting models are named CGCNNd and ALIGNNd, respectively. The schematic of the CGCNNd model is shown in Fig. 7. To inform GNN models about the metallic properties of the compounds, we also use our metallicity features in CGCNNd and ALIGNNd.
Prediction uncertainty has two main sources: (1) aleatoric uncertainty, which represents statistical noise in the data; (2) epistemic uncertainty, which arises due to uncertainty in the model with which we fit the data; this uncertainty arises due to the finite number of points in the training or model misspecification. Since our objective is to estimate lower quantiles of the required k-point distance to avoid overestimation, we primarily focus on aleatoric uncertainty.
To estimate aleatoric uncertainty, we employ conformalised quantile regression,44 which we found to be the most effective approach for our application. This method inherits the strongest features of the predecessor methods: quantile regression and conformal prediction.
Quantile regression learns the conditional quantiles of a dependent variable as a linear function of the explanatory variables Qα(x) = y: F(y|x) ≥ α. To perform quantile regression, one uses a special asymmetric, pinball loss function:
![]() | (1) |
This way, one α-quantile is learned. We use this loss to train the ALIGNN and CGCNN models to predict conditional quantiles.
For a Random Forest model, quantiles are estimated in another way.45 Conditional quantiles can be inferred with Quantile Regression Forests (QRF). QRF are usually trained with a standard MSE loss and is a generalisation of random forests, which retain for each leaf, the empirical distribution of training targets rather than only their mean. The empirical distribution is the weighted distribution of observed response variables, where the weights attached to observations are identical to those the original random forest algorithm. And from the estimation of conditional distribution
(y|x = X), the α-quantile Qα(x) is obtained as follows:
| Qα(x) = inf{y:F(y|X = x) ≥ α} | (2) |
We use the implementation of the Quantile Regression Forests from the sklearn-quantile package.
Conformal prediction (CP) is a distribution-free framework that provides valid prediction intervals with valid finite-sample marginal coverage under the assumption of exchangeability of the samples. Unlike parametric methods, it does not rely on strong distributional assumptions and can be applied as a wrapper around any machine learning method. Given a regression model f(x), CP constructs a prediction interval:
| C(xn+1) = [f(xn+1) − βα, f(xn+1) + βα] | (3) |
| βα = Q(1−α)(1+1/|C|)({βi} for i ∈ C) | (4) |
Conformalised Quantile Regression (CQR) preserves the strong features of quantile regression and conformal predictions and provides theoretical guarantees of distribution-free finite-sample coverage with asymmetric, heteroscedastic intervals conditioned on data points. CQR begins with a model that predicts the lower and upper conditional quantiles Qα_lo(x) and Qα_hi(x). Then, conformity scores are calculated on calibration dataset C: βi = max{Qα_lo(Xi) − Yi, Yi − Qα_hi(Xi)}. Then the empirical quantile βα is calculated for the distribution of conformity scores as in the CP method. And the final confidence intervals are determined as follows:
| C(Xn+1) = [Qα_lo(Xn+1) − βα, Qα_hi(Xn+1) + βα] | (5) |
:
10
:
10 proportion. We perform a random search over the hyperparameter space on the validation dataset for all models. Then, for the best set of hyperparameters, we perform a manual hyperparameter search in the vicinity of it. For the final hyperparameter sets, all models are trained on 3–5 different data splits determined for randomly chosen random seeds, and their performance is averaged. Confidence intervals for performance are estimated under the assumption of a Gaussian distribution of errors in the performance metrics.
We evaluate model performance using mean absolute error (MAE), mean absolute percentage error (MAPE), mean squared error (MSE), the coefficient of determination (R2), the Spearman rank correlation coefficient, and the Kendall rank correlation coefficient. The inclusion of rank-based metrics (Spearman and Kendall) is motivated by the need to assess how well the models preserve the correct pairwise ordering of k-point densities. This allows us to quantify whether the ML models introduce additional randomness into the relative ordering of the samples, even when absolute errors remain small.
Table 1 shows the performance of the ensemble models, RF and GB, trained with different feature sets. In the table, the feature sets are C = composition features (size = 146), CS = composition + structure + SOAP (size = 404), CSL = composition + structure + SOAP + lattice (size = 419), CSLM = composition + structure + SOAP + lattice + metallicity (size = 483), and JarvisCFID (size = 1557). One can see that the performance of Random Forest in general is better than that of GB on this task. The best model, Random Forest + CSLM, achieves an R2 score of 0.703, MAE = 0.067, MAPE = 0.197, Spearman correlation = 0.86. Fig. 8 shows the scatter plot for predictions made by this model.
| # | Model | Features | MAE | MAPE | MSE | R2 | Spearman | Kendall |
|---|---|---|---|---|---|---|---|---|
| 1 | RF | C | 0.075 ± 0.0004 | 0.221 ± 0.003 | 0.0109 ± 0.0002 | 0.641 ± 0.005 | 0.821 ± 0.004 | 0.629 ± 0.002 |
| 2 | RF | CS | 0.0695 ± 0.0006 | 0.206 ± 0.005 | 0.0095 ± 0.0001 | 0.686 ± 0.008 | 0.850 ± 0.004 | 0.661 ± 0.005 |
| 3 | RF | CSL | 0.067 ± 0.002 | 0.197 ± 0.006 | 0.0091 ± 0.0003 | 0.702 ± 0.009 | 0.861 ± 0.006 | 0.674 ± 0.008 |
| 4 | RF | CSLM | 0.067 ± 0.001 | 0.197 ± 0.006 | 0.0090 ± 0.0003 | 0.703 ± 0.008 | 0.861 ± 0.006 | 0.676 ± 0.006 |
| 5 | RF | JarvisCFID | 0.078 ± 0.0007 | 0.234 ± 0.005 | 0.0113 ± 0.0002 | 0.628 ± 0.003 | 0.815 ± 0.004 | 0.618 ± 0.004 |
| 6 | GB | C | 0.080 ± 0.002 | 0.240 ± 0.005 | 0.0118 ± 0.0004 | 0.616 ± 0.007 | 0.804 ± 0.004 | 0.605 ± 0.004 |
| 7 | GB | CS | 0.0741 ± 0.0008 | 0.219 ± 0.004 | 0.0102 ± 0.0003 | 0.659 ± 0.009 | 0.833 ± 0.005 | 0.636 ± 0.005 |
| 8 | GB | CSL | 0.0731 ± 0.0008 | 0.214 ± 0.002 | 0.0100 ± 0.0003 | 0.668 ± 0.008 | 0.839 ± 0.005 | 0.642 ± 0.006 |
| 9 | GB | CSLM | 0.074 ± 0.002 | 0.218 ± 0.005 | 0.0102 ± 0.0005 | 0.660 ± 0.010 | 0.833 ± 0.008 | 0.640 ± 0.008 |
| 10 | GB | JarvisCFID | 0.0803 ± 0.0008 | 0.236 ± 0.005 | 0.0119 ± 0.0004 | 0.610 ± 0.010 | 0.807 ± 0.008 | 0.606 ± 0.006 |
Comparing the performance of the ensemble models with different feature sets, we can see that each subset of features contributes to the final result. To quantify the contribution of each block of features to the performance of the final model, we make predictions on modified representations of the samples from the test set. We shuffle features in each block of features and see how much the performance of the prediction drops. The results are shown in Table 2. It can be seen that each set of features contributes to the final prediction, with composition features having the largest contribution, followed by metallicity features and lattice features. From the viewpoint of general physics/chemistry knowledge, this rating seems to be reasonable and supports the considerations we used when constructing the list of features.
| # | Shuffled block | MAE | MAPE | MSE | R2 | ΔMAE | ΔR2 |
|---|---|---|---|---|---|---|---|
| 1 | — | 0.067 | 0.197 | 0.009 | 0.703 | 0 | 0 |
| 2 | Composition | 0.102 | 0.347 | 0.017 | 0.455 | −0.035 | −0.248 |
| 3 | Structure | 0.068 | 0.192 | 0.009 | 0.693 | −0.001 | −0.01 |
| 4 | SOAP | 0.074 | 0.211 | 0.011 | 0.644 | −0.007 | −0.059 |
| 5 | Lattice | 0.074 | 0.204 | 0.011 | 0.626 | −0.007 | −0.077 |
| 6 | Metallicity | 0.078 | 0.248 | 0.011 | 0.612 | −0.011 | −0.091 |
Next, we turn to the training of the CrabNet model, which is based on transformer architecture and takes composition information as the only input. CrabNet was trained with mat2vec and extended cgcnn features (cgcnn + edc). Notably, the performance of this composition-only model is significantly worse than that for models with some encoding of structural information. For k-point prediction models, one would expect this kind of behaviour for models that capture the essence of the task (Table 3).
| # | Model | Features | MAE | MAPE | MSE | R2 | Spearman | Kendall |
|---|---|---|---|---|---|---|---|---|
| 1 | CrabNet | mat2vec | 0.084 ± 0.001 | 0.248 ± 0.007 | 0.0134 ± 0.0003 | 0.56 ± 0.01 | 0.772 ± 0.005 | 0.575 ± 0.005 |
| 2 | CrabNet | cgcnn + edc | 0.084 ± 0.002 | 0.245 ± 0.004 | 0.0138 ± 0.0006 | 0.55 ± 0.02 | 0.77 ± 0.01 | 0.58 ± 0.01 |
Table 4 shows the performance of the GNN models. The CGCNN model was trained with cgcnn features and two ways to construct an atomic graph, standard radius graph (rg + cgcnn) and crystalline graph, which has edges only between atoms sharing Voronoi surfaces (crg + cgcnn). The standard radius graph performs better. Because of this we used the standard radius graph in all other cases. Next, we try different atomic input features for the CGCNN, the list includes cgcnn (rg + cgcnn), mat2vec (rg + mat2vec), and cgcnn + edc (rg + cgcnn + edc). From all these combinations the standard cgcnn features give the best performance. So, for this combination we try to inject additional features after the GNN layers. The schematic of the extended CGCNN model is shown in Fig. 7. The performance of the extended CGCNNd model is slightly improved compared to the vanilla model, both in the case of CSL combination of features (CSL = composition + structure (without SOAP, just matminer) + lattice) and metallicity features (m). The ALIGNN model was trained with the radius graph only (for construction of both atomic and linear graphs) with cgcnn and cgcnn + edc features, with the latter showing slightly better performance. So, we choose this combination for the extended ALIGNNd model taking in additionally CSL or metallicity features. The performance of the ALIGNNd + cgcnn + edc + CSL model is comparable to that of our best Random Forest model (Fig. 9).
| # | Model | Features | MAE | MAPE | MSE | R2 | Spearman | Kendall |
|---|---|---|---|---|---|---|---|---|
| 1 | CGCNN | rg + cgcnn | 0.074 ± 0.002 | 0.207 ± 0.004 | 0.0112 ± 0.0006 | 0.630 ± 0.010 | 0.827 ± 0.006 | 0.634 ± 0.008 |
| 2 | CGCNN | crg + cgcnn | 0.077 ± 0.001 | 0.223 ± 0.006 | 0.0115 ± 0.0003 | 0.630 ± 0.010 | 0.819 ± 0.006 | 0.626 ± 0.005 |
| 3 | CGCNN | rg + mat2vec | 0.0748 ± 0.0008 | 0.208 ± 0.005 | 0.0116 ± 0.0002 | 0.620 ± 0.005 | 0.822 ± 0.003 | 0.629 ± 0.003 |
| 4 | CGCNN | rg + cgcnn + edc | 0.075 ± 0.002 | 0.215 ± 0.004 | 0.012 ± 0.001 | 0.630 ± 0.020 | 0.819 ± 0.007 | 0.629 ± 0.008 |
| 5 | ALIGNN | rg + cgcnn | 0.0700 ± 0.0005 | 0.200 ± 0.001 | 0.0105 ± 0.0002 | 0.644 ± 0.008 | 0.837 ± 0.003 | 0.650 ± 0.003 |
| 6 | ALIGNN | rg + cgcnn + edc | 0.0717 ± 0.0007 | 0.194 ± 0.003 | 0.0106 ± 0.0002 | 0.654 ± 0.008 | 0.845 ± 0.004 | 0.656 ± 0.004 |
| 7 | CGCNNd | rg + cgcnn + CSL | 0.0737 ± 0.0007 | 0.206 ± 0.005 | 0.0111 ± 0.0004 | 0.647 ± 0.005 | 0.841 ± 0.001 | 0.649 ± 0.001 |
| 8 | CGCNNd | rg + cgcnn + M | 0.0740 ± 0.0002 | 0.209 ± 0.001 | 0.0117 ± 0.0008 | 0.630 ± 0.020 | 0.839 ± 0.004 | 0.648 ± 0.003 |
| 9 | ALIGNNd | rg + cgcnn + edc + CSL | 0.069 ± 0.002 | 0.186 ± 0.004 | 0.0099 ± 0.0003 | 0.680 ± 0.010 | 0.856 ± 0.005 | 0.669 ± 0.008 |
| 10 | ALIGNNd | rg + cgcnn + edc + M | 0.070 ± 0.001 | 0.190 ± 0.004 | 0.0104 ± 0.0002 | 0.662 ± 0.008 | 0.849 ± 0.003 | 0.661 ± 0.004 |
Across all models tested, the ensemble methods (Random Forest and Gradient Boosting) outperform composition-only neural models (CrabNet) while performing comparably to advanced graph neural networks (ALIGNN). The best-performing model overall is RF + CSLM, which reaches an R2 of 0.7, slightly surpassing the ALIGNNd + cgcnn + edc + CSL model.
An example of the final decision tree is shown in Fig. 10. One can see that the root is melting temperature. The melting temperature is not a true melting temperature, but a feature that is calculated as a weighted mean of the melting temperatures of atomic compounds composed of atoms in the composition formula, and the weights are calculated from atomic fractions in the composition. And the splitting value is 203 K. It is an extremely low melting temperature; the elements with melting temperatures below this threshold are H, N, O, F, Cl, Ne, Ar, Kr, Xe, and Rn. So, the compounds with melting temperature below 203 are those rich in these elements, and surprisingly, approximately 50% of such compounds in the dataset fall in this category. Splitting at the melting temperature is a very stable feature appearing in all decision trees that we built for different splits of data.
![]() | ||
| Fig. 10 Surrogate model for RF-CSLM predictions. Yellow rectangles are leaves with final values of the k-point distance. | ||
Decision subtrees for compounds with melting temperatures below 203 K, and above this value, are very different. Following the ‘yes’ branch, we arrive at the condition ‘space group number <38’. This is also a very stable feature that persists for all data splits. For a low-symmetry compound, the next feature to split upon is the mean atomic number. For high symmetry compounds, the split on the third level of the tree is on the first PCA component (most informative) of the metallicity vector. The Metal0 feature carries information about the metallicity of the compound. However, the correspondence of the metallicity property of the compound is not fully described by this feature alone, and the meaning of the numerical value can't be interpreted easily. The final split in the decision tree is based on mean atomic number, mean atomic row, and metal0.
Following the ‘no’ branch, we arrive and the condition metal0 ≤−1.58. The next level of the decision tree uses “mean covalent radius” or “min covalent radius” depending on the metallicity property, showing the importance of the covalent radius in decision making about the k-point distance. The final level of decision-making is composed of structural and electronic features.
Analysis of decision making by the RF-CSLM model shows that the k-point distance is determined by diverse behaviour, and apart from composition features, global symmetry descriptors, metallicity descriptors, and electronic configuration are important too. This conclusion additionally explains why CrabNet is so inefficient for this task, and why the GNN models perform worse than RF + CSLM.
The results of CQR for ALIGNNd + cgcnn + edc + CSLM are shown in Fig. 11–13 for CQR coverages of 0.9, 0.8, and 0.7. Light blue points and confidence intervals show the samples for which the true value is inside the confidence interval; dark blue points and confidence intervals show the samples for which the true value lies outside of the confidence interval. It can be seen that the CQR empirical coverage corresponds well to the target coverage. The mean size of the confidence intervals decreases with a reduction in coverage, as expected.
Table 5 shows the results of conformalised quantile regression for the best models. In all cases, the empirical coverage is consistent with the target coverage, and the confidence intervals have a reasonable size. This means that our models guarantee that a target fraction of compounds is calculated accurately, conditioned on training set distribution. ALIGNNd + cgcnn + edc + CSLM provides slightly better coverage, narrower confidence intervals, and correspondingly smaller errors than RF + CSLM. All ML models at all confidence levels provide smaller overall errors than the baseline model. So, using ML models for the prediction of k-point distances is practically beneficial.
| # | Model | Features | Target coverage | Target quantile | CQR interval coverage | CQR mean width | Empirical coverage | MAE | k-distance constant baseline | MAE baseline |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | RF | CSLM | 0.95 | 0.90 | 0.895 | 0.313 | 0.958 | 0.152 | 0.158 | 0.237 |
| 2 | RF | CSLM | 0.90 | 0.80 | 0.820 | 0.268 | 0.917 | 0.122 | 0.184 | 0.215 |
| 3 | RF | CSLM | 0.85 | 0.70 | 0.720 | 0.180 | 0.849 | 0.104 | 0.205 | 0.200 |
| 4 | ALIGNNd | rg + cgcnn + edc + CSL | 0.95 | 0.90 | 0.904 | 0.254 | 0.966 | 0.143 | 0.158 | 0.238 |
| 5 | ALIGNNd | rg + cgcnn + edc + CSL | 0.90 | 0.80 | 0.790 | 0.217 | 0.908 | 0.114 | 0.184 | 0.215 |
| 6 | ALIGNNd | rg + cgcnn + edc + CSL | 0.85 | 0.70 | 0.703 | 0.172 | 0.840 | 0.095 | 0.205 | 0.200 |
Fig. 14 and 15 show the distribution of errors for RF + CSLM (5th percentile), ALIGNN + cgcnn + edc + CSLM (10th percentile), and the baseline model for different quantiles. Dark blue colour shows the samples for which the predicted k-distance is larger than the true one. These figures show that distribution of errors for ML models is narrower than that for the baseline model, which results in smaller mean absolute errors.
Translation of the gain in k-distance prediction error into gains in computation time in general will depend on a lot of details, such as parallelisation strategy, hardware setup, materials calculated, etc. However, to demonstrate that the gain is present, we calculate the gain in terms of wall time using data that we obtained in the process of generating our dataset. The results are shown in Table 6. For our dataset, the mean wall time per sample is 352 s, which is calculated using the wall time corresponding to true converged k-distances. Both ML models adjusted to predict a quantile suitable for the required confidence, and the baseline model uses different k-distances, resulting in spending additional time on performing calculations. We calculate the excessive wall time as the difference between the wall time corresponding to k-distance used by ML-model/baseline model and the mean optimal wall time in cases when the predicted k-distance is smaller than or equal to the true one (converged calculation), and in the opposite case all computation time is wasted, so all the time predicted by the model is added to the excessive time:
![]() | (6) |
| # | Model | Features | Mean wall time (optimal), s | Confidence level | Excessive wall time (pred), s | Excessive wall time (dummy), s | Gain per sample, s | Gain, % |
|---|---|---|---|---|---|---|---|---|
| 1 | RF | CSLM | 352 | 0.95 | 388 | 497 | 109 | 22 |
| 2 | RF | CSLM | 352 | 0.90 | 277 | 318 | 41 | 13 |
| 3 | RF | CSLM | 352 | 0.85 | 240 | 223 | 17 | 7 |
| 4 | ALIGNNd | rg + cgcnn + edc + CSL | 352 | 0.95 | 368 | 468 | 100 | 21 |
| 5 | ALIGNNd | rg + cgcnn + edc + CSL | 352 | 0.90 | 247 | 311 | 64 | 21 |
| 6 | ALIGNNd | rg + cgcnn + edc + CSL | 352 | 0.85 | 217 | 217 | 0 | 0 |
Interestingly, the gain in computation time for ML prediction compared to the baseline model depends on the quantile. For higher quantiles and more accurate calculations, the gain is more noticeable. With a decrease in the quantile, the number of the mispredicted samples grows, and the gain in time for ML predictions compared to the baseline model disappears.
The input page of the application asks the user to choose the variant of SSSP1.3 PBEsol to use, efficiency or precision, the ML model, and the confidence level. Our models are trained on data generated with SSSP1.3 PBEsol efficiency, but as shown above, for a large fraction of compounds, the k-point convergence values remain the same for SSSP1.3 PBEsol precision.
Next, the user is prompted to upload the structure CIF file or query the structure from one of the databases. Then, the structure may be subjected to transformations: reduced to a primitive one, or multiplied to create a supercell. When all parameters are specified, the user is prompted to generate the input file. Input file generation can be made with the ASE deterministic function or the LLM (the choice is available). The file generator makes prediction of the k-point distance using a primitive cell of the compound (as all structures in the training dataset were pre-processed into primitive ones), makes prediction of the k-point density, and then calculates the k-mesh from the inverse unit cell and the predicted k-point distance.
However, several limitations remain. First, magnetic materials remain challenging: spin-polarized calculations often yield different k-mesh requirements, and our current dataset includes only non-spin-polarized data. Second, the dataset is tied to SSSP-1.3 PBEsol pseudopotentials; although we show that trends transfer to the Precision library, other pseudopotential families may require retraining. Third, while our models capture structural and compositional factors well, more expressive representations—such as equivariant GNNs or learned Brillouin-zone features—may further improve accuracy. Next, we did not try to pre-train GNN/CrabNet models on the metallicity dataset. Transfer learning could work better than feature injection. Finally, we could use a completely different approach for building an ML model: we could predict the total energy conditioned on the structure and other DFT parameters. Then the speed up of solving the convergence problem would be realized through using an ML surrogate to generate the convergence curve instead of predicting the converged k-distance. If successful, this approach could be applicable more generally. However, it is not immediately clear how to determine convergence in this case, as ML predictions of the total energy are expected to introduce errors, which could obscure the convergence point. Maybe a combination of both approaches could be used.
Despite these limitations, the current approach is successful in reaching the objective of improving over the constant baseline, which is currently a common choice. The integration with MCP servers also opens a path toward fully autonomous computational agents capable of end-to-end materials simulation.
000 convergence-tested materials, we benchmarked a broad set of models and demonstrated that both ensemble methods and graph neural networks achieve high predictive accuracy. Combining these models with conformalised quantile regression enables statistically guaranteed, underestimation-averse predictions.
Relative to fixed-density workflows, the model reduces unnecessary k-mesh sampling and thereby lowers computational cost while maintaining accuracy. The provided web platform makes this tool available. We also plan to release an MCP server in the near future to make our tool available in any MCP-compatible system.
Overall, this work establishes a practical and scalable route to data-driven parameter optimisation in DFT, opening doors to more efficient, greener, and more robust high-throughput materials simulations.
The archived version of the web application and associated source code used in this work is available from Zenodo at https://doi.org/10.5281/zenodo.20324660.
The archived version of the code used for training and evaluating the machine-learning models is available from Zenodo at https://doi.org/10.5281/zenodo.20324627.
The deployed web application is available at https://goldilocks.streamlit.app/.
Supplementary information (SI): additional details on the Goldilocks dataset, the QE input-generation application, k-distance and k-mesh generation, the convergence-definition example, the comparison with the JARVIS dataset, the 1/k-distance model test, and the metal/insulator k-distance distributions. See DOI: https://doi.org/10.1039/d5dd00565e.
| This journal is © The Royal Society of Chemistry 2026 |