Open Access Article
Sebastiaan P. Huber†
a,
Michail Minotakis†
b,
Marnik Bercx†ab,
Timo Reentsb,
Kristjan Eimre
ab,
Nataliya Paulish
b,
Nicolas Hörmannc,
Martin Uhrin
d,
Nicola Marzariabe and
Giovanni Pizzi
*ab
aTheory and Simulation of Materials (THEOS), National Centre for Computational Design and Discovery of Novel Materials (MARVEL), École Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland
bPSI Center for Scientific Computing, Theory, and Data, 5232 Villigen PSI, Switzerland. E-mail: giovanni.pizzi@psi.ch
cFritz-Haber-Institut der Max-Planck-Gesellschaft, 14195 Berlin, Germany
dUniversité Grenoble Alpes, MIAI Cluster IA, SIMaP, 38000 Grenoble, France
eBremen Center for Computational Materials Science, MAPEX Center for Materials and Processes, University of Bremen, 28359 Bremen, Germany
First published on 18th February 2026
Density-functional theory (DFT) is a widely used method to compute properties of materials, which are often collected in databases and serve as valuable starting points for further studies. In this article, we present the Materials Cloud Three-Dimensional Structure Database (MC3D), an online database of computed three-dimensional (3D) inorganic crystal structures. Close to a million experimentally reported structures were imported from the COD, ICSD and MPDS databases; these were parsed and filtered to yield a collection of 72
589 unique and stoichiometric structures, of which 95% are, to date, classified as experimentally known. The geometries of structures with up to 64 atoms were then optimized using DFT with automated workflows and curated input protocols. The procedure was repeated for different functionals and computational protocols, generating three methodology-based MC3D subdatabases: PBE-v1, PBEsol-v1, and PBEsol-v2, with the latest containing 32
013 unique structures. All subdatabases of the MC3D are made available on the Materials Cloud portal, which provides a graphical interface to explore and download the data. The database includes the full provenance graph of all the calculations driven by the automated workflows, thus establishing full reproducibility of the results and more-than-FAIR procedures.
Nevertheless, computational databases might not always use a consistent setup to compute the properties of all materials. This, in turn, leads to potential small inconsistencies in the data across the periodic table, making more challenging to obtain accurate machine-learning models to predict properties of materials not present in the database, as already noted in ref. 52 and recently demonstrated by the accuracy of the PET-MAD model.35,53
In this article, we first present a set of automated workflows that import crystal structures from the Crystallographic Open Database (COD),54 the Inorganic Crystal Structure Database (ICSD),55 and the Materials Platform for Data Science (MPDS).56 Using these workflows, we import all entries (close to a million structures, mainly reported from experiments) and reduce them to a set of 72
589 unique bulk stoichiometric inorganic crystal structures, after several filtering steps that we describe in detail below. For the subset of structures not including lanthanides or actinides, and including at least all systems with up to 64 atoms, we compute the electronic ground state of their optimized geometries via DFT using the open-source Quantum ESPRESSO (QE)57,58 code, powered by the SIRIUS library.59 The computational inputs for the DFT simulations executed by the workflows are determined by fully automated protocols, requiring only the input structure as mandatory input, therefore enabling automated high-throughput execution of the workflows. The protocols are tuned to select input parameters that optimize the balance between precision and computational cost. Since the workflows are implemented in AiiDA,8,9,60 the entire provenance of each optimized geometry, including all raw simulation inputs and outputs, is automatically preserved in the database.
The calculations use both PBE61 and PBEsol,62 and different protocols for the input parameters. We refer to the collection of resulting databases of optimized geometries as the Materials Cloud Three-Dimensional Structure Database (MC3D). In particular, when considering the PBEsol functional and the most recent protocol
, we provide 32
013 unique structures computed and relaxed with DFT.
We make the MC3D database fully and publicly available through the Materials Cloud Archive.63 Moreover, for easier accessibility, we provide a dedicated web application in the Materials Cloud64 platform (https://www.materialscloud.org/mc3d), through which all data can be inspected interactively, as well as an OPTIMADE-compliant endpoint (implemented using
65) to access the data via a standardized API. In the following, we describe the data import and processing pipeline, the computational workflows, the content of the MC3D, as well as its web interface.
210 crystal structures from three of the largest experimental inorganic crystal structure databases: the COD54 (revision 213
553), the ICSD55 (version 2017.2), and the MPDS56 (version 1.0.2).
Fig. 1(a) shows a graphical representation of the pipeline that reduced this initial collection to 72
589 unique stoichiometric crystal structures. In the first step of the pipeline, the input structures—obtained in the Crystallographic Information File (CIF) format66—were parsed and validated. A number of CIF files had to be discarded due to invalid syntax or inconsistent information. From the 723
165 valid CIFs, the crystal structure was successfully parsed and the crystal lattice was normalized reducing it to the primitive cell (see the Methods for details on the validation and parsing).
The next step in the pipeline filtered out 244
962 structures that are not stoichiometric, i.e., those that contain partial occupancies. These structures are beyond the scope of the current study, due to the requirement of considering appropriately chosen large supercells to accommodate partial occupancies. From the remaining 478
203 crystal structures, 197
105 were found to be duplicates (details on how uniqueness is determined are given in the Methods section). The last filtering step in the pipeline removed 208
509 structures that include hydrogen and are available only in the COD database. As detailed in the Methods section, we apply this filter to effectively exclude molecular crystals, included in COD but not in ICSD nor in MPDS, since our study focuses on inorganic compounds.
The final collection of starting crystal structures, labeled MC3D-source, consists of 72
589 unique three-dimensional crystal structures. Of these, 3305 structures are explicitly reported to have a theoretical origin by the source database. Therefore, we identify no more than 69
284 experimentally known stoichiometric inorganics (see Methods for more details), a much smaller number than typically quoted. This first result already highlights the relatively small size of the space of experimentally known inorganic stoichiometric crystal structures. This might point to the actual size of this space, as well as the complexity of synthesizing and characterizing complex systems, especially those including several chemical elements (ternaries, quaternaries and beyond), as already highlighted in ref. 67.
Fig. 1(b) shows the Venn diagram of the distribution of structures in the MC3D-source from the three source databases. Out of the total 72
589 structures, 20
078 structures are present in all three databases, and 46
542 structures are present in at least two of the three databases. The COD, ICSD and MPDS each contain 2 373, 9264 and 14
410 structures that are unique to them, respectively (or 210
882 for COD, if we consider also structures with hydrogen, excluded from MC3D-source in our last filtering step as discussed above).
A subset of the 72
589 unique crystal structures of the MC3D-source was then passed through our automated workflow to optimize their geometry using DFT. First, structures containing lanthanides or actinides were not considered. This is because, on one hand, a physically accurate description of their electronic structure (in particular of the f electrons) would require a more sophisticated treatment than standard DFT. On the other hand, the pseudopotentials currently available for these elements, when available, have limited precision (see, e.g., Fig. S9.7 in the Supplementary Information of ref. 68) and also often lead to numerical instabilities and a significant failure rate. The subset of the MC3D-source database that excludes lanthanides and actinides contains 46
964 structures. Furthermore, in order to better exploit the available computational budget, we prioritized processing structures with smaller number of atoms in the unit cell. While also some larger structures have already been computed, in this paper we focus the analysis on the structures with up to 64 atoms, all of which were submitted to the ground-state atomic and geometric relaxation workflow.
We considered two different DFT functionals (PBE and PBEsol) and, in the PBEsol case, we run the workflow with two different versions of the input parameter protocols, the second one (v2) being refined and whose numerical precision was verified via extensive tests.69 The resulting optimized geometries form three subdatabases of MC3D, identified by the methodology labels
,
, and
. An overview of the differences in physical methods, input parameters, and pseudopotential libraries for these methodology labels is provided in the Methods section. The rest of this article focuses on the results of the most accurate and precise subdatabase,
, but the results of all three subdatabases are publicly available.
Out of the 38
739 structures that have been processed, the relaxation workflow successfully completed 33
142 structures, corresponding to a success rate of 85.5%. In the remaining cases, the workflow encountered errors that could not be solved by the automatic error handling mechanisms implemented (see the section on Automated error handling in the Methods for more details). These results are represented visually in Fig. 2, where the workflow results are plotted as a histogram as a function of (a) the number of elemental species and (b) the number of atomic sites in the structure.
Our complete relaxation workflow to compute the optimized geometry of a crystal structure and its ground-state electronic charge density comprises multiple steps executing the
code of the SIRIUS-accelerated59 QE package. However, the
code can encounter a variety of errors, ranging from issues with the compute hardware on which the code is run itself (e.g., insufficient job resources or node failures) to errors during code execution (e.g., numerical instabilities or convergence issues). These errors need to be handled as much as possible automatically by the workflow in order to be robust and scalable. Fig. 3 shows a visual representation of how often a workflow needed to restart the calculation to achieve successful completion, and of the most commonly observed errors. We first highlight that over 67% of the workflows completed successfully after the first execution of
(green bar in Fig. 3(a)) without triggering any error handling. We stress that this success rate could be achieved also thanks to the appropriate choices of input parameters defined by our protocols, that prevent certain failure modes, and by the use of advanced direct minimization algorithms for electronic convergence beyond the standard iterative self-consistent field (SCF) approach, such as direct-minimization strategies70,71 as implemented in the
72 plugin of SIRIUS library.59 The vast majority of errors come from the failure to converge to the required precision either the geometry optimization or the self-consistent field calculation (within the maximum allowed number of iterations). Notably, the error handling mechanism implemented in our workflows (see the Methods section for more details) manages to recover from these errors in over half of the cases in which a single
execution fails (orange bars in Fig. 3(a), with one or more workflow restarts). We emphasize that the error handler can decide to change the input of the calculation before resubmission, depending on the type of error reported in the code output. However, such changes do not include physical parameters of the calculation that would affect the results, but are limited only to those numerical aspects that can increase the chance of convergence (e.g., reducing the mixing parameters of the self-consistent cycle or changing diagonalization algorithm). Nevertheless, despite our error-handling mechanisms, for a number of crystal structures (14.5% of the total, corresponding to the rightmost red segment in Fig. 3) the workflow failed to complete after exhausting the configured maximum number of retries. Work on improving the error handling and robustness of the workflows is ongoing, and we expect to further increase the success rate in future versions of the MC3D. We also stress that, because of the nature of the errors, achieving a significant increase in convergence rate cannot be obtained solely by improving workflow error-handling mechanisms, but also requires enhancements to the underlying algorithms in the
code, as already addressed in this work by using the advanced convergence algorithms implemented in SIRIUS59 discussed above.
To make the MC3D easily accessible and explorable, we make it available as a Discover section (https://mc3d.materialscloud.org/) on the Materials Cloud64 portal at https://www.materialscloud.org/mc3d. Fig. 4 shows screenshots of the web application. The landing page allows the user to select a subdatabase of the MC3D and filter structures based on their composition via a periodic table. The matched structures are displayed in an index table that provides a link to a detail page together with several structural and material properties. The user can interactively edit the list of visible columns, adjust the row sorting, and apply column-based filtering. The detail pages give a detailed overview of the structural properties of the crystal as well as any other computed property. In particular, we provide powder X-ray diffraction (XRD) patterns computed using
73 based on Bragg's law and atomic scattering factors. Reflections can be plotted using Gaussian or Lorentzian peaks with an area corresponding to peak intensity, and the full width at half maximum (FWHM) parameter can be set by the user. Notably, the full data and calculation provenance is directly accessible from the web interface. Computed properties are decorated with an AiiDA icon linking to the source calculation in the Materials Cloud Explore section (https://www.materialscloud.org/explore/mc3d-pbesol-v2/). There, users can browse the full provenance graph of MC3D, extract raw inputs and outputs, and reconstruct how the optimized geometries were obtained. A more detailed description of the web platform is provided in the Methods section.
that originate from structures tagged as theoretical (see details in the Methods section). This information can be accessed by the corresponding column in the Materials Cloud web interface. We stress that, however, we need to rely on the information provided by the source databases which might not always be accurate or complete. Therefore, users should be aware that, in some rare cases, some of the structures in the MC3D might not be experimentally known, even if they are not flagged as theoretical.
For the 33
142 successfully completed workflows, the optimized geometries were validated by comparing the difference in volume between the initial and optimized crystal structures, as large volume changes could be an indication of problems with either the initial or optimized structure. Fig. 5 shows a histogram of the relative volume change in percent between the initial and optimized geometry for the PBE and PBEsol functionals. The majority of optimized geometries have a change in volume within ±5% (78.1% and 62.0% for PBEsol and PBE, respectively), highlighting that the optimized geometry (especially when using the PBEsol functional) is not too far away from the experimentally determined source crystal structure.
The changes in volume between the source and optimized geometry are due to various factors. First of all, the source databases include thousands of layered or lower-dimensional structures.74,75 However, since in this work we are not including van der Waals (vdW) corrections in the DFT calculations, the optimized geometries of these layered structures will typically have a potentially much larger cell volume than the source structure.74 To further investigate this aspect, we applied the
code,76 already used to generate the MC2D two-dimensional database,74,75,77 to determine the dimensionality of disconnected substructures in each crystal structure in the
dataset. We define the dimensionality of a material as the largest dimensionality that is found among any of the substructures. Using this classification, we indeed find that the 90% confidence interval of the relative volume changes is (−8.7%, 7.4%), (−10.2%, 21.0%), (−8.6%, 16.3%) and (−7.2%, 26.7%) for 3D, 2D, 1D and 0D structures, respectively, reflecting our expectation that the non-3D structures exhibit larger volume changes when relaxed without vdW corrections, especially for volume expansion.
Moreover, some source structures may have been characterized at high pressure and/or high temperature conditions (whereas our computational methods are limited to a temperature of 0 K and we considered a target zero pressure), which can also highly influence cell volume.
To account for these factors, we extracted the metadata from the source databases regarding the experimental conditions of the source structures, i.e., the temperature and pressure. Unfortunately, the conditions at which structures have been characterized are only very rarely accurately reported in the CIF source, making it difficult to reliably classify the structures. For those where such data is available, we classify them as high pressure (high temperature) if the pressure (temperature) is reported and above 5000 atm (100 °C). In Fig. 6(a) we show a histogram of the relative volume change for structures classified as above, only including those structures that report explicitly both pressure and temperature, highlighting that the vast majority of the outliers are structures that are either high pressure or high temperature.
However, this plot includes very few structures, and in particular only very few (21) explicitly report “ambient” conditions (according to our definition of both pressure and temperature below 5000 atm and 100 °C, respectively). We therefore also plot in Fig. 6(b) the distribution of implicitly defined ambient structures, i.e., those that only report one of the two conditions (pressure or temperature), and that condition is below our thresholds (i.e., we assume that the other condition is also ambient). Although the majority of the most extreme outliers are still high pressure or high temperature, there are more ambient structures whose volume changes significantly. To better understand these outliers, we investigated the source papers of the 7 explicit ambient outliers outside the [−5%, 5%] range, as well as the 43 implicit ambient outliers outside the [−50%, 50%] range. Among the explicit ambient structures, we found that 6 were actually extracted from a high pressure/temperature study (but this was not correctly reported in the source databases), and one was a layered structure. Among the implicit ambient structures, also in this case about half were part of high pressure/temperature studies, layered structures, or molecular crystals. For other structures, the atomic occupations were most likely not fully stoichiometric, even if reported as such in the CIF file. Therefore, although we cannot explain all ambient outliers, it is clear that many of them are related to a possible incorrect reporting of the conditions or of the crystal structure, thus highlighting the importance of careful curation of the source databases.
The 33
142 successfully optimized geometries were analyzed for uniqueness, using the same method used to determine the unique structures of all source structures imported from the COD, ICSD and MPDS. The analysis found that, after the geometry optimization, another 1129 structures had become duplicates and the final set contains 32
013 unique optimized geometries.
Finally, we compared this set of unique structures with two other computational materials databases: the Materials Project23 and the OQMD.25 These databases mostly use the ICSD as the source of input crystal structures, with the Materials Project recently also including in part structures from the MPDS, but none have considered structures from the COD yet. As already mentioned, another key difference with this work is that instead of QE as the main DFT code, the Materials Project and the OQMD typically use VASP.78,79 Our comparison shows that MC3D contains 3328 novel structures. Table 1 gives a detailed overview of the number of structures that have a space group or composition that does not occur in the Materials Project and/or in the OQMD (see the Methods section for more details on how structures are matched across the different databases).
's
compared to the reference databases OQMD and Materials Project (and their union). We distinguish between three categories of new structures: Subtype A: number of structures in the MC3D with a composition that is not found in the reference. Subtype B: number of structures in the MC3D with an existing composition but a different space group with respect to the reference. Subtype C: number of structures in the MC3D with existing composition and space group combination with respect to the reference, but non-duplicate according to the
In summary, we have presented MC3D, a database of relaxed geometries of three-dimensional stoichiometric inorganic structures computed using DFT. The database is built starting from almost a million experimentally known crystal structures imported from the COD, ICSD and MPDS databases. The structures were parsed and filtered to yield a collection of 72
589 unique stoichiometric inorganic crystal structures, which we refer to as the MC3D-source. We note that at most 69
284 of these are experimentally known, highlighting the small size of the space of experimentally known inorganic stoichiometric crystal structures. The geometries of a large fraction of these structures were then optimized with different functionals (PBE and PBEsol) and numerical protocols using a SIRIUS-enabled version of QE. The most accurate and precise subdatabase,
, contains 32
013 unique structures. These optimizations were performed using fully automated workflows implemented in AiiDA, automatically handling errors that occurred during the calculations, and preserving the full provenance of all data produced. We demonstrate the workflow robustness by showing that they successfully complete for over 85% of the structures, significantly improving upon the success rate without any automated error handling. The resulting database of optimized geometries as well as their underlying provenance is made available on the Materials Cloud Archive63 and can be browsed interactively on the Materials Cloud64 portal at https://www.materialscloud.org/mc3d, including interactive access to the full provenance graph of the calculations. The key distinguishing features of MC3D include: (1) the inclusion of structures from several source databases, thus extending the coverage of materials; (2) its focus on experimentally known compounds, so that top-performing materials identified from a screening study of MC3D are easily available; (3) the use of a consistent set of computational methods and input protocols to compute optimized geometries, providing a solid foundation for the creation of training sets for machine-learning applications; (4) the use of open-source software in the whole pipeline and in particular of QE (and SIRIUS) as the DFT engine, at variance with many other existing databases,22,23,25 thus providing the possibility to cross-validate results with different DFT implementations; and (5) the preservation of the full provenance of all data produced by the workflows, improving the reliability of derivative work based on MC3D and allowing other researchers to easily reproduce the results. MC3D thus constitutes a valuable resource for the materials science community complementing existing databases, and we expect it to be useful for applications including machine learning, materials discovery, and computational materials research.
class, to implement functionality that allows importing structure from an external database. An implementation for the COD and ICSD already existed in the
80 package, and the implementation for the MPDS was contributed for this work. A command line interface (CLI) was developed to download crystal structures through these importers and store them in an AiiDA8 database with associated metadata, such as the Uniform Resource Locator (URL) and unique identifier used by the source database. The code is made available through the
81 Python package v3.1.0.
The MPDS reports all of their structures as experimentally observed, and the experimental temperature and pressure condition data was retrieved from the
field of the API responses. For the COD, the OPTIMADE compliant API was used, and structures were flagged as theoretical if
field equaled
. Pressure and temperature data was retrieved from the fields
,
,
, and
. In the case of the ICSD, the
table in the 2020 MySQL version was parsed. Remarks “THE”, “ZTHE”, “ABIN”, “DFT” were flagged as theoretical, pressure information was parsed from the “PRE”, “ZPRE” fields, and temperature was parsed from the “TEM”, “ZTEM” fields.
Using these criteria, out of the 72
589 structures in MC3D-source we flag 3305 structures as theoretical, leaving out 69
284 structures that might be experimentally known. We note that 2214 of these have no flag, as e.g. the 2020 version of the ICSD database does not report anymore some of the structures originally imported from ICSD version 2017.2; these might also need to be flagged as deprecated in the future, as this might point to structures that have been retracted or corrected in later versions of the database. Because of the nature of the three source databases, we expect the vast majority of these structures to be experimentally known, even if we cannot exclude that some of them might be theoretical structures not correctly flagged as such in the source databases.
package v2.1.82,83 The
and
scripts were used to correct any incorrect syntax and remove unnecessary tags, respectively. The
script was run with the flags
,
and
. The
script was run with the flags
,
,
,
, and
. The cleaned CIF files were subsequently parsed using the
library v2018.12.12 (ref. 73) to extract the crystal structure defined through the cell basis vectors and atomic positions. Here a tolerance of 5 × 10−4 was employed in fractional coordinate distances to determine overlapping sites. The parsed crystal structure was then normalized and converted to primitive cell using
v1.8.1,84 using a symmetry precision
of 5 × 10−3 for the underlying
symmetry-detection code.85 The entire cleaning, parsing, and normalizing process is implemented in a single AiiDA workflow, called the
, which is available from the
81 Python package. Fig. 7 shows the provenance graph produced by the execution of a
.
After importing the CIF files with the
, we noticed a significant number of parsed structures had a reduced formula that is inconsistent with the formula reported in the original CIF. To address this, a post-processing step was executed that compares the chemical formula in the CIF with the chemical formula of the parsed structure. Structures with inconsistent formulae were flagged in the database, and not considered for the next step in filtering procedure described in Fig. 1 of the Results section. Table 2 gives a complete overview of the cleaning results for all structures imported from the three external databases.
workflows and post-processing formula check. Each row is a potential outcome of the process, where the first column describes the outcome and the remaining columns give the number of occurrences for CIFs imported from the COD, ICSD and MPDS, respectively. The “Different composition” and “Missing elements” outcomes are a result of the formula check, the other issues are direct failure modes (exit codes) of the
| Results of the CIF cleaning and parsing | Occurrences | |||
|---|---|---|---|---|
| COD | ICSD | MPDS | Total | |
| Successfully cleaned, parsed and normalized the crystal structure | 347 802 |
163 421 |
211 942 |
723 165 |
![]() |
||||
| Problematic CIFs | ||||
| Different composition | 36 358 |
7809 | 42 819 |
86 986 |
| Inconsistent formula: different compositions | 36 358 |
7809 | 42 819 |
86 986 |
| Missing elements | 9743 | 16 590 |
21 812 |
48 145 |
| Inconsistent formula: missing hydrogen | 7426 | 15 249 |
18 338 |
41 013 |
| Inconsistent formula: missing lithium | 42 | 77 | 133 | 252 |
| Inconsistent formula: missing other | 2275 | 1264 | 3341 | 6880 |
| Failed to parse | 8314 | 5978 | 28 060 |
42 352 |
| The CIF had invalid syntax that could not be corrected | 0 | 23 | 0 | 23 |
| The cleaned CIF defines no atomic sites | 1037 | 0 | 20 645 |
21 682 |
| The cleaned CIF defines sites with invalid atomic occupancies | 3136 | 3141 | 7415 | 13 692 |
| The cleaned CIF contains sites with unknown species | 1725 | 2814 | 0 | 4539 |
| The cleaned CIF defines sites with attached hydrogens with incomplete positional data | 2416 | 0 | 0 | 2416 |
| Symmetry issues | 117 | 73 | 372 | 562 |
| Failed to determine the primitive structure | 34 | 70 | 8 | 112 |
| Detected inconsistent symmetry operations | 83 | 3 | 364 | 450 |
| Total | 402 334 |
193 871 |
305 005 |
901 210 |
The last row shows the total number of files that have been imported and analyzed with the
. The bold rows indicate the corresponding (aggregated) category in the Sankey diagram in Fig. 1.
contain many duplicates. To determine the set of structures that are unique across all three databases, first the unique set of structures within each database was determined. The procedure first maps all the structures of a database on their chemical formula in the reduced Hill notation.86 One then only has to search for duplicates within each set of structures that share the same chemical formula, drastically reducing the computational complexity of the problem.
The second step is to separate the structures with the same formula by space group. This is mainly to avoid cases where the structure similarity algorithm is unable to properly differentiate distinct phases with very similar unit cells. The space group is determined using the
package with a
of 0.005. Using this strict approach, we give preference to cases where we consider two structures to be different that, in fact, should be flagged as similar. This is acceptable, since it is very likely that such cases will be detected as duplicates after the geometry has been optimized in the next step. As a consequence, we will be performing a geometry optimization for more structures than strictly necessary, but we will be less likely to discard distinct but similar crystal phases that should be considered independently.
For all structures that have the same reduced formula and space group, the similarity was analyzed using the
utility of the
73 package (v2023.5.10), with a fractional length tolerance of 0.2, a site tolerance of 0.3 and an angle tolerance of 5°. The options to convert to primitive cell or to attempt to match a supercell were disabled, since all structures had already been normalized using
in the previous step of the pipeline. Finally, within each family of duplicate structures, a single representative structure was selected as the prototype. Preference was given to structures originating from databases with more permissive usage terms, i.e. we selected from the COD first, followed by ICSD and MPDS.
509 structures originating from the COD were removed.
589 experimental unique inorganic crystal structures remains, which is referred to as the MC3D-source. The structures of the MC3D-source form the starting point of the next part of the work that seeks to computationally optimize their geometries.
The geometry optimization is performed in several steps, implemented as an AiiDA workflow, as shown in Fig. 8. The first step performs an initial geometry optimization with looser convergence parameters. This helps with two aspects: (1) it is less costly to obtain a reasonable first ground state geometry, improving the efficiency and (2) the less strict parameters makes it easier to converge for geometries far removed from the ground state, improving robustness. If the initial geometry optimization completes successfully, the optimization workflow is run again but this time with the convergence parameters as determined by the input protocol.
![]() | ||
| Fig. 8 Schematic diagram of the steps of the geometry optimization workflow, as described in the text. A dashed circular arrow around a block indicates the step can be repeated. | ||
Both the initial and full optimization steps are instances of the same workflow, run with different input parameters. Each workflow starts with generating the k-point mesh used to sample the Brillouin zone based on the input geometry, such that the k-point density in reciprocal space matches the minimum required density specified by the input protocol. The geometry is then optimized using QE's
code57,58 employing the SIRIUS library59 to allow running efficiently on Graphics Processing Unit (GPU) compute nodes of different architectures. The workflow automatically parses the output of the calculation and, in case of an error or a failure to converge, the input parameters are adjusted and restarted up to a maximum of 5 restarts. The logic for this automated workflow is described in the “Automated error handling” section of the Methods. The final optimization step is repeated until the following two conditions are met:
(1) The stresses in the final self-consistent field (SCF) performed by
are below the selected convergence threshold: during the geometry optimization, the basis sets used (i.e. the list of reciprocal-space G vectors), that depend not only on the energy cutoffs but also on the crystal geometry, are not updated at each step. The final SCF step that is performed at the end of the optimization recomputes the basis sets on the optimized unit cell. Large stresses in the final SCF indicate that the optimized geometry differs significantly from the initial geometry and the static basis sets used towards the end of the optimization were no longer converged enough. A restart of the optimization at the latest geometry is therefore performed.
(2) The k-point mesh corresponds to a lower density than the one dictated by the protocol: since the unit cell can change during the optimization, it is possible that, towards the end of the optimization cycle, the initial k-point mesh no longer satisfies the minimum required k-point density specified by the input protocol. In this case, a new k-point mesh is generated and another geometry optimization is performed.
subdatabase of the MC3D (see next subsection for the
and
subdatabases), pseudopotentials were taken from the Standard Solid State Pseudopotentials (SSSP) PBEsol61,62 efficiency v1.3 (ref. 87 and 88) library, which collects pseudopotentials from a number of libraries.89–95 SSSP provides a set of rigorously tested values for the recommended wavefunction and charge density energy cutoffs for each pseudopotential. For each structure, the highest value among those recommended for the elements in the composition of the material was selected. It should be noted here that recent tests versus an all-electron benchmark68 have found a suboptimal precision for the Hg and Au pseudopotentials provided by the SSSP v1.3 efficiency. Structures containing these elements are currently being recomputed with improved pseudopotentials and cutoffs, which will be added to the online database at a later point in time.
The magnetic and electronic properties cannot be reliably determined by just inspecting the geometry and composition of a crystal structure. Therefore, by default, all structures are considered to be magnetic and metallic in behavior in the initial geometry optimization. All calculations in the geometry optimization workflow are spin-polarized and are performed with smeared electronic occupations using the Marzari–Vanderbilt96 cold-smearing method with a broadening of 0.02 Ry (≈2.72 × 10−1 eV). Each calculation is initialized in a high-spin ferromagnetic configuration, where elements with partially occupied d or f orbitals are assigned a magnetic moment of 5µB or 7µB, respectively. For all other elements, the electrons are initialized to have a 10% surplus in the spin-up channel. To avoid an erroneous magnetic configuration being passed from the initial geometry optimization to the full one, magnetic moments are reinitialized in the second part of the workflow.
The Brillouin zone is sampled at k-points that are defined by a Monkhorst–Pack97 mesh including the Γ-point, where the distance between k-points in each reciprocal-space direction is at most 0.15 Å−1 (i.e., the algorithm chooses the smallest k-point mesh with at least the density required by the specified k-point distance). These values correspond to the extensively tested “balanced” protocol described in detail by de Miranda Nascimento et al.69
The threshold for electronic convergence in the SCF calculation is set to 0.2 × 10−9 Ry (≈2.72 × 10−9 eV) per atom. Geometry optimizations were stopped when the energy difference between two ionic steps was smaller than 10−5 Ry (≈1.36 × 10−4 eV) per atom, all forces on all atoms were smaller than 10−4 Ry bohr−1 (≈2.57 × 10−3 eV Å−1) and cell stress was smaller than 0.5 kbar (0.05 GPa).
,
, and
. The results discussed in this article focus on the most recent subdatabase,
. Table 3 provides an overview of the differences among the three methodology labels.
98 plugin package that publishes the workflow
The geometry optimization workflow, which for
and
used the workflows from
version 4, was changed for
to the workflows of
version 5, with the main differences being:
• For the MC3D
databases, the workflow ran a dedicated initial calculation of the electronic charge density to determine the electronic and magnetic character of the structure. This was dropped in
in favor of reinitializing the magnetic configuration for each geometry optimization;
• For the MC3D
databases, the workflow did not perform an initial geometry optimization with looser convergence parameters;
• For the MC3D
databases, the workflow only considered the volume difference between sequential geometry optimization runs in the convergence criteria;
As mentioned before, each methodology (physical method and computational protocol version) has been applied to (a subset of) the structures of the MC3D-source, which results in different optimized versions of these source structures. A naming convention was developed and adopted to unambiguously identify source structures of the MC3D-source and their optimized variants of the MC3D. All structures in the MC3D are given a unique identifier of the form
. Optimized geometries derived from MC3D-source structures are given a unique identifier of the form
. Here the
and
refer to the name of the main distinguishing physical methodology used in the optimization workflow and version of the corresponding computational protocol. These two concepts are separated because a change in physical methodology does not necessarily mean that its results supersede those of a different subdatabase. However, for a given physical method, multiple versions of the computational protocol may be used. These typically represent improvements, and the resulting optimized geometries should, in principle, supersede those produced by earlier protocol versions.
simulation code. As presented in the Results section, these calculations have various ways of failing that need to be dealt with automatically by the workflow as much as possible for the workflow to be scalable to tens of thousands of structures.
Therefore, a logical workflow was developed to wrap the calculation in a loop and run it until it completes successfully, whose logic is schematically represented in Fig. 9. Each time a calculation fails, predefined error handlers are called that inspect the results of the failed calculation and determine whether to abort the workflow or to perform an operation, such as changing selected input parameters, before restarting the calculation. The workflow automatically aborts if two calculations fail consecutively without the intervention of an error handler, or the number of iterations exceeds a predefined maximum to prevent the workflow from running indefinitely. The list of main error handlers includes, in order of executed priority:
• Insufficient bands: this handler performs a post-calculation check on successfully converged calculations to check if for any spin channel and k-point the highest band has an occupation >0.005, which indicates that insufficient bands have been used in the calculation. In this case, increase the number of bands (
) by 5% with a minimum of 4 bands.
• Diagonalization error: when the diagonalization algorithm fails, this handler systematically tries alternative algorithms in order of increasing robustness and computational cost (i.e., starting with the fastest but potentially less robust algorithm). The default algorithm is
, and alternatives are tried in the order:
,
, and finally
. If all algorithms have been tried, the calculation aborts.
• Out of walltime: handles calculations that exceeded the allocated walltime but shut down cleanly. Performs a full restart from the final structure (in case of a relaxation), charge density and wave functions.
• BFGS history failure: when the BFGS minimization algorithm fails, switches to the more robust
dynamics algorithm (or reduces the trust radius for
calculations), then restarts from the previous structure and charge density.
• Ionic convergence: handles other cases where ionic convergence thresholds were not met but the calculation shut down cleanly with a usable output structure. Restarts from the previous charge density using the last output structure.
• Electronic convergence: handles electronic convergence failures by reducing charge density mixing (
) by a factor of 0.8.
Although specifically developed for this work, the logic of this workflow is generic and can be used in combination with any calculation. It has since been contributed to the AiiDA workflow management system that, as of v1.1, provides it as an integrated component called the
(see AiiDA documentation99). There are now many AiiDA plugin packages that provide implementations of the
for a variety of codes.
, using version
of
. The only difference with the structure uniqueness analysis approach was that all cells were converted to primitive cell. This was disabled for the uniqueness analysis of the MC3D-source itself as that was already done that in a separate step before the analysis. However, this is not necessarily the case for the structures of the MP and OQMD, but even the structures of the MC3D-source may have changed significantly after the geometry optimization.
To compare structures of the MC3D, MP and OQMD, they were first grouped according to their space group and reduced chemical formula. Afterwards, each structure in the MC3D was compared against all the reference structures in the corresponding subgroup to identify whether any match is detected based on the
. To estimate the uncertainty of this analysis, since the assignment of the experimental reference to a certain structure might be handled differently in other databases, a matching was also performed on structures that refer to the same
(the internal identifier of structures in the ICSD, which is the source of the majority of structures in the MP and OQMD). For 657 (2.0%) out of 32
164 common
(including also tags that were filtered out as duplicates), the final structures reported in the MC3D and MP do not match. However, MP also considers the final structure to determine the reference, whereas we assign the reference database tag based on the initial structures. In 295 cases, the final MC3D structure does not match the initial structure, due to significant changes during geometry optimization, whereas the MP structure does. Further differences might be related to different XC-functionals, e.g., the PBE one adopted in the MP (the r2SCAN calculations are based on PBEsol geometry optimizations). This high agreement justifies the application of the
approach to estimate the amount of structures in the MC3D that are not found in the MP and OQMD.
,
, and
subdatabases of the MC3D are uploaded as AiiDA archive files and made publicly available on the Materials Cloud Archive. This data is imported into an AiiDA instance running on a Materials Cloud server, and served via the AiiDA REST API, which can be explored through the Explore section of the web application (https://www.materialscloud.org/explore/mc3d-pbesol-v2/). However, while this API is powerful and gives access to the full AiiDA data, it does not provide an intuitive interface to easily find and extract relevant data for non-experts and it does not allow to serve any additional metadata that is not integral part of the AiiDA provenance graph, such as the X-ray diffraction (XRD) data that was calculated as a post-processing step without AiiDA. Therefore, a metadata pipeline was developed which transforms the content of the MC3D into a reduced format that just contains the optimized geometries with their relevant properties and links to the nodes in the AiiDA provenance graph that computed those properties, as well any additional metadata and derived data (such as the XRD data). This metadata is stored in a MongoDB database which is then served via the Materials Cloud REST API, which is in turn consumed by the Discover section of the web application.
In addition to the MC3D explore and discover frontends and their related APIs, we also deploy an OPTIMADE50 compliant API to access the MC3D data. This makes it possible query the MC3D data in a standard manner, and explore it in any of the OPTIMADE-compliant clients available (such as the Materials Cloud OPTIMADE-client). We also provide an example Python script in Listing 1 (Chart 1) that demonstrates how to query the data programmatically, and convert structures to the Atomistic Simulation Environment12 format via the
65 library. Four example queries are provided, which query structures based on number of elements, chemical composition, the explicit MC3D ID, and the total magnetization.
![]() | ||
| Chart 1 Listing 1: Example script to query MC3D structures via the OPTIMADE API and convert into Atomic Simulation Environment (ase) [12] objects. The only Python dependency is optimade[ase]==1.3.1. | ||
package v3.1.0 (https://doi.org/10.5281/zenodo.18406626),100 which is made available under the MIT open-source license on GitHub at https://github.com/aiidateam/aiida-codtools. The automated workflows and input protocols to compute the optimized ground state electronic structure using Quantum ESPRESSO are bundled with the
package (https://doi.org/10.5281/zenodo.18406736),101 which is made available under the MIT open-source license on GitHub at https://github.com/aiidateam/aiida-quantumespresso. Both AiiDA plugin packages are distributed as installable packages through the Python Package Index, accessible at https://pypi.org/project/aiida-codtools and https://pypi.org/project/aiida-quantumespresso, respectively.
Footnote |
| † These authors contributed equally to this work. |
| This journal is © The Royal Society of Chemistry 2026 |