Adrian Xiao Bin
Yong
*ab,
Tianyu
Su
ab and
Elif
Ertekin
*bc
aDepartment of Materials Science and Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA. E-mail: axyong2@illinois.edu
bMaterials Research Laboratory, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
cDepartment of Mechanical Science and Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA. E-mail: ertekin@illinois.edu
First published on 15th August 2024
Generative models have received significant attention in recent years for materials science applications, particularly in the area of inverse design for materials discovery. However, these models are usually assessed based on newly generated, unverified materials, using heuristic metrics such as charge neutrality, which provide a narrow evaluation of a model's performance. Also, current efforts for inorganic materials have predominantly focused on small, periodic crystals (≤20 atoms), even though the capability to generate large, more intricate and disordered structures would expand the applicability of generative modeling to a broader spectrum of materials. In this work, we present the Disordered Materials & Interfaces Benchmark (Dismai-Bench), a generative model benchmark that uses datasets of disordered alloys, interfaces, and amorphous silicon (256–264 atoms per structure). Models are trained on each dataset independently, and evaluated through direct structural comparisons between training and generated structures. Such comparisons are only possible because the material system of each training dataset is fixed. Benchmarking was performed on two graph diffusion models and two (coordinate-based) U-Net diffusion models. The graph models were found to significantly outperform the U-Net models due to the higher expressive power of graphs. While noise in the less expressive models can assist in discovering materials by facilitating exploration beyond the training distribution, these models face significant challenges when confronted with more complex structures. To further demonstrate the benefits of this benchmarking in the development process of a generative model, we considered the case of developing a point-cloud-based generative adversarial network (GAN) to generate low-energy disordered interfaces. We tested different GAN architectures and identified reasons for good/poor performance. We show that the best performing architecture, CryinGAN, outperforms the U-Net models, and is competitive against the graph models despite its lack of invariances and weaker expressive power. This work provides a new framework and insights to guide the development of future generative models, whether for ordered or disordered materials.
While generative modeling efforts for inorganic materials9,11,12 have primarily centered around simpler bulk crystals, there has been comparatively less emphasis on disordered systems, despite their relevance across a wide spectrum of applications.13–15 Disordered systems usually have complex and irregular structures, necessitating large atomic representations and requiring more powerful generative models than those developed for simple crystals. They include structures that completely lack crystal lattices such as amorphous materials, as well as structures beyond bulk materials such as surfaces and interfaces. In direct physical modeling, disordered materials are most typically represented by large so-called “supercells” that (spuriously) introduce periodicity at larger length scales. Databases of disordered materials are growing,16–18 offering compelling prospects for the inverse design of metal–organic frameworks, porous amorphous materials, amorphous battery materials, and more. Beyond materials discovery, generative modeling can be used to generate amorphous structures of arbitrarily large sizes upon training on smaller samples that capture material correlation lengths.19 This capability enables more thorough investigations into properties that are influenced by size effects, such as thermal conductivity and mechanical properties. Generative modeling can also be used to refine atomic structures to align with experimental observations,20 typically focusing on the refinement of disordered structures.21,22 Yet, the application of generative modeling to disordered systems remains limited, such as generating 2D morphology rather than precise atomic structures.19 Moreover, generative models have been reported to fail when applied to large systems.11,23 To reap the benefits of generative modeling for disordered systems, better generative models need to be developed and evaluated on disordered systems.
When building a generative model, two major design decisions are the type of generative model and the material representation (i.e., the input used to describe the material). The compatibility of these two choices is important as well. The types of generative models that have been used for materials include variational autoencoders (VAEs),6,7,9,11 generative adversarial networks (GANs),8,10,24,25 diffusion models,26–30 and language models.31–33 Generative models were initially developed with two main types of material representations: (1) voxels6,7,11,12 and (2) point clouds.8–10,25 Voxel representation is memory intensive, resulting in limitations in the voxelization resolution and thus number of atoms (e.g., Court et al.11 restricted the number of atoms to ≤40 atoms per cell). Reconstruction issues were also reported11 for non-cubic cells. On the other hand, point clouds directly represent structures using their atomic coordinates and lattice parameters, making them highly scalable with the number of atoms. However, the design of point cloud architectures that are symmetry-invariant is not trivial, as the commonly used PointNet architecture34 does not include the desired symmetry invariances. More recently, other representations such as graphs,26,30 coordinate-based representations (e.g., UniMat,29 CrysTens28), and text-based representations31–33 have also been explored. Graphs are particularly attractive due to their symmetry invariances and strong expressive power, capturing both geometrical features and neighbor information. However, graph convolutions become computationally and memory intensive as the number of atoms increases.
To compare different generative models and make design choices, it is necessary to sufficiently evaluate the generated structures for their validity. Training generative models for materials discovery inherently makes the evaluation of the models' performance difficult. In more conventional problems such as image generation or speech synthesis, it is relatively easy to discern if the model has learned to generate realistic images or speech from the training data. However, it is much more difficult to determine if a newly generated material is realistic, or if the model is simply generating noise. Recent generative models26,27,29,33 have relied on limited and heuristic metrics (e.g., charge neutrality, material space coverage) to evaluate and compare between models, making it difficult to meaningfully assess model performance. One approach to circumventing the issues of evaluating on new, unknown materials is to instead train the model on a fixed set of materials (e.g., perovskites). Restricting the material space allows for easier determination of whether the correct structure is being generated, and direct comparison of the properties between the generated and training structures can be performed. In this scenario, however, the materials on which the model is trained should be sufficiently challenging to provide meaningful evaluation of the model's performance. In this regard, disordered materials are good candidates for the task, given that generative models can fail to generate even a single type of disordered material (e.g., amorphous silicon23).
In this work, we present the Disordered Materials & Interfaces Benchmark (Dismai-Bench), a generative model benchmark that uses datasets of an Fe60Ni20Cr20 austenitic stainless steel, a disordered Li3ScCl6(100)–LiCoO2(110) battery interface, and amorphous silicon. Dismai-Bench evaluates generative models on a wide range of material disorder ranging from structural to configurational (see Fig. 1). Structural disorder increases from left to right, and configurational disorder increases from right to left, in Fig. 1. The composition of each dataset is fixed, and each structure has 256–264 atoms. We selected four recent diffusion models to be benchmarked on Dismai-Bench, including two models that use graph representations (CDVAE26 & DiffCSP27) and two models that use coordinate-based representations (CrysTens28 & UniMat29). The models were trained on one dataset at a time, and the generated structures were compared with the training structures to obtain structural similarity metrics. These metrics quantify the model's ability to learn complex structural patterns found in disordered materials. We show that the graph models outperform the coordinate-based models due to the higher expressive power of graphs. The success of the less expressive models in materials discovery29,33 suggests that noisy models are better for discovering small crystals, but face challenges when tasked with generating larger, more complex structures.
To demonstrate the application of Dismai-Bench in the development of a generative model, we further considered the design of a GAN to generate low-interface-energy Li3ScCl6(100)–LiCoO2(110) interface structures. We chose the simple point cloud representation, and tested multiple different GAN architectures for which we included bond distance information explicitly in the GANs, instead of just atomic coordinates. Direct comparison between the generated and training structures identified the architecture that best achieved the intended goal, along with explanations for why the other architectures were less successful. We demonstrate that the best architecture, Crystal Interface Generative Adversarial Network (CryinGAN), can generate the disordered interfaces with low interface energy, and similar structural features to the training structures. Despite its design simplicity, CryinGAN outperforms the more recent coordinate-based diffusion models on Dismai-Bench. It does not outperform the graph diffusion models across all datasets, however, possibly as a result of its weaker expressive power and lack of invariances. Through this work, we present a novel framework for conducting meaningful comparisons between models, providing valuable insights into model weaknesses and failures to inform the design of future generative models.
![]() | (1) |
The SRO distributions of the 300 K and 1500 K alloy training datasets are shown in ESI Fig. S1† for the 1st and 2nd nearest neighbor interactions. The 300 K dataset shows more prominent SRO than the 1500 K dataset. The SRO parameter tends to distribute away from zero at 300 K, and consistently distribute near zero at 1500 K. We also created two additional datasets by filtering the CEMC-generated structures such that the SRO parameters have narrow distributions within ±0.1 of the average values (see ESI Fig. S2†). Sufficiently large number of structures were generated to obtain 1500 structures for each dataset. We refer to the unfiltered and filtered datasets as the wide SRO and narrow SRO datasets respectively. This SRO filtering was performed to enable comparison of generative model performance when trained on structures with more noisy SRO distribution (wide SRO dataset) and structures with less noisy SRO distribution (narrow SRO dataset).
The Li3ScCl6(100)–LiCoO2(110) interface dataset was created by generating random interface structures and relaxing them. Each structure was first constructed by randomly generating 3 formula units of LSC atoms in the interface region between the LSC and LCO slabs. For each structure, the thickness of the interface region was randomly selected between 4 and 6 Å, and a random lateral displacement was applied to the LSC slab (translation allowed along the full range of both lateral directions). The randomly generated structures were relaxed using density functional theory (DFT) calculations.
To perform relaxations faster, we trained from scratch a machine learning interatomic potential, M3GNet,49 for the LSC–LCO interfaces. A total of 15484 training structures consisting of optimized structures and intermediate ionic steps of the DFT relaxations were used to train the M3GNet interatomic potential. The M3GNet model achieved low test set mean absolute errors (MAEs) of 2.70 meV per atom, 20.9 meV Å−1, and 0.0146 GPa for energy, force, and stress respectively. Machine learning interatomic potentials with similar (or higher) MAEs showed good performance in relaxations and molecular dynamics simulations when applied to other Li-ion conductors.49,50 We then relaxed randomly generated interface structures using the M3GNet interatomic potential. The M3GNet-relaxed structures were found to be near DFT convergence (refer to Table 5 in the Methods section). The interface energies of the relaxed structures were distributed across a wide range (approximately 1.4 J m−2) as shown in ESI Fig. S4.† We define structures with interface energies no higher than 0.4 J m−2 relative to the lowest energy structure to be low-interface-energy structures. For reference, the observation frequency of a grain boundary in aluminum metal decreases by 95% when the grain boundary energy increases by around 0.35 J m−2 (ref. 51). We assembled 1500 low-interface-energy structures (all relaxed by M3GNet only) as the disordered interface dataset.
CDVAE26 is a VAE that uses equivariant graph neural networks for its encoder and decoder. The encoder is a graph convolutional network that encodes material structures into latent representations (z). Three multilayer perceptrons (MLPs) are used to predict the composition, lattice parameters, and number of atoms from z. These predictions are used to initialize structures (corresponding to the sampled z), where the atoms are initialized at random positions. The decoder is a graph diffusion model that denoises both the atomic coordinates and atomic species of the atoms.
In our early tests of training CDVAE on Dismai-Bench datasets, CDVAE was found to fail in generating structures with the correct compositions (see ESI Fig. S5†). Although CDVAE was able to predict the compositions correctly from z, the compositions became incorrect after the atomic species of the atoms were denoised. Therefore, we modified CDVAE such that the atomic species denoising becomes an optional feature, and all CDVAE benchmarking was performed without atomic species denoising. We also tested the effect of atomic species denoising when CDVAE is trained on the MP-20 dataset,26 which includes structures from the Materials Project52 of various compositions. Here, denoising the atomic species was found to increase the composition accuracy of reconstructed structures from around 24% to 54%. However, denoising the atomic species appears to be detrimental for larger structures (such as those in the Dismai-Bench datasets), and fails even when all structures have the same composition.
DiffCSP27 is another graph diffusion model. The main feature introduced in DiffCSP is the ability to jointly denoise the atomic coordinates and lattice parameters. In contrast, CDVAE predicts the lattice parameters first, and they remain fixed throughout the diffusion steps. Note that DiffCSP does not denoise the atomic species. To better capture periodicity, DiffCSP also uses fractional instead of Cartesian coordinates (as in CDVAE), and uses periodic translation invariant Fourier transformations in its message passing. However, DiffCSP does not include bond angle information in its graphs, whereas CDVAE does.
When DiffCSP was trained on the disordered interface dataset allowing joint atomic coordinate and lattice parameter diffusion, the generated structures were found to be of poor quality (refer to ESI Fig. S6a†). We modified DiffCSP to allow teacher forcing of lattice parameters during the initial training epochs, where the ground truth lattice parameters are used as input, and lattice cost is not used to update the model (only coordinate cost is used). The quality of the generated structures improved when trained with teacher forcing (see ESI Fig. S6b†). However, structures with the best quality were still obtained when DiffCSP was trained without any lattice diffusion, using the ground truth lattice parameters as input (see ESI Fig. S6c†). Therefore, all benchmarking of DiffCSP presented in Section 2.3 was performed without lattice denoising. This choice also provides a consistent comparison between DiffCSP and CDVAE, since CDVAE also does not perform lattice denoising. For reference, the Dismai-Bench metrics of DiffCSP trained with lattice denoising and teacher forcing are listed in ESI Table S1.† DiffCSP performance drops with lattice denoising across the metrics (see Section 2.3 for benchmarking details).
CrysTens28 is an image-like representation for materials. The pixel values of a CrysTens image are filled with information of the structure such as lattice parameters, fractional coordinates, and atomic number. Each CrysTens image has four channels, analogous to the RGB color channels of an image. The first channel includes a pairwise distance matrix between all atom pairs, and the remaining three channels include the pairwise Δx, Δy, and Δz matrices respectively. CrysTens is used with a 2D U-Net diffusion model for image generation, and the generated CrysTens images are reconstructed back into material structures. We made no major modifications to the baseline implementation. The dimensions of the CrysTens images were increased to fit Dismai-Bench structures (the original implementation only allowed up to 52 atoms). Refer to the Methods section for more details.
UniMat29 is a video-like representation for materials. Each frame of a UniMat video represents a single atom in the structure, and has three channels corresponding to the x, y, and z coordinates. The atomic species of the atom is indicated by the pixel location in the frame (e.g., top left pixel is H), where those pixel values correspond to the coordinates of the atom, and all other pixel values are set to −1. UniMat is used with a 3D U-Net diffusion model for video generation, and the generated UniMat videos are reconstructed back into materials. Despite the lack of any geometrical information beyond atomic coordinates, UniMat was shown to outperform CDVAE in discovering materials with lower formation energy (upon DFT relaxation of the generated structures). As the UniMat code is currently not openly available, we used an open-access implementation53 of the 3D U-Net model54 that the UniMat model was repurposed from. Besides this change, no major modifications were made to UniMat. The dimensions of the UniMat frames were decreased, since each Dismai–Bench structure has only five atomic species at most.
Although the benchmark metrics for all five generative model architectures are listed in this section, only the results for CDVAE, DiffCSP, CrysTens, and UniMat will be discussed here, since the CryinGAN architecture has not been presented yet (see Section 2.4). The benchmark results for CryinGAN will be discussed in Section 2.6, along with an overall comparison of the general performance of all architectures.
The generated structures were relaxed, and the benchmark metrics were calculated by analyzing the local coordination environment (motifs) of the atoms. The CrystalNNFingerprint,55 which contains the coordination likelihoods and local structure order parameters of a given atom, was averaged across the structures generated by each model, and separately obtained for the training dataset as well. We calculated the fingerprints of the cations (Li, Co, and Sc) coordinated to the anions (Cl and O). We also appended the fraction of Cl and O neighbors in each motif to the fingerprints, so that the fingerprints contain both chemical and coordination information. For each model, the Euclidean distance between the average fingerprint of the training structures and the generated structures was calculated.
The benchmark metrics for the disordered interfaces are shown in Table 1. Parameters dLi, dCo, dSc, and dall represent the fingerprint Euclidean distance of Li, Co, Sc, and all cations respectively. The percentages of structures that failed to be post-processed or relaxed are also listed. Notably, CDVAE is the only model that generates structures close to being fully relaxed, requiring only an average of 10 relaxation steps per structure (see ESI Table S2†). All other models required at least 68 relaxation steps on average, and the U-Net models have >50% failed structures. UniMat exhibited the highest % failed structures and dall, likely because it uses a less expressive material representation (atomic coordinates only). CrysTens, despite the inclusion of bond distances, also has high % failed structures. Although its Euclidean distance metrics are small, the low values are only achieved after relaxation (refer to Fig. 2 and Table S2† for unrelaxed structures). On the other hand, the graph diffusion models perform significantly better than the U-Net diffusion models due to the higher expressive power of graphs. Comparing between CDVAE and DiffCSP, CDVAE generates interface structures closer to convergence than DiffCSP, but upon relaxation, DiffCSP achieves lower distance metrics than CDVAE.
Model | d Li (min, max) | d Co (min, max) | d Sc (min, max) | d all (min, max) | % Struc. failed (min, max) |
---|---|---|---|---|---|
CDVAE | 0.0664 (0.0601, 0.0697) | 0.0427 (0.0372, 0.0504) | 0.201 (0.186, 0.214) | 0.0547 (0.0508, 0.0589) | 0.00 (0.00, 0.00) |
DiffCSP | 0.0439 (0.0414, 0.0474) | 0.0272 (0.0240, 0.0298) | 0.0954 (0.0909, 0.102) | 0.0370 (0.0353, 0.0388) | 7.17 (6.00, 8.90) |
CrysTens | 0.0202 (0.0160, 0.0244) | 0.0141 (0.00707, 0.0239) | 0.0972 (0.0888, 0.106) | 0.0213 (0.0164, 0.0257) | 55.3 (50.6, 61.7) |
UniMat | 0.131 (0.0966, 0.154) | 0.235 (0.165, 0.292) | 0.101 (0.0891, 0.111) | 0.151 (0.111, 0.181) | 64.6 (60.6, 68.3) |
CryinGAN | 0.0538 (0.0466, 0.0600) | 0.0274 (0.0210, 0.0359) | 0.0888 (0.0838, 0.0963) | 0.0426 (0.0379, 0.0451) | 9.00 (7.20, 10.2) |
The benchmark metrics for amorphous Si are shown in Table 2. Parameters dmotif, drdf, and dangle represent the Euclidean distance for motif fingerprint, RDF, and bond angle distribution respectively. Both CDVAE and DiffCSP had 0% failed structures, but CDVAE-generated structures were significantly closer to being fully relaxed than DiffCSP-generated structures. CDVAE only required an average of 23 relaxation steps per structure, whereas DiffCSP required an average of 140 steps per structure (see ESI Table S3†). Similarly, all distance metrics of CDVAE are lower than DiffCSP. The lack of bond angle information in DiffCSP's graphs is likely one main contributor to these trends. The RDFs and bond angle distributions of CDVAE and DiffCSP are shown in ESI Fig. S7.† DiffCSP and CDVAE encode distance information in their graphs, and both show similar RDFs to the training dataset. However, there is a clear difference when comparing the bond angle distributions, where DiffCSP shows a larger discrepancy with the training dataset than CDVAE. These results indicate that bond angle information is particularly beneficial for helping generative models learn amorphous structures, since amorphous structures have more complicated arrangements of atoms that do not locate on lattice sites.
Model | d motif (min, max) | d rdf (min, max) | d angle (min, max) | % Struc. failed (min, max) |
---|---|---|---|---|
CDVAE | 0.0402 (0.0396, 0.0412) | 0.392 (0.381, 0.408) | 0.00332 (0.00312, 0.00353) | 0.00 (0.00, 0.00) |
DiffCSP | 0.0647 (0.0462, 0.0908) | 1.39 (1.03, 1.69) | 0.0103 (0.00797, 0.0133) | 0.00 (0.00, 0.00) |
The generated structures were post-processed to remove atoms not on lattice sites. Any structure with >50 atoms removed (∼20% of all atoms) was considered a failed structure and rejected. Then, the fingerprint of each structure was calculated using a vector of conditional probabilities of observing each cluster (monomer/dimer) in the structure, up to the 7th neighbor shell. The Euclidean distance between the average cluster fingerprint of the training and generated structures, dcluster, was calculated for each model.
Benchmark metrics for the disordered alloys are shown in Table 3. The percentage of failed structures is 0% for DiffCSP and CryinGAN, while CDVAE exhibits a tiny percentage of failed structures. However, the percentage of structures with site vacancies is 90–100% for DiffCSP and CryinGAN. In comparison, CDVAE has significantly lower percentages of structures with vacancies, around roughly 40% for 300 K structures and 60% for 1500 K structures. Although DiffCSP-generated structures had more well-defined lattices than CDVAE-generated structures, almost all structures had overlapping atoms on lattice sites, resulting in site vacancies in the structures.
Model | d cluster (min, max) | % Struc. w/vac (min, max) | % Struc. failed (min, max) |
---|---|---|---|
300 K, narrow SRO | |||
CDVAE | 0.0604 (0.0540, 0.0675) | 35.8 (32.0, 41.2) | 0.13 (0.00, 0.40) |
DiffCSP | 0.0645 (0.0566, 0.0792) | 94.3 (91.8, 96.5) | 0.00 (0.00, 0.00) |
CryinGAN | 0.117 (0.106, 0.126) | 100 (100, 100) | 0.00 (0.00, 0.00) |
![]() |
|||
300 K, wide SRO | |||
CDVAE | 0.0658 (0.0598, 0.0742) | 37.4 (34.0, 43.4) | 0.07 (0.00, 0.10) |
DiffCSP | 0.103 (0.0831, 0.121) | 95.8 (95.4, 96.2) | 0.00 (0.00, 0.00) |
CryinGAN | 0.125 (0.119, 0.129) | 100 (100, 100) | 0.00 (0.00, 0.00) |
![]() |
|||
1500 K, narrow SRO | |||
CDVAE | 0.0621 (0.0589, 0.0665) | 54.8 (51.9, 56.8) | 0.37 (0.20, 0.50) |
DiffCSP | 0.0308 (0.0304, 0.0313) | 91.7 (91.5, 91.8) | 0.00 (0.00, 0.00) |
CryinGAN | 0.0643 (0.0629, 0.0659) | 100 (100, 100) | 0.00 (0.00, 0.00) |
![]() |
|||
1500 K, wide SRO | |||
CDVAE | 0.0618 (0.0549, 0.0694) | 61.3 (55.9 68.2) | 0.27 (0.20, 0.30) |
DiffCSP | 0.0338 (0.0332, 0.0342) | 92.3 (89.2, 94.4) | 0.00 (0.00, 0.00) |
CryinGAN | 0.0650 (0.0609, 0.0691) | 100 (100, 100) | 0.00 (0.00, 0.00) |
For the 300 K structures with narrow SRO, the dcluster values of CDVAE and DiffCSP are similar. However, for the 300 K structures with wide SRO, CDVAE has lower dcluster than DiffCSP. The wide SRO structures have less consistent SRO distributions, so DiffCSP had more difficulty in learning the wide SRO structures than the narrow SRO structures. On the other hand, dcluster only increased slightly for CDVAE between the wide and narrow SRO structures, indicating CDVAE's stronger ability to learn different degrees of SRO. For the 1500 K structures, which more resemble random solid solutions, there is little difference in dcluster between the narrow and wide SRO structures. Here, dcluster is lower for DiffCSP than CDVAE. Some factors contributing to DiffCSP's better performance in generating random solid solutions may be its stronger ability in learning the FCC lattice, weaker ability in learning SRO patterns, and low number of site vacancies (∼1.5 vacancies per structure). The Warren–Cowley SRO parameter distributions for the 1st and 2nd nearest neighbor interactions are shown in ESI Fig. S9–11† for reference. Overall, the models were able to generate structures with similar SRO distributions to the training structures.
A typical GAN consists of two neural networks: a generator and a discriminator. The role of the generator is to generate material structures from input noise, whereas the role of the discriminator is to distinguish between the real (training) structures and the fake (generated) structures. The generator and discriminator compete with each other during training to progressively improve the quality of the generated structures. The point cloud representation was used for the GANs, and since all of the structures in the dataset share the same lattice parameters, each structure was represented by the fractional coordinates of its atoms only. We tested a couple of different GAN architectures as summarized in Fig. 5.
The base GAN model (CryinGAN) was adapted from the Composition-Conditioned Crystal GAN (CCCGAN) presented by Kim et al.,25 which was used to generate Mg–Mn–O ternary materials. One-dimensional (1D) convolutions were used in the discriminator to extract the latent features of structures, an inspiration taken from PointNet,34 a 3D object classification and segmentation network. These 1D convolutions have been similarly implemented in other point-cloud-based crystal generative models such as FTCP-VAE9 and CubicGAN.10 We simplified and generalized CCCGAN to be used for any periodic system with fixed lattice, composition, and number of atoms. We did not include any conditional generation capability in CryinGAN for simplicity. The discriminator of the original CCCGAN relied solely on atomic coordinates and lattice parameters to distinguish between real and fake structures. To further provide explicit bond distance information to the discriminator, we added a second discriminator to the CryinGAN model that has the same architecture, but accepts bond distances as input instead of coordinates.
The CryinGAN model architecture is shown in Fig. 6. The generator accepts random gaussian noise as input, and produces fractional coordinates of structures as output. Of the two discriminators, one accepts fractional coordinates as input, and the other accepts bond distances (6 nearest-neighbors of each atom) as input. In the first convolutional layer of the discriminators, the fractional coordinates/bond distances are convoluted along each row separately (i.e., separate 1D convolutions for each atom). Note that CryinGAN is permutationally invariant to atom ordering within each atomic species block (e.g., order of the 72 Li atoms does not matter). CryinGAN implements the Wasserstein loss, which was shown to provide more stable training by preventing the vanishing gradients of the traditional GAN.56,57 The discriminator loss function with a gradient penalty term for improved stability57 is:
![]() | (2) |
Ldisc, total = Ldisc, coord + λLdisc, bond. | (3) |
![]() | (4) |
Lgen, total = Lgen, coord + λLgen, bond. | (5) |
Note that the base CryinGAN model uses only the fractional coordinate discriminator (λ = 0), which we use as the baseline reference model.
We developed the GAN models with the intention of evaluating the best performing model by comparing between DFT-relaxed training and generated structures. The training dataset was curated from structures relaxed with the M3GNet interatomic potential followed by DFT calculations (refer to Methods section). Note that the CryinGAN Dismai-Bench metrics shown in Section 2.3 were calculated using the same procedure as described previously. To study the effect of λ on CryinGAN model performance, we trained CryinGAN models on the (DFT-relaxed) dataset of interface structures with varying λ. A visualization of the training process is shown in ESI Movie S1,† where CryinGAN progressively learns to generate low interface energy structures over training epochs. Due to noise in the models, the generated structures often have a small fraction of atoms that are too close to each other. As λ increases, we found that the number of pairs of atoms generated too close together quickly decreases and then levels off around λ = 0.05–0.1 (see ESI Fig. S12†). This observation is an indicator that including the bond distance discriminator is helpful for training the generator to create structures with more reasonable atom–atom distances. The generated structures were then relaxed using M3GNet, refer to ESI Fig. S13† for examples of structures before and after relaxation.
The objective was to obtain a model with the lowest interface energy distribution. Fig. 7 shows the interface energy distribution for the models trained with different λ. For each λ, three separate models were trained and the spread of the interface energy distributions is indicated by the shading (the distribution of all trained models without shading is provided in ESI Fig. S14†). The results indicate that as λ increases from 0 to 0.05, the interface energy distribution shifts to lower energies. However, as λ further increases to 0.1, the interface energy distribution shifts to higher energies. The coordination motif fingerprint distance between the generated and training structures also shows the same trend (see ESI Fig. S15†), where it decreases between λ = 0 and λ = 0.05, then increases when λ > 0.05. These observations show that optimal values of λ improve model performance, but excessive weight on the bond distance discriminator causes the generator to prioritize the bond distances too much over positioning the atoms correctly. The use of a second discriminator does slow down training compared to using only a single discriminator, requiring around twice as long to train the model for the same number of epochs. However, we found that two discriminators still outperform a single discriminator given the same amount of training time (see ESI Fig. S16†), justifying the benefits of the bond distance discriminator. Whereas around 34% of the generated structures failed to converge during M3GNet relaxation for λ = 0, only around 11% did not converge for λ = 0.05. These results show that with an appropriately tuned λ, the bond distance discriminator improves model performance.
We further considered a couple of alternative GAN architectures (refer to Fig. 5). CryinGAN-comb combines both discriminators into a single discriminator, circumventing the need to tune λ. CryinGAN-max uses max pooling in the discriminator, instead of average pooling as in CryinGAN. CryinGAN-mix uses the mix pooling operation proposed by Wang et al.,58 where both max and average pooling operations are used together. The type of pooling operation affects the sampling sensitivity of the discriminator and the overall performance of the GAN. The sampling sensitivity describes how sensitive the discriminator is to changes in point density or the sampling pattern of the input point cloud. Max pooling was reported to cause lower sampling sensitivity than average pooling.58 Overall, CryinGAN was found to outperform all of these alternative architectures, see ESI Supplementary Note 2† for further details.
The superior performance of CryinGAN over CryinGAN-max and CryinGAN-mix shows that higher sampling sensitivity is beneficial for learning atomic configurations. The sampling sensitivity can be further increased through the use of graph convolutions, where each atom is convoluted with its surrounding atoms and bonds. We attempted a graph convolutional discriminator by adapting the commonly used Crystal Graph Convolutional Neural Networks (CGCNN).59 However, the training losses diverged and it was not possible to train a GAN that could generate useful structures (see ESI Fig. S17†). This result is consistent with the findings of Wang et al.,58 whose graph convolutional GAN also failed to produce point clouds of 3D objects. Graph convolutions are highly sensitive to sampling, making them prone to overfocus on the sampling pattern of a point cloud instead of the overall structure. This overfocus bears similarity to the behavior observed here for CryinGAN at high λ values. For graph convolutions to be implemented in point-cloud-based GANs, we expect that a carefully designed architecture will be required to take advantage of its sampling sensitivity without destabilizing training.
We filtered CryinGAN structures with low interface energy (≤0 J m−2) for structural comparison with the training structures. We also compared the CryinGAN structures to a dataset of high-interface-energy structures (>0 J m−2) that were randomly generated and relaxed. We analyzed the coordination motif fingerprints of the cations (Li and Sc) in the interface region coordinated to the anions (Cl and O), to determine the structural similarities/differences. The comparison for Li motifs showed smaller differences (see ESI Supplementary Note 3† for more details), so we focus the discussion here on Sc motifs. Table 4 shows the Euclidean distance and cosine similarity between the average interface Sc fingerprint of the CryinGAN/high energy dataset and the training dataset. The CryinGAN dataset has a lower Euclidean distance and higher cosine similarity than the high energy dataset. These results indicate that the interface Sc atoms in the CryinGAN structures are more similar to the low-interface-energy training structures, than the high-interface-energy structures.
Dataset | Euclidean distance (95% CI) | Cosine similarity (95% CI) |
---|---|---|
CryinGAN | 0.2791 (0.2500 to 0.3077) | 0.9905(0.9886 to 0.9926) |
High energy | 0.7074 (0.6734 to 0.7429) | 0.9316 (0.9250 to 0.9380) |
Fig. 9 shows the distribution of selected Sc coordination motifs, focusing on the motifs that show the largest differences across the datasets. Each motif is further subdivided based on the number of O atoms in the coordination shell of Sc (the distribution of all coordination motifs is provided in ESI Fig. S18b† for reference). Compared to the training dataset, the high energy dataset shows significantly higher percentages of motifs with no O atoms present, and lower percentages of motifs with O atoms. This finding indicates that the lower frequency of bonding between O (in LCO) and Sc leads to weaker interface binding and higher interface energy. In contrast, the CryinGAN dataset shows a higher frequency of Sc–O bonding than the training dataset, with higher percentages of motifs with 2 O atoms (trigonal bipyramidal and pentagonal pyramidal), and lower percentages of motifs with no O atoms (trigonal bipyramidal and octahedral). The average O bond count per Sc atom provided in ESI Fig. S19b† also shows that the high energy structures have fewer Sc–O bonds, whereas the CryinGAN structures have higher number of Sc–O bonds. This analysis shows that although the Sc coordination environments of the CryinGAN structures differ from the training structures due to a higher frequency of Sc–O bonding, the Sc–O bonds are still a feature of the low-interface-energy structures. We also analyzed the RDFs of the interface Li and Sc atoms. The RDFs confirm the trends revealed by the coordination motif analysis, in which the CryinGAN dataset shows higher similarity to the training dataset than the high energy dataset (see ESI Supplementary Note 4† for more details).
An overall ranking of all generative models based on their general performance on each Dismai-Bench dataset is visualized using a spider chart as shown in Fig. 10. The U-Net diffusion models struggled with most tasks in Dismai-Bench, while performing the best on the disordered interface dataset. Between the two, CrysTens performed marginally better on the disordered interfaces, probably as a result of the inclusion of distance matrices (albeit at a memory cost that scales with O(N2), N being the number of atoms). Also, both UniMat and CrysTens require data augmentation to compensate for their lack of all invariances, further increasing their memory requirements by orders of magnitude. We expect that the reason for the better performance on the disordered interface dataset is that the interface structures have sorted atom orderings that make learning easier for these models with image-/video-like representation. For example, the first 72 atoms are always Li, with the first 9 Li atoms always in the interface region, the next 9 Li atoms in the top layer of LCO, etc. In contrast, the other datasets do not have such well-sorted atom orderings. When CrysTens and UniMat were trained on disordered interface structures with randomized atom orderings, they were unable to generate the interface structures (see ESI Fig. S20†). On the other hand, CryinGAN is permutationally invariant, so it does not rely on the sorted atom orderings to perform well on the interface dataset, and was also able to generate the crystalline alloy structures.
The graph diffusion models demonstrated the best performance overall. Comparing the two models, CDVAE performed better on the amorphous Si and the 300 K alloy datasets, whereas DiffCSP performed better on the 1500 K alloy and interface datasets. CDVAE stands out as the only model that is able to generate structures that are nearly fully relaxed. In use cases where relaxation/post-processing is not feasible/practical (e.g., high-throughput materials discovery), CDVAE may show some advantages here. Although the graph models were successful in performing diffusion on atomic coordinates alone, they suffered when performing joint diffusion with atomic species or lattice parameters. When using atomic species diffusion, CDVAE was not able to generate the compositions correctly even though all Dismai-Bench datasets have fixed compositions. When using lattice diffusion, DiffCSP generated structures of lower quality. Such issues may be less severe when training on datasets with smaller number of atoms or lattice lengths, but future models can still improve upon their joint diffusion performance. To this end, Dismai-Bench provides a means for testing and evaluating different features/architectures (e.g., training DiffCSP with teacher forcing of the lattice parameters; see ESI Fig. S6†).
Despite the superior performance of graph models on Dismai-Bench, there are also reports that less expressive generative models (e.g., UniMat,29 language models31–33) can show comparable or even better performance in discovering novel materials than graph models. These seemingly conflicting observations suggest a need to reconsider material discovery strategies. Materials discovery can be formulated as an exploration-exploitation problem. Graph models are superior at learning structural patterns, and should likely learn and exploit more of the training distribution during generation. On the other hand, less expressive models learn more noisy distributions, and can readily explore outside of the training distribution during generation. Considering the relatively small size of training datasets (∼104 for the MP-20 dataset26) vs. the material space of possible compounds (∼1012 possible combinations of quaternary compounds62), it is arguable that exploration is more important than exploitation in materials discovery. However, good performance for materials discovery should not be confused with the ability to learn well from training structures. While less expressive models may do well for the discovery of small structures, they may struggle when the size/complexity of the structures increases, as suggested by this work.
As the number of atoms and/or structural complexity of training samples increase(s), the importance of symmetry invariances and the expressive capability of generative models also grows. Graphs serve as one type of invariant representation among alternatives like smooth overlap of atomic positions (SOAP) vectors63 and atom-centered symmetry functions (ACSFs).64 Exploring these invariant representations further is crucial, even if reconstruction back to atomic coordinates is necessary. Importantly, reconstruction does not necessarily pose a fundamental barrier, as demonstrated by Fung et al.,65 who achieved it through gradient-based optimization using automatic differentiation. The Dismai-Bench findings underscore the need for higher expressive power to model more complex structures, despite potential increases in computational and memory requirements. As generative modeling advances towards larger datasets and more complex structures, the development of models capable of parallelization will be a critical research direction.
In this work, we introduced Dismai-Bench as a novel method for assessing generative models and obtaining valuable insights into their performance. The development of generative models, and machine learning models in general, is iterative and often counterintuitive, as evidenced by the development of CryinGAN. Surprising findings include instances where an older generation GAN architecture has outperformed newer diffusion architectures, and that using two separate discriminators has yielded better results than using a single discriminator. Dismai-Bench proves effective in evaluating a generative model's capability to learn intricate structural patterns, particularly those present in disordered materials. Consequently, Dismai-Bench may contribute to improving generative models not only for ordered materials but also for disordered ones. However, Dismai-Bench's scope is limited in that it primarily assesses the accuracy of atomic positions, and does not evaluate a model's proficiency in learning compositions or lattice parameters, because Dismai-Bench operates by fixing the composition and lattice while varying atomic positions. Moreover, generative models incorporating symmetry constraints66,67 may not leverage Dismai-Bench for evaluation unless these constraints can be disabled. Dismai-Bench marks a significant initial stride towards robustly evaluating generative models, and we anticipate that new complementary benchmarks and datasets will emerge in the future.
To generate the Dismai-Bench alloy structures, only the chemical terms of the CE model were used to obtain structures in a non-magnetic state. A 4 × 4 × 4 conventional FCC supercell, containing 256 atoms (154 Fe atoms, 51 Ni atoms, 51 Cr atoms), was used. We performed MC simulations in the canonical ensemble based on the Metropolis-Hastings algorithm.69 Kawasaki dynamics70 for atom swaps was applied to ensure that the composition of the system remained fixed. The alloy structures were sampled at temperatures of 300 K and 1500 K, where 10 independent MC simulations were initialized for each temperature. For every MC simulation, 50000 passes were performed, and a MC snapshot was saved every 10 passes. For each temperature, 1500 structures were assembled (without any restriction on SRO distribution) as the wide SRO dataset. Another 1500 structures with SRO distribution within ±0.1 of the average SRO values were filtered as the narrow SRO dataset. Comparisons of the SRO between the Fe60Ni20Cr20 alloy modelled in Dismai-Bench and the original Fe56Cr21Ni23 alloy modelled in ref. 36 is shown in ESI Fig. S21c.† The Dismai-Bench alloy shows qualitatively similar SRO trends to the original alloy that was modelled with magnetism and a larger set of clusters.
In a preliminary test calculation, the interface was constructed by simply placing the LCO(110) and LSC(100) slabs in contact with each other, and we obtained a DFT-relaxed structure similar to that depicted in Fig. 1. An interface region with disordered LSC atoms formed with a thickness of around 5 Å. Using this structure as reference, we then generated random interface structures using the CALYPSO package.78–80 We used LCO(110) and LSC(100) slabs with 4 and 7 layers respectively. For each structure, an interface region thickness was randomly chosen between 4 and 6 Å, then the region was randomly populated with 3 formula units of LSC (9 Li, 3 Sc, and 18 Cl atoms). A random lateral displacement (in-plane direction parallel to the interface) was also applied to the LSC slab. A vacuum spacing of 14 Å was included in all the interface structures. Each interface structure has a total of 264 atoms.
We relaxed the randomly generated interface structures with the M3GNet interatomic potential. Only the atom positions were allowed to relax, where the lateral lattice vectors were fixed to the optimized values of the LCO slab (the elastic moduli of LCO81 is significantly larger than LSC82). Similar to the amorphous Si dataset, the structures were optimized using a conjugate gradient algorithm through the ASE package, and the relaxations were stopped when the force on each atom was below 0.05 eV Å−1. The normalized interface energies, int, of the relaxed structures were calculated using the following equation:
![]() | (6) |
For CrysTens, we augmented the disordered interface dataset by applying random permutations (shuffling) to the atoms in the interface region. CrysTens was found to perform poorly when data augmentation was performed by permutating all atoms (see ESI Fig. S20†). Each structure in the interface dataset was constructed with the same LSC and LCO slabs, as well as three formula units of LSC randomly generated in the interface region. As a result, the ordering of the atoms in the slabs is consistent across all structures, making it easier for CrysTens to learn these structures. Therefore, we only shuffled the ordering of the atoms in the interface region, where each structure was augmented 49 times. For amorphous Si and the alloys, which have no consistent atom orderings in their datasets, CrysTens was unable to generate meaningful structures even with data augmentation (see Fig. 3 and 4).
We also tested a graph convolutional neural network architecture by adapting CGCNN59 as the discriminator. Instead of predicting material properties, the output of CGCNN was used to estimate the Wasserstein distance between the training and generated structures. Default values for the hyperparameters were used, as provided in the code repository of CGCNN. The GAN was optimized using the Adam optimizer and a batch size of 32, where we tested learning rates of 10−2, 10−3, 10−4, and 10−5. We found that the losses diverged and the GAN was unable to generate meaningful structures. We did not further pursue the development of graph convolutional neural networks as discriminators, considering similar failures reported in literature for point cloud GANs.58
The best GAN architecture was found to be CryinGAN using both discriminators. For the disordered interface, λ = 0.05 was found to be optimum (see ESI Fig. S15†). For the alloys, λ = 0.1 was found to be optimum (see ESI Fig. S25†). For amorphous Si, CryinGAN was unable to generate meaningful structures (see Fig. 3), so it was not benchmarked for amorphous Si. Benchmark models were trained on the Dismai-Bench disordered interface and alloy datasets using λ = 0.05 and 0.1 respectively. For the disordered interface dataset, we did not perform any data augmentation. CryinGAN is permutationally invariant, and the model does not need to learn to generate rotated interface structures. Each structure in the dataset was originally generated with the LCO slab fixed and the LSC slab randomly displaced laterally (parallel to the interface). Training models on a dataset with lateral translation augmentations was found to slow down CryinGAN's learning, since it had to learn to generate structures with displaced LCO (and LSC) slabs. Therefore, we did not apply any translation augmentation as well. For the alloy datasets, we performed data augmentation by translating the structures by integer multiples of the unit cell lattice constant (3.6 Å). Each structure is a 4 × 4 × 4 supercell, giving 43 − 1 = 63 unique translations, so we augmented each structure 63 times. The disordered interface and alloy benchmark models were trained for 100000 and 1500 epochs respectively, and models from the last epoch were used for benchmarking. All models were trained with a batch size of 32.
For each successfully relaxed structure, the CrystalNNFingerprint55 was calculated for the cations (Li, Co, and Sc), allowing only the anions (Cl and O) to be considered as neighbors. We also appended the fraction of Cl and O neighbors in each motif to the fingerprints, so that the fingerprints contain both chemical and coordination information. Average fingerprints were calculated by averaging the site fingerprints of each structure, then averaging across all structures. The Euclidean distances between the average fingerprint of the generated structures and the training structures were calculated.
For each successfully relaxed structure, the CrystalNNFingerprint55 was calculated for all Si atoms. The Euclidean distances between the average fingerprint of the generated structures and the training structures were calculated. The RDF of each structure was calculated using the vasppy85 package. The RDFs were calculated between 0.0 Å and 10.0 Å using a bin width of 0.02 Å. The RDFs were averaged across structures, and the Euclidean distances between the average RDF of the generated structures and the training structures were calculated. The bond angles of each structure was also calculated. The neighbors of each atom were determined using the CrystalNN55 algorithm. The bond angle distribution was calculated by binning the bond angles using a bin width of 0.5°. The bond angle distributions were averaged across structures, and the Euclidean distances between the average bond angle distribution of the generated structures and the training structures were calculated.
The clusters (1 monomer and 7 dimers) of each structure were counted, where we considered up to the 7th neighbor shell. A fingerprint was constructed for each structure using the vector of conditional probabilities of observing each cluster (e.g., the probability of observing a 1st nearest neighbor Cr–Fe dimer given two nearest neighbor sites). The Euclidean distances between the average cluster fingerprint of the generated structures and the training structures were calculated.
Three datasets were used to perform the structural analysis: (1) CryinGAN-generated structures with low interface energy (≤0 J m−2), (2) randomly generated structures with low interface energy (i.e., the training structures), and (3) randomly generated structures with high interface energy (>0 J m−2). Each dataset contains 1500 structures that were relaxed using M3GNet followed by DFT calculations. The coordination environment of atoms in the interface region was compared between the datasets. The coordination motif fingerprints of Li and Sc atoms in the interface region were calculated for each dataset and averaged. The Euclidean distance and cosine similarity between the average fingerprint of the CryinGAN/high energy dataset and the training dataset were calculated. To visualize the coordination motif distribution of the datasets, the most likely coordination motif of each interface Li/Sc atom was identified by first selecting the coordination number with the highest likelihood, then selecting the coordination motif of this coordination number with the highest local structure order parameter (LoStOP).55 We only plotted coordination motifs that appeared in >1% of the interface Li/Sc atoms in the training structures. 95% bootstrap pivotal confidence intervals86 were calculated using 1000 bootstrap samples with 1500 structures in each sample.
We first performed structural relaxations on unit cells of LCO (Rm) and LSC (C2/m41), allowing the cell shapes, cell volumes, and atom positions to relax, until the force on each atom was below 0.001 eV Å−1. The Brillouin zone was sampled using a (9 × 9 × 1) gamma-centered k-point grid for LCO, and a (6 × 4 × 6) Monkhorst–Pack k-point grid for LSC. The relaxed unit cells were used to construct the LCO(110) and LSC(100) slabs as described in Section 4.1.3.
To generate the dataset for training the M3GNet interatomic potential, LSC(100)–LCO(110) interfaces were randomly generated as described in Section 4.1.3. A total of 350 structures were generated without mutual exchanges, and 600 structures were generated with mutual exchanges. Random mutual exchanges were performed between atoms in the interface region and the top layer of LCO, where we allowed up to 3 Sc ↔ Co, 3 Li ↔ Co, and 6 Cl ↔ O exchanges per structure. Structural relaxations were performed on all interface structures with the cell shapes and cell volumes fixed. The relaxations were performed in two stages. In the first stage, a kinetic energy cutoff of 374.3 eV was used, and only the atoms in the interface region were allowed to relax for 50 ionic steps. In the second stage, a kinetic energy cutoff of 450 eV was used, and all atoms were allowed to relax until the force on each atom was below 0.1 eV Å−1. The Brillouin zone was sampled at the gamma point only for both stages.
For structures that have been pre-relaxed using M3GNet (after the M3GNet model was trained), DFT relaxations were performed without the first stage relaxation (second stage only). The interface dataset for Dismai-Bench was created by relaxing randomly generated structures with M3GNet only. The interface dataset for developing CryinGAN was created by relaxing randomly generated structures with M3GNet followed by DFT calculations. No mutual ion exchanges were performed when the structures were generated since the exchanges mostly led to high-interface-energy structures. The initial 950 structures that were relaxed using DFT calculations only were just used for training the M3GNet model, but not any generative model.
L = MSEE + MSEF + 0.1(MSES) | (7) |
We compared interface structures that were pre-relaxed with M3GNet with their final structures after subsequent DFT relaxation. Table 5 shows the errors between the M3GNet- and DFT-calculated total energies. The errors are generally low even when compared to the final DFT-relaxed structures, showing that the M3GNet relaxations are able to give accurate predictions of the energy, and yield structures close to DFT convergence.
Structure type | n steps | |ΔEM-opt| (meV per atom) | |ΔED-opt| (meV per atom) | |||
---|---|---|---|---|---|---|
Mean | Std | Mean | Std | Mean | Std | |
Low energy | 17.6 | 25.7 | 3.14 | 3.41 | 2.15 | 1.42 |
High energy | 56.1 | 51.8 | 11.0 | 15.3 | 4.75 | 4.51 |
All | 31.7 | 41.8 | 6.03 | 10.4 | 3.10 | 3.21 |
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00100a |
This journal is © The Royal Society of Chemistry 2024 |