Nirmal Baishnab,
Ethan Herron,
Aditya Balu,
Soumik Sarkar,
Adarsh Krishnamurthy* and
Baskar Ganapathysubramanian*
Iowa State University, Ames, IA, USA. E-mail: adarsh@iastate.edu; baskar@iastate.edu
First published on 9th September 2025
The ability to generate 3D multiphase microstructures on-demand with targeted attributes can greatly accelerate the design of advanced materials. Here, we present a conditional latent diffusion model (LDM) framework that rapidly synthesizes high-fidelity 3D multiphase microstructures tailored to user specifications. Using this approach, we generate diverse two-phase and three-phase microstructures at high resolution (volumes of 128 × 128 × 64 voxels, representing >106 voxels each) within seconds, overcoming the scalability and time limitations of traditional simulation-based methods. Key design features, such as desired volume fractions and tortuosities, are incorporated as controllable inputs to guide the generative process, ensuring that the output structures meet prescribed statistical and topological targets. Moreover, the framework predicts corresponding manufacturing (processing) parameters for each generated microstructure, helping to bridge the gap between digital microstructure design and experimental fabrication. While demonstrated on organic photovoltaic (OPV) active-layer morphologies, the flexible architecture of our approach makes it readily adaptable to other material systems and microstructure datasets. By combining computational efficiency, adaptability, and experimental relevance, this framework addresses major limitations of existing methods and offers a powerful tool for accelerated materials discovery.
Various approaches have been explored for microstructure generation.9,10 Classical statistical methods, such as Markov random fields,11 Gaussian random fields,12 and descriptor-based reconstructions,13,14 can produce microstructures that match certain target statistics. While these methods have proven useful, they suffer from important limitations. In general, statistical models are computationally intensive and do not scale well to generating large 3D volumes or numerous samples. They often rely on strict assumptions (e.g. stationarity or isotropy of features) and tailored mathematical descriptors, which limits their flexibility and generalizability to different materials or complex structures. Adapting such models to incorporate new microstructural constraints or application-specific objectives is non-trivial and typically requires substantial rederivation or optimization changes. These challenges highlight the need for a more flexible, data-driven generative framework for microstructures.
Recently, deep generative models have shown great promise in capturing complex microstructural features from data.15,16 Approaches like variational autoencoders (VAEs),17 generative adversarial networks (GANs),18 and diffusion models (DMs)19 have been applied to microstructure generation tasks. VAEs can learn low-dimensional representations of microstructures but often produce blurry outputs that lack sharp detail.20 GAN-based models have succeeded in generating 3D microstructures with improved visual fidelity,21–23 but they do not allow user control over generated structures and are notorious for training instabilities.24 Moreover, GANs and similar networks can be computationally demanding for 3D data, sometimes requiring extensive resources for training and generation. Diffusion models offer even higher output quality, often surpassing GANs, but their iterative sampling process makes inference slow and resource-intensive.25 At this time, no prior generative approach has simultaneously provided high fidelity, user controllability, and computational efficiency for 3D microstructure generation.
Latent diffusion models (LDMs) have emerged as a compelling solution to address these gaps.26,27 LDMs combine the strengths of VAEs and DMs by operating in a compressed latent space to dramatically reduce computational costs while preserving the ability to generate high-quality, diverse microstructures. This latent-space approach yields orders-of-magnitude speed-ups over conventional pixel-space diffusion models. Importantly, LDM architectures naturally support conditioning mechanisms that enable users to steer generation towards desired attributes. They also exhibit more stable training dynamics and avoid mode collapse, yielding a broader variety of outputs compared to GANs.28–31 These advantages make LDMs well-suited for fast and controllable 3D microstructure synthesis.
To date, applying diffusion-based generative models to microstructure design has predominantly focused on unconditional generation.32–34 In our prior work, Herron et al.35 applied a diffusion model to 2D organic solar cell microstructures without enabling user-specified target features. While recent advances36,37 have begun exploring conditional generative approaches to microstructure reconstruction and design, these have typically not integrated predictions of corresponding manufacturing parameters. Our current work introduces a conditional latent diffusion modeling (LDM) framework that not only allows user-defined control over critical microstructural descriptors but also uniquely predicts manufacturing parameters likely to produce such microstructures experimentally. This two-fold capability addresses key challenges in computational materials design:38,39 not only can we generate microstructures with tailored properties, but we can also provide insight into how to manufacture them – thereby tackling the oft-cited “manufacturability gap” in microstructure design.
We demonstrate the framework using organic photovoltaic (OPV) active-layer microstructures as a representative example. OPV active layers typically consist of a donor material and an acceptor material, forming a complex two-phase (or three-phase with a mixed phase) morphology.40 Two microstructural descriptors are particularly crucial for OPV performance: the donor (acceptor) phase volume fraction and the tortuosity of the percolating pathways.41,42 The volume fraction (the ratio of donor to acceptor material in the blend) directly influences the balance between charge generation and transport, while tortuosity reflects the complexity of pathways that charge carriers must navigate to reach the electrodes. By conditioning on these properties in the LDM, we can generate microstructures that meet specific targets (e.g. a desired donor volume fraction and phase connectivity) known to optimize OPV efficiency. We quantify volume fraction and tortuosity for each generated sample using established computational techniques.43
The key contributions of this work include: (1) scalable high-resolution 3D microstructure generation: leveraging an LDM, we rapidly produce diverse multiphase 3D microstructures (including two-phase and three-phase examples) at a resolution of 128 × 128 × 64 voxels (over one million voxels each), which is orders of magnitude larger than those demonstrated in prior studies. Our approach generates these 3D microstructures in seconds per sample. (2) Conditional generation with user-defined features: our framework introduces controllability to microstructure synthesis by allowing users to specify target volume fractions and tortuosities; the LDM then generates microstructures that faithfully realize these input parameters, ensuring the output matches desired structural characteristics. (3) Linking microstructure to manufacturing: we integrate a predictive module that outputs relevant processing parameters (e.g. annealing or fabrication conditions) corresponding to each generated microstructure, facilitating a direct connection between the digital microstructure design and its experimental realization. These advances collectively overcome the scalability, controllability, and manufacturability limitations of existing methods. By enabling fast generation of application-specific microstructures along with guidance for their fabrication, our conditional LDM framework illustrates the promise of AI-driven approaches in computational materials science and microstructure design.
Our proposed generative modeling framework, schematically illustrated in Fig. 10, consists of three sequentially trained modules: a Variational Autoencoder (VAE), a Feature Predictor (FP), and the Latent Diffusion Model (LDM). Initially, the VAE compresses complex, high-dimensional 3D microstructures into compact latent representations, drastically reducing computational complexity. The FP network subsequently predicts relevant microstructural features (e.g., volume fractions and tortuosities) and manufacturing parameters directly from these latent representations. Finally, the conditional LDM leverages these predictions to generate realistic 3D microstructures, guided explicitly by user-specified conditions.
In the following sub-sections, we detail our evaluation of the framework's generative capabilities, including the quality and diversity of generated microstructures, the effectiveness of conditional sampling for targeted microstructure design, and the model's unique capacity to predict experimental manufacturing parameters.
Each generated microstructure spans a volume of 128 × 128 × 64 voxels, corresponding to over one million voxels (1048
576), allowing detailed resolution of intricate morphological features. Importantly, our LDM framework achieves this generation within approximately 0.5 seconds per microstructure using an NVIDIA A100 GPU, significantly outperforming traditional physics-based simulation methods, which typically require hours or days of computation for similar-sized volumes.44–46 The transition from two-phase to three-phase systems maintains high quality and fidelity, demonstrating the flexibility and scalability of our framework. Without any modification to the core architecture, retraining on a three-phase dataset successfully generated microstructures exhibiting smaller domains and more complex, finely detailed features. This ease of adaptability underscores the potential for further extension of our approach to accommodate additional phases.
The LDM is conditioned on two crucial microstructural descriptors relevant to organic photovoltaics: the volume fractions and tortuosities of the phases (A, B, and the mixed phase). However, our flexible conditioning framework is easily extensible to other relevant morphological descriptors, depending on the application requirements (see additional examples provided in the SI Results). Fig. 2 illustrates representative examples of conditionally generated microstructures, clearly demonstrating the effectiveness of the model in synthesizing morphologies tailored to user-specified volume fractions and tortuosities.
To evaluate the model's ability to generate conditional outputs, we created 3200 microstructures with different targeted volume fractions and tortuosity values. We systematically compared these microstructure attributes with the user-specified conditioning parameters, as depicted in Fig. 3. Our analysis reveals a high degree of accuracy in conditional generation, achieving Pearson correlation coefficients (R2) of 0.93 or greater. This robust correlation underscores the LDM's effectiveness in adhering to precise user-defined constraints, thereby enabling targeted material design and optimization that surpasses prior methods in versatility and computational efficiency.22,23
As with most data-driven generative frameworks, the proposed LDM model learns and reproduces the joint distribution of microstructural features in the training data. In physics-based datasets such as our CH dataset, certain features (e.g., volume fraction and tortuosity) naturally exhibit correlations due to underlying physical constraints. Consequently, the generative model tends to reflect these correlations and may struggle to generate feature combinations that are poorly represented or absent in the training dataset. However, the framework remains flexible and, in principle, capable of learning a broader range of feature combinations if provided with sufficiently diverse and decorrelated training data. As the diversity and coverage of the training dataset increase, the model's ability to generate microstructures with uncommon or more complex feature relationships is expected to improve accordingly. We conducted experiments using more conditioning parameters to assess the framework's capacity for higher-dimensional conditioning. Appendix Fig. 16 presents the results with seven conditioning parameters. Increasing the number of conditioning parameters introduces two challenges. First, the model must learn more complex and potentially correlated feature relationships. Second, as the dimensionality grows, the volume of the conditioning space expands rapidly, resulting in a sparser sampling of the parameter space, which in turn demands a larger and more diverse training dataset. Fig. 16 shows a decline in R2 between the input features and the measured features of the generated microstructures, yet the model still maintains strong correlations. If the conditioning features are not fully independent but exhibit correlations, care must be taken to ensure that valid and physically meaningful combinations are used during inference. In such scenarios, dimensionality reduction techniques (e.g., principal component analysis or other embedding methods) may be employed to reduce the effective dimensionality of the conditioning space prior to model training.
Alternative conditional generative approaches for microstructure design have recently been reported. For example, Gao et al.36 introduced a deep learning framework for multi-scale prediction of mechanical properties from microstructural features in polycrystalline materials, while Lee and Yun37 developed a denoising diffusion-based method for generating three-dimensional anisotropic microstructures from two-dimensional micrographs. While these works incorporate conditional elements, they do not provide the combined capability of user-defined control over specific microstructural descriptors and simultaneous prediction of manufacturing parameters. Our conditional latent diffusion framework thus addresses a different design space—high-resolution descriptor-controlled generation.
Moreover, Fig. 4b presents contour plots predicting the manufacturing parameters—the blend ratio, the interaction parameter (χ), and the annealing time (timesteps)—required for realizing these microstructures. Notably, the LDM framework identifies multiple feasible fabrication pathways: a combination of higher χ values with shorter annealing durations, or lower χ values with extended annealing periods. This data-driven insight aligns well with the known physical behavior of phase-separating systems described by the Cahn–Hilliard model, where increased interaction parameters accelerate phase separation, thereby requiring less annealing time, whereas lower interaction parameters necessitate longer annealing to achieve comparable morphologies. This pathway prediction capability illustrates the integration of computational design with experimental manufacturability, thus significantly advancing current microstructure design methodologies.22,23 Such an approach could be expanded to include other manufacturing parameters, making the model applicable across various material systems and manufacturing processes.22,23
Although the training dataset includes time-dependent snapshots generated from Cahn–Hilliard simulations, the generative model itself operates solely on static 3D microstructures paired with their corresponding morphological descriptors. The time-dependent simulations are used primarily to provide a diverse and physically meaningful training set across a range of morphologies. The generative model remains agnostic to the physical dynamics or governing equations responsible for generating the dataset. This formulation enables flexible, descriptor-driven microstructure generation. Future work could explore the incorporation of additional physics-based constraints, such as mass conservation or dynamic evolution, for applications requiring dynamic modeling.
Using the experimental dataset, we generated 1000 microstructures conditioned on user-specified inputs. Fig. 5 shows the correlation between the specified inputs and the corresponding measured features, with Pearson R2 values of 0.89 for volume fraction, 0.86 for acceptor tortuosity, and 0.77 for donor tortuosity. These values are somewhat lower than those observed for the CH dataset; however, they remain reasonably strong given the characteristics of the experimental data. First, the experimental microstructures are lower in resolution but contain finer-scale features, which limits the ability of the latent diffusion model to capture the details. Second, the experimental dataset is less diverse: the subvolumes are extracted from only two larger tomographic samples, with overlapping subregions, resulting in a narrower sampling of the feature space. These factors inherently constrain the achievable correlation between target features and generated structures. Nevertheless, the model captures the volume fraction with higher accuracy, as it is a simpler global descriptor. In contrast, tortuosity, a more localized and structurally complex feature, potentially requires better resolution and poses greater modeling challenges.
Additionally, Fig. 6a presents six representative microstructures generated from identical conditioning inputs (volume fraction: 0.5; donor and acceptor tortuosities: 0.2 each), illustrating notable morphological diversity. The kernel density estimation (KDE) plots shown in Fig. 6b confirm that the generated feature distributions are closely centered around the specified target values, with standard deviations of 0.02 or less, highlighting the precision and robustness of the conditional LDM in practical, experimental contexts.
Model component | CH dataset | Experimental dataset |
---|---|---|
Input size | 128 × 128 × 64 | 64 × 64 × 64 |
Latent dimension | 4 × 8 × 8 × 4 | 1 × 8 × 8 × 8 |
Conditional parameters | 4 | 3 |
Manufacturing parameters | 3 | 0 |
VAE size (MB) | 178.97 | 178.96 |
VAE parameters | 46![]() ![]() |
46![]() ![]() |
DDPM size (MB) | 575.55 | 575.23 |
DDPM parameters | 150![]() ![]() |
150![]() ![]() |
Feature predictor size (MB) | 2.51 | 0.63 |
Feature predictor parameters | 657![]() |
164![]() |
Fig. 7 presents a breakdown of inference performance for both datasets across varying batch sizes. For each dataset, we report the total inference time, along with a decomposition into denoising and decoding times. The model demonstrates parallel scalability up to a batch size of 32, beyond which the time per sample plateaus at approximately 0.5 s for the CH dataset and 0.8 s for the experimental dataset. Although the experimental dataset has a smaller total latent size, its latent representation has fewer channels and larger spatial dimensions per channel, which leads to less efficient parallelization. In contrast, the CH dataset, with more channels, better utilizes GPU parallelism at the kernel level. Across all configurations, denoising remains the dominant computational cost, while decoding contributes minimally. For example, at a batch size of 32, denoising takes over 200 times longer than decoding for both datasets. This behavior is consistent with diffusion models, where the denoising process involves iterative sampling—in our case, 1000 iterations per sample.
From these microstructures, we performed thresholding to obtain two-phase and three-phase representations. In the two-phase case, for example, voxels with values below 0.5 were assigned to one phase (0), while those above 0.5 were assigned to the other phase (1). The three-phase microstructures were generated by applying multi-level thresholding to the simulated continuous microstructure fields. Two threshold values were selected to partition the field into three distinct regions (donor, acceptor, and interface), each corresponding to one of the phases. There were no fixed threshold levels; the levels were adjusted based on the original microstructure to ensure that the interface did not become too thick compared to the donor and acceptor phases. Based on the thresholded microstructures, morphological descriptors such as volume fractions and tortuosities were calculated. The volume fraction was computed as the ratio of the number of voxels belonging to a given phase to the total number of voxels in the microstructure. In the context of OPV, tortuosity is quantified as the fraction of phase-connected voxels exhibiting straight rising paths (i.e., with a tortuosity of 1) to their respective electrodes.43,49 Specifically, donor tortuosity refers to the fraction of black voxels (donor phase) that are connected to the anode (top electrode or top edge of the microstructure) via straight rising paths, while acceptor tortuosity refers to the fraction of white voxels (acceptor phase) connected to the cathode (bottom electrode or bottom edge). Tortuosity is a critical microstructural descriptor in OPV because it captures the efficiency of charge transport pathways within the active layer. We used the graph-based tool GraSPI50 to compute these descriptors. GraSPI provides a suite of microstructural descriptors that are particularly relevant to the analysis and performance evaluation of organic solar cells.
Fig. 8 shows snapshots from a single time series within the training dataset. The snapshots represent the temporal evolution of phase separation during the 3D simulation of the Cahn–Hilliard equation, illustrating the dynamic changes in microstructures over time. The Cahn–Hilliard model accounts for both thermodynamic forces and kinetic processes driving phase separation, providing insights into how processing conditions, such as annealing, influence the final morphology of the active layer. This understanding can aid to the optimization of material processing to improve organic solar cell (OSC) performance.51,52
![]() | ||
Fig. 8 A sequence of 10 snapshots from one time series out of 67 in the entire dataset, illustrating the evolution of phase separation in a 3D simulation of the Cahn–Hilliard equation. |
In addition to the computational dataset, we also utilized voxelized experimental OPV morphologies from spin-cast P3HT:PCBM thin films fabricated using two different solvents: chlorobenzene (CB) and dichlorobenzene (DCB). These morphologies were fabricated and reconstructed using tomographic energy-filtered TEM (see Heiber et al.,47 Herzing et al.48 for details). The imaging volume had approximate dimensions of 1 μm × 1 μm × 100 nm, with the EF-TEM-based reconstruction achieving a voxel resolution of approximately 2.12 nm. The CB morphology is depicted in Fig. 9, where blue domains represent the electron-donating (donor) materials and red domains indicate the electron-accepting (acceptor) materials. The voxelized resolutions of the CB and DCB morphologies are 466 × 465 × 50 and 478 × 463 × 60, respectively. To generate a uniform dataset, we extracted cubic subvolumes spanning the full z-axis of each morphology and resized them to 64 × 64 × 64 using nearest-neighbor interpolation. In the x and y directions, we used a step size of 4 voxels, resulting in over 10500 cubic subvolumes of size 64 × 64 × 64 from each of the two main morphologies. This process yielded a total of over 21
000 64 × 64 × 64 3D microstructures. A similar subvolume extraction and segmentation strategy has been used in related 3D microstructure studies,53 where high-resolution FIB-SEM images were segmented into voxelized phase maps for downstream model training. Similar to the synthetic dataset, this dataset was also divided into training and validation sets in the usual 80–20% split.
![]() | ||
Fig. 10 Overview of the proposed LDM-based framework's three-step training process: VAE training and latent representation dataset creation, training of the FP, training of DM in the latent space. |
Additionally, the scalability of LDMs enables them to manage larger datasets and more complex microstructures without a proportional increase in resource consumption, unlike traditional DMs. This combination of factors renders LDMs a more efficient and practical choice for generating detailed 3D microstructures in a resource-conscious manner. Our LDM framework comprises three components: a VAE, a Feature Predictor (FP), and a DM, which are trained sequentially. The encoder and decoder of the VAE are trained simultaneously to obtain the latent space from which the FP is trained. Once the VAE and FP are trained, we train the DM using the latent space and the predicted features.
To generate a sample z from the latent space, the VAE uses a random sample ε drawn from a standard normal distribution:
![]() | (1) |
![]() | (2) |
The decoder maps the latent representation z back to the input space:
![]() | (3) |
The loss function in VAEs consists of two terms, the reconstruction loss and the KL divergence:
![]() | (4) |
This function balances the accuracy of reconstruction with the regularization of the latent space.
The VAE is the entry point for our architecture. The VAE employed in this work consists of an encoder-decoder structure with residual blocks for feature extraction and reconstruction. The encoder comprises five 3D convolutional layers, each followed by Instance Normalization and a residual block to capture spatial dependencies in the input data. The latent space is parameterized by a mean (‘mu’) and log-variance (‘logvar’), both of which are obtained through additional 3D convolutional layers. The decoder mirrors the encoder's structure, using transposed convolutions to upsample the latent space back to the original input dimensions with residual blocks and Instance Normalization for stable training. A final Sigmoid activation is applied to the output to generate the reconstructed data. Once the VAE is trained, we use its encoder to compress microstructures with over a million voxels into a compact encoded representation of size 1024 (4 × 8 × 8 × 4), while for experimental VAE inputs of 64 × 64 × 64 (over 262 K voxels), the output is further reduced to 512 (1 × 8 × 8 × 8). This reduced-dimensional latent space, distinguished by its efficiently learned data distribution, facilitates more efficient and stable diffusion processes.
![]() | (5) |
![]() | (6) |
The neural network's primary role in a DM is to learn the inverse of the noise addition process. By systematically removing the noise added during the forward diffusion process, the network reconstructs the original data from its noisier versions. This process enables the generation of new, high-quality samples from completely random Gaussian noise. More concretely, once the DM has been trained, we can generate a new latent sample by starting with random Gaussian noise and iteratively applying the learned backward process pθ(xt−1|xt). Specifically, we compute
, where
, for t = T to 1, yielding a new sample x0.
In the context of enhancing the generative capabilities of DMs, incorporating a conditional vector provides a strategic augmentation of the model's architecture. By embedding conditional vector, c, within both the embedding and decoder layers of the U-Net structure in the diffusion process, the model gains an additional layer of contextual guidance. This integration is mathematically articulated as , where the conditional vector c is seamlessly intertwined with the noise prediction and denoising functions of the generative model, Gθ. Such an approach leverages the conditionality to steer the generative process, thereby imbuing the model with enhanced directional specificity and adaptiveness in its generation capabilities, aligning closely with the encoded conditions in c.
Our LDM model operates under a linear beta schedule, which dictates the noise addition and removal process across the diffusion stages. This schedule is precomputed and stored as buffers, allowing for consistent noise manipulation during both training and sampling phases. For this work we use a starting beta value of 1 × 10−4 and a final beta value of 0.02. The diffusion process involves progressively adding noise to the latent features and then denoising them through a series of timesteps to generate the final microstructure.
To guide the diffusion process, the model employs two key embedding networks:
• Time embedding: this network converts the current timestep into an embedding, providing temporal guidance during the denoising phase.
• Context embedding: the context embedding network incorporates manufacturing features that condition the generation process, ensuring that the generated microstructures adhere to specific manufacturing parameters.
During the forward pass, the input 3D data is first encoded through the VAE to extract latent features. These features are then processed by a feature predictor model to obtain context features, specifically the initial four manufacturing features (e.g., two volume fractions and two tortuosities). These latent features are progressively diffused using the predefined beta schedule, with the U-Net model performing denoising at each timestep. The denoising process is informed by both time and context embeddings, enabling precise reconstruction of the microstructure. For new sample generation, the diffusion process is reversed, starting from pure noise and progressively refining the latent space into a structured representation conditioned on the context features.
The inference process begins with the pre-trained weights of the LDM, VAE decoder, and feature predictor. The VAE encoder is not required during inference. The process involves user input and random noise sampled in the latent space. The random noise is iteratively refined by the LDM, conditioned on the user inputs. After 1000 iterations, the denoised latent representation of the microstructure is obtained. This step is the most time-consuming during inference. However, despite this many iterations, the process remains highly efficient because the denoising occurs in latent space rather than pixel space, which has 1000 times fewer dimensions. The inference pipeline is demonstrated in Fig. 11. Once the denoised latent representation of the microstructure is obtained, it is passed through both the feature predictor and the VAE decoder. The feature predictor provides the manufacturing conditions, while the VAE decoder generates the final conditioned microstructure. Using NVIDIA A100 80GB GPUs, it takes approximately 2 seconds to generate and save a single microstructure, including export to storage.
The dataset consists of three microstructure types:
• Experimental microstructures (64 × 64 × 64): voxelized from real-world samples for modeling.
• 2-phase Cahn–Hilliard microstructures (128 × 128 × 64): thresholded from Cahn–Hilliard simulations.
• 3-phase Cahn–Hilliard microstructures (128 × 128 × 64): thresholded from Cahn–Hilliard simulations.
For each dataset type, pretrained generative model weights are provided:
• – denoising diffusion probabilistic model
In addition to the full datasets, smaller sample subsets are provided for testing and demonstration purposes.
All datasets and pretrained weights have been permanently archived on Zenodo: https://doi.org/10.5281/zenodo.17010419. The complete codebase has also been archived on Zenodo: https://doi.org/10.5281/zenodo.17029570.
For convenience, the dataset is additionally available on Hugging Face: https://huggingface.co/datasets/BGLab/microgen3D, and the latest development version of the code is available on GitHub: https://github.com/baskargroup/MicroGen3D.
Supplementary information: the training loss curves, additional inference examples, the feature distribution of the training data, and results from extended conditioning experiments. See DOI: https://doi.org/10.1039/d5dd00159e.
![]() | ||
Fig. 12 Training log of the LDM. (a) Epoch progression over wall-clock training time. (b) Training and validation loss curves for all three components of the LDM framework. |
The loss function for VAE combined a Mean Squared Error (MSE) loss for reconstruction and a Kullback–Leibler Divergence (KLD) loss,59 with a weight of 1 × 10−6 for regularizing the latent space. The goal was to keep both the KLD and reconstruction losses in the same order of magnitude. The feature predictor was trained using an MSE loss function to assess the accuracy of predictions by measuring the difference between predicted and actual feature values. The encoder of the pretrained VAE was kept frozen during feature predictor training phase. For both the VAE and the feature predictor, the initial learning rate was set to 5 × 10−5, with a minimum of 5 × 10−7.
For the LDM, the diffusion process was divided into 1000 timesteps. The training objective was to minimize the MSE between the predicted noise and the actual noise added during the diffusion process. Initial and minimum learning rates are 1 × 10−6 and 1 × 10−7, respectively. The learning rate was selected based on the pioneering work by,26 which demonstrated the effectiveness of using this order of magnitude in similar architectures. Both VAE and feature predictor were kept frozen during LDM training.
The training process for all models was conducted in a GPU-enabled environment, using an NVIDIA A100 GPU with 80 GB of memory. The entire framework was implemented in PyTorch and managed by PyTorch Lightning, which handled the training loop, logging, and checkpointing. Checkpoints were automatically saved based on the validation loss, ensuring that only the best-performing models were retained. Throughout the training, real-time progress and performance metrics were continuously logged using the WandB58 logger, providing detailed experiment tracking and facilitating reproducibility and scalability.
Fig. 14 shows some additional examples of three-phase microstructures generated by the conditional LDM, demonstrating sharper features and controlled generation, including cases with predominant phase B and specified volume fractions/tortuosities.
This journal is © The Royal Society of Chemistry 2025 |