Open Access Article
Daniel
Gleaves
,
Nihang
Fu
,
Edirisuriya M.
Dilanga Siriwardane
,
Yong
Zhao
and
Jianjun
Hu
*
Department of Computer Science and Engineering University of South Carolina Columbia, SC 29201, USA. E-mail: jianjunh@cse.sc.edu
First published on 27th January 2023
Data driven generative deep learning models have recently emerged as one of the most promising approaches for new materials discovery. While generator models can generate millions of candidates, it is critical to train fast and accurate machine learning models to filter out stable, synthesizable materials with the desired properties. However, such efforts to build supervised regression or classification screening models have been severely hindered by the lack of unstable or unsynthesizable samples, which usually are not collected and deposited in materials databases such as ICSD and Materials Project (MP). At the same time, there is a significant amount of unlabelled data available in these databases. Here we propose a semi-supervised deep neural network (TSDNN) model for high-performance formation energy and synthesizability prediction, which is achieved via its unique teacher-student dual network architecture and its effective exploitation of the large amount of unlabeled data. For formation energy based stability screening, our semi-supervised classifier achieves an absolute 10.3% accuracy improvement compared to the baseline CGCNN regression model. For synthesizability prediction, our model significantly increases the baseline PU learning's true positive rate from 87.9% to 92.9% using 1/49 model parameters. To further prove the effectiveness of our models, we combined our TSDNN-energy and TSDNN-synthesizability models with our CubicGAN generator to discover novel stable cubic structures. Out of the 1000 recommended candidate samples by our models, 512 of them have negative formation energies as validated by our DFT formation energy calculations. Our experimental results show that our semi-supervised deep neural networks can significantly improve the screening accuracy in large-scale generative materials design. Our source code can be accessed at https://github.com/usccolumbia/tsdnn.
000 crystal materials compared to the almost infinite chemical space. To search for novel materials in uncharted chemical space, it is important to develop the capability to screen stable and synthesizable hypothetical materials12,13 out of the candidates generated by generative models or CSP algorithms and then apply high-performance materials property prediction models to find the desired candidates.14,15
Given a material's structure, its structural stability can be estimated by calculating its formation energy using first-principles computations such as density functional theory (DFT) and the phase stability of a structure can be quantified by using the energy above the hull (Ehull).16 However, DFT based calculation of formation energy or Ehull is too computationally expensive, which leads to a large number of machine learning models for formation energy/enthalpy prediction17,18 based on composition without18–25 or with structures.15,21,26–28 However despite the development of more than a dozen formation energy/enthalpy prediction models, they all suffer from a neglected strong bias from the training data: most of the training samples from the repositories of known materials are stable structures with negative formation energy. For example, out of the 138
613 samples of the Materials Project database, only 11
340 samples have positive formation energy. This makes it difficult to train good supervised classification or regression models that can differentiate stable materials from the unstable candidates.
These methods usually formulate the formation energy prediction problem as a regression problem with models trained with a majority of negative formation energy. However, such formation energy prediction models are most interesting when they can be used to differentiate stable versus non-stable hypothetical materials, most of which tend to be unstable and have positive formation energy. Despite the claimed high accuracy of these models,17,25 they are mainly evaluated on stable materials with negative formation energy, leading to their questionable extrapolation performance on out-of-distribution non-stable materials with positive formation energy.29 The question here is how we can train ML models with a majority of samples with only negative formation energy while they are expected to differentiate stable materials with negative formation energy from unstable materials with positive formation energy. In addition to this issue, it is argued that the accurate prediction of formation alone does not correspond exactly to high accuracy of predicting stability which can be better measured by the quantity ΔHf and be obtained by a convex hull construction in formation enthalpy (ΔHf)-composition space.25
Synthesizability of a hypothetical material is another important property needed for effective materials screening,30,31 which is challenging to predict accurately.32 It is found that many naive generative models for molecules tend to generate unsynthesizable candidates.31 Unfortunately, synthesizability is much more challenging to be predicted using ML models or other computational methods.32,33 One approach is to predict the synthesis path given a material composition;34–37 however, these methods are newly emerging and cannot yet be applied to the large scale of hypothetical materials. Another option is ML based models for materials synthesizability prediction. For inorganic materials, a recent study using the positive and unlabelled semi-supervised machine learning algorithm (PU-learning)13 has been applied to predict synthesizability with promising results. Davariashtiyani et al. proposed a 3D voxel representation based convolution network for synthesizability classification trained with 600 anomaly samples.38 However, the extrapolation prediction power of their model is expected to be low due to their highly biased and limited selection of anomaly structures.
Semi-supervised learning39,40 has been widely and successfully used in computer vision,41 natural language processing,42 and medical diagnosis43 to mainly address the scarce annotation data issue or just to improve the performance using unlabelled data. However, despite the well-known small data issue in materials ML problems, semi-supervised learning has rarely been used in such problems except in a few studies13,44,45 for materials synthesis classification, microstructure classification, and synthesizability prediction.13
SSL algorithms are developed on several fundamental assumptions40 including (1) the smoothness assumption: two samples close to each other in the input space tend to have similar labels; (2) low-density assumption: the decision boundary should not pass through high-density areas in the input space; (3) manifold assumption: data points on the same low-dimensional manifold should have the same label. These assumptions can be interpreted as specific instances of the cluster assumption: similar points tend to belong to the same group/cluster. There are two main categories of SSL algorithms including graph based transductive methods which focus on label propagation and inductive methods which aim to build a ML model f: x → y by incorporating unlabelled data either in pre-processing steps, directly inside the loss function, or via a pseudo-labeling step. SSL algorithms have demonstrated strong performance especially in the deep learning framework.42
Here we propose a semi-supervised learning (SSL) approach for the materials formation energy and synthesizability prediction problems by considering both the database bias that most samples are stable, synthesizable materials with negative formation energy and the model application scenarios for which we need to apply the models to differentiate stable and unstable hypothetical materials. In this work, we exploit a deep learning based SSL framework, the teacher-student deep neural network (TSDNN),46 to address the lack of negative samples in synthesizability prediction and formation energy prediction. A TSDNN is characterized by a dual-network architecture with a teacher model trained using a supervised signal and an unsupervised feedback signal from the student network to improve the teacher's pseudo-labeling capability. The teacher model provides pseudo-labels for unlabeled data for the student model to learn from. Unlike the previous positive-unlabeled SSL algorithm for synthesizability prediction,13 our TSDNN model has much fewer parameters while achieving 5.3% higher prediction accuracy and improving the positive rate from 87.9% to 92.9% using the same performance evaluation. Extensive experiments on the formation energy classifiers also show that our TSDNN can screen negative formation energies with 7.5% higher precision, a 10.3% higher F1 score, and 9.7% higher accuracy than the CGCNN regression model.
Our contributions in this paper can be summarized as follows:
• We identify the inherent dataset bias in formation energy and synthesizability prediction problems and propose to formulate both as semi-supervised classification problems.
• We exploit a novel teacher-student dual network deep neural network model framework to achieve high-performance semi-supervised learning for both formation energy and synthesizability classification. Compared to previous approaches, our models achieved >10% performance improvement with much simpler model structures and 98% fewer model sizes.
• We evaluate our algorithms on different dataset configurations and demonstrate the effectiveness and advantage of SSL for both problems.
• We apply our TSDNN based formation energy and synthesizability SSL model for screening new materials from the hypothetical cubic crystal materials and identify a set of new stable materials as verified by DFT formation energy calculations.
![]() | ||
| Fig. 1 PU-learning based dataset generation and training procedure for the TSDNN framework. (a) The first step of a TSDNN is to cluster positive and negative samples from the unlabeled set. Since there are only positive samples in our raw dataset, we use an iterative PU learning procedure13 to select the most likely negative samples from the unlabeled set. It starts with only positive (green) and unlabeled (gray) samples. It first randomly selects unlabeled samples (equal in number to the positive) as negative ones. A TSDNN model is then trained using these labels and used to classify all samples. This random sampling and prediction process is repeated 5 times and the classification scores are averaged for each material, as shown in the gradient bar. From this, we assemble a complete dataset: 9629 materials with the highest classification scores (P Test) are selected as the positive test set and 9629 lowest (N Test) ones as the negative test set. The labeled training dataset (P labeled and N labeled) is selected as shown, and the middle section of uncertain classifications is left as the unknown set (unlabeled). A final fine-tuned TSDNN model is then trained using this clustered dataset. (b) A TSDNN model is trained using a teacher model and a student model. The teacher model is trained on labeled data (PL + NL) and predicts pseudo-labels (classification score) for the unlabeled data (U). The student model learns from these pseudo-labels exclusively. The teacher model also has a feedback signal46 from the student model based on the student model's loss calculated on the batch of labeled data. This allows for the teacher model to be updated to optimize for the student model's performance. The student model is saved and used for testing and predictions. | ||
Given a labeled dataset and an unlabeled dataset, the training process of the TSDNN goes as follows: first, a batch of labeled and a batch of unlabeled data are sampled. The teacher's loss is calculated on the labeled batch. The teacher model then provides pseudo-labels for the unlabeled batch for training the student network. The student model's loss is calculated on the labeled data both before and after the student model is updated with the pseudo-labels from the teacher model. The change in this performance from the teacher's pseudo-labels is used to calculate the student model's feedback signal, which is combined with the teacher's loss over labeled data to update the teacher model. This helps the student network to learn the true labels of a large set of unlabeled data by ensuring that the student model is clustering the unlabeled data consistent with the labeled dataset. The benefit of this is that a small labeled dataset can be used and augmented with a much larger unlabeled dataset, resulting in a more robust student model that has been trained on the unlabeled data.
Loss functions of our student and teacher networks include:
![]() | (1) |
![]() | (2) |
A feedback signal from the student model46 is additionally included to further optimize the teacher model by improving its pseudo-labeling. This reduces labeled data bias by introducing a dynamic teacher; while a static teacher model would replicate implicit biases, this dynamic teacher model is able to adapt to the full context of the unlabeled dataset, which in turn leads to a less biased final model.
In the TSDNN, before training can commence, the dataset must be prepared for our semi-supervised framework. In the case of synthesizability, there are only positive data, so we must first identify candidate negative samples. This is possible by clustering, since synthesizability is defined with respect to other previously synthesized materials. For synthesizability, selecting the most optimal negative labels is integral to assembling an accurate labelled dataset. For formation energy classification, the two greatest challenges to overcome are the high density of materials with near-zero formation energies, as shown in Fig. 3, and the labelled dataset imbalance with relatively few negative samples. Once these issues are resolved, the TSDNN model can be trained.
![]() | ||
| Fig. 3 Distribution of formation energy for the MP dataset with few positive values and the Cubic test dataset with many positive energies. | ||
The PU learning framework is a modified transductive bagging support vector machine.49 In this framework, a model is trained with a random selection of the unlabeled data set as the negative class equal in size to the positive class. This model then produces predictions on the remaining unlabeled data not chosen as the negative class. After a given number of iterations, the unlabeled scores are averaged, resulting in a final score. The motivation in this is to identify a cluster of samples that lie apart from the positive class. This is useful for identifying the highest and lowest prediction score materials, but still leaves a large amount of uncertain data with a score near the classification boundary. Using our TSDNN semi-supervised framework, we train our final fine-tuned model on the new labeled dataset produced from the PU learning dataset generation step and use it to classify the remaining unknown data.
619 materials with 48
146 of them being ICSD entries, which are labeled as positive samples to indicate that they are synthesizable materials. However, there are no ground truth negative samples, the un-synthesizable materials from the downloaded MP dataset. There are only ICSD entries and virtual materials, the latter with an unknown synthesizability status. This lack of negative samples prevents a traditional supervised classification model from being trained as it normally would. To overcome this, we used the positive and unknown (PU) learning method13 discussed above to cluster the unlabeled data and identify materials with low synthesizability scores to be used as negative samples, as discussed above for the initial experimental dataset. We first remove 9629 randomly selected positive samples to be used as the test set prior to any training. Then, we generate the clustered unlabeled dataset as shown in Fig. 1b, where our TSDNN model is trained for 5 independent iterations. In each iteration, a random subset of the unlabeled dataset is selected to be the negative set with a size equal to the number of samples in the positive set. A TSDNN model is then trained on these data. This model makes predictions on the unlabeled samples not selected as the negative set. The final predicted scores are averaged across the 5 iterations to provide the clustered dataset. We then selected the lowest 9629 lowest-scored materials as negative samples to be used in the final test set to ensure our final model accurately classifies the positive and negative set that was determined by clustering. Then, the next 38
517 lowest-scored materials (all scored below 0.33) were selected to match the 38
517 positive samples for the labelled dataset. This provides a full labeled dataset with negative labels and a full test set in which accuracy can be determined. The remaining 29
327 samples are considered inconclusive and remain as the unlabeled set to be filtered by the final model.
This dataset could be directly used to train a supervised or semi-supervised model, which is performed with the balanced TSDNN and supervised CGCNN models. However, since the negatively labeled materials are selected as the result of an imperfect model's predictions, there will be false negatives introduced into the training data. This increases as materials are selected that had prediction scores closer to 0.5 than to 0.0. As a countermeasure to this, we leverage our semi-supervised model to gain insight into the dataset and select optimal negative samples. When trained with our semi-supervised model, the true negative rate is especially low compared to the true positive rate. However, when the threshold for negative samples is moved lower from the 0.33 prediction score, this performance improves. By utilizing this, we are able to determine the optimal negative class threshold to balance the true positive rate and true negative rate, which leads to improved performance of the unbalanced TSDNN.
For the second model, an unseparated TSDNN, we use only materials with positive formation energies (n = 2444) as negative samples and an equal number of randomly selected materials with negative formation energy as positive samples. This is optimized for a representative distribution of positive samples, with the intent of ensuring dataset smoothness and a low density. This allows for improved smoothness by including samples with near-zero eV formation energies while still ensuring a low density near the classification threshold of 0.0 eV. This is a general screener for positive vs. negative formation energy screening as opposed to the first approach, which is optimized for strictly low eV classification. This approach results in a high-precision model, where 78.4% of samples with predicted scores greater than 0.5 have a formation energy of less than −2.0 eV and 99.0% of samples have a negative formation energy. It correctly classifies 57.8% of the possible samples with formation energies less than −2.0 eV.
In both models, we use an unlabeled dataset with 500
000 CubicGAN-generated structures. These two models ensure that there is a low sample density at the classification threshold. To use the dataset as-is with a threshold of 0.0 eV would result in a very high density of materials at the threshold. As such, we use the different thresholds and data-selection methods to account for this. Each model has distinct benefits that are best suited for different applications, as shown in Fig. 5.
We structure our datasets in this way to correct for biases and inconsistencies that models are ingrained with due to the unbalanced nature of formation energy datasets. As shown in Fig. 3a, the Materials Project has an overwhelming majority of <0 eV materials. If trained from the raw data, it is likely that a model will bias heavily toward predicting >0 eV materials as being <0 eV. For this reason, we seek to combine the benefit of our TSDNN model with a balanced dataset to remove this bias. It is of particular importance that the model be unbiased when used with generated materials, such as those produced by our CubicGAN, as they contain many more >0 eV materials. We seek to apply our method to provide superior screening performance in identifying low formation energy materials.
| Synthesizability | Stability (formation energy) | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Labeled | Src | Unlabeled | Src | Model | Labeled | Src | Unlabeled | Src |
| Supervised CGCNN | 77 035 |
MP | 0 | N/A | Unseparated FE TSDNN | 4888 | MP | 500 000 |
CG |
| Balanced TSDNN | 77 035 |
MP | 29 327 |
MP | Separated FE TSDNN | 11 078 |
MP | 500 000 |
CG |
| Unbalanced TSDNN | 45 165 |
MP | 29 327 |
MP | CGCNN regressor | 20 614 |
MP | 0 | N/A |
| PU-learning13 | 46 781 |
MP | 78 734 |
MP | |||||
The supervised CGCNN and balanced TSDNN models use the same labeled datasets. The balanced TSDNN model is trained using the remaining samples as the unlabeled set. This uses the unoptimized dataset provided from the dataset generation step. The unbalanced TSDNN uses the optimized labeled dataset from the optimization step discussed in the formation energy classification performance section.
Due to the fact that our CubicGAN generative model produces strictly cubic structures, we utilized only cubic Materials Project structures to train a formation-energy classification model to predict samples with negative formation energies. We used two selections of data for our formation energy models. The first model, the unseparated TSDNN, uses only materials with formation energies lower than 0.0 eV as negative data (n = 2444). We then randomly selected an equal number from the remaining samples as positive data (n = 2444). This allowed for a balanced labeled dataset with the full distribution of negative formation energy samples represented. The second model, the separated TSDNN, is trained using the lowest 25% eV samples (n = 5539) as positive data and the highest eV materials (n = 5539) as negative data. This excludes the range of materials close to 0.0 eV. The motivation for this is to separate the positive and negative classes in the input space. The motivation for this is to train the model to identify only low formation energy materials. The CGCNN regression model is trained using the full cubic training dataset. We validate our formation energy models' performance by testing it on our own dataset of cubic structures produced by the CubicGAN with DFT-calculated formation energies. For each model, we used a test set of 36
847 CubicGAN-generated structures with DFT-calculated formation energies. This test set has 16
407 negative formation energy samples and 20
440 positive formation energy samples.
545
713 novel CubicGAN-generated materials and selecting the top 1000 for analysis. We perform DFT calculations to calculate their formation energies to analyze their stability and likely synthesizability.
We show the results of our synthesizability prediction in Table 2. The balanced TSDNN was trained using the full labeled dataset and a comparatively small unlabeled dataset to compare it to the strictly supervised CGCNN classifier method. These two models have equivalent performance, with the supervised CGCNN achieving an 81.60% TPR and the balanced TSDNN achieving an 81.20% TPR. To improve on this and benefit from semi-supervised learning, we then use the optimized dataset described in the TSDNN-syn method section for training the unbalanced TSDNN model, which achieved the highest accuracy of 94.11% along with a TPR of 92.90%. We also evaluate this model by moving the test data into the unlabeled dataset for the seeded TSDNN test. We use this test to evaluate the pseudo-labeling ability of our teacher model and to show that the true labels of data in the unlabeled set are learned correctly. The seeded TSDNN achieves the highest TPR of 93.80% and an accuracy of 91.48%, which demonstrates accurate teacher pseudo-labelling for unlabeled data. It increased the TPR of the unbalanced TSDNN from 92.90% to 93.80%. This is the best comparison to real-world performance, as the unlabeled data would be the desired data to be classified.
| Model | TPR | Accuracy | Test Set |
|---|---|---|---|
| Supervised CGCNN (baseline) | 81.60% | 62.73% | 9629 holdout |
| Balanced TSDNN (ours) | 81.20% | 56.40% | 9629 holdout |
| Seeded TSDNN (ours) | 93.80% | 91.48% | 9629 unlabeled |
| Unbalanced TSDNN (ours) | 92.90% | 94.11% | 9629 holdout |
| PU-learning (baseline)13 | 87.90% | N/A | 9629 holdout |
In both the basic PU learning method for synthesizability13 and our TSDNN framework, a decision boundary of 0.5 is used for determining synthesizable vs. unsynthesizable materials for both classifiers. To show the consistency and performance of both models, Fig. 4 plots the probabilities of being stable materials for all the ICSD materials from our test set by the PU learning model against those predicted by our TSDNN model. The figure is divided into quadrants, with each quadrant signifying agreement or disagreement between the PU learning method and our TSDNN framework. The top right quadrant signifies correct agreement between the models, where both models correctly classify the materials as positive. As expected due to the similarity in methods, both models correctly agree for 92.24% of samples. The bottom left quadrant, similarly, denotes the incorrect agreement that the materials should be classified as negative. These are very few, totaling only 0.62% of samples. The bottom right quadrant signifies a disagreement in which the TSDNN model correctly classifies the materials and the PU learning method does not. It can be easily found that the bottom right quadrant contains much more samples compared to the top left quadrant, containing 5.91% and 1.22% of samples, respectively, solidly indicating that there are many materials with very high prediction scores correctly predicted by our TSDNN model but were incorrectly classified by the PU learning method as being unsynthesizable. These results show that while our model has an improved true positive rate, the improvement is not simply a result of materials being classified right at the 0.5 classification boundary.
047 positive samples and 41
387 unlabeled samples. The withheld test set contained 15
747 positive samples and 55
138 unlabeled samples. We used the 15
747 true positive samples from the post-2018 test set to evaluate our model's performance as it would be used to make predictions for new data.
The second experiment was conducted in the same manner but withholding any materials containing the element Mn. This experiment was conducted to evaluate the model's reliance on relating the similarity of materials. The training dataset consisted of 47
635 positive samples and 85
393 unlabeled samples that do not include the element Mn. The final test set contained 2159 positive samples and 11
132 unlabeled samples each containing the element Mn. The results for each experiment are found in Table 3.
| Model | TPR | |
|---|---|---|
| Time-based splitting | PU-learning (baseline)13 | 86.20% |
| TSDNN (ours) | 91.65% | |
| Element holdout | TSDNN (ours) | 72.67% |
We compare our time-based splitting validation with the standard PU learning method from ref. 13. They use an older version of the training dataset with data pre-2015, so we similarly used the latest 5 years of data for testing to match their dataset splitting. Our model's consistent performance when trained on historical data demonstrates our model's efficacy for use in real-world application for future predictions.
Similarly, with the element holdout experiment, our model demonstrates the expected performance. Though this experiment is orthogonal to real-world material discovery through similarity to existing materials, the model demonstrates that it can still perform well with little knowledge of the interactions of Mn.
Table 4 shows the classification performance of the three models on our test set of materials. Our unseperated TSDNN model achieves a 74.60% F1 score compared to the CGCNN regression model's F1 score of 64.3%, with a significant absolute 10.3% improvement by using our semi-supervised learning approach. At the same time, this model achieves an accuracy of 74%, with an absolute 9.7% improvement over the CGCNN model. Our separated TSDNN model shows that our approach is able to be tweaked for achieving higher precision by adjusting the training threshold, resulting in a high-confidence model. Here, the table shows that the model can be tuned to achieve 100% precision for identifying candidates which are highly likely to be stable materials.
| Model | Precision | F1 score | Accuracy |
|---|---|---|---|
| CGCNN | 58.60% | 64.30% | 64.30% |
| Unseparated FE TSDNN | 66.10% | 74.60% | 74.00% |
| Separated FE TSDNN | 100.00% | 16.50% | 59.50% |
To further illustrate the advantage of our TSDNN models, we show the formation energy distributions of the positively classified samples (with negative formation energies) from our test set by using our classifiers and the baseline CGCNN regression model. As shown in Fig. 5a, our test set contains a large number of samples with positive formation energy to fully test the model's ability to differentiate between samples with positive and negative formation energies. The desired formation energy distribution of screened samples is seen in the bottom group of samples at around −2.0 eV. Fig. 5b shows that our separated TSDNN model has just obtained the desired sample groups with the formation energy distributed around the peak of −2.2 eV, which indicates that our separated FE TSDNN is effective for applications which require a high certainty that a material will have a low formation energy because of its very high precision. For more general screening with an eV threshold of 0.0, our unseparated TSDNN model is more suitable (Fig. 5c). With the vast array of materials with formation energies very close to 0.0 eV, it is very challenging to train a model to accurately differentiate between materials with small positive and small negative formation energies. As shown in Fig. 5d, the CGCNN model is not able to capture the full distribution of negative formation energy materials in the test set and has difficulty in differentiating between samples with positive and negative formation energies. As shown in Table 4, our unseparated TSDNN model is able to improve greatly in performance with a 7.5% increase in precision, a 10.3% increase in F1 score, and a 9.7% increase in accuracy compared to the CGCNN. This makes it preferred for applications that wish to screen for stable materials (usually with negative formation energy).
Starting with 2.5 million candidate materials, we first apply our separated TSDNN model to classify them as having positive or negative formation energies. 918
686 of them are predicted as having a negative formation energy. We then select 5000 of these materials with the highest prediction scores and apply our unbalanced TSDNN synthesizability model to predict their probability of being able to be synthesized. We finally select the top 1000 samples with the highest probability to be synthesizable. These samples are sent for DFT relaxation and further validation.
| Chemical formula | Space group | E form (eV per atom) |
|---|---|---|
| TbLuO2 | 225 | −3.599 |
| HoPaF6 | 216 | −3.551 |
| RbPmF6 | 216 | −3.108 |
| Pm2IO6 | 225 | −2.785 |
| PaSnF6 | 216 | −2.427 |
| PaMoF6 | 216 | −2.351 |
| PaIF6 | 216 | −2.255 |
We plot the correlation between the TSDNN prediction score and the calculated formation energy of the selected materials in Fig. 7. Formation energy is not a suitable indicator for a material's synthesizability, so we do not see a strong correlation as these scores are the final predictions from our synthesizability prediction model.
Previously, CGCNN-based regression models have been used to screen for stable material candidates using predicted formation energy. The issue with such models to screen for material candidates with low formation energies is the introduction of model and prediction biases due to the dataset imbalance. As shown in Fig. 3a, only 8.2% of the total MP database is comprised of materials with a formation energy greater than 0 eV. This results in ML based regression models that bias their predictions heavily toward negative formation energies with true positive samples, as shown in Fig. 8.
Here we proposed a dual crystal graph convolutional neural network-based semi-supervised learning framework for synthesizability and formation energy prediction. Comprehensive testing and validation show that our TSDNN models can successfully exploit the unlabeled data in each use case in conjunction with existing labeled data to accurately and effectively predict synthesizability and formation energies. Our TSDNN models can be paired with existing and future material generation models for efficient screening across a variety of applications, as shown with our CubicGAN. Our models' integration with generative models provides for a greatly optimized and more reliable search for new materials. Compared to the CGCNN based regression model, which misclassified a large grouping of materials as having positive formation energies due to the bias caused by the dataset imbalance, our semi-supervised TSDNN classification model reduces this bias, as it is designed with screening in mind from start. Furthermore, by using our TSDNN framework in conjunction with our CubicGAN model, we were able to use the large amount of unscreened data as unlabeled data to train our model for improved performance.
We recognize that currently the CGCNN is no longer the state-of-the-art graph neural network model for formation energy prediction with the emergence of new variants such as Megnet,27 DeeperGATGNN,14 and ALIGNN.57 Our twin network model can be easily combined with these algorithms to achieve even better performance for semi-supervised materials property prediction.
In this work, we use our recently developed CubicGAN algorithm8 to generate 10 million hypothetical ternary cubic crystal structures of three space groups (221
225, and 216) which are reduced to 2.5 million unique candidate cubic structures. With such a high volume of candidates, finding stable and synthesizable ones is almost like finding a needle in a haystack. To address this challenge, we develop semi-supervised deep learning based classification models for identifying hypothetical materials candidates with negative formation energy and high synthesizability, respectively.
Five independent models are trained under the PU learning framework, using a random subset of the unlabeled samples as negative samples to complete a labeled dataset. Each model is trained using that iteration's labeled and unlabeled sets, with an 80% training and 20% validation split for the labeled set. After each iteration has completed, the prediction scores for each unlabeled sample are averaged. The lowest of these average scores are used as negative samples in a new labeled dataset to train a sixth and final model, along with the remaining unlabeled samples. This model is used to make predictions and is evaluated using the initial test set withheld at the beginning.
The hyper-parameters of our TSDNN models are set for training as found in Table 6.
| Hyper-parameters | Value |
|---|---|
| Number of bagging iterations | 5 |
| Dataset holdout for testing | 20% |
| Holdout validation per iteration | 20% |
| Number of epochs per iteration | 100 |
| Learning rate | 0.001 |
| Momentum | 0.9 |
| Weight decay | 0 |
| Atomic feature length | 90 |
| Hidden feature length | 180 |
| Number of convolution layers | 3 |
| Number of hidden layers | 1 |
| Optimizer | SGD |
We follow the hyperparameters as specified in ref. 13 for direct comparison. Specific synthesizability dataset splitting procedures may be found in the TSDNN-syn section. Similarly, specific formation energy dataset splitting procedures may be found in the TSDNN-fe section. Each model is trained according to the general training procedure described above.
![]() | (3) |
We evaluate the TSDNN-fe models on three metrics with variable formation energy thresholds: accuracy, precision, and F1 score. We again use a prediction score boundary of 0.5 to determine a positive or negative sample classification. The accuracy metric is shown as
![]() | (4) |
The precision and recall metrics can be expressed as
![]() | (5) |
![]() | (6) |
.![]() | (7) |
545
713 hypothetical materials generated by our CubicGAN model. Overall, we screened 918
686 materials that were positively classified by the formation energy model with our synthesizability prediction model. We select the top 1000 of these final screened materials for DFT verification and find that 51.2% have negative formation energies. These results show that our TSDNN semi-supervised learning framework is effective for large-scale material discovery screening.
| This journal is © The Royal Society of Chemistry 2023 |