Sasan Amariamira,
Janine Georgeab and
Philipp Benner
*a
aFederal Institute of Materials Research and Testing, Unter den Eichen 87, 12205 Berlin, Germany. E-mail: philipp.benner@bam.de
bFriedrich Schiller University Jena, Institute of Condensed Matter Theory and Solid-State Optics, Max-Wien-Platz 1, 07743 Jena, Germany
First published on 27th March 2025
Material discovery is a cornerstone of modern science, driving advancements in diverse disciplines from biomedical technology to climate solutions. Predicting synthesizability, a critical factor in realizing novel materials, remains a complex challenge due to the limitations of traditional heuristics and thermodynamic proxies. While stability metrics such as formation energy offer partial insights, they fail to account for kinetic factors and technological constraints that influence synthesis outcomes. These challenges are further compounded by the scarcity of negative data, as failed synthesis attempts are often unpublished or context-specific. We present SynCoTrain, a semi-supervised machine learning model designed to predict the synthesizability of materials. SynCoTrain employs a co-training framework leveraging two complementary graph convolutional neural networks: SchNet and ALIGNN. By iteratively exchanging predictions between classifiers, SynCoTrain mitigates model bias and enhances generalizability. Our approach uses Positive and Unlabeled (PU) learning to address the absence of explicit negative data, iteratively refining predictions through collaborative learning. The model demonstrates robust performance, achieving high recall on internal and leave-out test sets. By focusing on oxide crystals, a well-characterized material family with extensive experimental data, we establish SynCoTrain as a reliable tool for predicting synthesizability while balancing dataset variability and computational efficiency. This work highlights the potential of co-training to advance high-throughput materials discovery and generative research, offering a scalable solution to the challenge of synthesizability prediction.
Historically, physico-chemical based heuristics such as the Pauling Rules4 or the charge-balancing criteria5 have been used to assess materials stability and synthesizability. Nevertheless, these simplified approaches have been shown to be insufficient, as more than half of the experimental (already synthesized) materials on the Materials Project database6 do not meet these criteria for synthesizability.5,7
In more recent attempts, material scientists often employed thermodynamic stability as a proxy for synthesizability, ignoring the effect of kinetic stabilization. This involves conducting first-principle calculations to estimate the formation energy of crystals and their distance from the convex hull. A negative formation energy, or a minimal distance from the convex hull, is commonly interpreted as an indicator of synthesizability.8–12 While stability significantly contributes to synthesizability, it is just one aspect of this complex issue. There are many –potentially interesting– metastable materials that do exist, even though their formation energies deviate from the ground-state.8,11,13–15 These materials can be synthesized in alternate thermodynamic conditions in which they are the ground-state. After removing the favorable thermodynamic field, they have stayed stuck in the metastable structure by kinetic stabilization.8 On the other hand, there are many hypothetical stable materials in well-explored chemical spaces which have never been synthesized. This could be due to a high activation energy barrier between them and the common precursors.13–15 Beyond the theoretical and thermodynamic considerations, synthesizability is also a technological problem. Novel materials that are developed through cutting-edge methods were practically unsynthesizable before the invention of their methods of synthesis. For example, new high-entropy alloys with great potential for catalysis applications were recently synthesized using the Carbothermal Shock (CTS) method.16 Their particular homogeneous components and uniform structures were not accessible through conventional synthesis methods. On the other hand, some materials can only be synthesized under specific conditions, such as extremely high pressures.17
The fact that estimating synthesizability is related to materials structures without a straightforward formula to solve for, makes it an apt candidate for machine learning. This and many other challenges have made machine learning the ideal technique to accelerate material discovery.18 In this work, we define a classification task for two classes of materials, namely synthesizable (the positive class), and unsynthesizable (the negative class). This classification comes with a few challenges and intricacies. The first one is encoding materials structures in a machine understandable format. Some previous works have circumvented this challenge in creative ways such as combining different elemental features,15,19 using text-mining algorithms to search the relevant literature to identify synthesizable materials,20 using the picture of crystal cells with convolutional neural networks,21 or even a network analysis of materials discovery timeline with respect to their stability.22 Others,14,23 including this work, utilize graph convolutional neural networks (GCNNs) to encode and learn from crystal structures. While the GCNNs are more complicated to implement, they have the advantage of including more information about the structure than composition alone or the other previously mentioned approaches that represent the structure information indirectly through a proxy.
The second challenge of estimating synthesizability lies within the nature of the available data. Unlike a typical classification task, we do not have access to enough negative data. On the one hand, this is due to the fact that unsuccessful attempts of synthesis are not typically published nor uploaded to public databases. The attempts of using such failed experiments24 inevitably remain confined to local labs and a small class of materials. Also, synthesis success strongly depends on the synthesis conditions and technology. Hence, the failure of synthesis attempts in one setting does not necessarily imply failure in a different lab with different synthesis methods or equipment. Finally, creating a proper negative-set for training a classifier is a whole new challenge.5 If the negative-set is too different from real materials, it may not teach the model a meaningful decision boundary for detecting synthesizability. To design a realistic-looking negative-set, one would need to understand the features that determine synthesizability in the first place.
The final challenge in this task comes as a fundamental aspect of machine learning. Regardless of which model is chosen, it will inherently exhibit a certain degree of bias. One introduces a possibly unintended bias when selecting one model over another, since the model's ability to generalize out of sample is, in part, predetermined by its architecture. This model bias comes even with the best performing models. In fact, a model with great benchmarks might perform worse than simpler models when predicting targets for out-of-distribution data,25 perhaps due to overfitting. This challenge becomes particularly pronounced when predicting synthesizability. The objective is to forecast a target for new and often out-of-distribution data, where the issue of generalization is most acute. The lack of the negative data compounds this issue, as it makes performance metrics less reliable. One way to mitigate this issue is by leveraging multiple models. An ensemble of models with diverse architectures and learning strategies can help balance individual model biases, improve robustness, and provide a more reliable assessment of synthesizability. By aggregating predictions from multiple models, the approach reduces overfitting, enhances generalization, and compensates for the missing negative data, leading to more accurate and trustworthy synthesizability predictions.25
To address these challenges, we have developed a model ready for integration into high-throughput simulations and generative materials research. It is called SynCoTrain (pronounced similar to ‘Synchrotron’). It is a semi-supervised classification model designed for predicting synthesizability of oxide crystals. SynCoTrain addresses the generalizability issue by utilizing co-training. Co-training is an iterative semi-supervised learning process designed for scenarios with some positive data and a lot of unlabeled data.26,27 It leverages the predictive power of two distinct classifiers to find and label positive data points among the unlabeled data. Different models have different biases, and by combining their predictions, we can practically reduce these biases while keeping what they learn about the target. We use the Atomistic Line Graph Neural Network (ALIGNN)28 and the SchNetPack29,30 models as our chosen classifiers. They are both innovative GCNNs with distinct attributes. ALIGNN is unique in that it directly encodes both atomic bonds and bond angles into its architecture, offering a perspective that aligns with a chemist's view of the data. SchNetPack stands out for using a unique continuous convolution filter which is suitable for encoding atomic structures, which can be thought of as a physicist's perspective on the data.
At each step of co-training, SynCoTrain learns the distribution of the synthesizable crystals through the Positive and Unlabeled learning (PU learning) method introduced by Mordelet and Vert.31 This base PU learning method with a different classifier has already been employed to predict synthesizability for all classes of crystals14 and for perovskites specifically.23 In this work, we utilize multiple PU learners as the building blocks for co-training. In each iteration of co-training, the learning agents exchange the knowledge they gained from the data between each other. Eventually, the labels are decided based on average of their predictions. This process increases the prediction reliability and accuracy, much like two experts who discuss and reconcile their views before finalizing a complex decision. This collaborative approach suggests that co-training is more likely to generalize effectively to unseen data compared to using a single model with equivalent classification metrics such as accuracy or recall.
We verify the performance of the model by recall for an internal test-set and a leave-out test-set. We also evaluate our model further by predicting whether a crystal is stable or not for the same data points. Note that in predicting stability, we do not aim for a good performance. In fact, we expect an overall poor performance due to high contamination of the unlabeled data;31 more info in ESI.† However, we compare the ground truth recall in stability to the recall produced by the PU learning, to gauge the reliability of the latter.
We chose a single family of materials, oxides, to establish the utility of co-training in predicting materials properties. Oxides are a well-studied class of materials with a large amount of experimental data to learn from ref. 32 and 33. A higher number of training data would typically decrease the classification error in machine learning. However, training across all available families of crystals would introduce greater variability in the dataset, potentially increasing the uncertainty and error margins in our results. In other words, the prediction quality for new materials would vary substantially. By achieving high recall values with oxides as our training data, we demonstrate the effectiveness of co-training. This approach ensures reliable results while maintaining reasonable training times for our models. Our data stems from the Materials Project database,6 in which all of the crystal structures have been optimized with DFT and should be of similar quality. In many cases, the starting structures for optimization were those from the Inorganic Crystal Structure Database (ICSD).34 For training machine learning models, it is crucial to minimize obvious biases, which can arise from combining data from different sources. Such biases can be easily detected by machine learning models, leading to distorted performance metrics.35 To mitigate this risk, we rely exclusively on a single data source for training our model.
Less than 1% of the experimental data with energy above hull higher than 1 eV were removed, as potentially corrupt data. The learning began with 10206 experimental and 31
245 unlabeled data points.
Co-training consists of two separate iteration series, the results of which are averaged in the final step. In the first series, we start by training a base PU learner with an ALIGNN classifier. This is the iteration ‘0’ of co-training, and this step is called ALIGNN0. The learning agent predicts positive labels for some of the unlabeled data, creating a pseudo-positive class. This class is added to the original experimental data, expanding the initial positive class. Iteration ‘1’ of co-training on this series is to train a base PU learner with the other classifier, here the SchNet, on the newly expanded labels. This step is called coSchnet1. Each iteration provides newly expanded labels for the next iteration. The classifiers alternate for each iteration, from ALIGNN to SchNet and vice versa, as shown in Fig. 1a.
Parallel to this series, we set up a mirror series where iteration ‘0’ begins with a SchNet based PU learner. This step of iteration ‘0’ is called SchNet0. This series learns the data from a different, complementary view compared to the former series, see Fig. 1a. It continues in the same manner with alternating classifiers. The order of the steps in each series can be found in Table 1.
Co-training steps | Iteration ‘0’ | Iteration ‘1’ | Iteration ‘2’ | Iteration ‘3’ | Averaging scores | |||
---|---|---|---|---|---|---|---|---|
Training data source | Original labels | Labels expanded by Iteration ‘0’ | Labels expanded by Iteration ‘1’ | Labels expanded by Iteration ‘2’ | Scores provided by the optimal iteration | |||
Training series | ALIGNN0 | > | coSchnet1 | > | coAlignn2 | > | coSchnet3 | Synthesizability scores |
SchNet0 | > | coAlignn1 | > | coSchnet2 | > | coAlignn3 |
Each base PU learner produces a synthesizability score between 0 and 1 for each unlabeled datum. This is done through 60 runs of the bagging method established by Mordelet and Vert,31 as illustrated in Fig. 1c. In each independent run of this ensemble learner, a random subset of the unlabeled data is sampled to play the role of the negative data in training the classifier. The average of the predictions in these runs for data points that were not part of the training in that run yields the synthesizability score. This score is interpreted as the predicted probability of being synthesizable. A threshold of 0.5 is applied for labeling each datum as either synthesizable (labeled 1) or not-synthesizable (labeled 0).
After several iterations of co-training, the optimal iteration is chosen based on the prediction metrics (i.e. recall rate). Continuing to further iterations yields diminishing returns in performance metric while risking reinforcing existing model bias. The scores provided from the two series at the optimal iteration are then averaged. The 0.5 cutoff threshold is applied to this averaged score to produce the final synthesizability score. Once we have synthesizability labels for both the experimental and theoretical data, a simple machine learning task remains. We train a classifier on these labels and end up with a model that can predict synthesizability (see Fig. 1b).
In our study, we employ two distinct test-sets to measure recall. The first is a dynamic test-set, which varies with each iteration of base PU learning. The second is a leave-out test-set that remains untouched during all training iterations. As the result, we obtain a ‘recall range’ between the two distinct recall measures; an averaged recall for the dynamic test-set and a leave-out recall. This gives us more information than a single recall value. The construction and reasoning behind this are detailed in the ground truth evaluation section.
The recall values for each iteration are depicted in Fig. 2. The two distinct co-training series are separately visualized to clearly illustrate recall changes at each step. Iteration ‘0’ represents a basic PU learning approach with isolated classifiers, without any co-training. We see that the SchNet0 series somewhat plateaus in iteration ‘2’, while the ALIGNN0 series still improves in recall. However, neither series make significant improvement on their recall in iteration ‘3’. This suggests that using the third iteration yields diminishing returns in terms of new learning, while risking enforcing models' biases through too many repetitions. Furthermore, the predicted positive rate increases in both series for iteration ‘3’, without a meaningful increase in recall range to justify it. This means that the model is more likely to classify a theoretical crystal as synthesizable, without improving its understanding of synthesizability. This is akin to over-fitting, when additional learning steps do not yield better validation results. These factors indicate that iteration ‘2’ is optimal. Consequently, we omit the third iteration and use the results from iteration ‘2’ as the source for synthesizability labels.
The synthesizability scores provided for the unlabeled data is the actual goal of this PU learning task. The distribution of these scores, alongside a large recall rate, provide a sense of the performance quality. A model that marks almost all crystals as synthesizable would have a high recall but could not distinguish the two classes from each other. Fig. 3 and 4 show the distributions of synthesizability scores at iteration ‘0’ and iteration ‘2’ of co-training, respectively. Despite high recall values, the PU learners mark only about 20% of the unlabeled data as synthesizable. The synthesizability scores for the intermediate iterations can be found in the ESI.†
In the final step of co-training, the scores from iteration ‘2’ are averaged and final labels are predicted via a cutoff of 0.5. This yields the final labels to for training the synthesizability predictor. The recall range is now [95–97]% and 21% of the unlabeled data are predicted to be synthesizable, see Fig. 5. Of course, all experimental data, including the ∼3% that were misclassified as unsynthesizable, are labeled as positive for training the synthesizability predictor.
Next, we examine the synthesizability score for the unlabeled data and its relationship to stability. Crystals with energy more than 1 eV above the convex hull are considered highly unstable and unlikely to be synthesizable. As shown in Fig. 6, the majority of our dataset consists of stable materials, indicating that our synthesizability predictions largely exclude unstable data. Furthermore, Fig. 6 reveals that unstable crystals are 2.5 times less likely to be classified as synthesizable than as non-synthesizable. However, among all crystals with energy less than 1 eV above the hull, only about 21% are classified as synthesizable. Additionally, we observe a sharp decline in energy above the hull when the synthesizability score increases slightly from zero (contour line). Conversely, materials that are confidently predicted to be synthesizable exhibit an increase in energy above the hull.
While one might expect stability to correlate directly with higher synthesizability scores, this trend is not strongly demonstrated here, likely due to the limited number of unstable crystals in our dataset. Although stability plays a significant role in synthesizability, it is not expected to be the sole determining factor. It is also important to acknowledge the inherent limitations of DFT, such as finite temperature effects and precision constraints, which may influence these observations.
In Fig. 7, we compare the energy above hull and formation energy for data with positive and negative labels. The left column presents the experimental data, while the right column corresponds to the actual task of distinguishing positive and negative classes within the unlabeled data. As expected, we observe a clustering of positive data around lower values of energy above hull, without any distinct density peaks in formation energy. This aligns with our expectations, as stability (and, by extension, synthesizability) is influenced more by relative energy states than by absolute energy values.
On the other hand, using a leave-out test-set in the common practice in materials informatics. At the very least, a leave-out test-set would provide a more comparable evaluation with similar works in this topic.
Ultimately, the goal of a test error is to approximate the expected test error. By using both test sets, we will have two values for recall. That means more information about the model's performance. We chose a leave-out test set with 5% of the positive data for all the runs. For the dynamic test set, 10% of the positive data is chosen at each run of the PU learning.
We use the same dataset as before, now including the outliers of experimental structures with high energy above hull that were previously excluded. This adjustment retains data for learning the higher energy structure and provides a better benchmark for comparison with previous works in synthesizability that used the outliers. These data points were classified into positive (stable) and negative (unstable) classes based on a cutoff in energy above the convex hull; for details please see ESI.† The key difference is that, unlike a real PU learning task, all the positive and negative labels are available for evaluation post training. A random subset of the positive class, with the same number of data points as the original experimental class, kept their positive label. We then hid the label of the remaining data to manufacture a PU learning scenario. The models were trained on the stability PU data using the same code as the synthesizability task. Having access to all the labels, we could estimate the ground truth recall value and compare it with the recall values produces by the two test sets.
As shown in Fig. 8, the recall values produced by both test-sets closely approximate the ground truth recall, confirming the reliability of using recall for evaluating the model's performance. In both co-training series, the leave-out recall value starts more optimistic than the ground truth, especially when high-energy experimental outliers are included in the PU learning. This optimistic recall was the reported recall value in the previous PU learning studies predicting synthesizability.14,23 From iteration ‘1’ of co-training the order flips and the dynamic test set becomes too optimistic. While there is no guarantee the ground-truth will always be found in the range between the two values, Fig. 8 illustrates why using both test-sets is worthwhile rather than just keeping one.
We selected SchNet as our classifier and achieved good results, though other classifiers like ALIGNN can also be trained using the same labels. Detailed training parameters are available in the METHODS section. The pretrained model is accessible in our repository (https://github.com/BAMeScience/SynCoTrainMP).
The trained model reached 90.5% accuracy on a test set comprising 5180 data points. To further evaluate the model's performance, we analyzed the synthesizability predictions for three additional datasets, focusing exclusively on oxides. These datasets originate from other sources than our training data and consequently exhibit different biases.35 First, we examined theoretical oxides from the Open Quantum Materials Database (OQMD),37 downloaded via the Jarvis Python package,38 after filtering out any crystals already present in Materials Project's experimental data, leaving 23056 theoretical oxides. Second, we analyzed 14
095 oxide crystals from the WBM dataset,39 which were generated using random sampling of elements in Materials Project structures, with chemical similarity safeguards based on ICSD data.34 We used the relaxed version of this dataset. Finally, we predicted the synthesizability of 6156 vanadium oxide crystals generated by iMatGen.11 Fig. 9 compares the synthesizability scores of these datasets with the theoretical portion of the test set. All these crystal structures and their predicted synthesizability scores are available to download in our GitHub repository.
Over half of the theoretical test-set data shows a synthesizability score close to zero, as expected, since previously synthesized crystals have been excluded by Materials Project. In contrast, the OQMD data shows roughly twice the proportion of synthesizable crystals, which may result from differing inclusion criteria between Materials Project and OQMD. We still observe a peak near a score of 1, possibly indicating synthesized crystals not listed in Materials Project. The iMatGen data show the lowest synthesizability, with multiple peaks at low scores, reflecting the artificial nature of these generated structures, which are often less realistic. The WBM dataset scores higher on average, without significant peaks. Despite being artificially generated, the WBM data employed mechanisms like chemical similarity to avoid unstable crystals. As a result, we observe more novel crystals with ambiguous synthesizability predictions, with scores around 0.5, and no clear peaks close to 0 or 1.
The decision thresholds of 0.5 and 0.75 were used as unbiased values for classification and class expansion. However, these thresholds are ultimately arbitrary and can be adjusted based on the specific goals and applications. In a more exploratory study, a looser threshold could be utilized to avoid overlooking potentially interesting novel structures. Conversely, a project operating with a tighter budget could employ a stricter threshold to save on resources. Label distributions based on a threshold of 0.25 and 0.75 are illustrated as examples in Fig. 10. When compared with the unbiased threshold of 0.5 shown in Fig. 5, a cutoff threshold of 0.25 is more lenient in classifying crystals as synthesizable. However, it only identifies 26% of the theoretical oxides in Materials Project as synthesizable, recognizing two thirds of the data as unsynthesizable. Conversely, a threshold of 0.75 results in a more stringent classification, with only 17% of the theoretical oxides meeting this threshold. And yet, these oxides are more likely to be synthesizable compared to those that did not meet the cut.
![]() | ||
Fig. 10 Label distribution based on 0.25 and 0.75 classification thresholds at the end of co-training for the first (a) and second (b) series. |
In this work we combined two different learners based on strong classifiers to reach a more reliable recall. New models for predicting materials properties are developed rapidly and materials data is growing. Combining different tools, instead creating one from scratch, is an untapped potential to learn more from the data already available in the materials space.
At each iteration, a base learner calculates a synthesizability score between 0 and 1 for both the unlabeled and experimental test data. To expand the positive class, unlabeled data points confidently classified as positive by the PU learner are selected. Here, we use a threshold of 0.75, rather than 0.5, to determine which unlabeled data points are added in the original positive class. After iteration ‘2’, the scores from both training series are averaged. The 0.5 cutoff determines the final label.
The base learner was changed from the original naïve Bayes classifier to a base PU learners equipped with convolutional neural networks. The different views of the data were achieved through the difference between the data encoding in the classifiers, i.e., ALIGNN and SchNet. Two parallel co-training series with altering classifiers were carried out accordingly.
In this work, two base PU learners were made through using two classifiers. In both cases, a complete bagging of PU learning took 60 runs. Note that the separate runs of PU learning are not referred to as iterations as each run is independent of the rest. This is not the case in co-training, where each iteration depends on the results produced by the previous iteration.
The training data at each PU learning run has a 1:
1 ratio of positive and negative labels. The size of the training set increases after each co-training iteration, due to the expansion of the positive class. Each run of PU learning predicts a label, 0 or 1, for the data points that did not take part in the training phase of that run. After the 60 runs, these predictions are averaged for each data to produce the synthesizability score. This score is also referred to as the predicted probability of synthesizability. The cutoff thresholds of 0.5 and 0.75 are used to predict the labels and expand the positive class, respectively.
The SchNetPack model was originally designed for regression. To accommodate classification, a sigmoid non-linearity and a cutoff function were added to the final layer. We used the version 1.0.0.dev0 of this model.
During initial tests, the predictor displayed a tendency to overestimate the positive class, likely due to overfitting to the data distribution in the Materials Project. To mitigate this, several regularization steps were introduced. First, noise was added to the labels by randomly selecting 5% of the positive class and flipping their labels from 1 to 0. An equal number of negative class labels were also flipped from 0 to 1. This small amount of label noise helps regularize the model, preventing classifier's overconfidence in any class distribution.
Data augmentation was then employed, following a previously published method that showed significant improvements in predicting material properties. This approach perturbs atomic positions using Gaussian noise to generate slightly altered versions of the original data, which are used alongside the unperturbed data for training. This augmentation doubles the size of the training set.
The SchNet model was used as the primary synthesizability predictor, with additional regularization techniques enhancing its generalizability. A weighted loss function was employed, with a ratio of 0.45:
0.55 for the positive and negative classes, respectively. This adjustment subtly discouraged over-prediction of the positive class, while maintaining model sensitivity.
Finally, to implement regularization during training, dropout layers were added to the model, with 10% dropout at the embedding layer and 20% at each convolutional layer. To manage the learning rate, a ‘Cosine Annealing with Warm Restarts‘ scheduler was used, allowing it to cycle through phases, helping the model escape local minima early in training while converging effectively later on. Early stopping was also implemented to prevent overtraining.
Open Quantum Materials Database (OQMD)37 served as an external dataset that was not used in model training. However, they were downloaded through the Jarvis Python package38 on 2023.12.12, which provides easy access to this data.
The WBM dataset39 was made available through the Matbench Discovery41 project through figshare.42
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00394b |
This journal is © The Royal Society of Chemistry 2025 |