Robert
Tempke
and
Terence
Musho
*
Department of Mechanical, Materials and Aerospace Engineering, West Virginia University, P.O. Box 6106, Morgantown, WV, USA. E-mail: tdmusho@mail.wvu.edu
First published on 19th April 2024
The utilization of machine learning in Materials Science underscores the critical importance of the quality and quantity of data in training models effectively. Unlike fields such as image processing and natural language processing, there is limited availability of atomistic datasets, leading to biases in training data. Particularly in the domain of materials discovery, there exists an issue of continuity in atomistic datasets. Experimental data sourced from literature and patents is usually only available for favorable data, resulting in bias in the training dataset. This study focuses on developing a SMILES-based model for generating synthetic datasets of quantum materials using a variational autoencoder. This study centers on the generation of a synthetic dataset of quantum materials specifically for quantum sensing applications, with a focus on two-level quantum molecules that exhibit a dipole blockade. The proposed technique offers an improved sampling algorithm by incorporating newly generated data into the sampling algorithm to create a more normally distributed dataset. Through this technique, the study was able to generate over 1000000 candidate quantum materials from a small dataset of only 8000 materials. The generated dataset identified several iodine-containing molecules as promising single photon emitting materials for potential quantum sensing applications.
To achieve this goal, this study employs a deep learning technique known as a variational autoencoder (VAE), which is capable of synthesizing chemical species with specific chemical properties.7–10 The VAE forms a custom chemical compression intelligence that provides efficient generation of new specific chemical species by sampling the latent space of the VAE, which can be thought of as a representation of compressed chemical information.11–13 By training the neural network to learn the chemical and structural similarities of species with specific physical properties, we enable it to identify patterns in a higher dimensionality. See Fig. 1 for an illustration of this concept in application to this research.
This research is focused on the generation of candidate quantum materials, specifically single-photon source (SPS) materials, also referred to as UV/vis materials.14–18 These SPS materials rely on a two-level system of electronic states with the added complexity a secondary interaction to form a resonant behavior. While it is beyond the scope of this paper to discuss all of the potential quantum material frameworks the focus of this study is on the discovery of a material that exhibit SPS behavior with a strong dipole interaction. The creation of a chemically diverse dataset is critical to the development and training the furture of accurate machine learning algorithms. In the context of machine learning, the phrase “garbage in, garbage out” highlights the importance of high-quality data.19,20 The machine learning algorithm must be trained on a range of inputs and outputs, as the space is continuous, and it must learn everything to know everything.
Several studies have emphasized the issue of bias in experimental design and data collection, which ultimately leads to skewed and unreliable data. Griffiths et al. investigated biases in the natural sciences, focusing on the impact of data splitting, noisy datasets, and contextual variables on the outcome of experiments.21 Similarly, Kovács et al. highlighted the direct effects that biased and unbiased datasets can have on the quality of machine learning outputs.22 Glavatskikh et al. demonstrated how the lack of diversity in data limits the machine learning's potential to predict.23
The creation of an experimentally unbiased, continuous dataset is a costly and challenging task. However, our use of deep learning techniques, specifically VAEs, shows promise in generating a chemically diverse dataset with specific chemical properties.8,24,25 Through our research, we aim to contribute to the development of more accurate and reliable datesets to train machine learning algorithms in chemistry and materials science.
In the field of SPS materials, it is common to encounter incomplete or inaccurate data in the literature, making it unsuitable for machine learning algorithms. Zakutayev et al. have highlighted the significance of having sufficiently large and diverse datasets for the training of advanced machine learning algorithms in materials science26 emphasizes the importance of having a robust and extensive dataset to develop machine learning algorithms that can predict the structure, stability, and properties of various materials. Furthermore, it illustrates that existing machine learning algorithms can be quickly and easily adopted to address material science problems, provided there is a suitable dataset for training purposes.5,27,28
In the context of UV/vis research in the context of SPS materials, the research of Beard et al. stands out for their comprehensive collection of available materials and corresponding relevant calculations.29 The authors conducted an extensive search of over 400000 scientific documents to extract a database of just over 8000 unique compounds. Despite the use of state-of-the-art tools such as ChemDataExtractor, the process of creating a database for quantum materials is challenging. This is due to the wide variety of formatting among different scientific journals, discrepancies within the tools being used, and the lack of a standard set of ground truth rules for representing materials using the SMILES notation. Nevertheless, the database created by Beard et al. is currently the most complete UV/vis material dataset available.
SPS materials have been proposed for a variety of quantum applications, such as quantum communication, quantum computing, quantum information, and quantum precious metrology.16,17,30 SPS materials have found demonstrated application to date in quantum communication, remote sensing, and dipole gates.2,31,32 These materials exhibit a two-level system behavior, where a resonance is formed between a ground state and a excited state or between two excited states. In the application of quantum sensing or computing, the defining metric is the coherence time or the lifetime of the resonance. Often this involves several aspects of the material, which are deeply rooted in the atomic coordination on the atoms that make up the molecules. One of the targeted metrics in these organic SPS based quantum sensing materials, is discovering a material that exhibits both a strong photo absorption strength and a strong dipole interaction. This type of molecule, when interacting with neighboring molecules, will exhibit a quantum resonance in which a single photon can exist on only one of the molecules state at a time. The dipole interaction will shift or change the neighboring molecule's excited state energy level. This single photon–dipole interaction will give rise to a resonance with a precise energy that can be exploited for quantum sensing and other quantum applications. This two level system can be described by a Hamiltonian, which is beyond the scope of this research but will be required for complete understanding and control of this quantum system.
One potential application that seems more near term is remote sensing, which has gained significant attention in recent years as quantum materials technology has improved. These remote sensing methods have been used to monitor contaminants in water, air quality trends, dissolved nutrients in surface water, and many other advanced techniques.33–35 For example, Spangenberg et al. demonstrated how quantum materials could be combined to detect relative concentrations of mixtures within water in real-time.34 Fei et al. demonstrated that the right combination of machine learning algorithms and SPS material, the monitoring of groundwater contamination could be achieved.35 However, the limited dataset of 1665 materials used in Fei et al.'s work highlights the need for much larger datasets. Moreover, Mamede et al. further demonstrated the potential of machine learning to be applied with quantum materials by focusing on finding the UV/vis absorption spectrum of organic molecules using fingerprints generated from 2D chemical structures. Their work yielded a sample size of approximately 75000 molecules using only information about the chemical structures.36
Recent studies by De Leonardis et al. and Richter et al. have demonstrated the potential of quantum materials in overcoming phase-matching challenges in remote sensing applications.31,37 These recent research studies demonstrate the growing demand for new quantum materials application that can advance different fields. This highlights the need for more extensive datasets to facilitate machine learning algorithms in the discovery of new SPS materials.
The hypothesized approach in this study is to create a material compression intelligence via the latent space representation of an existing SPS materials experimental database, using a VAE with a similar structure as AGoRaS.7 To differentiate this network from this previous study we will refer to the network as AGoRaS-Quantum. The latent representation of the materials can be sampled at various points to generate new SPS materials with desirable characteristics. The latent space can be viewed as a representation of the compressed structural and chemical information inherent in the species used to train the network. By populating the latent space with structurally and chemically similar SPS materials, the network can learn the underlying similarities between the materials, resulting in the generation of new materials that share the desired characteristics.38
The VAE's ability to represent data in n-dimensions and fit input nodes to probabilistic distributions, typically multivariate Gaussian distributions, offers a significant advantage over traditional methods of data representation.8,12,13,39 The methodology proposed in this study demonstrates the practicality and flexibility of the AGoRaS network in creating SPS materials for quantum applications, with the generation of single materials replacing the balanced chemical reactions generated in the original AGoRaS study.7
The generation of SPS materials has to be rigorously validated, following a similar methodology to the one outlined by Beard et al., using a workflow that facilitates data collection, network training, network testing, material generation, and material testing.29 The modular nature of this workflow enhances code quality and robustness while also enabling non-data scientists to employ the methodology with ease, thereby expanding the network's utility to researchers in different fields who lack data science expertise. This methodology has been successfully utilized in other generative networks such as ChatGPT, enabling non-data scientists to generate text with background knowledge beyond their expertise. The developed network aims to have the same utility, but more narrow in application, with the focus of generating of SMILES representation of molecules.
To simplify the network's predictions and improve reproducibility, the AGoRaS-Quantum network focused solely on predicting SMILES species and not on their associated properties, such as dipole moments and excitation wavelength.43 This decision allowed the network to focus on learning the underlying physical and chemical structural patterns rather than extending the prediction to properties, which could be calculated afterwards using quantum chemistry tools such as density functional theory (DFT).43–45 To generate new species, the network continued to use character-level embedding due to its advantages over word-level embedding in natural language processing. This allowed the generative network to use the information learned during training to generate new species based on the universal alphabet created from all the species in the dataset.46,47
The use of molecule embedding can improve the predictive power of machine learning models by formulating inputs into sequence embeddings.8,48–50 In this study, TensorFlow's built-in embedding techniques were used to create embeddings based on the universal alphabet created from the chemical species.51 This approach was inspired by Gaspar et al.'s work, which demonstrated that molecule embedding can be similar to NLP embeddings.49 By using sequence embeddings, the network can capture more of the structural and chemical similarities between the species, allowing it to generate new species with similar properties. The use of embedding techniques and a universal alphabet enables the AGoRaS-Quantum network to accurately represent chemical species and generate new species, making it a useful tool for materials science research.
The sampled solution space is then decoded using a RepeatVector layer wrapped around the output of the latent space, which turns the data into a tensor vector that an LSTM layer can read. The LSTM layer's output is projected into a vector of length n, and this projection is used to calculate the loss of the network. AGoRaS-Quantum uses a sequence-to-sequence style loss function typical of variation autoencoders, and the kl loss is used as the monitoring metric during training. The network was trained for 500 epochs using a batch size of 25, an embedding dimensionality of 500, and a latent dimensionality of 350. The kl weight used was 0.1, and the activation function was SoftMax. The optimizer function was Adam, and the learning rate was set at 1 × 10−5. This structure closely mimics that of the original AGORAS network for chemical reaction prediction, except for the input vector's length.
The model takes in a vector representation of the longest SMILES string in the training dataset, which is then projected into a higher dimensionality space. The projected vectors are passed through a bidirectional LSTM layer with a recurrent dropout to extract the mean and log variance, which are then used to sample the solution space. A sequence-to-sequence style loss function is used to calculate the loss of the network, with the kl loss serving as the monitoring metric during training. The model's performance is governed by several hyperparameters, including batch size, embedding and latent dimensionality, kl weight, activation function, optimizer function, and learning rate. The AGoRaS-Quantum algorithm's combination of deep learning techniques and chemical domain knowledge allows it to generate new chemical species accurately and efficiently.
During the training process, a sequence-to-sequence loss function was employed to score the reconstructed string versus the original string. This approach enabled the validation of the VAE's ability to reconstruct the chemical equations with zero loss of information, which is indicative of a stable latent space. Given the small size of the data used in this study, it was essential to validate the stability of the latent space as much as possible. After it had been demonstrated that the network could reconstruct the test data, the remaining 90 percent of the data were also tested to further validate the stability of the latent space. Although the network should be able to reconstruct all the data used in training, this additional test served as a further validation of the latent space's stability.
Overall, validating the stability of the latent space is critical for this study. By demonstrating that the VAE can encode and decode the original chemical equations with zero loss of information, it is possible to confirm that the latent space is stable. This validation is especially important given the small size of the data used in this study.
Due to the probabilistic nature of the latent representation, an almost unlimited number of sample points can be taken to generate new species. However, this approach has diminishing returns as there are only a limited number of chemically feasible species that can be generated. Nevertheless, the directed sampling approach can still be a powerful tool for researchers to generate new species with specific characteristics of interest.
The performance of the AGoRaS-Quantum networks is determine by comparing the number of generated species to the number of unique species. It was determined that 10% of the generated species are unique on the first iteration. That means nearly 10 million species need to generate in order to discover 1 million unique species on the first interation. Success was around 10% at first because the latent space was limited, since we started with only 8000 species. As the latent space was sampled more and more stable species were added to the list of stable species, the future predictions became better. This increased the stability of the sampling. In the end, approximately 45% of generated species were stable per interation. Approximately 5% of those were repeats of previously generated species. This theoretically would continue to improve as we sampled more of the latent space as values can be sampled between known species in latent space.
The semi-empirical quantum chemistry model that was implemented in the automated Pipeline Pilot script was based on the Dassault Systems' Materials Studio VAMP software package.56 Geometry optimization was conducted with a diatomic differential overlap (NDDO) and PM6 Hamiltonian, with auto multiplicity, and a spin state starting with the most rigorous; an unrestricted Hartree–Fock (UHF), restricted Hartree–Fock (RHF), or annihilated unrestricted Hartree–Fock (A-UHF).57,58 Several spin states were tested based on convergence. A Paulay/IIS convergence scheme was selected with a convergence energy tolerance of 2 × 10−4. The thermodynamics information and total dipole moment were output from VAMP output. Thermodynamic information is based on the optimized atomistic description, semiemperical molecular information, and electronic and phonon calculation.
The Pipeline Pilot script conducted a series of data preparation steps prior to the semi-empirical calculation. After data was read using SMILES format the SMILES was checked for consistency, followed by making and cleaning of the molecule. The cleaning steps included centering the molecule, adding hydrogen, and conducting a quick empirical elastic relaxation of the structure to refine the initial geometry. The structure was provided to a programmed series of VAMP calculations starting with the most rigorous spin state and relaxing the spin state in the case of failure and retrying the calculation. In the event that the semi-empirical calculation fails for each spin state, the molecule was assumed unstable and removed from the dataset.
The semi-empirical calculation was chosen because it throughput and robustness of the calculation. Compared to all-electron density functional theory calculations, the semi-empirical calculations take between one to two orders of magnitude less time to provide a prediction. This is critical to this approach where we are aiming to predict the properties of hundreds of thousands of molecules. The reason to use the semi-empirical approach was to provide a quick estimate of the molecule stability and properties, which could be later investigated using higher fidelity models after the initial screening.
The semi-empirical model did provide an estimate of several SPS properties of the molecules, not limited to the formation energy, dipoles, and UV/vis properties. VAMP use the calculated molecular wavefunctions to derive the dipole moment and associated excited state properties. This is done using the LCAO method of molecular orbitals rather than the standard MNDO Hamiltonian calculation.56,59 VAMP is also able to calculate accurate dipole moments using the Natural Atomic Orbital-Point Charge model for molecular electrostatic properties. All of the structures generated along with their properties predicted with the semi-empirical model can be found on the NOMAD materials science database, see Data Availability Section.
Fig. 4 TDDFT results for the FIII (H2FI3) molecule (A and B) and C[I][I]II (C and D). Subfigure A and C illustrate the ground states local density of states surfaces. The yellow is associated with spin-up and blue is associated with spin-down electrons. Subfigure B and D are the excited states local density of states. Comparing the ground state to the exited states provide spatial information of the electron transition outlined in Table 1. It is desirable to have a high overlap of these states to avoid decoherence of the states. |
From | To | TD-ex [eV] | TD-ex [nm] | f-osc | Overlap |
---|---|---|---|---|---|
85 | 86+ | 1.31 | 950 | 0.000026 | 0.50 |
83 | 86− | 1.46 | 851 | 0.000110 | 0.87 |
81 | 86+ | 2.17 | 571 | 0.001437 | 0.39 |
80 | 86+ | 2.76 | 450 | 0.000796 | 0.44 |
From | To | TD-ex [eV] | TD-ex [nm] | f-osc | Overlap |
---|---|---|---|---|---|
111 | 112+ | 1.83 | 676 | 0.001886 | 0.61 |
109 | 112− | 1.99 | 622 | 0.000295 | 0.59 |
107 | 112+ | 2.25 | 550 | 0.082783 | 0.56 |
106 | 112+ | 2.32 | 535 | 0.003263 | 0.46 |
The distribution of these atom counts can be seen in Fig. 2, which allows for a more in-depth analysis. Again, the training dataset is pictured in blue in the background with the generated dataset in the foreground in red. The y-axis in all the plots is the percentage of all molecules within that bin and the x-axis is the associated chemical property. It can be seen from Fig. 2 that the generated species and original species share an approximately normal distribution with two main differences. The first is the high percentage of species in the first two bins and the second is that the distributions are not centered around the same number of atoms. Both of these differences are due to the original dataset containing single-element molecules. Since they are already included in the original species and AGoRaS-Quantum cannot come up with new elements, it is impossible for it to share that feature. However, due to the existence of these small molecules, it biased the network into creating species that were on average smaller than the median number of atoms for the original dataset.
Of course, for these molecules to be useful as either a dataset for machine learning or as a database for potential experimentalists, it had to be proven that these generated species shared the same properties as the original species. It was decided to look at molecules containing no more than 10 atoms. This was due to the computational complexity and cost associated with the semi-empirical calculations. The criteria that were deemed the most important to compare between the datasets were those that can aid in identifying if a material is a SPS material. This criteria included whether or not the calculated emission spectra of the molecule exhibit a single strong peak (high oscillator strength) at an optical wavelength.
Using semi-empirical calculations it was possible to calculate the wavelength and emission strength of excited electrons. It is important to note that the intensity of the photon being emitted is difficult to compare between molecules but can be compared between other peaks in the spectrum for a given material to provide a normalized intensity. This is due to the Franck–Condon Principle, which explains the relative intensities of vibronic transitions. These intensities are the relation between the probability of a vibrational transition to the overlap of the vibrational wave functions. These calculations at each energy level led to the calculated emission spectra that will be used as validation of these materials. All electronic levels of each molecular species were determined at the standard state using the semi-empirical computational technique described earlier. Other properties of interest, but secondary to the peak strength, were the total dipole moment, vibrational spectrum, and vibrational strength.
Once it was confirmed that the generated data exhibited single peak behavior it was necessary to perform further analysis to confirm that the generated data shared similar value ranges as the original data. For this overlapping histograms were determined to be an ideal way to show that the generated data had properties similar to that of the real data. Fig. 2D illustrates a histogram of the original and generated species’ peak oscillator strength. It can be seen that both the generated and original species have a semi-normal distribution with a slightly left skew to the values. However, it appears that on average the generated materials have a slightly weaker peak oscillator strength.
In this type of behavior, both the lower average peak strength and the identical distribution of values are expected due to the bias in the training data. Since the network uses all of the original data as a starting point for sampling the latent space it will always return data of a similar distribution. This problem could easily be overcome by sampling only data from the underrepresented regions until a uniform distribution was created.
The high percentage of molecules being generated that produce weaker strengths is also a byproduct of the inherited bias. Due to the training data being sourced from experimental results, only the best material i.e., the strongest emitters are reported. This leaves a lot of materials for AGoRaS-Quantum to be able to generate that still meet the chemical and physical requirements but do not produce as strong of a peak. Simply put the area of the latent space that generates strong peak materials is crowded, while the rest of the latent space is sparsely populated. As shown in study two, however, if the latent space were to continue to be sampled until we reached 100 times the number of generated materials to the original material. Then the generated distributions would be exactly that of the original materials.
Another important aspect of these types of materials is at which frequency these peaks occur. Fig. 2B depicts the excitation wavelength at which the molecule's peak oscillator strength occurs for the original and generated species. Once again it can immediately be seen that the original and generated materials follow a similar distribution of semi-normal with a left skew. Like with the previous figure, this could be corrected with a more directed sampling methodology. Another factor in the similarity of these distributions is that, unlike the other histograms that have been shown in this study, their values could theoretically be anything. The excitation wavelengths are calculated between 100 and 1400 nm, which helps to enforce an equal distribution of values within that range. An interesting find from Fig. 2B is that the original data has a disjointed distribution of values when the excitation wavelength is greater than 250 nm. The generated data shows a much more normal distribution as the values tail out to 1200 nm. This helps to suggest that even if the training data has a disjointed distribution that a VAE will be able to generate a smooth distribution of the data.
The total dipole moments of the real and generated materials were also calculated, which was important in selecting a molecule that could potentially be operated in the dipole blockade quantum sensing application. The dipole is based on the partial charge and positions of the atoms. The overlaid histogram for the total dipole moments of the original and generated species can be seen in Fig. 2C. As we have seen previously it is a semi-normal distribution with a left skew. It should be noted that there is a the high percentage of dipole values around 0 Debye for the original species. This is due to the original dataset containing single atom species which would have zero dipole moment. It is interesting to note, in Fig. 2B, where the original data has a bit of an uneven distribution, however, the generated data is a single peak distribution. The network cannot generate any more single element materials with new elements from the periodic table so the zero dipole materials are limited.
The filling in of the generated data represents an extremely important aspect of VAEs and especially of the AGoRaS-Quantum network. This stems from the ability of the network to map the latent probabilistic solution space of these materials to the real space. By sampling the latent space, the network is able to fill in the gaps between points and discovery new spaces. It is the aim to demonstrate that the new materials are filling in the solution space and therefore, effectively removing the bias. To visualize this a t-Distributed Stochastic Neighbor Embedding (t-SNE) was generated as shown in Fig. 5. The t-SNE algorithm is used primarily to be able to explore and visualize high-dimensional data such as text. At its most simple level, it allows a user to get an understanding of how data is arranged in high-dimensional space. The algorithm accomplishes this through an unsupervised learning method of stochastic neighbor embedding to give high-dimensional data a single point on a two-dimensional grid.
For this t-SNE algorithm, the only input was the SMILES representations of the molecules embedded as numbers just as in the original training for the AGoRaS-Quantum network. The blue circles represent the generated data set and the red circles represent the training data. Fig. 5A has all of the original data 8000 species while only showing a randomly selected 8000 of the generated species. Meanwhile, Fig. 5B also illustrates the 8000 generated species but has 80000 randomly selected generated species. This is done to illustrate how as we sample more species we can fill in the latent space. It can be seen from Fig. 5 that AGoRaS-Quantum is starting to fill in the blank spaces in the latent space. It is interesting to note that most of the original species are concentrated within a small area in the latent space. Fig. 5B clearly illustrates how the network is beginning to fill in all of the available space with generated materials. It appears the areas around the original species are the most densely populated with generated materials. This would make sense as species were used as entry points into the latent space to for the initial sampling. Therefore, a high proportion of the early generated species would be located near the original species. Due to the memory cost, it was not possible to show how using 800000 species would look visually, but imagine an even more densely packed latent space that is expanding outward.
The novel aspect of the AGoRaS-Quantum network was its ability to generate a large quantity of new molecular species that were both stable and shared the same defining feature as the training dataset. This was an improvement of the previous AGoRaS sampling method in the ability to use the SMILES representations of the molecular species as starting points in sampling the latent space. This allowed for targeted sampling of the latent space to generate materials with specific types of properties. This is possible due to the ability of the VAE to gather knowledge of physics and chemistry from the dataset it is trained on and to generate new molecular species beyond the size and descriptions contained in the training data. We were able to acheive nearly 40 percent generation of new species on each sampling interation.
This developed SMILES language modeling approach opens the possibilities for more in-depth analysis of these generative models for chemistry and materials science. For example, there is potential for using a traditional machine learning analysis to be performed on the AGoRaS-Quantum network in order to gain a better understanding of the underlying processes. The covariant estimates of the different parameters within the network would be one approach. This would also help quantify the overfitting of the latent space via the network's variance. Another interesting study to improve the network speed and efficiency, is the autonomous design of the network parameters. While this study was based on hand-tuned parameters until a stable network could be created. This leaves a great opportunity for the design of a more memory-efficient network. Both the source code and datasets are availible (see Data Availability Section) for this study in hopes for others to expand upon the network and apply this approach to other application.
The source code corresponding to the machine learning model and the Pipeline Pilot script can be downloaded on GitHub at the following address: https://github.com/Dr-Musho-Research-Group/AGORAS_QUANTUM.
This journal is © The Royal Society of Chemistry 2024 |